sky's blog

Python 爬虫教程 03

loskyertt Unknown

2024-10-17 16:51:40 2024-10-17 16:51:40 Created 2025-02-17 04:36:55 2025-02-17 04:36:55 Updated

Python爬虫

python

342 Words 1 Mins

1.无请求头访问

如果不构建请求头，直接向目标网站发送请求：

import requests
from lxml import etree

url = "https://spiderbuf.cn/playground/s02"

html = requests.get(url=url).text

f = open('./课程/02course/02.html', 'w', encoding='utf-8')
f.write(html)
f.close()

print(html)
root = etree.HTML(html)
trs = root.xpath('//tr')

f = open('./课程/02course/data02.txt', 'w', encoding='utf-8')
for tr in trs:
    tds = tr.xpath('./td')
    s = ''
    for td in tds:
        # print(td.text)
        s = s + str(td.text) + ' | '
    print(s)
    if s!= '':
        f.write(s + '\n')

输出结果：

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>tengine</center>
</body>
</html>

很容易被网站检测到是爬虫。

2.添加请求头

所以基本上在发送请求之前都会封装一个http请求的头部信息：

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
}

html = requests.get(url=url, headers=headers).text

有时候还需要往里面填入Cookie。甚至为了防止被检测到是爬虫，需要更换User-Agent，比如用火狐浏览器的等，或者浏览器不同版本的，这在网上可以查询到。

Title: Python 爬虫教程 03
Author: loskyertt
Created at : 2024-10-17 16:51:40
Updated at : 2025-02-17 04:36:55
Link: https://redefine.ohevan.com/2024/10/17/03Python爬虫/
License: This work is licensed under CC BY-NC-SA 4.0.

#python

推荐阅读

Python 爬虫教程 09

Python 爬虫教程 09

Python 爬虫教程 07

Python 爬虫教程 07

Python 爬虫教程 05

Python 爬虫教程 05

推荐阅读

Python 爬虫教程 09

Python 爬虫教程 09

Python 爬虫教程 07

Python 爬虫教程 07

Comments

On this page

Python 爬虫教程 03

1.无请求头访问
2.添加请求头