当前位置:首页 > python > 正文

python爬虫实例(python爬虫案例详细)

  • python
  • 2024-03-12 09:03:36
  • 1729
Python 爬虫示例
要点:

1. 使用 Requests 库获取网页源代码



import requests
url = "http://www.example.com/"
response = requests.get(url)
html = response.text

2. 使用 BeautifulSoup 库解析 HTML



from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

3. 查找和提取所需数据



titles = soup.find_all("h1")
for title in titles:
print(title.text)

4. 处理和存储数据



# 写入 CSV 文件
import csv
with open("data.csv", "w", newline="") as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(["Title", "URL"])
for title in titles:
csvwriter.writerow([title.text, url])

5. 处理错误和响应代码



try:
response = requests.get(url)
if response.status_code != 200:
raise Exception("Error: " + str(response.status_code))
except Exception as e:
print(e)

6. 使用多线程或并发库提高速度



# 使用 ThreadPoolExecutor 多线程
from concurrent.futures import ThreadPoolExecutor
def get_html(url):
response = requests.get(url)
html = response.text
return html
urls = ["url1", "url2", "url3"]
with ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(get_html, urls)