使用Python构建一个简单的Web爬虫

前天 8阅读

在当今数据驱动的世界中，网络爬虫（Web Crawler）是一种非常有用的工具，可以帮助我们从互联网上自动收集信息。本文将介绍如何使用 Python 编写一个简单的 Web 爬虫，它可以从网页中提取数据并保存到本地文件中。我们将使用 requests 和 BeautifulSoup 这两个流行的库来实现这个爬虫。

准备工作

在开始编写代码之前，请确保你的开发环境中已经安装了以下库：

requestsbeautifulsoup4

你可以使用以下命令安装这些库：

pip install requests beautifulsoup4

此外，为了遵守网站的爬取政策，请务必查看目标网站的 /robots.txt 文件，了解哪些页面是允许爬取的。

项目结构

我们的爬虫将完成以下几个任务：

向指定 URL 发送 HTTP 请求。解析返回的 HTML 内容。提取感兴趣的文本或链接。将提取的数据保存到本地文件中。

发送HTTP请求

首先，我们需要使用 requests 库向目标网站发送一个 GET 请求。下面是一个基本的示例：

import requestsdef fetch_url(url):    try:        response = requests.get(url)        response.raise_for_status()  # 如果响应状态码不是200，抛出异常        return response.text    except requests.RequestException as e:        print(f"请求失败: {e}")        return None

这段代码定义了一个函数 fetch_url()，它接受一个 URL 参数，并尝试获取该页面的内容。如果请求失败，会打印错误信息并返回 None。

解析HTML内容

接下来，我们将使用 BeautifulSoup 来解析 HTML 文档。假设我们要从一个新闻网站抓取文章标题和链接。

from bs4 import BeautifulSoupdef parse_html(html_content):    soup = BeautifulSoup(html_content, 'html.parser')    return soup

这个函数接收 HTML 字符串并返回一个 BeautifulSoup 对象，我们可以用它来查找特定的标签和属性。

提取数据

以 https://example-news-site.com 为例，假设每个新闻标题都包含在一个 <h3> 标签中，且具有类名 title。我们可以这样提取所有标题和链接：

def extract_news(soup):    news_list = []    for item in soup.find_all('h3', class_='title'):        title = item.get_text(strip=True)        link = item.find('a')['href']        news_list.append({'title': title, 'link': link})    return news_list

注意：这里的 HTML 结构是假设的，请根据实际目标网页的结构进行调整。

保存数据到文件

最后，我们将提取到的新闻数据保存为 JSON 文件：

import jsondef save_to_file(data, filename='news.json'):    with open(filename, 'w', encoding='utf-8') as f:        json.dump(data, f, ensure_ascii=False, indent=4)    print(f"数据已保存到 {filename}")

整合所有功能

现在我们将以上各个部分组合成一个完整的程序：

import requestsfrom bs4 import BeautifulSoupimport jsondef fetch_url(url):    try:        response = requests.get(url)        response.raise_for_status()        return response.text    except requests.RequestException as e:        print(f"请求失败: {e}")        return Nonedef parse_html(html_content):    return BeautifulSoup(html_content, 'html.parser')def extract_news(soup):    news_list = []    for item in soup.find_all('h3', class_='title'):        title = item.get_text(strip=True)        link = item.find('a')['href']        news_list.append({'title': title, 'link': link})    return news_listdef save_to_file(data, filename='news.json'):    with open(filename, 'w', encoding='utf-8') as f:        json.dump(data, f, ensure_ascii=False, indent=4)    print(f"数据已保存到 {filename}")if __name__ == '__main__':    target_url = 'https://example-news-site.com'    html = fetch_url(target_url)    if html:        soup = parse_html(html)        news_data = extract_news(soup)        save_to_file(news_data)

⚠️ 注意事项：
实际使用时请替换 target_url 为你想要爬取的真实网址。请确保你有权限爬取该网站的内容，并遵循其 robots.txt 规则。建议添加 User-Agent 请求头以模拟浏览器访问，避免被反爬机制拦截。

扩展功能

8.1 添加User-Agent

许多网站会检测爬虫行为，因此我们需要设置 User-Agent 模拟浏览器访问：

headers = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}response = requests.get(url, headers=headers)

8.2 多页爬取

如果你需要爬取多个页面，可以使用循环遍历不同的页面链接：

base_url = 'https://example-news-site.com/page/'for page in range(1, 6):  # 爬取前5页    url = base_url + str(page)    html = fetch_url(url)    if html:        soup = parse_html(html)        news_data.extend(extract_news(soup))

8.3 数据存储到CSV

除了 JSON，你也可以将数据保存为 CSV 格式：

import csvdef save_to_csv(data, filename='news.csv'):    with open(filename, 'w', newline='', encoding='utf-8') as f:        writer = csv.DictWriter(f, fieldnames=['title', 'link'])        writer.writeheader()        writer.writerows(data)    print(f"数据已保存到 {filename}")

只需将 save_to_file(news_data) 替换为 save_to_csv(news_data) 即可。

总结

本文介绍了如何使用 Python 构建一个基础的 Web 爬虫，涵盖了发送请求、解析 HTML、提取数据以及保存结果等关键步骤。虽然这个爬虫相对简单，但它是进一步开发更复杂爬虫的良好起点。

随着你对爬虫技术的深入学习，可以尝试使用 Scrapy 框架来构建更强大、可维护的爬虫系统，或者结合 Selenium 来处理 JavaScript 渲染的动态网页。

希望这篇文章对你有所帮助！如果你有任何问题，欢迎留言交流。

免责声明：本文来自网站作者，不代表CIUIC的观点和立场，本站所发布的一切资源仅限用于学习和研究目的；不得将上述内容用于商业或者非法用途，否则，一切后果请用户自负。本站信息来自网络，版权争议与本站无关。您必须在下载后的24个小时之内，从您的电脑中彻底删除上述内容。如果您喜欢该程序，请支持正版软件，购买注册，得到更好的正版服务。客服邮箱：ciuic@ciuic.com