使用 Python 实现一个简单的 Web 爬虫

前天 12阅读

在当今大数据时代，网络爬虫（Web Crawler）成为获取互联网信息的重要工具。通过编写爬虫程序，我们可以自动抓取网页内容、提取结构化数据，并用于数据分析、监控或构建知识图谱等场景。

本文将详细介绍如何使用 Python 编写一个简单但功能完整的 Web 爬虫系统，涵盖以下技术点：

使用 requests 发送 HTTP 请求使用 BeautifulSoup 解析 HTML 内容使用 re 正则表达式提取特定信息存储爬取的数据到本地文件（CSV）遵守网站的爬取规则（Robots.txt）

项目目标

我们以爬取 https://books.toscrape.com/ 这个演示网站为例，该网站专门用于练习爬虫技术。我们的目标是爬取每一页中所有书籍的信息，包括书名、价格和评分，并将这些数据保存为 CSV 文件。

开发环境与依赖库

所需模块：

requests: 发送 HTTP 请求beautifulsoup4: 解析 HTMLcsv: 将数据写入 CSV 文件re: 正则表达式处理urllib.robotparser: 解析 robots.txt 文件

你可以使用 pip 安装这些库：

pip install requests beautifulsoup4

注意：csv 和 re 是标准库，无需额外安装。

代码实现

第一步：检查 robots.txt

为了尊重网站规则，我们在爬取前应先检查目标网站的 robots.txt 文件。例如：

from urllib.robotparser import RobotFileParserrp = RobotFileParser()rp.set_url("https://books.toscrape.com/robots.txt")rp.read()# 检查是否允许访问can_fetch = rp.can_fetch("*", "https://books.toscrape.com/")print("是否允许爬取：", can_fetch)

如果输出为 True，表示允许爬取。

第二步：发送请求并解析页面

接下来，我们定义函数来获取网页内容并解析其中的书籍信息。

import requestsfrom bs4 import BeautifulSoupdef fetch_page(url):    headers = {        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0 Safari/537.36'    }    response = requests.get(url, headers=headers)    if response.status_code == 200:        return response.text    else:        print(f"请求失败，状态码：{response.status_code}")        return None

第三步：解析书籍信息

使用 BeautifulSoup 提取书籍标题、价格和星级评价。

def parse_books(html):    soup = BeautifulSoup(html, 'html.parser')    books = []    for item in soup.select('.product_pod'):        title = item.select_one('h3 a')['title']        price = item.select_one('.price_color').text        rating_class = item.select_one('p')['class'][1]  # 获取星级类名        rating = convert_rating(rating_class)        books.append({            'title': title,            'price': price,            'rating': rating        })    return books

辅助函数：将星级类名转换为数字评级（如 One -> 1）

def convert_rating(rating_str):    ratings = {        'One': 1,        'Two': 2,        'Three': 3,        'Four': 4,        'Five': 5    }    return ratings.get(rating_str, 0)

第四步：翻页爬取

我们还需要支持分页爬取，直到没有下一页为止。

def get_next_page(soup):    next_link = soup.select_one('ul.pager li.next a')    if next_link:        return 'https://books.toscrape.com/' + next_link['href']    return None

整合主函数：

import csvdef main():    base_url = "https://books.toscrape.com/index.html"    current_url = base_url    all_books = []    while current_url:        html = fetch_page(current_url)        if not html:            break        soup = BeautifulSoup(html, 'html.parser')        books = parse_books(html)        all_books.extend(books)        current_url = get_next_page(soup)        print(f"已爬取 {len(books)} 条书籍信息，继续爬取下一页...")    # 保存到 CSV 文件    with open('books.csv', mode='w', newline='', encoding='utf-8') as file:        writer = csv.DictWriter(file, fieldnames=['title', 'price', 'rating'])        writer.writeheader()        writer.writerows(all_books)    print(f"总共爬取 {len(all_books)} 条书籍信息，已保存至 books.csv")

第五步：运行爬虫

最后，调用主函数即可启动爬虫：

if __name__ == "__main__":    main()

完整代码汇总

以下是整个项目的完整代码：

import requestsfrom bs4 import BeautifulSoupimport csvfrom urllib.robotparser import RobotFileParserdef fetch_page(url):    headers = {        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0 Safari/537.36'    }    response = requests.get(url, headers=headers)    if response.status_code == 200:        return response.text    else:        print(f"请求失败，状态码：{response.status_code}")        return Nonedef convert_rating(rating_str):    ratings = {        'One': 1,        'Two': 2,        'Three': 3,        'Four': 4,        'Five': 5    }    return ratings.get(rating_str, 0)def parse_books(html):    soup = BeautifulSoup(html, 'html.parser')    books = []    for item in soup.select('.product_pod'):        title = item.select_one('h3 a')['title']        price = item.select_one('.price_color').text        rating_class = item.select_one('p')['class'][1]        rating = convert_rating(rating_class)        books.append({            'title': title,            'price': price,            'rating': rating        })    return booksdef get_next_page(soup):    next_link = soup.select_one('ul.pager li.next a')    if next_link:        return 'https://books.toscrape.com/' + next_link['href']    return Nonedef main():    base_url = "https://books.toscrape.com/index.html"    current_url = base_url    all_books = []    # 检查 robots.txt    rp = RobotFileParser()    rp.set_url("https://books.toscrape.com/robots.txt")    rp.read()    if not rp.can_fetch("*", base_url):        print("该网站禁止爬取！")        return    while current_url:        html = fetch_page(current_url)        if not html:            break        soup = BeautifulSoup(html, 'html.parser')        books = parse_books(html)        all_books.extend(books)        current_url = get_next_page(soup)        print(f"已爬取 {len(books)} 条书籍信息，继续爬取下一页...")    # 保存到 CSV 文件    with open('books.csv', mode='w', newline='', encoding='utf-8') as file:        writer = csv.DictWriter(file, fieldnames=['title', 'price', 'rating'])        writer.writeheader()        writer.writerows(all_books)    print(f"总共爬取 {len(all_books)} 条书籍信息，已保存至 books.csv")if __name__ == "__main__":    main()

总结与优化建议

本项目实现了一个基本的 Web 爬虫，具备以下特点：

遵守网站 Robots 协议使用高效的 HTML 解析方式支持多页翻页爬取数据存储格式清晰

后续可优化方向：

增加并发能力：使用 concurrent.futures 或 aiohttp 实现异步爬取，提升效率。设置请求间隔：添加 time.sleep() 避免对服务器造成过大压力。异常处理机制：完善超时重试、代理 IP 切换等功能。数据库存储：将数据存入 MySQL、MongoDB 等数据库。日志记录：使用 logging 模块记录运行过程中的关键信息。

通过以上内容的学习与实践，你已经掌握了一个基础但完整的 Web 爬虫系统的构建方法。希望你能在此基础上拓展更多实际应用场景，探索更复杂的数据采集任务。

免责声明：本文来自网站作者，不代表CIUIC的观点和立场，本站所发布的一切资源仅限用于学习和研究目的；不得将上述内容用于商业或者非法用途，否则，一切后果请用户自负。本站信息来自网络，版权争议与本站无关。您必须在下载后的24个小时之内，从您的电脑中彻底删除上述内容。如果您喜欢该程序，请支持正版软件，购买注册，得到更好的正版服务。客服邮箱：ciuic@ciuic.com