Crawl4AI 人民网新闻爬虫

使用 Crawl4AI 框架爬取人民网新闻，支持使用本地 Chrome 浏览器。

安装依赖

pip install crawl4ai playwright
playwright install chromium  # 或者使用本地 Chrome

使用方法

基本使用

# 使用默认配置（自动使用本地 Chrome）
python crawl4ai/main.py [category] [limit] [output_file]

# 示例
python crawl4ai/main.py politics 20 output/news.json

指定 Chrome 路径

# 指定 Chrome 可执行文件路径
python crawl4ai/main.py politics 20 output/news.json "C:\Program Files\Google\Chrome\Application\chrome.exe"

在代码中使用

import asyncio
from crawl4ai.PeopleNetCrewer import PeopleNetCrewer

async def main():
    # 使用默认 Chrome（自动检测）
    crewer = PeopleNetCrewer()
    
    # 或者指定 Chrome 路径
    # crewer = PeopleNetCrewer(chrome_path="C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe")
    
    news_list = await crewer.crawl(category="politics", limit=20)
    
    for news in news_list:
        print(f"标题: {news.title}")
        print(f"链接: {news.url}")
        print("-" * 50)
    
    await crewer.close()

if __name__ == "__main__":
    asyncio.run(main())

配置说明

使用本地 Chrome

代码会自动尝试使用本地安装的 Chrome 浏览器。如果未指定 chrome_path，会通过 channel="chrome" 参数使用系统默认的 Chrome。

浏览器配置

在 PeopleNetCrewer 类中，可以通过修改 _get_crawler 方法中的 browser_config 来调整浏览器行为：

headless: 是否无头模式（默认 True）
verbose: 是否显示详细日志（默认 False）
channel: 浏览器通道（"chrome" 表示使用本地 Chrome）
executable_path: 指定浏览器可执行文件路径

注意事项

确保已安装 Chrome 浏览器
如果遇到 Playwright 浏览器未找到的错误，可以运行 playwright install chromium 安装 Playwright 自带的浏览器
使用本地 Chrome 时，确保 Chrome 版本与 Playwright 兼容

README.md Unescape Escape