重要新闻

2025-11-21 14:55:50 +08:00
parent 0e7cee3070
commit 7ccec2b624
4 changed files with 2018 additions and 219 deletions
--- a/schoolNewsCrawler/crawler/xxqg/README_important_crawler.md
+++ b/schoolNewsCrawler/crawler/xxqg/README_important_crawler.md
@@ -0,0 +1,140 @@
+# 学习强国重要新闻爬虫使用说明
+
+## 功能概述
+
+在 `XxqgCrawler` 类中新增了 `crawl_important` 方法，用于爬取学习强国"重要新闻"栏目的文章内容。
+
+## 实现原理
+
+该方法结合了旧版 `myQiangguo` 爬虫和新版 Selenium 爬虫的优势：
+
+1. **获取文章列表**：参考旧版爬虫方式，使用 `requests` 库直接请求 JSON 接口获取文章列表
+   - JSON接口地址: `https://www.xuexi.cn/lgdata/1jscb6pu1n2.json?_st=26095725`
+   - 返回包含文章URL、标题、来源等基础信息的列表
+
+2. **解析文章详情**：使用现有的 `parse_news_detail` 方法（基于 Selenium）解析每篇文章的详细内容
+   - 提取标题、发布时间、来源
+   - 提取正文内容（文字、图片、视频）
+   - 保存完整的文章结构
+
+## 使用方法
+
+### 基本用法
+
+```python
+from crawler.xxqg.XxqgCrawler import XxqgCrawler
+
+# 初始化爬虫
+crawler = XxqgCrawler()
+
+# 爬取重要新闻（默认最多60篇）
+result = crawler.crawl_important()
+
+# 检查结果
+if result.success:
+    print(f"成功爬取 {len(result.dataList)} 篇新闻")
+    for news in result.dataList:
+        print(f"标题: {news.title}")
+        print(f"来源: {news.source}")
+        print(f"发布时间: {news.publishTime}")
+else:
+    print(f"爬取失败: {result.message}")
+
+# 关闭浏览器
+crawler.driver.quit()
+```
+
+### 自定义爬取数量
+
+```python
+# 只爬取前10篇文章
+result = crawler.crawl_important(max_count=10)
+```
+
+### 运行测试脚本
+
+```bash
+cd f:\Project\schoolNews\schoolNewsCrawler\crawler\xxqg
+python test_important_crawler.py
+```
+
+## 输出结果
+
+爬取完成后，结果会自动保存到 `Xxqg_important_news.json` 文件中，包含以下信息：
+
+```json
+[
+    {
+        "title": "文章标题",
+        "url": "文章URL",
+        "source": "来源",
+        "publishTime": "发布时间",
+        "contentRows": [
+            {
+                "type": "text",
+                "content": "段落文本"
+            },
+            {
+                "type": "img",
+                "content": "<img src='图片URL' />"
+            }
+        ]
+    }
+]
+```
+
+## 参数说明
+
+### `crawl_important(max_count=60)`
+
+- **max_count**: 最多爬取的文章数量，默认60篇
+- **返回值**: `ResultDomain` 对象
+  - `success`: 是否成功
+  - `code`: 状态码（0表示成功，1表示失败）
+  - `message`: 提示信息
+  - `dataList`: 新闻列表（`List[NewsItem]`）
+
+## 注意事项
+
+1. **浏览器初始化**：首次运行时会自动打开 Chrome 浏览器并访问学习强国主页获取 Cookie
+2. **验证码处理**：如果遇到验证码，程序会暂停30秒让用户手动完成验证
+3. **爬取速度**：每篇文章之间会有1-2秒的随机延迟，避免请求过快被封禁
+4. **资源清理**：使用完毕后记得调用 `crawler.driver.quit()` 关闭浏览器
+
+## 与旧版爬虫的对比
+
+### 旧版爬虫 (myQiangguo)
+- 使用 `requests` + `BeautifulSoup` 解析静态HTML
+- 依赖于特定的 `data+MD5.js` 接口格式
+- 需要处理不同格式的URL（.html和.json）
+
+### 新版爬虫 (XxqgCrawler)
+- 结合 `requests` 获取列表 + `Selenium` 解析详情
+- 能够处理动态加载的内容
+- 统一的接口和返回格式
+- 更好的错误处理和日志记录
+
+## 扩展功能
+
+如果需要爬取其他栏目，可以参考 `crawl_important` 方法的实现，修改对应的 JSON 接口URL即可。
+
+常见栏目的JSON接口：
+- 重要新闻: `https://www.xuexi.cn/lgdata/1jscb6pu1n2.json?_st=26095725`
+- 重要活动: `https://www.xuexi.cn/lgdata/1jpuhp6fn73.json?_st=26095746`
+- 重要会议: `https://www.xuexi.cn/lgdata/19vhj0omh73.json?_st=26095747`
+- 重要讲话: `https://www.xuexi.cn/lgdata/132gdqo7l73.json?_st=26095749`
+
+## 技术架构
+
+```
+crawl_important()
+├── requests 获取JSON列表
+│   └── 解析文章URL和基础信息
+├── 遍历URL列表
+│   ├── parse_news_detail() (Selenium)
+│   │   ├── 访问文章页面
+│   │   ├── 提取标题、时间、来源
+│   │   └── 解析内容（文字、图片、视频）
+│   └── 补充缺失的字段
+└── 保存结果到JSON文件
+```