Files
schoolNews/schoolNewsCrawler/crawler/人民网界面结构.md
2025-11-10 15:22:44 +08:00

65 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 爬虫页面
## [人民网搜索](http://search.people.cn/) 热点排行
```python
CrawlerConfig(
base_url="http://www.people.com.cn",
urls={
"search": UrlConfig(
url="http://search.people.cn/search-platform/front/search",
method="POST",
params={
"key": "",
"page": 1,
"limit": 10,
"hasTitle": True,
"hasContent": True,
"isFuzzy": True,
"type": 0, # 0 所有1 新闻2 互动3 报刊4 图片5 视频
"sortType": 2, # 1 按相关度2 按时间
"startTime": 0,
"endTime": 0
}
),
"hot_point_rank": UrlConfig(
url="http://search.people.cn/search-platform/front/searchRank",
method="GET",
params={}
)
},
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Content-Type': 'application/json;charset=UTF-8'
}
)
```
## [精彩头条](http://www.people.com.cn/GB/59476/index.html)
> 查询对应日期的所有精彩头条
http://www.people.com.cn/GB/59476/review/yyyyMMdd.html
一个html文件。里面包含了当天所有精彩新闻
## 一个新闻详情页html内数据结构
```sh
---------------------------------------------------------------------------
xxxx等导航栏
---------------------------------------------------------------------------
# 左右结构 col col-1 fr |col col-2 fr
新闻标题 h1 | 热门排行 rm_ranking cf
新闻作者 author cf |
时间、渠道 channel cf col-1-1 fl |
新闻内容含img、video标签 rm_txt_con cf | 二维码 tjewm1 cf
img、video都是相对路径。拼接baseUrl
---------------------------------------------------------------------------
```
## [反腐](http://fanfu.people.com.cn/index1.html)
分页查询 http://fanfu.people.com.cn/index{page}.html
根据页数拼接get等待html。
对里面的独立新闻链接访问。 再走独立详情页查询