65 lines
2.3 KiB
Markdown
65 lines
2.3 KiB
Markdown
|
|
# 爬虫页面
|
|||
|
|
|
|||
|
|
## [人民网搜索](http://search.people.cn/) 热点排行
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
CrawlerConfig(
|
|||
|
|
base_url="http://www.people.com.cn",
|
|||
|
|
urls={
|
|||
|
|
"search": UrlConfig(
|
|||
|
|
url="http://search.people.cn/search-platform/front/search",
|
|||
|
|
method="POST",
|
|||
|
|
params={
|
|||
|
|
"key": "",
|
|||
|
|
"page": 1,
|
|||
|
|
"limit": 10,
|
|||
|
|
"hasTitle": True,
|
|||
|
|
"hasContent": True,
|
|||
|
|
"isFuzzy": True,
|
|||
|
|
"type": 0, # 0 所有,1 新闻,2 互动,3 报刊,4 图片,5 视频
|
|||
|
|
"sortType": 2, # 1 按相关度,2 按时间
|
|||
|
|
"startTime": 0,
|
|||
|
|
"endTime": 0
|
|||
|
|
}
|
|||
|
|
),
|
|||
|
|
"hot_point_rank": UrlConfig(
|
|||
|
|
url="http://search.people.cn/search-platform/front/searchRank",
|
|||
|
|
method="GET",
|
|||
|
|
params={}
|
|||
|
|
)
|
|||
|
|
},
|
|||
|
|
headers={
|
|||
|
|
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36',
|
|||
|
|
'Accept': 'application/json, text/plain, */*',
|
|||
|
|
'Accept-Language': 'zh-CN,zh;q=0.9',
|
|||
|
|
'Content-Type': 'application/json;charset=UTF-8'
|
|||
|
|
}
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## [精彩头条](http://www.people.com.cn/GB/59476/index.html)
|
|||
|
|
|
|||
|
|
> 查询对应日期的所有精彩头条
|
|||
|
|
http://www.people.com.cn/GB/59476/review/yyyyMMdd.html
|
|||
|
|
|
|||
|
|
一个html文件。里面包含了当天所有精彩新闻
|
|||
|
|
|
|||
|
|
## 一个新闻详情页html内数据结构
|
|||
|
|
```sh
|
|||
|
|
---------------------------------------------------------------------------
|
|||
|
|
xxxx等导航栏
|
|||
|
|
---------------------------------------------------------------------------
|
|||
|
|
# 左右结构 col col-1 fr |col col-2 fr
|
|||
|
|
新闻标题 h1 | 热门排行 rm_ranking cf
|
|||
|
|
新闻作者 author cf |
|
|||
|
|
时间、渠道 channel cf (col-1-1 fl) |
|
|||
|
|
新闻内容(含img、video标签) rm_txt_con cf | 二维码 tjewm1 cf
|
|||
|
|
img、video都是相对路径。拼接baseUrl
|
|||
|
|
---------------------------------------------------------------------------
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
## [反腐](http://fanfu.people.com.cn/index1.html)
|
|||
|
|
分页查询 http://fanfu.people.com.cn/index{page}.html
|
|||
|
|
根据页数拼接get等待html。
|
|||
|
|
对里面的独立新闻链接访问。 再走独立详情页查询
|