调试修改爬虫

This commit is contained in:
2025-11-12 19:16:50 +08:00
parent 675e6da7d7
commit e55a52f20b
27 changed files with 1023 additions and 601 deletions

View File

@@ -1,7 +0,0 @@
{
"code": "0",
"message": "获取搜索结果失败",
"success": false,
"data": null,
"dataList": []
}

View File

@@ -1,324 +0,0 @@
{
"code": 0,
"message": "",
"success": true,
"data": null,
"dataList": [
{
"title": "",
"contentRows": [],
"url": "http://cpc.people.com.cn/n1/2025/1109/c435113-40599647.html",
"publishTime": "",
"author": "",
"source": "人民网",
"category": ""
},
{
"title": "习近平在广东考察",
"contentRows": [
{
"tag": "p",
"content": "<p></p>"
},
{
"tag": "img",
"content": "<img style='None' src='http://www.people.com.cn/mediafile/pic/BIG/20251108/12/10441932996427049992.jpg' />"
},
{
"tag": "p",
"content": "<p>  11月7日至8日中共中央总书记、国家主席、中央军委主席习近平在广东考察。这是7日下午习近平在位于梅州市梅县区雁洋镇的叶剑英纪念馆参观叶剑英生平事迹陈列。</p>"
},
{
"tag": "p",
"content": "<p>  新华社记者 谢环驰 摄</p>"
},
{
"tag": "img",
"content": "<img style='None' src='http://www.people.com.cn/img/2020wbc/imgs/share.png' />"
}
],
"url": "http://pic.people.com.cn/n1/2025/1108/c426981-40599554.html",
"publishTime": "2025年11月08日17:22",
"author": "",
"source": "新华社",
"category": ""
},
{
"title": "",
"contentRows": [],
"url": "http://cpc.people.com.cn/n1/2025/1031/c64094-40593715.html",
"publishTime": "",
"author": "",
"source": "人民网",
"category": ""
},
{
"title": "习近平抵达韩国",
"contentRows": [
{
"tag": "p",
"content": "<p></p>"
},
{
"tag": "img",
"content": "<img style='text-align: center;' src='http://www.people.com.cn/mediafile/pic/20251031/24/17044241366860047372.jpg' />"
},
{
"tag": "p",
"content": "<p style=\"text-align: center;\"><span style=\"color: #0000cd;\">当地时间十月三十日上午,国家主席习近平乘专机抵达韩国,应大韩民国总统李在明邀请,出席亚太经合组织第三十二次领导人非正式会议并对韩国进行国事访问。这是习近平抵达釜山金海国际机场时,韩国外长赵显等高级官员热情迎接。新华社记者 黄敬文摄</span></p>"
},
{
"tag": "p",
"content": "<p style=\"text-align: justify;\">  本报韩国釜山10月30日电 记者莽九晨、杨翘楚当地时间10月30日上午国家主席习近平乘专机抵达韩国应大韩民国总统李在明邀请出席亚太经合组织第三十二次领导人非正式会议并对韩国进行国事访问。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-align: justify;\">  习近平抵达釜山金海国际机场时韩国外长赵显等高级官员热情迎接。礼兵分列红地毯两侧致敬军乐团演奏行进乐机场鸣放21响礼炮。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-align: justify;\">  蔡奇、王毅、何立峰等陪同人员同机抵达。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-align: justify;\">  先期抵达的香港特别行政区行政长官李家超、中国驻韩国大使戴兵也到机场迎接。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-align: justify;\">  中国留学生和中资企业代表挥舞中韩两国国旗,热烈欢迎习近平到访。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-align: justify;\">  本报北京10月30日电 10月30日上午国家主席习近平乘专机离开北京应大韩民国总统李在明邀请赴韩国庆州出席亚太经合组织第三十二次领导人非正式会议并对韩国进行国事访问。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-align: justify;\">  陪同习近平出访的有:中共中央政治局常委、中央办公厅主任蔡奇,中共中央政治局委员、外交部部长王毅,中共中央政治局委员、国务院副总理何立峰等。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-align: justify;\">  《人民日报》2025年10月31日 第01版</p>"
},
{
"tag": "img",
"content": "<img style='None' src='http://www.people.com.cn/img/2020wbc/imgs/share.png' />"
}
],
"url": "http://korea.people.com.cn/n1/2025/1031/c407366-40594082.html",
"publishTime": "2025年10月31日13:38",
"author": "",
"source": "人民网-人民日报",
"category": ""
},
{
"title": "习近平抵达韩国",
"contentRows": [
{
"tag": "p",
"content": "<p></p>"
},
{
"tag": "p",
"content": "<p>  当地时间十月三十日上午,国家主席习近平乘专机抵达韩国,应大韩民国总统李在明邀请,出席亚太经合组织第三十二次领导人非正式会议并对韩国进行国事访问。这是习近平抵达釜山金海国际机场时,韩国外长赵显等高级官员热情迎接。<br/>  新华社记者 黄敬文摄</p>"
},
{
"tag": "p",
"content": "<p>   本报韩国釜山10月30日电  记者莽九晨、杨翘楚当地时间10月30日上午国家主席习近平乘专机抵达韩国应大韩民国总统李在明邀请出席亚太经合组织第三十二次领导人非正式会议并对韩国进行国事访问。</p>"
},
{
"tag": "p",
"content": "<p>  习近平抵达釜山金海国际机场时韩国外长赵显等高级官员热情迎接。礼兵分列红地毯两侧致敬军乐团演奏行进乐机场鸣放21响礼炮。</p>"
},
{
"tag": "p",
"content": "<p>  蔡奇、王毅、何立峰等陪同人员同机抵达。</p>"
},
{
"tag": "p",
"content": "<p>  先期抵达的香港特别行政区行政长官李家超、中国驻韩国大使戴兵也到机场迎接。</p>"
},
{
"tag": "p",
"content": "<p>  中国留学生和中资企业代表挥舞中韩两国国旗,热烈欢迎习近平到访。</p>"
},
{
"tag": "p",
"content": "<p>  本报北京10月30日电  10月30日上午国家主席习近平乘专机离开北京应大韩民国总统李在明邀请赴韩国庆州出席亚太经合组织第三十二次领导人非正式会议并对韩国进行国事访问。</p>"
},
{
"tag": "p",
"content": "<p>  陪同习近平出访的有:中共中央政治局常委、中央办公厅主任蔡奇,中共中央政治局委员、外交部部长王毅,中共中央政治局委员、国务院副总理何立峰等。 </p>"
},
{
"tag": "p",
"content": "<p></p>"
},
{
"tag": "p",
"content": "<p><span id=\"paper_num\">  《 人民日报 》( 2025年10月31日 01 版)</span></p>"
},
{
"tag": "img",
"content": "<img style='None' src='http://www.people.com.cn/img/2020wbc/imgs/share.png' />"
}
],
"url": "http://politics.people.com.cn/n1/2025/1031/c1024-40593454.html",
"publishTime": "2025年10月31日06:10",
"author": "",
"source": "人民网-人民日报",
"category": ""
},
{
"title": "习近平回到北京",
"contentRows": [
{
"tag": "p",
"content": "<p></p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">本报北京11月1日电  11月1日晚国家主席习近平结束出席亚太经合组织第三十二次领导人非正式会议和对韩国的国事访问后回到北京。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">中共中央政治局常委、中央办公厅主任蔡奇,中共中央政治局委员、外交部部长王毅等陪同人员同机返回。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">本报韩国釜山11月1日电  记者王嵘、朱笑熺当地时间11月1日晚国家主席习近平结束出席亚太经合组织第三十二次领导人非正式会议和对韩国的国事访问返回北京。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">离开釜山时,韩国外长赵显等高级官员到机场送行。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">前往机场途中,中国留学生和中资企业代表在道路两旁挥舞中韩两国国旗,热烈祝贺习近平主席访问圆满成功。</p>"
},
{
"tag": "img",
"content": "<img style='None' src='http://www.people.com.cn/img/2020wbc/imgs/share.png' />"
}
],
"url": "http://gd.people.com.cn/n2/2025/1102/c123932-41398959.html",
"publishTime": "2025年11月02日11:15",
"author": "",
"source": "人民网-人民日报",
"category": ""
},
{
"title": "习近平回到北京",
"contentRows": [
{
"tag": "p",
"content": "<p></p>"
},
{
"tag": "p",
"content": "<p>   本报北京11月1日电  11月1日晚国家主席习近平结束出席亚太经合组织第三十二次领导人非正式会议和对韩国的国事访问后回到北京。</p>"
},
{
"tag": "p",
"content": "<p>  中共中央政治局常委、中央办公厅主任蔡奇,中共中央政治局委员、外交部部长王毅等陪同人员同机返回。</p>"
},
{
"tag": "p",
"content": "<p>  本报韩国釜山11月1日电  记者王嵘、朱笑熺当地时间11月1日晚国家主席习近平结束出席亚太经合组织第三十二次领导人非正式会议和对韩国的国事访问返回北京。</p>"
},
{
"tag": "p",
"content": "<p>  离开釜山时,韩国外长赵显等高级官员到机场送行。</p>"
},
{
"tag": "p",
"content": "<p>  前往机场途中,中国留学生和中资企业代表在道路两旁挥舞中韩两国国旗,热烈祝贺习近平主席访问圆满成功。 </p>"
},
{
"tag": "p",
"content": "<p></p>"
},
{
"tag": "p",
"content": "<p><span id=\"paper_num\">  《 人民日报 》( 2025年11月02日 01 版)</span></p>"
},
{
"tag": "img",
"content": "<img style='None' src='http://www.people.com.cn/img/2020wbc/imgs/share.png' />"
}
],
"url": "http://politics.people.com.cn/n1/2025/1102/c1024-40594763.html",
"publishTime": "2025年11月02日05:46",
"author": "",
"source": "人民网-人民日报",
"category": ""
},
{
"title": "",
"contentRows": [],
"url": "http://cpc.people.com.cn/n1/2025/1102/c64094-40594809.html",
"publishTime": "",
"author": "",
"source": "人民网",
"category": ""
},
{
"title": "《习近平的文化情缘》《习近平经济思想系列讲读》在澳门启播",
"contentRows": [
{
"tag": "p",
"content": "<p></p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">人民网澳门9月28日电 记者富子梅《习近平的文化情缘》及《习近平经济思想系列讲读》两部专题片在澳门启播仪式28日举行。澳门特区行政长官岑浩辉中宣部副部长、中央广播电视总台台长兼总编辑慎海雄中央政府驻澳门特区联络办公室主任郑新聪出席活动并致辞。</p>"
},
{
"tag": "img",
"content": "<img style='text-align: center;' src='http://www.people.com.cn/NMediaFile/2025/0928/MAIN1759049114282Z17GV1PI43.jpg' />"
},
{
"tag": "p",
"content": "<p style=\"text-align: center;\"><span desc=\"desc\">《习近平的文化情缘》《习近平经济思想系列讲读》澳门启播仪式。(澳门特区政府新闻局供图)</span></p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">岑浩辉表示,《习近平的文化情缘》《习近平经济思想系列讲读》在澳门落地启播,高度契合澳门中西荟萃、内联外通的优势和功能,具有重大而且深远的意义。期待以此为契机,持续深化推动广大澳门同胞和海内外人士对习近平新时代中国特色社会主义思想的关注、理解和实践,共同讲好中国故事、促进国际交流、不断扩大“朋友圈”</p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">慎海雄指出,两部精品节目是助力澳门各界更好学习领会领袖思想的一次生动实践,是让澳门居民深切感悟中华文明深厚底蕴和新时代伟大成就的一场文化盛宴。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">郑新聪表示,两部精品节目在澳门播出,有力促进习近平文化思想、习近平经济思想的宣传普及、落地生根,将为澳门打造中西文明交流互鉴的重要窗口、推动经济适度多元发展提供精神动力和科学指引。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">9月28日起电视专题片《习近平的文化情缘》在澳门广播电视股份有限公司的澳视澳门频道、澳门有线电视股份有限公司互动新闻台、澳门莲花卫视传媒有限公司网站以及《澳门日报》《大众报》《市民日报》《濠江日报》《正报》《澳门商报》《澳门焦点报》《莲花时报》等媒体的新媒体平台陆续上线。大型专题节目《习近平经济思想系列讲读》9月28日起在澳广视旗下电视频道及新媒体平台上线播出。</p>"
},
{
"tag": "p",
"content": "<p style=\"text-indent: 2em;\">启播仪式后举行的“盛世莲开颂华章 - 中央广播电视总台与澳门各界深化合作仪式”上双方代表分别交换《中央广播电视总台与澳门特别行政区政府深化战略合作框架协议》、《国家电影局与澳门特别行政区政府社会文化司关于电影产业合作框架协议》、《十五运会和残特奥会澳门赛区筹备办公室与中央广播电视总台合作意向书》、《中央广播电视总台与澳门广播电视股份有限公司关于整频道转播央视CCTV-5体育频道的协议》、《中央广播电视总台亚太总站与澳门大学深化战略合作框架协议》等5份合作文件。</p>"
},
{
"tag": "img",
"content": "<img style='None' src='http://www.people.com.cn/img/2020wbc/imgs/share.png' />"
}
],
"url": "http://gba.people.cn/n1/2025/0928/c42272-40573895.html",
"publishTime": "2025年09月28日16:44",
"author": "",
"source": "人民网-大湾区频道",
"category": ""
},
{
"title": "",
"contentRows": [],
"url": "http://cpc.people.com.cn/n1/2025/0926/c64094-40572435.html",
"publishTime": "",
"author": "",
"source": "人民网",
"category": ""
}
]
}

View File

@@ -2,10 +2,10 @@
from typing import Callable, Dict, Optional, List, Any, Union from typing import Callable, Dict, Optional, List, Any, Union
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
import requests import requests
from bs4 import BeautifulSoup from bs4 import BeautifulSoup, NavigableString
from loguru import logger from loguru import logger
from pydantic import BaseModel, Field, HttpUrl from pydantic import BaseModel, Field, HttpUrl
import json
class UrlConfig(BaseModel): class UrlConfig(BaseModel):
"""URL配置数据模型""" """URL配置数据模型"""
@@ -49,6 +49,8 @@ class NewsItem(BaseModel):
author: Optional[str] = Field(default=None, description="作者") author: Optional[str] = Field(default=None, description="作者")
source: Optional[str] = Field(default=None, description="来源") source: Optional[str] = Field(default=None, description="来源")
category: Optional[str] = Field(default=None, description="分类") category: Optional[str] = Field(default=None, description="分类")
executeStatus: Optional[int] = Field(default=0, description="执行状态")
executeMessage: Optional[str] = Field(default=None, description="执行消息")
class BaseCrawler(ABC): class BaseCrawler(ABC):

View File

@@ -6,12 +6,15 @@ from loguru import logger
import re import re
import chardet import chardet
from datetime import datetime, timedelta from datetime import datetime, timedelta
from bs4 import NavigableString
from urllib.parse import urlparse
import json
class RmrbCrawler(BaseCrawler): class RmrbCrawler(BaseCrawler):
"""人民日报新闻爬虫""" """人民日报新闻爬虫"""
def __init__(self): def __init__(self):
"""初始化人民日报爬虫""" """初始化人民日报爬虫"""
config = CrawlerConfig( config = CrawlerConfig(
base_url="http://www.people.com.cn", base_url="http://www.people.com.cn",
@@ -62,6 +65,12 @@ class RmrbCrawler(BaseCrawler):
}, },
) )
super().__init__(config) super().__init__(config)
self.detail_map = {
"gba": self.parse_base_news_detail,
"politics": self.parse_base_news_detail,
"finance": self.parse_base_news_detail,
"cpc": self.parse_cpc_news_detail,
}
def search(self, key: str, total: int, news_type: int = 0) -> ResultDomain: def search(self, key: str, total: int, news_type: int = 0) -> ResultDomain:
""" """
@@ -104,17 +113,25 @@ class RmrbCrawler(BaseCrawler):
records = response_json.get("data", {}).get("records", []) records = response_json.get("data", {}).get("records", [])
for record in records: for record in records:
news = self.parse_news_detail(record.get("url")) news = self.parse_news_detail(record.get("url"))
if news['title'] == '': if news.title == '':
news['title'] = record.get("title") news.title = record.get("title")
if news['contentRows'] == []: if news.contentRows == []:
news['contentRows'] = record.get("contentOriginal") # 如果contentOriginal是字符串,转换为列表格式
if news['publishTime'] == '': content_original = record.get("contentOriginal")
news['publishTime'] = datetime.datetime.fromtimestamp(record.get("displayTime") / 1000).date() if isinstance(content_original, str):
if news['author'] == '': news.contentRows = [{"type": "text", "content": content_original}]
news['author'] = record.get("author") elif isinstance(content_original, list):
if news['source'] == '': news.contentRows = content_original
news['source'] = record.get("originName") if not news.contentRows:
news.executeStatus= 1
news.executeMessage = "直接从接口响应获取"
if news.publishTime == '':
news.publishTime = str(datetime.fromtimestamp(record.get("displayTime", 0) / 1000).date())
if news.author == '':
news.author = record.get("author")
if news.source == '':
news.source = record.get("originName")
news_list.append(news) news_list.append(news)
else: else:
resultDomain.code = response_json.get("code") resultDomain.code = response_json.get("code")
@@ -259,6 +276,27 @@ class RmrbCrawler(BaseCrawler):
return resultDomain return resultDomain
def parse_news_detail(self, url: str) -> Optional[NewsItem]: def parse_news_detail(self, url: str) -> Optional[NewsItem]:
# 从 URL 中提取 category
netloc = urlparse(url).netloc
category = "gba"
if netloc.endswith('.people.com.cn'):
category = netloc.split('.')[0]
# 从 detail_map 中获取对应的解析函数
print(category)
parser_func = self.detail_map.get(category)
if parser_func is None:
logger.error(f"未找到对应解析器category={category}, url={url}")
return NewsItem(
url=url,
executeStatus=0,
executeMessage=f"不支持的新闻类型: {category}"
)
# 调用对应的解析方法(注意:这些方法是实例方法,需通过 self 调用)
return parser_func(url)
def parse_base_news_detail(self, url: str) -> Optional[NewsItem]:
""" """
解析人民日报新闻详情 解析人民日报新闻详情
@@ -277,10 +315,14 @@ class RmrbCrawler(BaseCrawler):
publishTime="", publishTime="",
author="", author="",
source="人民网", source="人民网",
category="" category="",
executeStatus=1,
executeMessage="成功解析新闻"
) )
if not response: if not response:
logger.error(f"获取响应失败: {url}") logger.error(f"获取响应失败: {url}")
news.executeStatus = 0
news.executeMessage = f"获取响应失败: {url}"
return news return news
# BeautifulSoup 可以自动检测并解码编码,直接传入字节数据即可 # BeautifulSoup 可以自动检测并解码编码,直接传入字节数据即可
@@ -288,18 +330,24 @@ class RmrbCrawler(BaseCrawler):
soup = self.parse_html(response.content) soup = self.parse_html(response.content)
if not soup: if not soup:
logger.error("解析HTML失败") logger.error("解析HTML失败")
news.executeStatus = 0
news.executeMessage = f"解析HTML失败"
return news return news
# 提取主内容区域 # 提取主内容区域
main_div = soup.find("div", class_="layout rm_txt cf") main_div = soup.select_one("div.layout.rm_txt.cf")
if not main_div: if not main_div:
logger.error("未找到主内容区域") logger.error("未找到主内容区域")
news.executeStatus = 0
news.executeMessage = f"未找到主内容区域"
return news return news
# 提取文章区域 # 提取文章区域
article_div = main_div.find("div", class_="col col-1") article_div = main_div.select_one("div.col.col-1")
if not article_div: if not article_div:
logger.error("未找到文章区域") logger.error("未找到文章区域")
news.executeStatus = 0
news.executeMessage = f"未找到文章区域"
return news return news
# 提取标题 # 提取标题
@@ -380,4 +428,215 @@ class RmrbCrawler(BaseCrawler):
except Exception as e: except Exception as e:
logger.error(f"解析新闻详情失败 [{url}]: {str(e)}") logger.error(f"解析新闻详情失败 [{url}]: {str(e)}")
return None news.executeStatus = 0
news.executeMessage = f"解析新闻详情失败: {str(e)}"
return news
def parse_cpc_news_detail(self, url: str) -> Optional[NewsItem]:
"""
解析人民日报新闻详情
"""
try:
response = self.fetch(url)
news = NewsItem(
title="",
contentRows=[], # 修复:使用 contents 而不是 content
url=url,
publishTime="",
author="",
source="人民网",
category="",
executeStatus=1,
executeMessage="成功解析新闻"
)
if not response:
logger.error(f"获取响应失败: {url}")
news.executeStatus = 0
news.executeMessage = f"获取响应失败: {url}"
return news
# BeautifulSoup 可以自动检测并解码编码,直接传入字节数据即可
# 它会从 HTML 的 <meta charset> 标签或响应头自动检测编码
soup = self.parse_html(response.content)
if not soup:
logger.error("解析HTML失败")
news.executeStatus = 0
news.executeMessage = f"解析HTML失败"
return news
# 提取主内容区域
main_div = soup.select_one("div.text_con.text_con01")
if not main_div:
logger.error("未找到主内容区域")
news.executeStatus = 0
news.executeMessage = f"未找到主内容区域"
return news
# 提取文章区域
article_div = main_div.select_one("div.text_c")
if not article_div:
logger.error("未找到文章区域")
news.executeStatus = 0
news.executeMessage = f"未找到文章区域"
return news
# 提取标题
title_tag = article_div.select_one("h1")
title = title_tag.get_text(strip=True) if title_tag else ""
# 提取作者
author_tag = article_div.select_one("div.author.cf")
author = author_tag.get_text(strip=True) if author_tag else ""
# 提取发布时间和来源
channel_div = article_div.select_one("div.sou")
publish_time = ""
source = ""
if channel_div:
# 提取时间:取第一个非空文本节点
for child in channel_div.children:
if isinstance(child, str) and child.strip():
publish_time = child.strip().split("来源:")[0].strip()
break
# 提取来源
a_tag = channel_div.find("a")
source = a_tag.get_text(strip=True) if a_tag else ""
# 清理不可见空格
publish_time = publish_time.replace("\xa0", " ").replace(" ", " ").strip()
# 提取内容
content_div = article_div.select_one('div.show_text')
contents = [] # 构建一个富文本内容
pList = content_div.find_all("p") # 所有p标签
# 解析p标签 变为quill富文本
# 遍历 show_text 下的所有直接子节点(保持顺序)
for child in content_div.children:
# 跳过纯文本节点(如换行、空格)
if isinstance(child, NavigableString):
continue
tag_name = child.name
if tag_name is None:
continue
# 情况1检测是否是视频容器根据 id 特征或内部结构)
video_tag = child.find('video') if tag_name != 'video' else child
if video_tag and video_tag.get('src'):
src = str(video_tag['src'])
p_style = video_tag.get("style", "")
if not src.startswith("http"):
src = self.config.base_url + src
contents.append({
"tag": "video",
"content": f"<video style='{p_style}' src='{src}'></video>"
})
continue
img_tag = child.find('img') if tag_name != 'img' else child
if img_tag and img_tag.get('src'):
src = str(img_tag['src'])
p_style = child.get("style", "")
if not src.startswith("http"):
src = self.config.base_url + src
contents.append({
"tag": "img",
"content": f"<img style='{p_style}' src='{src}' />"
})
continue
if tag_name == 'p':
p_style = child.get("style", "")
img_tag = child.find('img')
video_tag = child.find('video')
# 情况1存在 <img> 或 <video> 标签(静态资源)
if img_tag or video_tag:
src = img_tag.get('src') if img_tag else video_tag.get('src')
if src:
src = str(src)
if not src.startswith(('http://', 'https://')):
src = self.config.base_url.rstrip('/') + '/' + src.lstrip('/')
tag_type = "img" if img_tag else "video"
if img_tag:
content_html = f"<img style='{p_style}' src='{src}' />"
else:
content_html = f"<video style='{p_style}' src='{src}' controls></video>"
contents.append({
"tag": tag_type,
"content": content_html
})
else:
# 无 src当作普通段落
contents.append({"tag": "p", "content": str(child)})
continue
# 情况2检查是否包含人民网的 showPlayer 脚本(动态视频)
script_tags = child.find_all('script', string=True)
video_src = None
poster_url = None
for script in script_tags:
script_text = script.string or ""
if "showPlayer" not in script_text:
continue
# 使用正则精准提取 src 和 posterUrl支持空格、换行
src_match = re.search(r"src\s*:\s*'([^']*)'", script_text)
poster_match = re.search(r"posterUrl\s*:\s*'([^']*)'", script_text)
if src_match:
video_src = src_match.group(1)
if poster_match:
poster_url = poster_match.group(1)
if video_src:
break # 找到视频源即可退出
if video_src:
# 补全 URL确保是绝对路径
if not video_src.startswith(('http://', 'https://')):
video_src = self.config.base_url.rstrip('/') + '/' + video_src.lstrip('/')
if poster_url and not poster_url.startswith(('http://', 'https://')):
poster_url = self.config.base_url.rstrip('/') + '/' + poster_url.lstrip('/')
# 构造 video 标签属性
attrs_parts = []
if p_style:
attrs_parts.append(f"style='{p_style}'")
if poster_url:
attrs_parts.append(f"poster='{poster_url}'")
attrs_parts.append("controls")
attrs = " ".join(attrs_parts)
contents.append({
"tag": "video",
"content": f"<video {attrs} src='{video_src}'></video>"
})
else:
# 普通段落文本
contents.append({
"tag": "p",
"content": str(child)
})
continue
news.title=title
news.contentRows=contents # 修复:使用 contents 而不是 content
news.url=url
news.publishTime=publish_time
news.author=author
news.source=source or "人民网"
news.category=""
logger.info(f"成功解析新闻: {title}")
return news
except Exception as e:
logger.error(f"解析新闻详情失败 [{url}]: {str(e)}")

View File

@@ -51,7 +51,7 @@ def main():
"message": result.message, "message": result.message,
"success": result.success, "success": result.success,
"data": None, "data": None,
"dataList": [item.dict() for item in result.dataList] if result.dataList else [] "dataList": [item.model_dump() for item in result.dataList] if result.dataList else []
} }
if output_file: if output_file:

View File

@@ -81,20 +81,19 @@ def main():
try: try:
logger.info(f"开始搜索: 关键词='{key}', 数量={total}, 类型={news_type}") logger.info(f"开始搜索: 关键词='{key}', 数量={total}, 类型={news_type}")
crawler = RmrbCrawler() crawler = RmrbCrawler()
# result = crawler.search(key=key.strip(), total=total, news_type=news_type) result = crawler.search(key=key.strip(), total=total, news_type=news_type)
output = {
"code": result.code,
"message": result.message,
"success": result.success,
"data": None,
"dataList": [item.model_dump() for item in result.dataList] if result.dataList else []
}
result = None result = None
with open("../output/output.json", "r", encoding="utf-8") as f: with open("F:\Project\schoolNews\schoolNewsCrawler\output\output.json", "r", encoding="utf-8") as f:
result = json.load(f) result = json.load(f)
print(result)
output = result output = result
# output = {
# "code": result["code"],
# "message": result["message"],
# "success": result["success"],
# "data": None,
# "dataList": [item.model_dump() for item in result["dataList"]] if result["dataList"] else []
# }
if output_file: if output_file:
output_path = Path(output_file) output_path = Path(output_file)
output_path.parent.mkdir(parents=True, exist_ok=True) output_path.parent.mkdir(parents=True, exist_ok=True)
@@ -102,8 +101,11 @@ def main():
json.dump(output, f, ensure_ascii=False, indent=2) json.dump(output, f, ensure_ascii=False, indent=2)
logger.info(f"结果已保存到: {output_file}") logger.info(f"结果已保存到: {output_file}")
print(json.dumps(output, ensure_ascii=False, indent=2))
crawler.close() crawler.close()
# sys.exit(0 if result.success else 1)
# print(json.dumps(output, ensure_ascii=False, indent=2))
sys.exit(0 if result["success"] else 1) sys.exit(0 if result["success"] else 1)
except Exception as e: except Exception as e:

View File

@@ -132,7 +132,7 @@ def main():
"message": result.message, "message": result.message,
"success": result.success, "success": result.success,
"data": None, "data": None,
"dataList": [item.dict() for item in result.dataList] if result.dataList else [] "dataList": [item.model_dump() for item in result.dataList] if result.dataList else []
} }
# 保存到文件 # 保存到文件

File diff suppressed because one or more lines are too long

View File

@@ -78,6 +78,8 @@ CREATE TABLE `tb_data_collection_item` (
`images` TEXT DEFAULT NULL COMMENT '图片列表JSON', `images` TEXT DEFAULT NULL COMMENT '图片列表JSON',
`tags` VARCHAR(500) DEFAULT NULL COMMENT '标签(逗号分隔)', `tags` VARCHAR(500) DEFAULT NULL COMMENT '标签(逗号分隔)',
`status` TINYINT(1) NOT NULL DEFAULT 0 COMMENT '状态0未处理 1已转换为资源 2已忽略', `status` TINYINT(1) NOT NULL DEFAULT 0 COMMENT '状态0未处理 1已转换为资源 2已忽略',
`execute_status` TINYINT(1) NOT NULL DEFAULT 0 COMMENT '执行状态0未执行 1已执行',
`execute_message` TEXT DEFAULT NULL COMMENT '执行结果信息',
`resource_id` VARCHAR(64) DEFAULT NULL COMMENT '转换后的资源ID', `resource_id` VARCHAR(64) DEFAULT NULL COMMENT '转换后的资源ID',
`crawl_time` DATETIME DEFAULT NULL COMMENT '爬取时间', `crawl_time` DATETIME DEFAULT NULL COMMENT '爬取时间',
`process_time` DATETIME DEFAULT NULL COMMENT '处理时间', `process_time` DATETIME DEFAULT NULL COMMENT '处理时间',

View File

@@ -127,5 +127,15 @@ public interface DataCollectionItemService {
* @since 2025-11-08 * @since 2025-11-08
*/ */
ResultDomain<Long> countByStatus(String taskId, Integer status); ResultDomain<Long> countByStatus(String taskId, Integer status);
/**
* @description 更新采集项状态
* @param itemId 采集项ID
* @param status 状态
* @return ResultDomain<TbDataCollectionItem> 操作结果
* @author yslg
* @since 2025-11-08
*/
ResultDomain<String> updateItemStatus(String itemId, int status);
} }

View File

@@ -105,6 +105,16 @@ public class TbDataCollectionItem extends BaseDTO {
*/ */
private String processor; private String processor;
/**
* @description 单条新闻执行状态0:失败 1:成功)
*/
private Integer executeStatus;
/**
* @description 单条新闻执行消息(记录错误信息或成功提示)
*/
private String executeMessage;
public String getTaskId() { public String getTaskId() {
return taskId; return taskId;
} }
@@ -248,5 +258,21 @@ public class TbDataCollectionItem extends BaseDTO {
public void setProcessor(String processor) { public void setProcessor(String processor) {
this.processor = processor; this.processor = processor;
} }
public Integer getExecuteStatus() {
return executeStatus;
}
public void setExecuteStatus(Integer executeStatus) {
this.executeStatus = executeStatus;
}
public String getExecuteMessage() {
return executeMessage;
}
public void setExecuteMessage(String executeMessage) {
this.executeMessage = executeMessage;
}
} }

View File

@@ -111,6 +111,16 @@ public class DataCollectionItemVO implements Serializable {
*/ */
private String processor; private String processor;
/**
* 单条新闻执行状态(0:失败 1:成功)
*/
private Integer itemExecuteStatus;
/**
* 单条新闻执行消息
*/
private String itemExecuteMessage;
/** /**
* 创建时间 * 创建时间
*/ */
@@ -336,6 +346,22 @@ public class DataCollectionItemVO implements Serializable {
this.processor = processor; this.processor = processor;
} }
public Integer getItemExecuteStatus() {
return itemExecuteStatus;
}
public void setItemExecuteStatus(Integer itemExecuteStatus) {
this.itemExecuteStatus = itemExecuteStatus;
}
public String getItemExecuteMessage() {
return itemExecuteMessage;
}
public void setItemExecuteMessage(String itemExecuteMessage) {
this.itemExecuteMessage = itemExecuteMessage;
}
public Date getCreateTime() { public Date getCreateTime() {
return createTime; return createTime;
} }

View File

@@ -63,6 +63,11 @@ public class DataCollectionItemController {
return itemService.convertToResource(request.getItemId(), request.getTagId()); return itemService.convertToResource(request.getItemId(), request.getTagId());
} }
@PutMapping("/{itemId}/status/{status}")
public ResultDomain<String> updateItemStatus(@PathVariable(name = "itemId") String itemId, @PathVariable(name = "status") int status) {
return itemService.updateItemStatus(itemId, status);
}
/** /**
* @description 转换请求 * @description 转换请求
*/ */

View File

@@ -28,6 +28,15 @@ public interface CrontabLogMapper extends BaseMapper<TbCrontabLog> {
*/ */
int insertLog(@Param("log") TbCrontabLog log); int insertLog(@Param("log") TbCrontabLog log);
/**
* @description 更新日志
* @param log 日志信息
* @return int 影响行数
* @author yslg
* @since 2025-11-12
*/
int updateLog(@Param("log") TbCrontabLog log);
/** /**
* @description 根据ID查询日志 * @description 根据ID查询日志
* @param logId 日志ID * @param logId 日志ID

View File

@@ -84,6 +84,16 @@ public interface DataCollectionItemMapper extends BaseMapper<TbDataCollectionIte
*/ */
long countByStatus(@Param("taskId") String taskId, @Param("status") Integer status); long countByStatus(@Param("taskId") String taskId, @Param("status") Integer status);
/**
* @description 更新采集项状态
* @param itemId 采集项ID
* @param status 状态
* @return int 影响行数
* @author yslg
* @since 2025-11-08
*/
int updateItemStatus(@Param("itemId") String itemId, @Param("status") Integer status);
// ==================== VO查询方法(使用JOIN返回完整VO) ==================== // ==================== VO查询方法(使用JOIN返回完整VO) ====================
/** /**

View File

@@ -58,6 +58,10 @@ public class TaskExecutor {
log.setDeleted(false); log.setDeleted(false);
try { try {
log.setExecuteStatus(0);
log.setExecuteMessage("执行中");
int i = logMapper.insertLog(log);
// 检查是否允许并发执行 // 检查是否允许并发执行
if (task.getConcurrent() == 0) { if (task.getConcurrent() == 0) {
// TODO: 可以添加分布式锁来防止并发执行 // TODO: 可以添加分布式锁来防止并发执行
@@ -84,7 +88,7 @@ public class TaskExecutor {
log.setEndTime(endTime); log.setEndTime(endTime);
log.setExecuteDuration((int) (endTime.getTime() - startTime.getTime())); log.setExecuteDuration((int) (endTime.getTime() - startTime.getTime()));
log.setExecuteStatus(1); log.setExecuteStatus(1);
log.setExecuteMessage("执行成功"); log.setExecuteMessage(null);
logger.info("任务执行成功: {} [{}ms]", task.getTaskName(), log.getExecuteDuration()); logger.info("任务执行成功: {} [{}ms]", task.getTaskName(), log.getExecuteDuration());
} catch (Exception e) { } catch (Exception e) {
@@ -100,7 +104,7 @@ public class TaskExecutor {
} finally { } finally {
// 保存执行日志 // 保存执行日志
try { try {
logMapper.insertLog(log); logMapper.updateLog(log);
} catch (Exception e) { } catch (Exception e) {
logger.error("保存任务执行日志失败: {}", task.getTaskName(), e); logger.error("保存任务执行日志失败: {}", task.getTaskName(), e);
} }

View File

@@ -17,10 +17,13 @@ import org.xyzh.common.utils.IDUtils;
import org.xyzh.common.vo.DataCollectionItemVO; import org.xyzh.common.vo.DataCollectionItemVO;
import org.xyzh.common.vo.ResourceVO; import org.xyzh.common.vo.ResourceVO;
import org.xyzh.crontab.mapper.DataCollectionItemMapper; import org.xyzh.crontab.mapper.DataCollectionItemMapper;
import org.xyzh.crontab.mapper.CrontabLogMapper;
import org.xyzh.crontab.mapper.CrontabTaskMapper; import org.xyzh.crontab.mapper.CrontabTaskMapper;
import org.xyzh.common.dto.crontab.TbCrontabLog;
import org.xyzh.common.dto.crontab.TbCrontabTask; import org.xyzh.common.dto.crontab.TbCrontabTask;
import org.xyzh.system.utils.LoginUtil; import org.xyzh.system.utils.LoginUtil;
import java.util.ArrayList;
import java.util.Date; import java.util.Date;
import java.util.List; import java.util.List;
@@ -42,6 +45,9 @@ public class DataCollectionItemServiceImpl implements DataCollectionItemService
@Autowired @Autowired
private CrontabTaskMapper taskMapper; private CrontabTaskMapper taskMapper;
@Autowired
private CrontabLogMapper logMapper;
@Autowired @Autowired
private ResourceService resourceService; private ResourceService resourceService;
@@ -100,11 +106,23 @@ public class DataCollectionItemServiceImpl implements DataCollectionItemService
int successCount = 0; int successCount = 0;
Date now = new Date(); Date now = new Date();
List<TbDataCollectionItem> newItems = new ArrayList<>();
int result = itemMapper.batchInsertItems(itemList); for (TbDataCollectionItem it : itemList) {
if (result > 0) { TbDataCollectionItem existing = itemMapper.selectBySourceUrl(it.getSourceUrl());
successCount = result; if (existing == null) {
newItems.add(it);
}
} }
if (!newItems.isEmpty()) {
successCount = itemMapper.batchInsertItems(newItems);
}
String logId = itemList.get(0).getLogId();
TbCrontabLog log = new TbCrontabLog();
log.setID(logId);
log.setExecuteStatus(1);
log.setExecuteMessage("爬取成功,共" + itemList.size() + "条,新增" + successCount + "");
int i = logMapper.updateLog(log);
logger.info("批量创建采集项成功,共{}条,成功{}条", itemList.size(), successCount); logger.info("批量创建采集项成功,共{}条,成功{}条", itemList.size(), successCount);
resultDomain.success("批量创建采集项成功", successCount); resultDomain.success("批量创建采集项成功", successCount);
@@ -404,5 +422,21 @@ public class DataCollectionItemServiceImpl implements DataCollectionItemService
return resultDomain; return resultDomain;
} }
@Override
public ResultDomain<String> updateItemStatus(String itemId, int status) {
ResultDomain<String> resultDomain = new ResultDomain<>();
try {
int result = itemMapper.updateItemStatus(itemId, status);
if (result > 0) {
resultDomain.success("更新采集项状态成功", itemId);
} else {
resultDomain.fail("更新采集项状态失败");
}
} catch (Exception e) {
logger.error("更新采集项状态异常: ", e);
resultDomain.fail("更新采集项状态异常: " + e.getMessage());
}
return resultDomain;
}
} }

View File

@@ -23,6 +23,9 @@ public class ArticleStruct {
private String publishTime; private String publishTime;
private String author; private String author;
private String source; private String source;
private String logId;
private Integer executeStatus;
private String executeMessage;
private List<RowStruct> contentRows; private List<RowStruct> contentRows;
@Data @Data

View File

@@ -158,7 +158,8 @@ public class NewsCrawlerTask extends PythonCommandTask {
item.setTaskId(taskId); item.setTaskId(taskId);
item.setLogId(logId); item.setLogId(logId);
item.setTitle(news.getTitle()); item.setTitle(news.getTitle());
item.setExecuteStatus(news.getExecuteStatus());
item.setExecuteMessage(news.getExecuteMessage());
// 拼接HTML内容 // 拼接HTML内容
if (news.getContentRows() != null && !news.getContentRows().isEmpty()) { if (news.getContentRows() != null && !news.getContentRows().isEmpty()) {
StringBuilder html = new StringBuilder(); StringBuilder html = new StringBuilder();

View File

@@ -99,6 +99,20 @@
</trim> </trim>
</insert> </insert>
<!-- updateLog -->
<update id="updateLog">
UPDATE tb_crontab_log
SET
<if test="log.executeStatus != null">execute_status = #{log.executeStatus},</if>
<if test="log.executeMessage != null">execute_message = #{log.executeMessage},</if>
<if test="log.exceptionInfo != null">exception_info = #{log.exceptionInfo},</if>
<if test="log.endTime != null">end_time = #{log.endTime},</if>
<if test="log.executeDuration != null">execute_duration = #{log.executeDuration},</if>
update_time = NOW()
WHERE id = #{log.ID} AND deleted = 0
</update>
<!-- 根据ID查询日志 --> <!-- 根据ID查询日志 -->
<select id="selectLogById" resultMap="BaseResultMap"> <select id="selectLogById" resultMap="BaseResultMap">
SELECT SELECT

View File

@@ -25,6 +25,8 @@
<result column="crawl_time" property="crawlTime" /> <result column="crawl_time" property="crawlTime" />
<result column="process_time" property="processTime" /> <result column="process_time" property="processTime" />
<result column="processor" property="processor" /> <result column="processor" property="processor" />
<result column="execute_status" property="executeStatus" />
<result column="execute_message" property="executeMessage" />
<result column="create_time" property="createTime" /> <result column="create_time" property="createTime" />
<result column="update_time" property="updateTime" /> <result column="update_time" property="updateTime" />
<result column="delete_time" property="deleteTime" /> <result column="delete_time" property="deleteTime" />
@@ -53,6 +55,8 @@
<result column="crawl_time" property="crawlTime" /> <result column="crawl_time" property="crawlTime" />
<result column="process_time" property="processTime" /> <result column="process_time" property="processTime" />
<result column="processor" property="processor" /> <result column="processor" property="processor" />
<result column="item_execute_status" property="itemExecuteStatus" />
<result column="item_execute_message" property="itemExecuteMessage" />
<result column="item_create_time" property="createTime" /> <result column="item_create_time" property="createTime" />
<result column="item_update_time" property="updateTime" /> <result column="item_update_time" property="updateTime" />
@@ -74,7 +78,7 @@
<sql id="Base_Column_List"> <sql id="Base_Column_List">
id, task_id, log_id, title, content, summary, source, source_url, category, author, id, task_id, log_id, title, content, summary, source, source_url, category, author,
publish_time, cover_image, images, tags, status, resource_id, crawl_time, process_time, publish_time, cover_image, images, tags, status, resource_id, crawl_time, process_time,
processor, create_time, update_time, delete_time, deleted processor, execute_status, execute_message, create_time, update_time, delete_time, deleted
</sql> </sql>
<!-- VO查询字段列表(包含关联表) --> <!-- VO查询字段列表(包含关联表) -->
@@ -98,6 +102,8 @@
i.crawl_time, i.crawl_time,
i.process_time, i.process_time,
i.processor, i.processor,
i.execute_status as item_execute_status,
i.execute_message as item_execute_message,
i.create_time as item_create_time, i.create_time as item_create_time,
i.update_time as item_update_time, i.update_time as item_update_time,
t.task_name, t.task_name,
@@ -259,7 +265,7 @@
INSERT INTO tb_data_collection_item ( INSERT INTO tb_data_collection_item (
id, task_id, log_id, title, content, summary, source, source_url, id, task_id, log_id, title, content, summary, source, source_url,
category, author, publish_time, cover_image, images, tags, status, category, author, publish_time, cover_image, images, tags, status,
resource_id, crawl_time, process_time, processor, resource_id, crawl_time, process_time, processor, execute_status, execute_message,
create_time, update_time, deleted create_time, update_time, deleted
) )
VALUES VALUES
@@ -269,7 +275,7 @@
#{item.summary}, #{item.source}, #{item.sourceUrl}, #{item.category}, #{item.summary}, #{item.source}, #{item.sourceUrl}, #{item.category},
#{item.author}, #{item.publishTime}, #{item.coverImage}, #{item.images}, #{item.author}, #{item.publishTime}, #{item.coverImage}, #{item.images},
#{item.tags}, #{item.status}, #{item.resourceId}, #{item.crawlTime}, #{item.tags}, #{item.status}, #{item.resourceId}, #{item.crawlTime},
#{item.processTime}, #{item.processor}, #{item.processTime}, #{item.processor}, #{item.executeStatus}, #{item.executeMessage},
NOW(), NOW(), 0 NOW(), NOW(), 0
) )
</foreach> </foreach>
@@ -397,4 +403,12 @@
ORDER BY i.create_time DESC ORDER BY i.create_time DESC
</select> </select>
<!-- updateItemStatus -->
<update id="updateItemStatus">
UPDATE tb_data_collection_item
SET status = #{status}
WHERE id = #{itemId}
AND deleted = 0
</update>
</mapper> </mapper>

View File

@@ -0,0 +1,9 @@
-- 为 tb_data_collection_item 表添加单条新闻执行状态和消息字段
-- 执行日期: 2025-11-12
ALTER TABLE tb_data_collection_item
ADD COLUMN execute_status INT DEFAULT 1 COMMENT '单条新闻执行状态(0:失败 1:成功)' AFTER processor,
ADD COLUMN execute_message VARCHAR(500) DEFAULT NULL COMMENT '单条新闻执行消息(记录错误信息或成功提示)' AFTER execute_status;
-- 为现有数据设置默认值
UPDATE tb_data_collection_item SET execute_status = 1 WHERE execute_status IS NULL;

View File

@@ -0,0 +1,383 @@
# 后端新增接口实现文档
## 接口说明
更新数据采集项状态接口,用于在创建资源后更新采集项的状态和关联的资源ID。
---
## 1. Controller层
**文件路径**: `schoolNewsServ/crontab/src/main/java/org/xyzh/crontab/controller/DataCollectionItemController.java`
```java
package org.xyzh.crontab.controller;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.xyzh.api.crontab.DataCollectionItemService;
import org.xyzh.common.domain.ResultDomain;
import org.xyzh.common.dto.crontab.TbDataCollectionItem;
import java.util.Date;
/**
* 数据采集项Controller
*/
@RestController
@RequestMapping("/crontab/collection/item")
public class DataCollectionItemController {
@Autowired
private DataCollectionItemService dataCollectionItemService;
/**
* 更新数据采集项状态
* PUT /api/crontab/collection/item/{itemId}/status
*
* @param itemId 采集项ID
* @param request 更新请求体
* @return 更新结果
*/
@PutMapping("/{itemId}/status")
public ResultDomain<String> updateCollectionItemStatus(
@PathVariable String itemId,
@RequestBody UpdateStatusRequest request) {
try {
// 参数校验
if (itemId == null || itemId.trim().isEmpty()) {
return ResultDomain.fail("采集项ID不能为空");
}
if (request.getResourceId() == null || request.getResourceId().trim().isEmpty()) {
return ResultDomain.fail("资源ID不能为空");
}
// 查询采集项
TbDataCollectionItem item = dataCollectionItemService.getById(itemId);
if (item == null) {
return ResultDomain.fail("采集项不存在");
}
// 更新采集项状态
item.setStatus(request.getStatus() != null ? request.getStatus() : 1); // 默认为已转换
item.setResourceId(request.getResourceId());
item.setProcessTime(new Date());
// 如果传入了处理人,也更新
if (request.getProcessor() != null) {
item.setProcessor(request.getProcessor());
}
// 保存更新
boolean success = dataCollectionItemService.updateById(item);
if (success) {
return ResultDomain.success("更新成功", request.getResourceId());
} else {
return ResultDomain.fail("更新失败");
}
} catch (Exception e) {
e.printStackTrace();
return ResultDomain.fail("更新失败: " + e.getMessage());
}
}
/**
* 更新状态请求体
*/
public static class UpdateStatusRequest {
/** 状态: 0-未处理 1-已转换 2-已忽略 */
private Integer status;
/** 资源ID */
private String resourceId;
/** 处理人 */
private String processor;
public Integer getStatus() {
return status;
}
public void setStatus(Integer status) {
this.status = status;
}
public String getResourceId() {
return resourceId;
}
public void setResourceId(String resourceId) {
this.resourceId = resourceId;
}
public String getProcessor() {
return processor;
}
public void setProcessor(String processor) {
this.processor = processor;
}
}
}
```
---
## 2. Service层 (如果需要)
**文件路径**: `schoolNewsServ/api/api-crontab/src/main/java/org/xyzh/api/crontab/DataCollectionItemService.java`
如果Service接口中没有 `getById``updateById` 方法,需要添加:
```java
package org.xyzh.api.crontab;
import org.xyzh.common.dto.crontab.TbDataCollectionItem;
/**
* 数据采集项Service接口
*/
public interface DataCollectionItemService {
/**
* 根据ID查询采集项
* @param itemId 采集项ID
* @return 采集项数据
*/
TbDataCollectionItem getById(String itemId);
/**
* 更新采集项
* @param item 采集项数据
* @return 是否成功
*/
boolean updateById(TbDataCollectionItem item);
// ... 其他方法
}
```
---
## 3. Service实现层
**文件路径**: `schoolNewsServ/crontab/src/main/java/org/xyzh/crontab/service/impl/DataCollectionItemServiceImpl.java`
```java
package org.xyzh.crontab.service.impl;
import com.baomidou.mybatisplus.core.conditions.query.QueryWrapper;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import org.xyzh.api.crontab.DataCollectionItemService;
import org.xyzh.common.dto.crontab.TbDataCollectionItem;
import org.xyzh.crontab.mapper.DataCollectionItemMapper;
/**
* 数据采集项Service实现
*/
@Service
public class DataCollectionItemServiceImpl implements DataCollectionItemService {
@Autowired
private DataCollectionItemMapper dataCollectionItemMapper;
@Override
public TbDataCollectionItem getById(String itemId) {
QueryWrapper<TbDataCollectionItem> wrapper = new QueryWrapper<>();
wrapper.eq("id", itemId);
return dataCollectionItemMapper.selectOne(wrapper);
}
@Override
public boolean updateById(TbDataCollectionItem item) {
int rows = dataCollectionItemMapper.updateById(item);
return rows > 0;
}
// ... 其他方法实现
}
```
---
## 4. 数据库表结构确认
确保 `tb_data_collection_item` 表有以下字段:
```sql
-- 如果没有这些字段,需要添加
ALTER TABLE tb_data_collection_item
ADD COLUMN IF NOT EXISTS resource_id VARCHAR(50) COMMENT '转换后的资源ID';
ALTER TABLE tb_data_collection_item
ADD COLUMN IF NOT EXISTS process_time DATETIME COMMENT '处理时间';
ALTER TABLE tb_data_collection_item
ADD COLUMN IF NOT EXISTS processor VARCHAR(50) COMMENT '处理人';
```
---
## 5. 接口测试
### 请求示例
**请求URL**: `PUT http://localhost:8080/api/crontab/collection/item/{itemId}/status`
**请求Headers**:
```
Content-Type: application/json
Authorization: Bearer {token}
```
**请求Body**:
```json
{
"status": 1,
"resourceId": "resource_123456",
"processor": "admin"
}
```
### 响应示例
**成功响应**:
```json
{
"success": true,
"message": "更新成功",
"data": "resource_123456",
"code": 200
}
```
**失败响应**:
```json
{
"success": false,
"message": "采集项不存在",
"code": 500
}
```
---
## 6. 前端调用示例
前端已经在 `src/apis/crontab/index.ts` 中添加了调用方法:
```typescript
/**
* 更新采集项状态为已转换
* @param itemId 采集项ID
* @param resourceId 资源ID
* @returns Promise<ResultDomain<string>>
*/
async updateCollectionItemStatus(itemId: string, resourceId: string): Promise<ResultDomain<string>> {
const response = await api.put<string>(`${this.baseUrl}/collection/item/${itemId}/status`, {
status: 1, // 已转换
resourceId
});
return response.data;
}
```
使用方式:
```typescript
await crontabApi.updateCollectionItemStatus('item_123', 'resource_456');
```
---
## 7. 完整流程
```
1. 用户在 ArticleAdd 组件中编辑文章
2. 点击"立即发布"
3. 前端调用: POST /api/news/resources/resource
创建资源,返回 resourceID
4. 前端调用: PUT /api/crontab/collection/item/{itemId}/status
传入参数:
{
"status": 1,
"resourceId": "新创建的resourceID"
}
5. 后端更新 tb_data_collection_item:
- status = 1 (已转换)
- resource_id = resourceID
- process_time = NOW()
6. 前端刷新列表,采集项状态显示"已转换"
```
---
## 8. 注意事项
1. **权限控制**: 建议添加权限校验,只有管理员可以调用此接口
2. **事务处理**: 如果涉及多表操作,建议添加 `@Transactional` 注解
3. **日志记录**: 建议添加操作日志,记录谁在什么时间转换了哪个采集项
4. **并发控制**: 如果存在并发转换的情况,建议添加乐观锁
5. **参数校验**: 可以使用 `@Valid``@NotBlank` 等注解进行参数校验
---
## 9. 可选:添加操作日志
```java
/**
* 更新数据采集项状态
*/
@PutMapping("/{itemId}/status")
@Log(title = "更新采集项状态", businessType = BusinessType.UPDATE)
public ResultDomain<String> updateCollectionItemStatus(
@PathVariable String itemId,
@RequestBody UpdateStatusRequest request) {
// ... 实现代码
}
```
---
## 10. Mapper层 (如果使用MyBatis Plus)
**文件路径**: `schoolNewsServ/crontab/src/main/java/org/xyzh/crontab/mapper/DataCollectionItemMapper.java`
```java
package org.xyzh.crontab.mapper;
import com.baomidou.mybatisplus.core.mapper.BaseMapper;
import org.apache.ibatis.annotations.Mapper;
import org.xyzh.common.dto.crontab.TbDataCollectionItem;
/**
* 数据采集项Mapper
*/
@Mapper
public interface DataCollectionItemMapper extends BaseMapper<TbDataCollectionItem> {
// MyBatis Plus 已经提供了 selectById, updateById 等方法
}
```
---
## 完成清单
- [ ] 复制 Controller 代码到 `DataCollectionItemController.java`
- [ ] 检查 Service 接口是否有 `getById``updateById` 方法
- [ ] 如果没有,添加 Service 实现
- [ ] 确认数据库表有 `resource_id`, `process_time`, `processor` 字段
- [ ] 如果没有,执行 SQL 添加字段
- [ ] 重启后端服务
- [ ] 使用 Postman 或前端测试接口
- [ ] 验证数据库数据是否正确更新
---
完成以上步骤后,前端的转换功能就可以正常工作了!

View File

@@ -243,6 +243,17 @@ export const crontabApi = {
tagId tagId
}); });
return response.data; return response.data;
} },
/**
* 更新采集项状态为已转换
* @param itemId 采集项ID
* @param status 状态
* @returns Promise<ResultDomain<string>>
*/
async updateCollectionItemStatus(itemId: string, status: number): Promise<ResultDomain<string>> {
const response = await api.put<string>(`${this.baseUrl}/collection/item/${itemId}/status/${status}`);
return response.data;
},
}; };

View File

@@ -140,6 +140,9 @@ export interface DataCollectionItem extends BaseDTO {
processTime?: string; processTime?: string;
/** 处理人 */ /** 处理人 */
processor?: string; processor?: string;
/** 解析状态 */
itemExecuteStatus?: number;
itemExecuteMessage?: string;
} }
/** /**

View File

@@ -135,13 +135,24 @@
<!-- 爬虫解析结果 --> <!-- 爬虫解析结果 -->
<el-table-column label="解析结果" width="220"> <el-table-column label="解析结果" width="220">
<template #default="{ row }">
<el-tag
:type="getStatusTagType(row.itemExecuteStatus)"
size="small"
>
{{ getAnalyzeStatus(row.itemExecuteStatus) }}
</el-tag>
</template>
</el-table-column>
<!-- 来源 -->
<el-table-column label="来源" width="220">
<template #default="{ row }"> <template #default="{ row }">
<div class="parse-result"> <div class="parse-result">
<div v-if="row.category" class="result-item"> <div v-if="row.category" class="result-item">
<el-tag size="small" type="info">{{ row.category }}</el-tag> <el-tag size="small" type="info">{{ row.category }}</el-tag>
</div> </div>
<div v-if="row.source" class="result-item"> <div v-if="row.source" class="result-item">
来源: {{ row.source }} {{ row.source }}
</div> </div>
<div v-if="row.tags" class="result-item"> <div v-if="row.tags" class="result-item">
标签: {{ row.tags }} 标签: {{ row.tags }}
@@ -341,6 +352,7 @@
<ArticleAdd <ArticleAdd
v-if="convertDialogVisible" v-if="convertDialogVisible"
:initial-data="convertFormData" :initial-data="convertFormData"
:collection-item-id="convertItem?.id"
:show-back-button="false" :show-back-button="false"
@publish-success="handleConvertSuccess" @publish-success="handleConvertSuccess"
@back="convertDialogVisible = false" @back="convertDialogVisible = false"
@@ -501,7 +513,8 @@ function handleViewDetail(row: DataCollectionItem) {
// ==================== 转换操作 ==================== // ==================== 转换操作 ====================
/** /**
* 处理富文本内容,清理不必要的样式 * 处理富文本内容,清理可能导致冲突的样式
* 采用温和策略:只移除明显有问题的样式,保留大部分原始格式
*/ */
function cleanHtmlContent(html: string): string { function cleanHtmlContent(html: string): string {
if (!html) return ''; if (!html) return '';
@@ -510,32 +523,61 @@ function cleanHtmlContent(html: string): string {
const tempDiv = document.createElement('div'); const tempDiv = document.createElement('div');
tempDiv.innerHTML = html; tempDiv.innerHTML = html;
// 移除所有内联样式中的字体大小、字体族等可能导致显示问题的样式 // 需要移除的问题样式属性(这些通常会导致显示问题)
const problematicStyles = [
'font-family', // 字体族可能不存在
'font-size', // 字体大小可能过大或过小
'line-height', // 行高可能不适配
'width', // 固定宽度可能导致布局问题
'height', // 固定高度可能导致内容截断
'max-width', // 最大宽度限制
'max-height', // 最大高度限制
'position', // 定位可能导致布局混乱
'top', 'left', 'right', 'bottom', // 定位相关
'z-index', // 层级可能冲突
'float', // 浮动可能导致布局问题
];
// 处理所有带有内联样式的元素
const elementsWithStyle = tempDiv.querySelectorAll('[style]'); const elementsWithStyle = tempDiv.querySelectorAll('[style]');
elementsWithStyle.forEach((el) => { elementsWithStyle.forEach((el) => {
const element = el as HTMLElement; const element = el as HTMLElement;
const style = element.style; const style = element.style;
// 保留一些重要的样式,移除可能冲突的样式 // 收集所有当前样式
const preservedStyles: string[] = []; const preservedStyles: string[] = [];
for (let i = 0; i < style.length; i++) {
const property = style[i];
const value = style.getPropertyValue(property);
// 保留文本颜色 // 如果不在问题样式列表中,则保留
if (style.color) preservedStyles.push(`color: ${style.color}`); if (!problematicStyles.includes(property) && value) {
// 保留背景色 preservedStyles.push(`${property}: ${value}`);
if (style.backgroundColor) preservedStyles.push(`background-color: ${style.backgroundColor}`); }
// 保留文本对齐 }
if (style.textAlign) preservedStyles.push(`text-align: ${style.textAlign}`);
// 保留边距
if (style.marginTop) preservedStyles.push(`margin-top: ${style.marginTop}`);
if (style.marginBottom) preservedStyles.push(`margin-bottom: ${style.marginBottom}`);
element.setAttribute('style', preservedStyles.join('; ')); // 重新设置样式
if (preservedStyles.length > 0) {
element.setAttribute('style', preservedStyles.join('; '));
} else {
element.removeAttribute('style');
}
}); });
// 移除可能的外部类名,避免样式冲突 // 移除明显的外部框架类名(如bootstrap、tailwind等)
const problematicClassPrefixes = ['col-', 'row-', 'container', 'flex-', 'grid-', 'd-', 'p-', 'm-', 'w-', 'h-'];
const elementsWithClass = tempDiv.querySelectorAll('[class]'); const elementsWithClass = tempDiv.querySelectorAll('[class]');
elementsWithClass.forEach((el) => { elementsWithClass.forEach((el) => {
el.removeAttribute('class'); const classList = el.className.split(' ').filter(cls => {
// 如果类名以问题前缀开头,则移除
return !problematicClassPrefixes.some(prefix => cls.startsWith(prefix));
});
if (classList.length > 0) {
el.className = classList.join(' ');
} else {
el.removeAttribute('class');
}
}); });
return tempDiv.innerHTML; return tempDiv.innerHTML;
@@ -632,6 +674,14 @@ function getStatusText(status: number | undefined): string {
} }
} }
function getAnalyzeStatus(executeStatus: number | undefined): string {
switch (executeStatus) {
case 0: return '解析失败';
case 1: return '解析成功';
default: return '未知';
}
}
/** /**
* 获取状态标签类型 * 获取状态标签类型
*/ */

View File

@@ -115,6 +115,7 @@ import { RichTextComponent } from '@/components/text';
import { FileUpload } from '@/components/file'; import { FileUpload } from '@/components/file';
import { ArticleShow } from '.'; import { ArticleShow } from '.';
import { resourceTagApi, resourceApi } from '@/apis/resource'; import { resourceTagApi, resourceApi } from '@/apis/resource';
import { crontabApi } from '@/apis/crontab';
import { ResourceVO, Tag, TagType } from '@/types/resource'; import { ResourceVO, Tag, TagType } from '@/types/resource';
defineOptions({ defineOptions({
@@ -126,6 +127,7 @@ interface Props {
showBackButton?: boolean; showBackButton?: boolean;
backButtonText?: string; backButtonText?: string;
initialData?: ResourceVO; initialData?: ResourceVO;
collectionItemId?: string;
} }
const props = withDefaults(defineProps<Props>(), { const props = withDefaults(defineProps<Props>(), {
@@ -221,30 +223,61 @@ async function handlePublish() {
await formRef.value?.validate(); await formRef.value?.validate();
publishing.value = true; publishing.value = true;
if (isEdit.value) {
const result = await resourceApi.updateResource(articleForm.value); // 如果是从数据采集转换过来的,使用转换接口
if (result.success) { if (props.collectionItemId) {
ElMessage.success('保存成功'); await handleConvertFromCollection();
emit('publish-success', result.data?.resource?.resourceID || '');
} else {
ElMessage.error(result.message || '保存失败');
}
} else { } else {
// 普通创建资源
const result = await resourceApi.createResource(articleForm.value); const result = await resourceApi.createResource(articleForm.value);
if (result.success) { if (result.success) {
const resourceID = result.data?.resource?.resourceID || '';
ElMessage.success('发布成功'); ElMessage.success('发布成功');
emit('publish-success', result.data?.resource?.resourceID || ''); emit('publish-success', resourceID);
} else { } else {
ElMessage.error(result.message || '发布失败'); ElMessage.error(result.message || '发布失败');
} }
} }
} catch (error) { } catch (error) {
console.error('发布失败:', error); console.error('发布失败:', error);
ElMessage.error('发布失败');
} finally { } finally {
publishing.value = false; publishing.value = false;
} }
} }
// 从数据采集转换为资源
async function handleConvertFromCollection() {
if (!props.collectionItemId) return;
try {
// 1. 先创建资源(使用用户编辑后的内容)
const createResult = await resourceApi.createResource(articleForm.value);
if (!createResult.success) {
ElMessage.error(createResult.message || '创建资源失败');
return;
}
const resourceID = createResult.data?.resource?.resourceID || '';
// 2. 更新采集项状态,关联resourceID
try {
await crontabApi.updateCollectionItemStatus(props.collectionItemId, 1);
console.log('采集项状态已更新, resourceID:', resourceID);
} catch (error) {
console.warn('更新采集项状态失败:', error);
// 不影响主流程,资源已经创建成功
}
ElMessage.success('转换成功');
emit('publish-success', resourceID);
} catch (error) {
console.error('转换失败:', error);
throw error;
}
}
// 保存草稿 // 保存草稿
async function handleSaveDraft() { async function handleSaveDraft() {
savingDraft.value = true; savingDraft.value = true;