基于朴素贝叶斯模型和 PageRank 算法的电网事故主题爬虫算法

浏览次数:10
  • 分享到:

摘要:

为了减少电网人身安全事故,利用数据挖掘技术构建和分析事故多维数据,建立准确的预警模型十分必要。其中一个极具挑战性的问题是如何在海量网页中自动化采集人身事故样本数据。提出一种朴素贝叶斯模型与PageRank结合的主题爬虫算法。首先采用中文文本分割和设置关键词词频的方法对数据预处理,进行特征选择后,构建并训练朴素贝叶斯分类模型,从而实现电网事故分类准确度的显著提升。然后利用PageRank算法对精确分类后的网页进行主题相关性排序,有效避免普通爬虫方法中出现的主题漂移问题。实验结果表明,不论是在相同时间还是相同页面数的条件下,该方法的页面收获率均高于单独使用朴素贝叶斯分类器或PageRank的收获率,即本方法能够在大量网页中更高效 、准确地爬取电网事故信息。

In order to reduce the number of personal safety accidents in the power grid, it is necessary to construct and analyze multi-dimensional data of accidents to build precise early warning models by using data mining techniques. One of the challenging problems is to automate the collection of accident data in large-scale websites, In this paper, we propose a focused crawler algorithm that combines Naive Bayes model and PageRank algorithm, First, by adopting the Chinese text segmentatior method and setting keyword frequency, data are preprocessed, After feature selection, a Naive Bayesian classification model is constructed and trained, so as to achieve a significant increase in the classification accuracy of power grid accidents, Then, the PageRank algorithm is used to sort the topic relevance of the accurately classified web pages, which effectively avoids the problem of topic drift that common crawler methods often suffer from, The experimental results show that the page harvesting rate of the proposed algorithm is hizher than that of using the Naive Bayesian classifier or the PageRank algorithm alone, regardless of the same time budget or the same number of searched pages. Thus,this method is capable of crawling information about power grid accidents more efficiently and accurately among a large number of web pages.

作者:

谢树泳,刘之亮

Xie Shuyong,Liu Zhiliang

机构地区:

广东电网有限责任公司惠州供电局;广州南方电网有限责任公司

引用本文:

谢树泳,刘之亮。基于朴素贝叶斯模型和PageRank算法的电网事故主题爬虫算法[J] . 学报(自然科学版) , 2025, 53(2) : 124-130. (Xie Shunyong, Liu Zhiliang.A focused crawler algorithm based on Naive Bayes model and PageRank on power grid accidents[J] . Journal of Henan Normal University(Natural Science Edition) ,2025,53(2) :124-130. DOI:10. 16366/j. cnki.1000-2367. 2023. 12. 26. 0001. )

基金:

国家自然科学基金;南方电网科技项目

关键词:

电网安全;人身事故;朴素贝叶斯模型;PageRank算法;主题爬虫

power grid accident; personal safety; Naive Bayes model; PageRank algorithm; focused crawler

分类号:

TP391


基于朴素贝叶斯模型和 PageRank 算法的电网事故主题爬虫算法.pdf

Baidu
map