目 录
网络爬虫是一种用于查找大量Web页面的RoBot程序,用于扫描互联网上的页面。主题定制方法可以使爬虫用于特定的应用场合,如本例中应用于在线媒体的监测和获取。本文首先介绍了网络爬虫的工作原理及其在信息检索和信息抽取领域的应用,然后在传统网络爬虫基本功能的基础上进行了主题定制的网络爬虫的研究,其中重点研究了如何充分利用现有搜索引擎的资源以减小爬行范围,提高爬虫效率;使用了一种基于正则表达式的通用匹配提取方法;通过大量的例子分析,总结出一些当前应用的流媒体链接方式,并提出了相应的分析抽取的方法,及对html的半结构化分析提取方法做了一些有益尝试
关键字: 网络爬虫,主题定制,流媒体,正则表达式,网络信息动态监测,
信息自动获取,
Abstract
Crawler is a RoBot program which can search abundance of web pages.It is used to search all the pages on the Internet.The Topic Focused method can make the Crawler be used in some special feilds ,for instance ,in this case ,it is used in the Detection and Extraction about on-line meida resource . This article first introduced the principle of the Web Crawler.Then on the base function of tradition Spider,I have done some research on the method of making good use of search engine,in order to reduce the cost of Crawler`s processing;a General Match and Extraction method based on Regular Expression , summarize some type of stream media `sLink and the ways to analyze and extration ,at the some time I try to make a method of semi-construction for HTML .