网络爬虫
- Web crawler;Internet worm
-
基于WEB信息采集的分布式网络爬虫搜索引擎的研究
Research of a Distributed Web Crawler Search Engine Based on Web Information Collection
-
网络爬虫的其他应用在于监测WEB页面和搜索引擎。
Other applications of Web spiders are in monitoring Web pages and in search engines .
-
基于Java技术的主题网络爬虫的研究与实现
Research and Realization of Topic Web Crawler Based on Java Technology
-
网络爬虫是一种智能化的程序,能够自动抓取Web上的页面。
Net crawler is a kind of intelligent software and it can crawl pages in the Web automatically .
-
基于JavaScript切片的AJAX框架网络爬虫技术研究
Web Crawler Technology of AJAX Frame Based on JavaScript Slicing
-
支持AJAX的网络爬虫系统设计与实现
Design and Implementation of a Web Crawler System Supported AJAX
-
分布式网络爬虫URL去重策略的改进
Improvement on Unrepeated Tactics of URL of Distributed Spider
-
在这个由关于Applebot(苹果(Apple)的网络爬虫——译者注)的议论和硅谷(SiliconValley)狂热者主导的商业世界里,马尔凯蒂是个异类。
In a commercial world dominated by chatter about Applebot and Silicon Valley evangelists , Marchetti is an exception .
-
创建和维护索引的任务由网络爬虫完成,网络爬虫代表搜索引擎递归地遍历和下载Web页面。
The creation and maintenance of Web indices is done by Web crawlers , which recursively traverse and download Web pages on behalf of search engines .
-
基于HTMLParser信息提取的网络爬虫设计
Design of Crawler Based on HTML Parser Information Extraction
-
海量URL的管理一直是提高网络爬虫性能的一个瓶颈。
The management of mass URL is the bottleneck for web crawler to improve its performance .
-
通用网络爬虫会从一个或者多个种子URL链接开始,爬行整个网络上的网页。
Starting from the seeding links , normal web crawler searches all the web pages throughout the internet .
-
近日,微软准备推出自己最新的计划,他们称之为“ProjectBarcelona”。这个项目将用于企业级搜索和信息存储所用的网络爬虫工具。
Project Barcelona , a new project in the works from Microsoft , will give enterprises Web crawler-like tools for searching and storing information .
-
Webrobot(网络爬虫)作为一种网络资源获取程序,在广泛应用于信息搜索的同时,也带来了一些负面影响。
Web robot is a program for accessing network resources . It is widely used in areas such as search engine , but has also brought many negative issues .
-
一种维护WAP网站的网络爬虫的设计
Design for Maintaining WAP Site Crawler
-
对现有的网络爬虫技术进行了改进,借助ODP专题实现了网页分类。
Of existing technology to improve web crawlers , web pages achieved using ODP topics classification .
-
各网络爬虫根据站点类型采用相应的采集策略以实现精确采集,并支持脚本执行、RSS解析。
Each web crawler adopts corresponding strategy according to different sites to realize the precise crawling with the function of executing scripting and parsing RSS .
-
然后,研究网络爬虫技术和网页预处理技术包括网页DOM模型、网页清洗、和网页结构图形化显示。
Then , the Web crawler technology and Web pretreatment technology that include web DOM model , web page cleaning and page structure graphical display are studied .
-
它的原理是用聚焦网络爬虫对目标网站群的数据进行抓取、分析和处理,然后提供RSS推送。
Its principle is to use the focus web crawler to crawl , analyse and process the data of the site group , and then offer RSS feed .
-
其次介绍了本文用到的并行计算技术,并对并行网络爬虫系统的各个重要组成部分所实现的功能进行了分析与设计,最后对网页URL消重算法和并行爬行效率分别进行性能测试。
Then introducing parallel computing techniques , and analysis and design each important parts of parallel web crawler . Give the testing results , including URL filter algorithm testing and performance testing of Parallel crawler .
-
首先,spock使用语义学技术、通过网络爬虫系统采集到的所有常用短语都会成为标签。
First , all frequent phrases that Spock extracts using its semantic technology via its web crawler become tags .
-
本文提出了一种维护WAP网站的网络爬虫系统,该系统可以自动遍历WAP网站,并对网页进行分析,检查语法和语义的错误。
This paper provides a Maintaining WAP Site Crawler system . This system can automatically traverse the WAP site , parse every page in the site and check syntax and semantic faults .
-
最后本文对不良信息过滤系统以及WAP网络爬虫系统实现涉及到的关键技术进行了详细的分析和讨论,同时也开发了原型系统并进行实验测试。
Finally , this paper detailedly analyzes and discusses the key technology of realizing the negative information filtering system and the WAP network crawler system . Also , it develops a prototype system and makes some tests .
-
本文还提出并设计了用于主动发现不良信息网站的WAP网络爬虫系统,作为不良信息过滤系统的补充,主动抓取和分析WAP网页内容,识别不良WAP网站。
This paper also proposes and designs the WAP websites crawler system , being used to discover negative information websites initiatively , and this crawler system is considered as the complement of the negative information filtering system .
-
最后通过设计使用分布式网络爬虫nutch来查找分布式存储模式下的资源,期待为今后的研究者们提供一种思路。
Finally , uses a distributed Web crawler & nutch to find the resources in distributed storage mode , expecting for future researchers to provide an idea .
-
该系统包含针对BT种子文件的网络爬虫和种子文件解析器,它能自动采集网络中的BT种子文件;再从中分离出共享文件的描述信息,建立索引和描述信息的历史纪录。
This system includes a web crawler and a file parser against the seed files . Therefore , it can automatically collect BT seed files in the network and separate the description information from sharing files , and record the historical index and description .
-
Spider(网络爬虫)是一种网络资源获取程序,它加速网络流通的同时也增加了网络负载,有必要监控spider对网站的访问。
Spider ( Web robot ) is a program for harvesting internet resources , which not only speeds up the flow but also accelerates the load of the network , so it is necessary to regulate and monitor behaviors of spiders visiting website .
-
首先,优化了基于Nutch的分布式网络爬虫系统,实现了爬虫系统的并行化同步运行方式,提升了爬虫处理性能。
Firstly , we optimize the distributed web crawler based on Nutch system by a synchronous operational architecture . This improvement can enhance the efficiency of the web crawler .
-
其中非法关键字扫描和图片视频监控模块作为平台的核心模块,创新性地将网络爬虫、FTP以及视频切片等技术应用在了网站安全领域,为网站的内容安全提供实时的监控。
Among the modules , illegal keyword scanning , pictures and video supervision modules are designed as core modules . The platform innovatively adopts webs crawler , FTP and video slice technology in the field of web security to provide real-time monitoring for the security of webs content .
-
此外,系统利用网络爬虫抓取的网页建立了主题文档库,对于FAQ无法解答的问题,系统将从主题文档库中检索答案,这部分是对问答系统的补充和完善。
Besides the system constructs theme document library taking advantage of web pages which Web crawler fetches . For the question which can not be answered by FAQ , the system will find answers from the theme document library . That is supplement and perfection of Question Answering System .