The Viuva Negra crawler

This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per da...

Full description

Bibliographic Details
Main Author: Gomes, Daniel (author)
Other Authors: Silva, Mário J. (author)
Format: report
Language:por
Published: 2009
Subjects:
Online Access:http://hdl.handle.net/10451/14117
Country:Portugal
Oai:oai:repositorio.ul.pt:10451/14117
Description
Summary:This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per day, correspondent to 63 GB of data. We describe hazardous situations to crawling found on the web and the adopted solutions to mitigate their effects. The gathered information was integrated in a web warehouse that provides support for its automatic processing by text mining applications.