The Viuva Negra crawler

This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per da...

ver descrição completa

Detalhes bibliográficos
Autor principal: Gomes, Daniel (author)
Outros Autores: Silva, Mário J. (author)
Formato: report
Idioma:por
Publicado em: 2009
Assuntos:
Texto completo:http://hdl.handle.net/10451/14117
País:Portugal
Oai:oai:repositorio.ul.pt:10451/14117
Descrição
Resumo:This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per day, correspondent to 63 GB of data. We describe hazardous situations to crawling found on the web and the adopted solutions to mitigate their effects. The gathered information was integrated in a web warehouse that provides support for its automatic processing by text mining applications.