Collecting Statistics about the Portuguese Web
This report presents a characterization of text documents from the Portuguese Web. This characterization was produced from a crawl of over 4 million URLs and 131 thousand sites in 2003. We describe rules that we established for defvining its boundaries and the methodology used to gather statistics....
Autor principal: | |
---|---|
Outros Autores: | |
Formato: | report |
Idioma: | por |
Publicado em: |
2009
|
Assuntos: | |
Texto completo: | http://hdl.handle.net/10451/14211 |
País: | Portugal |
Oai: | oai:repositorio.ul.pt:10451/14211 |
Resumo: | This report presents a characterization of text documents from the Portuguese Web. This characterization was produced from a crawl of over 4 million URLs and 131 thousand sites in 2003. We describe rules that we established for defvining its boundaries and the methodology used to gather statistics. We also show how crawling constraints and abnormal situations on the Web can influence the results |
---|