Term frequency dynamics in collaborative articles

Documents on the World Wide Web are dynamic entities. Mainstream information retrieval systems and techniques are primarily focused on the latest version a document, generally ignoring its evolution over time. In this work, we study the term frequency dynamics in web documents over their lifespan. W...

ver descrição completa

Detalhes bibliográficos
Autor principal: Sérgio Nunes (author)
Outros Autores: Cristina Ribeiro (author), Gabriel David (author)
Formato: book
Idioma:eng
Publicado em: 2010
Assuntos:
Texto completo:https://hdl.handle.net/10216/70210
País:Portugal
Oai:oai:repositorio-aberto.up.pt:10216/70210
Descrição
Resumo:Documents on the World Wide Web are dynamic entities. Mainstream information retrieval systems and techniques are primarily focused on the latest version a document, generally ignoring its evolution over time. In this work, we study the term frequency dynamics in web documents over their lifespan. We use the Wikipedia as a document collection because it is a broad and public resource and, more important, because it provides access to the complete revision history of each document. We investigate the progression of similarity values over two projection variables, namely revision order and revision date. Based on this investigation we find that term frequency in encyclopedic documents - i.e. comprehensive and focused on a single topic - exhibits a rapid and steady progression towards the document's current version. The content in early versions quickly becomes very similar to the present version of the document.