A theoretical model for n-gram distribution in big data corpora

There is a wide diversity of applications relying on the identification of the sequences of n consecutive words (n-grams) occurring in corpora. Many studies follow an empirical approach for determining the statistical distribution of the n-grams but are usually constrained by the corpora sizes, whic...

ver descrição completa

Detalhes bibliográficos
Autor principal: Silva, Joaquim F. (author)
Outros Autores: Gonçalves, Carlos Jorge de Sousa (author), Cunha, José C. (author)
Formato: conferenceObject
Idioma:eng
Publicado em: 2017
Assuntos:
Texto completo:http://hdl.handle.net/10400.21/6829
País:Portugal
Oai:oai:repositorio.ipl.pt:10400.21/6829