A theoretical model for n-gram distribution in big data corpora

There is a wide diversity of applications relying on the identification of the sequences of n consecutive words (n-grams) occurring in corpora. Many studies follow an empirical approach for determining the statistical distribution of the n-grams but are usually constrained by the corpora sizes, whic...

Full description

Bibliographic Details
Main Author: Silva, Joaquim F. (author)
Other Authors: Gonçalves, Carlos Jorge de Sousa (author), Cunha, José C. (author)
Format: conferenceObject
Language:eng
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10400.21/6829
Country:Portugal
Oai:oai:repositorio.ipl.pt:10400.21/6829