Tokenization of Portuguese: resolving the hard cases

This research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the prob...

Full description

Bibliographic Details
Main Author: Branco, António Horta (author)
Other Authors: Silva, João (author)
Format: report
Language:por
Published: 2009
Subjects:
Online Access:http://hdl.handle.net/10451/14199
Country:Portugal
Oai:oai:repositorio.ul.pt:10451/14199