Using named entity recognition for relevance detection in social network messages

The continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increas...

Full description

Bibliographic Details
Main Author: Filipe Daniel da Gama Batista (author)
Format: masterThesis
Language:eng
Published: 2017
Subjects:
Online Access:https://repositorio-aberto.up.pt/handle/10216/106160
Country:Portugal
Oai:oai:repositorio-aberto.up.pt:10216/106160
Description
Summary:The continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone.