Resumo: | To correctly assert the precision of a classification model, previously labeled data is needed to validate the output provided by the model. The process of labeling data can be achieved either by a human manual effort or, automatically, by computers. In this dissertation, an automatic system was designed and created to assess the precision of a classification model with no human component is used throughout the process of labeling the data. The goal of the classification model, used as the basis of this project, is to identify newsworthy social network messages. The model takes advantage of the vast information spread across social networks and aims to filter relevant data, which may have important information from a journalistic point of view. To assert the precision of the classification model, social network messages need to be labeled as news-worthy or not, which can be achieved by manual labeling. While this assessment is fundamental to train the model at a first stage, the monetary, time and precision costs involved do not allow this procedure to be done regularly. Yet, the classification of data is essential to train our models and to determine their accuracy. For this reason, and to avoid the downsides of manual labeling, a four stage automatic system was created. This new approach starts with the collection of data, both messages and news articles. The collected messages will be classified based on the news articles also gathered. The second step is the information extraction. Here, the system will analyze the information present in the different texts, using several information extraction techniques, such as named entity recognition and keywords detection. These results are presented in a standardized vector of features for the messages and news. The third stage is the matching of news and social media messages, based on the similarity of contents. When a message is associated with the content of a news article, it is labeled as news related. This final part, message classification, allows the distinction of news relevant and not relevant messages. This process is also assisted by a filtering model, which helps exclude weak matches. These are cases where even though messages and news have similar information, it is not relevant or newsworthy. The matching method was validated while it was being developed. In the end, the final system has a precision of over 80% in labeling newsworthy social network messages. Nonetheless, techniques and mechanisms developed in this dissertation can be extrapolated for other uses within the media and journalism world. As an example, the research can be targeted at finding possible contradictory information in social network messages, potentially helping news entities to update their stories as live information comes through. Another application might be to detect breaking news and crisis events.
|