Resumo: | Social media has been proven to be an excellent resource for connecting people and creating a parallel community. Turning it into a suitable source for extracting real world events information and information about its users as well. All of this information can be carefully re-arranged for social monitoring purposes and for the good of its community. For extracting health evidence in the social media, we started by analyzing and identifying postpartum depression in social media posts. We participated in an online challenge, eRisk 2020, continuing the previous participation of BioInfo@UAVR, predicting self-harm users based on their publications on Reddit. We built an algorithm based on methods of Natural Language Processing capable of pre-processing text data and vectorizing it. We make use of linguistic features based on the frequency of specific sets of words, and other models widely used that represent whole documents with vectors, such as Tf-Idf and Doc2Vec. The vectors and the correspondent label are then passed to a Machine Learning classifier in order to train it. Based on the patterns it found, the model predicts a classification for unlabeled users. We use multiple classifiers, to find the one that behaves the best with the data. With the goal of getting the most out of the model, an optimization step is performed in which we remove stop words and set the text vectorization algorithms and classifier to be ran in parallel. An analysis of the feature importance is integrated and a validation step is performed. The results are discussed and presented in various plots, and include a comparison between different tuning strategies and the relation between the parameters and the score. We conclude that the choice of parameters is essential for achieving a better score and for finding them, there are other strategies more efficient then the widely used Grid Search. Finally, we compare several approaches for building an incremental classification based on the post timeline of the users. And conclude that it is possible to have a chronological perception of certain traits of Reddit users, specifically evaluating the risk of self-harm with a F1 Score of 0.73.
|