Resumo: | The internet plays an important role in our society, namely in the circulation of political ideas [2] knowing that, political actors have been using web’s potential to invigorate their campaigns [4]. Obama’s 2008 presidential campaign is a well-known example [1]. In this study, which is part of a larger one about European elections, we intend to examine Portuguese’ use of online media in relation to political involvement in the 2019 European Parliament election. This project, which is being developed in partnership with Netquest (an opinion and market research company), uses a database of web navigation actions (WNA) from its Internet user panel in Portugal. This data set includes navigation actions on computer and mobile devices, for a sample of 1,288 users. Our data were collected between April 26 and June 26, 2019 (a period of two months, around the elections, held on May 26, in Portugal), and contains 20,137,355 WNA. In order to analyze this data set we applied the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology [3]. After business and data understanding, we are in phase 3, the most time-consuming task - data preparation. In this phase a binary variable was added to identify if a WNA refers to an online media or not. This classification was based on the list of media provided by the Entidade Reguladora para a Comunica¸c˜ao Social (Portuguese Regulatory Authority for the Media). The next step is to identify which of these WNA of Portuguese media are about politics. First, we select subdomains or tabs of the WNA URL address that contains the words: “politica” (politics), “eleicao” or “eleicoes” (election). Other options will be apply text mining to news titles in the WNA url address or use HTML scraping and text mining algorithms to analyze online news content. This study is focused on challenges we faced during data preparation. To make sure that the database is consistent and does not include duplicate or redundant information it was necessary to understand what each variable actually represented. Then we recoded the data to numerical values. And some variables were grouped, such as the region, the level of education and the area of study. We have also turned date of birth into age and standardized standardized the time spent online. In addition, mobile and desktop WNA information have been tuned to be expressed in the same way. Finally, we present the preliminary results of the identification of WNA related to policy issues and an exploratory analysis of the information will be carried out.
|