In Multimedia tools and applications
Social media is more and more dominant in everyday life for people around the world. YouTube content is a resource that may be useful, in social computational science, for understanding key questions about society. Using this resource, we performed web scraping to create a dataset of 644,575 video transcriptions concerning net activism and whistleblowing. We automatically performed linguistic feature extraction to capture a representation of each video using its title, description and transcription (downloaded metadata). The next step was to clean the dataset using automatic clustering with linguistic representation to identify unmatched videos and noisy keywords. Using these keywords to exclude videos, we finally obtained a dataset that was reduced by 95%, i.e., it contained 35,730 video transcriptions. Then, we again automatically clustered the videos using a lexical representation and split the dataset into subsets, leading to hundreds of clusters that we interpreted manually to identify a hierarchy of topics of interest concerning whistleblowing. We used the dataset to learn a lexical representation for a specific topic and to detect unknown whistleblowing videos for this topic; the accuracy of this detection is 57.4%. We also used the dataset to identify interesting context linguistic markers around the names of whistleblowers. From a given list of names, we automatically extracted all 5-g word sequences from the dataset and identified interesting markers in the left and right contexts for each name by manual interpretation. The results of our study are the following: a dataset (raw and cleaned collections) concerning whistleblowing, a hierarchy of topics about whistleblowing, the automatic prediction of whistleblowing and the semi-automatic semantic analysis of markers around whistleblower names. This text mining analysis can be exploited for digital sociology and e-democracy studies.
Turenne Nicolas
2022-Sep-29
Computational social science, Machine learning, Natural language processing, Net activism, Social media, Text mining