ArXiv Preprint
Billions of people across the globe have been using social media platforms in
their local languages to voice their opinions about the various topics related
to the COVID-19 pandemic. Several organizations, including the World Health
Organization, have developed automated social media analysis tools that
classify COVID-19-related tweets into various topics. However, these tools that
help combat the pandemic are limited to very few languages, making several
countries unable to take their benefit. While multi-lingual or low-resource
language-specific tools are being developed, they still need to expand their
coverage, such as for the Nepali language. In this paper, we identify the eight
most common COVID-19 discussion topics among the Twitter community using the
Nepali language, set up an online platform to automatically gather Nepali
tweets containing the COVID-19-related keywords, classify the tweets into the
eight topics, and visualize the results across the period in a web-based
dashboard. We compare the performance of two state-of-the-art multi-lingual
language models for Nepali tweet classification, one generic (mBERT) and the
other Nepali language family-specific model (MuRIL). Our results show that the
models' relative performance depends on the data size, with MuRIL doing better
for a larger dataset. The annotated data, models, and the web-based dashboard
are open-sourced at https://github.com/naamiinepal/covid-tweet-classification.
Rabin Adhikari, Safal Thapaliya, Nirajan Basnet, Samip Poudel, Aman Shakya, Bishesh Khanal
2022-10-11