ArXiv Preprint
In recent years, there has been a surge of interest in research on automatic
mental health detection (MHD) from social media data leveraging advances in
natural language processing and machine learning techniques. While significant
progress has been achieved in this interdisciplinary research area, the vast
majority of work has treated MHD as a binary classification task. The
multiclass classification setup is, however, essential if we are to uncover the
subtle differences among the statistical patterns of language use associated
with particular mental health conditions. Here, we report on experiments aimed
at predicting six conditions (anxiety, attention deficit hyperactivity
disorder, bipolar disorder, post-traumatic stress disorder, depression, and
psychological stress) from Reddit social media posts. We explore and compare
the performance of hybrid and ensemble models leveraging transformer-based
architectures (BERT and RoBERTa) and BiLSTM neural networks trained on
within-text distributions of a diverse set of linguistic features. This set
encompasses measures of syntactic complexity, lexical sophistication and
diversity, readability, and register-specific ngram frequencies, as well as
sentiment and emotion lexicons. In addition, we conduct feature ablation
experiments to investigate which types of features are most indicative of
particular mental health conditions.
Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz
2022-12-19