Ranking Subreddits by Classifier Indistinguishability in the Reddit Corpus

Alquaddoomi, Faisal; Estrin, Deborah

Home // eKNOW 2018, The Tenth International Conference on Information, Process, and Knowledge Management // View article

Ranking Subreddits by Classifier Indistinguishability in the Reddit Corpus

Authors:
Faisal Alquaddoomi
Deborah Estrin

Keywords: Natural language processing; Web mining; Clustering methods

Abstract:
Reddit, a popular online forum, provides a wealth of content for behavioral science researchers to analyze. These data are spread across various “subreddits”, subforums dedicated to specific topics. Social support subreddits are common, and users' behaviors there differ from reddit at large; most significantly, users often use 'throwaway' single-use accounts to disclose especially sensitive information. This work focuses specifically on identifying depression-relevant posts and, consequently, subreddits, by relying only on posting content. We employ posts to r/depression as labeled examples of depression-relevant posts and train a classifier to discriminate posts like them from posts randomly selected from the rest of the Reddit corpus, achieving 90% accuracy at this task. We argue that this high accuracy implies that the classifier is descriptive of "depression-like" posts, and use its ability (or lack thereof) to distinguish posts from other subreddits as discriminating the "distance" between r/depression and those subreddits. To test this approach, we performed a pairwise comparison of classifier performance between r/depression and 229 candidate subreddits. Subreddits which were very closely related thematically to r/depression, such as r/SuicideWatch, r/offmychest, and r/anxiety, were the most difficult to distinguish. A comparison this ranking of similar subreddits to r/depression to existing methods (some of which require extra data, such as user posting co-occurrence across multiple subreddits) yields similar results. Aside from the benefit of relying only on posting content, our method yields per-word importance values (heavily weighing words such as "I", "me", and "myself"), which recapitulate previous research on the linguistic phenomena that accompany mental health self-disclosure.

Pages: 128 to 133

Copyright: Copyright (c) IARIA, 2018

Publication date: March 25, 2018

Published in: conference

ISSN: 2308-4375

ISBN: 978-1-61208-620-0

Location: Rome, Italy

Dates: from March 25, 2018 to March 29, 2018