Home // HUSO 2018, The Fourth International Conference on Human and Social Analytics // View article


Fast Extraction of Statistically Relevant Descriptor Words for Social Media Communities

Authors:
Arces A. Talavera
Arnulfo P. Azcarraga

Keywords: Random Projection; Dimensionality Reduction; Social Media; Text Analytics

Abstract:
Social media communities can be characterized by descriptor words that are frequently used by its community members but are less often used in other communities. These can be extracted by computing a descriptor index and choosing those words with the highest index. The novel descriptor index proposed here is based on the z-score that measures the frequency of a word in a given community relative to the frequency of the word in all the communities combined, using a statistical standard error. The measure based on z-scores is validated by comparing the words extracted when using z-scores with the words extracted using the fairly popular Term Frequency-Inverse Document Frequency (TF-IDF) and the Lagus method. Once it is established that z-scores can be used to extract descriptor words, the next hurdle is to reduce the dimensionality of the vector space model, where each word that appears in any of the social community messages would constitute one dimension in the vector space model. The solution explored here, used in tandem with z-scores as descriptor index measure, is the Random Projection method. In this dimensionality reduction method, more than 40,000 unique words (dimensions) are randomly projected to as few as 400 dimensions (99% reduction) and yet the proposed scheme still extracts essentially the same descriptor words for each community. To evaluate the combined use of z-scores and Random Projection, and to determine some suitable parameter values for the proper execution of the Random Projection method, 10 communities on Facebook were selected. Despite using only 1% of the original number of dimensions, there is a match of 85% of the top 10 descriptor words between those extracted with all 40,000 dimensions compared to those extracted with only 400.

Pages: 24 to 30

Copyright: Copyright (c) IARIA, 2018

Publication date: June 24, 2018

Published in: conference

ISSN: 2519-8351

ISBN: 978-1-61208-648-4

Location: Venice, Italy

Dates: from June 24, 2018 to June 28, 2018