TY - GEN
T1 - Surpassing the limit
T2 - 26th ACM Conference on Hypertext and Social Media, HT 2015
AU - Sampson, Justin
AU - Morstatter, Fred
AU - Maciejewski, Ross
AU - Liu, Huan
N1 - Funding Information: This work is sponsored, in part, by O_ce of Naval Re- search grant N000141410095. Publisher Copyright: © 2015 ACM.
PY - 2015/8/24
Y1 - 2015/8/24
N2 - Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.
AB - Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.
KW - Clustering
KW - Social media
KW - Text processing
UR - http://www.scopus.com/inward/record.url?scp=84951875749&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84951875749&partnerID=8YFLogxK
U2 - 10.1145/2700171.2791030
DO - 10.1145/2700171.2791030
M3 - Conference contribution
T3 - HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media
SP - 237
EP - 245
BT - HT 2015 - Proceedings of the 26th ACM Conference on Hypertext and Social Media
PB - Association for Computing Machinery, Inc
Y2 - 1 September 2015 through 4 September 2015
ER -