TY - GEN
T1 - Can One Tamper with the Sample API?
AU - Morstatter, Fred
AU - Dani, Harsh
AU - Sampson, Justin
AU - Liu, Huan
N1 - Funding Information: This work is sponsored, in part, by Office of Naval Research (ONR) grant N000141410095 and by the Department of Defense under the MINERVA initiative through the ONR N00014131083. 5. REFERENCES [1] J. Elder. Inside a Twitter Robot Factory. The Wall Street Journal, 11 2013. http://on.wsj.com/1Qo215n. [2] D. Kergl, R. Roedler, and S. Seeber. On the Endogenesis of Twitter’s Spritzer and Gardenhose Sample Streams. In Advances in Social Networks Analysis and Mining, pages 357–364. IEEE, 2014. [3] F. Morstatter, J. Pfeffer, and H. Liu. When is it Biased? Assessing the Representativeness of Twitter’s Streaming API. In WWW, pages 555–556, 2014. [4] F. Morstatter, J. Pfeffer, H. Liu, and K. M. Carley. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. In ICWSM, pages 400–408, 2013. Publisher Copyright: © 2016 owner/author(s).
PY - 2016/4/11
Y1 - 2016/4/11
N2 - While social media mining continues to be an active area of research, obtaining data for research is a perennial problem. Even more, obtaining unbiased data is a challenge for researchers who wish to study human behavior, and not technical artifacts induced by the sampling algorithm of a social media site. In this work, we evaluate one social media data outlet that gives data to its users in the form of a stream: Twitter's Sample API. We show that in its current form, this API can be poisoned by bots or spammers who wish to promote their content, jeopardizing the credibility of the data collected through this API. We design a proof-of-concept algorithm that shows how malicious users could increase the probability of their content appearing in the Sample API, thus biasing the content towards spam and bot content and harming the representativity of this data outlet.
AB - While social media mining continues to be an active area of research, obtaining data for research is a perennial problem. Even more, obtaining unbiased data is a challenge for researchers who wish to study human behavior, and not technical artifacts induced by the sampling algorithm of a social media site. In this work, we evaluate one social media data outlet that gives data to its users in the form of a stream: Twitter's Sample API. We show that in its current form, this API can be poisoned by bots or spammers who wish to promote their content, jeopardizing the credibility of the data collected through this API. We design a proof-of-concept algorithm that shows how malicious users could increase the probability of their content appearing in the Sample API, thus biasing the content towards spam and bot content and harming the representativity of this data outlet.
KW - data mining
KW - data sampling
KW - sample bias
UR - http://www.scopus.com/inward/record.url?scp=85129823587&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85129823587&partnerID=8YFLogxK
U2 - 10.1145/2872518.2889372
DO - 10.1145/2872518.2889372
M3 - Conference contribution
T3 - WWW 2016 Companion - Proceedings of the 25th International Conference on World Wide Web
SP - 81
EP - 82
BT - WWW 2016 Companion - Proceedings of the 25th International Conference on World Wide Web
PB - Association for Computing Machinery, Inc
T2 - 25th International Conference on World Wide Web, WWW 2016
Y2 - 11 May 2016 through 15 May 2016
ER -