应谨慎使用Twitter作为数据源

#研究分享#【应谨慎使用Twitter作为数据源】“Twitter上是什么观点?”作为数据记者,你迟早会听到此类问题。Twitter作为一种对话的平台或媒介非常利于挖掘人们对事件的看法,但依据不同版本的Twitter数据(有时是相互矛盾的)往往会得出不同结论。如Twitter API的一个主要不足便在于缺乏有关所收集内容及数量的文献,这使得人们对取样的代表性产生了怀疑。当然,这并不是说Twitter作为数据源没有价值,而是认为人们在使用前需对其局限性保持警惕。

【文章全文】Think twice before using Twitter as a data source 

未标题-1

At Source, Jacob Harris notes that, as appealing as the Twitter API may be as a source of what-people-are-thinking data, it’s hardly perfect. Twitter’s demographics are representative of either Internet users as a whole or the broader population, and geocoding data is sparse and inconsistent.

 

But what does Twitter think?”

 

If you are a data journalist in a newsroom, you will hear this question sooner or later in your career. It doesn’t really matter the context—I’ve heard it asked about everything from the Academy Awards to the Westminster Dog Show to presidential debates or the death of Osama bin Laden. And why not? It certainly sounds like a great idea. Twitter is such a conversational medium, it seems like an easy way to dip into the mindset of the world to see what they think. But as with all great ideas, it’s very easy to go wrong.

 

I’d add that different versions of Twitter data can return different — sometimes even mutually contradictory — results, as explored in this paper by Morstatter, Pfeffer, Liu, and Carley, which notes that the results generated by Twitter’s Streaming API (which provides only a sampled subset of all tweets) and the full (expensive) Twitter firehose:

 

The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter’s sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

 

All that’s not to say that Twitter’s not incredibly useful as a data source — lord knows we like what it does for Fuego, and few tools can provide the slice of broader sentiment it does as quickly and easily. Just be aware of its limits and be careful about assuming what it produces represents any larger reality.

【文章作者】Joshua Benton

【文章来源】niemanlab

【文章链接】http://www.niemanlab.org/2013/07/think-twice-before-using-twitter-as-a-data-source/


1 条评论

  1. fanboyang说道:

    感谢分享,http://e.weibo.com/1711479641/A0BTmnedZ



无觅相关文章插件