大数据分析:挑战与问题

#研究分享#【大数据分析:挑战与问题】研究者Tufekci认为大数据分析存在方法和概念上的挑战:1、对数据收集平台所存在的显性和隐性结构性偏见关注不足;2、大多数大数据分析来源于某个单一平台,从而造成对信息流的生态关注不足;3、借鉴自其他领域的网络方法在被用于分析人的社交媒体互动时,需要仔细评估其适用性;4、许多大数据集仅仅包含节点对节点间互动的信息,而“场效应”也是人类社会文化经验的一个重要组成部分;5、网络结构与其他属性间的关系是复杂和多面向的;6、人的自反性需要在分析中被考虑进来。

【文章全文】Big Data Pitfalls, Methods and Concepts for an Emergent Field

inside-red-pill-the-weird-new-cult-for-men-who-dont-understand-women

New research suggests using big data, particularly social media data, can lead to a biased representation of the data based on societal factors.

 

Striking new research out of Princeton University’s Center for Information Technology Policy and the University of North Carolina at Chapel Hill suggests that inferences based on how people use social media platforms like Twitter and Facebook should be reconsidered. The reason? These platforms represent skewed samples from which it is difficult to draw accurate conclusions.

 

In her draft paper, Big Data: Pitfalls, Methods and Concepts for an Emergent Field, UNC professor and Princeton CITP fellow Zeynep Tufekci (@zeynep) compares the methodological challenges of developing socially-based big data insights using Twitter to biological testing on Drosophila flies, better known as fruit flies. Drosophila flies are usually chosen because they’re relatively easy to use in lab settings, easy to breed, have rapid and “stereotypical” life cycles, and the adults are pretty small. The problem? They’re not necessarily representative of non-lab (read: real-life) scenarios. Tufekci posits that the dominance of Twitter as the “model organism” for social media in big data analyses similarly skews analysis:

 

Each social media platform carries with it certain affordances which structure its social norms and interactions and may not be representative of other social media platforms, or general human social behavior …

 

Twitter is used by about 10% of the U.S. population, which is certainly far, far from a representative sample. While Facebook has a wider diffusion rate, its rates of use are structured by race, gender, class and other factors and are not representative. Using these sources as “big data” model organisms raises important questions of representation and visibility as demographic or social groups may have different behavior — online and offline — and may not be fully represented or even sampled via current methods.

 

Tufekci says that one of the biggest methodological dangers of big data analysis is “insufficient understanding of the underlying samples.” In her words,

 

It’s not enough to understand how many people have “liked” a Facebook status updated, clicked on a link, or “retweeted” a message, without having a sense of how many people saw and chose to — or not to — take that option. That kind of normalization is rarely done, or may even be actively decided against because the results start appearing more complex or more trivial.

 

On the conceptual side of the big data analysis challenge, Tufekci posits that more in-depth research needs to be done in order to deepen the understanding of exactly what a social media footprints mean — and what can legitimately be inferred from big data analysis of those footprints.

 

A case in point: while retweets or mentions are often equated as a measure of “influence,” the meaning of a retweet could actually be something far different than influence, ranging from “affirmation to denunciation to sarcasm to approval to disgust.”

 

Tufekci makes three additional points regarding conceptual analysis of big data that can be applied in a business setting:

 

All networks don’t operate the same way.

 

Are social media networks similar to airline networks? Methodologies need to rely on more than “they’re both networks” as a basis of comparison; it’s crucial to examine the specific properties of nodes, edges, connectivity, flow, interaction and structure in different networks to understand which methods can be carried over from one type of network to another.

 

Humans do not interact only in networks.

 

Human social information flows do not occur only through node-to-node networks, but also through field effects — large-scale societal events that impact a large group … through changes within whole social, cultural and political fields — that must be taken into consideration.

 

You name it, humans will game it.

 

People will create false hashtag trends. They will ‘subtweet” as a way of talking about a topic or person and deliberately misspell something, or leave out the @ sign, in order to not be visible in a measurable way. They will game algorithms and metrics. This should be expected in all analysis.

 

When I asked Tufekci how she thinks her research applies to business managers using online and social media data, she said it’s important to keep in mind that more data does not necessarily mean more insight.

 

“A lot of big data research is done in an isolated, one-shot, single-method manner with no way to assess, interpret or contextualize the findings,” she said. “There is great potential for error and misunderstanding; worse, with a lot of money flowing into this space, there is a lot pressure to produce “results” and overlook the fact that methods that were not developed to study humans, and do not necessarily work the same way, but are being applied widely.

 

“The online imprints that create these large, aggregate datasets are not just mere ‘mirrors’ of human activity; rather, they are partial, filtered, distorted and complex reflections.”

 

 

Abstract:

 

Big Data, large-scale aggregate databases of imprints of online and social media activity, has captured scientific and policy attention. However, this emergent field is challenged by inadequate attention to methodological and conceptual issues.

 

I review key methodological and conceptual challenges including: 1) Inadequate attention to the implicit and explicit structural biases of the platform(s) most frequently used to generate datasets (the model organism problem). 2) The common practice of selecting on the dependent variable without corresponding attention to the complications of this path. 3) Lack of clarity with regard to sampling, universe and representativeness (the denominator problem). 4) Most big data analyses come from a single platform (hence missing the ecology of information flows).

 

Conceptual issues reviewed in this paper include: 1) More research is needed to interpret aggregated mediated interactions. Clicks, status updates, links, retweets, etc. are complex social interactions. 2) Network methods imported from other fields need to be carefully reconsidered to evaluate appropriateness for analyzing human social media imprints. 3) Most big datasets contain information only on “node-to-node” interaction. However, “field” effects – events that affect a society or a group in a wholesale fashion either through shared experience or through broadcast media – are an important part of human socio-cultural experience. 4).Human reflexivity – that humans will alter behaviors around metrics – needs to be assumed and built into the analysis. 5) Assuming additivity and counting interactions so that each new interaction is seen as (n 1) without regards to the semantics or context can be misleading. 6) The relationship between network structure and other attributes is complex and multi-faceted.

 

Number of Pages in PDF File: 24

 

Keywords: big data, social science, Twitter, Facebook, computer science, data science

 

working papers series

 

【文章作者】Zeynep Tufekci

【文章来源】mit

【文章链接】http://sloanreview.mit.edu/article/the-pitfalls-of-using-online-and-social-data-in-big-data-analysis/


2 条评论

  1. yangliming说道:

    这篇论文非常有学术价值,但是一是考虑到专业名词太多,二是内容太过抽象,一般的读者不能很好地理解,所以微博没有采用。建议可尝试就这一篇论文编成几条微博,对里面提及的六点作比较详细阐述,有助读者理解,效果会比较好。

  2. yangliming说道:

    这篇论文非常有学术价值,但是一是专业名词太多,二是内容太抽象,一般的读者不能很好地理解,所以微博没有采用,请谅。建议可尝试就这一篇论文编成几条微博,对里面提及的六点作比较详细阐述,有助读者理解,效果会比较好。



无觅相关文章插件