I’ve come away from the DataEdge conferencewith some answers…and some more questions. While I don’t intend to recap the conference itself, I do want to take advantage of time spent with this diverse group of participants and their varied perspectives to try to offer the bigger picture sense I’m starting to develop of the big data/data analytics trend.
The idea that big data might usher in a new era of automatic research and along with it researcher de-skilling or that it would render the scientific method obsolete did not prove to be a popular sentiment (*phew* sigh of relief). The point that data isn’t self-explanatory, that it needs to be interpreted was reasserted many times during the conference coming from people who occupy very different roles in this data science world. No need to panic, let’s move along to some answers to those questions I raised in part I.
What is big data? Ok, this was not a question I raised going into the conference, but I should have. Perhaps unsurprisingly there wasn’t a clear consensus or a consistent definition that carried through the talks. I found myself at certain points wondering, “are we still talking about ‘big data’ or are we just talking about your standard, garden-variety statistics now?” At any rate, this confusion was productive and led me to identify three things that appear to be new in this discussion of data, statistics, and analysis.
(1) Type of data– we’re talking about a lot of new behavioral data, log data following from pervasive digitalization. One part of this is the trend towards the non-anonymous Internet. Lawrence Lessig points out, “as the Internet has matured, the technologies for linking behavior with an identity have increased dramatically” … (let’s aside for now how the word ‘identity’ is being used/abused in that quote). Magnifying this trend is the more recent diffusion of smart phones and location-based services. We are getting a potentially finer-grained record of where people go and their patterns of moving through and inhabiting space (both online and offline). A whole lot of this data is collected without conscious awareness on the part of the individuals generating that data. (and of course this raises some privacy issues – a topic that received some focused discussion, specifically in a talk by danah boyd)
Related to this was something Matthew Salganik (Asst Prof at Princeton) said in a session about the way sociologists study the things that concern them – about inequality and, for example, its relationship to race. One (of many) ways Sociologists do this is by identifying where Caucasian, African-American, Asian, Hispanic people live by looking at home address and the ethnic category of inhabitants to show clustering and spaces of segregation. Possibilities for much more fine-grained data are now emerging – we may be able to answer a question like: how do people segregate themselves across time and space over the course of days or weeks?
(2) quantity of data – following from the fact that such data is automatically logged from platforms that may have millions of users (like Tweets or Google searches), we’re talking tera- or petabytes rather than merely mega- or gigabytes. That sheer quantity means that processing (cleaning, analyzing) such data will require algorithmic work. Using human labor to go through such data line by line or entry by entry is simply too cost prohibitive and time-consuming. Additionally, the ‘rawness’ of much of this data means that even sampling from the data and then going to line-by-line analysis is still not really the answer to the challenge. The data may need to be pieced back together first – for example to relate coordinates to semantically understandable locations, or to link measured behaviors to individuals.
(3) range and variety of data – where data is logged automatically, the work to gather (certain kinds of) information may be minimal (compared to sending out an army of workers to get 1000s of questionnaires completed). That means analysts are starting to consider the possibilities of looking at lots and lots of potential correlations, rather than driving such an exploration from a strong theoretical basis with a few carefully chosen variables (with warnings of course about the increased risk of finding spurious correlations). If such data may be processed algorithmically then it is possible to analyze from a total enumeration of a given population rather than a sample of that population.
Of course, I would step in here to note that such logging processes capture some kinds of things straightforwardly, but not others – the whole realm of opinion, intent, and meaning…things that are generally read interpretively like body language, or the more subtle semantics in many language practices. One participant at the conference when pressed could only come up with ’emotion’ as the kind of thing that remains for humans to interpret and analyze…but the domain that is beyond the easy reach of ‘big data’ I believe is much more vast.
So where do we stand in relation to this phenomenon as ethnographers, or more generally, as researchers with a bent towards qualitative and interpretivist approaches?
There’s something that ethnographers have in common with big data enthusiasts though neither group perhaps realizes this. Though ethnography has sometimes inaptly been equated out in the wider world with interview studies, it is the immersion of the ethnographer in a social world and the attempt to observe the phenomenon of interest as it unfolds that more distinctively characterizes such a methodological stance. Howard Becker states on the value attributed to this close observation by ethnographers, “the nearer we get to the conditions in which [the people we are studying] actually do attribute meanings to objects and events, the more accurate our description of those meanings are likely to be” (Becker 1996). It is this the closeness to the phenomenon of interest that is a shared concern. There is a common understanding that what people say (out of context, in a private interview or survey) is not a transparent representation of what they do. Ethnographers get at this the labor-intensive way, by hanging around and witnessing things first hand. Big data people do it a different way, by figuring out ways to capture actions in the moment, i.e. someone clicked on this link, set that preference, moved from this wireless access point to that one at a particular time.
Of course a major and very important point here – ethnographers’ observations are NOT equivalent to what data logs record…and a critical point is that ethnographers don’t stop with the observation or treat it as inherently meaningful, but do a whole lot of complementary work to try to connect apparent behavior to underlying meaning. They often do this through casual in situ conversations or more formal interviews. Problematically, there is a notion that log data, because it is so close to the phenomenon, because it is captured automatically, is transparently truthful as opposed to the kind of self-reported information that is met with skepticism and shouts of … “people lie! they forget! they estimate poorly!” One panelist (who shall remain nameless) commented early in the conference “data has become the source of truth.” But ‘big data’ is still just as prone to misinterpretation as other kinds of captured data. The misinterpretations may be arrived at differently, but they are still very much possible.
- Now on to a question I posed before the conference “How do ‘big data’ analysts connect data on behavior to the meaning/intent underlying that behavior? How do they avoid (or how do they think they can avoid) getting this wrong?” I find no profound or new answers to this question from the conference. It’s a thorny and intractable challenge. That data must be interpreted and that this work can be tricky was a recurring refrain. A shortage of data scientists was lamented. Much talk focused on ways of simplifying and automating some of the simpler applications of big data, mostly for marketing and other business uses. On the thorniness of arriving at ‘meaning’ from digital traces, danah boyd, in her talk, drew a connection between privacy claims and the practices of coded communication among teens on social networking sites. Their traces of behavior and even tweets, Facebook status updates, etc. become as impossible to decipher as ancient runes. How populations cope with marginality, the careful boundary management of subcultural groups is something I’m quite familiar with from time spent with Internet scammers in Ghana whose coded language draw from pidgin, Twi, Hausa, and English. Jay-Zs book Decoded in which page after page of rap music lyrics are parsed and patiently explained in relation to experiences of urban poverty and struggle also comes to mind.
- Another question I posed is a relevant follow on: “The data analytics discussion appears to be US-centric debate … how well are researchers grappling with the analysis of ‘big data’ when dealing with data collected from across heterogeneous, international populations?” This issue was brought up briefly in Hal Varian’s talk, in the context of an explanation of how to use Google search data to develop models that make better economic predictions. He noted, of some related work in Chile where only 30% or so of the population accesses the Internet, that these users are not by any stretch of the imagination “representative.” Yet, for economic predictions, as the most affluent members of society who thus lead economic trends, they are sufficient. We should always keep in mind, even when we have data that is very, very, very big, that some people, some experiences, some forms of expression are missing and may need to be accounted for. Tapan Parikh in the ‘global development’ breakout session noted that the populations often targeted for development assistance, the poorest of the poor, often are those most likely not to show up in the data.
It seems now that a Part III will be necessary to finish up my thoughts on methods, epistemology, (big) data, statistics, and where ethnography fits in to this discussion. I hope to reconsider whether it makes sense to talk about what we do as “small data.” Finally I wish to answer the questions I started off with about applications of big data. To what uses is it being put, what are some key applications? And how might such work complement projects that are principally ethnographic?
Read the rest of the posts in the “The Ethnographer’s Complete Guide to Big Data” series:
The Ethnographer’s Complete Guide to Big Data: Small Data People in a Big Data World (part 1 of 3)
The Ethnographer’s Complete Guide to Big Data: Conclusions (part 3 of 3)