Since the earliest days of the internet, along with all of the near-utopian promises of openness, freedom and all that, came questions about how the technology would reproduce social inequality. Styled the “digital divide,” the separation of the public into those who had access to the internet and those who lacked access would advantage the already advantaged, and further isolate the disadvantaged from the social and informational resources needed to participate fully in society.
Eszter Hargittai has done some of the most comprehensive work on this topic, and has tracked with longitudinal surveys a diverse undergraduate population, showing how racial and socioeconomic participation in the internet has changed over time. Even recently, she has shown how racial and socioeconomic differences persist in usage of online social networking sites.
Given her work, I, like many others, were very interested to see a release from Facebook today of a study in which their data science team estimated the racial diversity of their population by statistical analysis of members’ surnames. What has always set Facebook apart from other online services is the use of real names. While i’m “redlog” most everywhere else (Twitter, and so on), I’m “Scott Golder” on Facebook. It’s long been noted that this kind of “real” or “honest” data are immensely useful, but I think this is among the first time it’s been shown exactly why.
In short, Facebook used statistical data from the Census on race-surname mappings to estimate the racial makeup of their user base. For example, if 73% of people named Smith are white, then multiply the number of Smiths by .73 and add that to the number of white people. In the blog post, they describe how this method assumes Facebook users are randomly sampled from the population, and they used a mixture model to correct for this error (though more details on the modeling would’ve been great).
They present several findings, but I was most interested in the “saturation” plot. Though whites were slightly overrepresented initially, over time this has disappeared. Asians have consistently been overrepresented. Most strikingly, until recently, blacks and hispanics were signficiantly underrepresented, but are only recently approaching being proportionate. This proportionate representation is good news, but I’d caution against thinking of it as evidence that the digital divide is over. Indeed, I’d argue that racial diversity online is less interesting than socioeconomic diversity, which this study doesn’t address.
Originally I was going to talk in this post about how this is an example of why using internet data to do social science is awesome. But it’s actually the opposite — it’s the use of social science data to do internet research. Facebook has cleverly used data generated at great expense by the American public in order to make their own data more valuable. In addition to using Census data, they could use Social Security Administration data on first names to improve the analysis, as well as perhaps the Census’ race atlas data and zip codes.
I’m very happy that Facebook is doing many things that are of interest to social scientists and to the public interest in general. But now that they’ve leveraged public data for their own use, I’d ask them to think about what can they do with their data to help the public in return.