Scaling Social Science

A friend at Cloudera recently invited me to write a post for their corporate blog about how social scientists are using large scale computation.
I’ve been using Hadoop and MapReduce to study some really large datasets this year. I think it’s going to become more and more important and open the world of scientific computing to social scientists. I’m happy to evangelize for it.

One of the ideas that didn’t make its way into the final version is that even though the tools and data are becoming more widely available to laypeople, asking good social science questions — and answering them correctly — is still hard. It’s comparatively easy to ask the wrong question, use the wrong data, draw the wrong inference, and so on, epecially if the wrongness is subtle. As an example, I think the OkCupid blog is interesting, but it’s not social science.

Social science has long been concerned with sampling methods precisely because it’s dangerously easy to incorrectly extrapolate findings from a non-representative sample to an entire population. Drawing conclusions from internet-based interactions can be problematic because the sample frame doesn’t match the population of interest. Even though I learned to make a cigar box guitar from Make Magazine, I don’t assume I know that much about acoustic engineering. Likewise, recreational data analysis is fun, illuminating and perhaps suggestive of how our social world works, but one ought not conclude that correlations or trends tell the whole, correct story. However, if exploring and experimenting with data can spark an interest in quantitative analysis of our social world, then I think it’s all for the better.

Link: http://www.cloudera.com/blog/2010/04/scaling-social-science-with-hadoop

This entry was posted on Tuesday, April 6th, 2010 at 1:19 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

4 Responses to Scaling Social Science

Michael Bishop says:

April 6, 2010 at 4:32 pm

Fascinating stuff Scott! I agree that the future is very bright for research with these enormous datasets.

Reply
Revision Control Statistics Bleg « Permutations says:

April 21, 2010 at 8:04 pm

[…] working with a big dataset (ok Scott, not that big) and I’ve written a fair bit of code. Nothing too complicated, it is half data […]

Reply
guillermoparedes says:

April 29, 2010 at 12:47 pm

Here’s a paper you might find interesting:

http://docs.google.com/viewer?a=v&q=cache:-AfoyDvzIpAJ:e-science.lancs.ac.uk/multiR/multiR-paper.pdf+distributed+computing+for+social+research&hl=es&gl=mx&pid=bl&srcid=ADGEEShEpE6yRQVgmjsquKm-jsNR63QIpiR1iFnHe3ELaAana98L8FgSIz3Njsh7wbmZa8wpw4V6VHx1h9akzzC24Kpb3Bx9Vcj6aKbepgYgvbqAMhqYF_rMBZZ5ZfJI3RMdqZYM_f-j&sig=AHIEtbQleW1MttWfkfKes9ncjvBQSkHk1Q

Reply
A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis « Conflated Automatons says:

April 21, 2011 at 12:45 pm

[…] the approach might be naive enough to be described as folk computational sociology, I prefer to think of it as punk […]

Reply

	successful life coac… on Transparency from the ASA and…
	Helpdesk.ipt.pw on Transparency from the ASA and…
	Difficult Relationsh… on Transparency from the ASA and…
	Game slot penghasil… on Assorted Links
	R and version contro… on Revision Control Statistics…

Permutations