Neal Caren is on github, replication in social science!

December 11, 2012

I’m passionate about open-source science, so I had to give Big Ups to Neal Caren who I just learned is sharing code on github.  His latest offering  essentially replicates the Mark Regnerus study of children whose parents had same-sex relationships.  The writeup of this exercise is at Scatterplot.

My previous posts on github and sharing code are here and here.  If you’re on github, follow me.


The Success of Stack Exchange: Crowdsourcing + Reputation Systems

May 3, 2012

You’ve heard me say it before… Crowdsourced websites like StackOverflow and Wikipedia are changing the world.  Everyone is familiar with Wikipedia, but most people still haven’t heard about the StackExchange brand question and answer sites.  If you look into their success, I think you’ll begin to see how the combination of crowdsourcing and online reputation systems is going to revolutionize academic publishing and peer-review.

Do you know what’s happened to computer programming since the founding of StackOverflow, the first StackExchange question and answer site?  It has become a key part of every programmer’s continuing education, and for many it is such an essential tool that they can’t imagine working a single day without it.

StackOverflow began in 2008, and since then more than 1 million people have created accounts, more than 3 million questions have been asked, and more than 6 million answers provided (see Wikipedia entry).  Capitalizing on that success, StackExchange, the company which started StackOverflow, has begun a rapid expansion into other fields where people have questions.  Since most of my readers do more statistics than programming, you might especially appreciate the Stack Exchange for statistics (aka CrossValidated).  You can start exploring at my profile on the site or check out this interesting discussion of machine learning and statistics.

How do the Stack Exchange sites work?

The four most common forms of participation are question asking, question answering, commenting, and voting/scoring.  Experts are motivated to answer questions because they enjoy helping, and because good answers increase their prominently advertised reputation score.  Indeed, each question, answer, and comment someone makes be voted up or down by anyone with a certain minimum reputation score.  Questions/answers/comments each have a score next to them, corresponding to their net-positive votes.  Users have an overall reputation score.  Answers earn their author 10 points per up-vote, questions earn 5, and comments earn 2.  As users gain reputation, they earn administrative privileges, and more importantly, respect in the community.  Administrative privileges include the ability to edit, tag, or even delete other people’s responses.  These and other administrative contributions also earn reputation, but most reputation is earned through questions and answers.  Users also earn badges, which focuses attention on the different types of contributions.
Crowdsourcing is based on the idea that knowledge is diffuse, but web technology makes it much easier to harvest distributed knowledge.  A voting and reputation system isn’t necessary for all forms of crowdsourcing, but as the web matures, we’re seeing voting and reputation systems being applied in more and more places with amazing results.
To name a handful the top of my head:
  • A couple of my friends are involved in a startup called ScholasticaHQ which is facilitating peer-review for academic journals, and also offers social networking and question and answer features.
  • The stats.stackexchange.com has an open-source competitor in http://metaoptimize.com/qa/ which works quite similarly.  Their open-source software can and is being applied to other topics.
  • http://www.reddit.com is a popular news story sharing and discussion site where users vote on stories and comments.
  • http://www.quora.com/ is another general-purpose question and answer site.

It isn’t quite as explicit, but internet giants like google and facebook are also based on the idea of rating and reputation.

A growing number of academics blog, and people have been discussing how people could get academic credit for blogging.  People like John Ioannidis are calling attention to how difficult it is to interpret the a scientific literature because of publication bias and other problems.  Of course thoughtful individuals have other concerns about academic publishing.  Many of these concerns will be addressed soon, with the rise of crowdsourcing and online reputation systems.


Open Science: what we can learn from open-source software engineers

April 1, 2012

I believe that sharing data, and the code to analyze it, should be more common in academia – especially in sociology.  It will save a lot of researcher time, and ensure that the work that we do is reproducible.  My last post announced my contribution to github which consists of code to prepare some data from the exceptionally popular Add Health data.  I’ll use this post to briefly describe github, revision control, and open-source programming more generally.

Let’s start with something everyone knows about… Wikipedia.  It’s one of the marvels of our time, and it was created by hundreds of thousands of people collaborating with little top-down control.

In the world of computer programming, sites like github have done something similar to Wikipedia in the way they’ve harnessed the energy of the crowd.   A legitimate complaint about Wikipedia, often made by experts, is that non-experts have difficulty accounting for the uneven quality of its articles.  Sites like github mostly avoid this problem.  There are plenty of people posting code that isn’t very good, but this isn’t really a problem because most users never even come across it, and they have a pretty good idea how much to trust something based on the reputation of the author.  For some of the most popular projects “editor” or “maintainer” would be a more accurate title than “author,” because many individuals are contributing and the owner of the project spends more time approving changes other people make than writing their own changes.

Unlike Wikipedia, sites hosting open-source software, like github, often embrace having multiple versions (known as “forks”) of the same code.  For every contributor to a project, many more people merely fork it and make minor changes for their personal use.  Still, a significant number of programmers do seek the satisfaction, and the badge of honor, that comes with suggesting a change in the code of a popular project, and having it approved.

An absolutely integral part of all collaborative software engineering, not just open-source, is version control (aka revision control) software.  This software allows users to keep a history of all previous versions of a piece of code (or some times other kinds of document), and to instantly highlight differences, and merge improvements made on separate copies.

I recommend most data analysts begin using version control software, even if they don’t plan on writing code collaboratively, because its easy to use and it facilitates backup, and good organization.  I first discussed version control for data analysis here, but there is a much better discussion of  for data analysts here.  I took that advice, about a year ago, and started using git.  It was simpler than I thought, and it is necessary to contribute to (though not to copy code from)  github.  Note that while I’m using R for the Add Health code I’m sharing, there is absolutely no reason not to use tools like git and github with SPPS, STATA, SAS, html, etc.


Revision Control Statistics Bleg

April 21, 2010

revision control system, for those with even less programming experience than myself, manages “changes to documents, programs, and other information stored as computer files.”  The most advanced ones are used by teams of programmers who simultaneously edit the same code.   Simpler revision control is built in to things like wikis and word processors.

I’m wondering whether a revision control system would be helpful for me now, or in the future, even if all I’m doing is statistics.

I’m working with a big dataset (ok Scott, not that big) and I’ve written a fair bit of code.  Nothing too complicated, it is half data preparation, and half analysis and graphics.  Every so often I save my code under a new name, that way, if I accidentally save bad changes, I can always revert to a previous state.  I do the same thing with the dataset itself, and, in R, with my workspace.  In fact, I have an extra reason to do this with the data and my R workspace: memory management.  R often complains that  its running out of memory so I respond by deleting variables that I probably won’t need or could recreate without too much trouble.

It is sometimes annoying to find code that I wrote simply because there is a lot of text to go through.  I can only organize it one way, e.g. I could put all the code that makes graphs together, but then the code that makes graphs wouldn’t be placed next to the code that creates the data the graphs are based on.

Is a revision control system overkill for what I’m doing?  Any other thoughts?


Scaling Social Science

April 6, 2010

A friend at Cloudera recently invited me to write a post for their corporate blog about how social scientists are using large scale computation.
I’ve been using Hadoop and MapReduce to study some really large datasets this year. I think it’s going to become more and more important and open the world of scientific computing to social scientists. I’m happy to evangelize for it.

One of the ideas that didn’t make its way into the final version is that even though the tools and data are becoming more widely available to laypeople, asking good social science questions — and answering them correctly — is still hard. It’s comparatively easy to ask the wrong question, use the wrong data, draw the wrong inference, and so on, epecially if the wrongness is subtle. As an example, I think the OkCupid blog is interesting, but it’s not social science.

Social science has long been concerned with sampling methods precisely because it’s dangerously easy to incorrectly extrapolate findings from a non-representative sample to an entire population. Drawing conclusions from internet-based interactions can be problematic because the sample frame doesn’t match the population of interest. Even though I learned to make a cigar box guitar from Make Magazine, I don’t assume I know that much about acoustic engineering. Likewise, recreational data analysis is fun, illuminating and perhaps suggestive of how our social world works, but one ought not conclude that correlations or trends tell the whole, correct story. However, if exploring and experimenting with data can spark an interest in quantitative analysis of our social world, then I think it’s all for the better.

Link: http://www.cloudera.com/blog/2010/04/scaling-social-science-with-hadoop