I’m passionate about open-source science, so I had to give Big Ups to Neal Caren who I just learned is sharing code on github. His latest offering essentially replicates the Mark Regnerus study of children whose parents had same-sex relationships. The writeup of this exercise is at Scatterplot.
You’ve heard me say it before… Crowdsourced websites like StackOverflow and Wikipedia are changing the world. Everyone is familiar with Wikipedia, but most people still haven’t heard about the StackExchange brand question and answer sites. If you look into their success, I think you’ll begin to see how the combination of crowdsourcing and online reputation systems is going to revolutionize academic publishing and peer-review.
Do you know what’s happened to computer programming since the founding of StackOverflow, the first StackExchange question and answer site? It has become a key part of every programmer’s continuing education, and for many it is such an essential tool that they can’t imagine working a single day without it.
StackOverflow began in 2008, and since then more than 1 million people have created accounts, more than 3 million questions have been asked, and more than 6 million answers provided (see Wikipedia entry). Capitalizing on that success, StackExchange, the company which started StackOverflow, has begun a rapid expansion into other fields where people have questions. Since most of my readers do more statistics than programming, you might especially appreciate the Stack Exchange for statistics (aka CrossValidated). You can start exploring at my profile on the site or check out this interesting discussion of machine learning and statistics.
How do the Stack Exchange sites work?
- A couple of my friends are involved in a startup called ScholasticaHQ which is facilitating peer-review for academic journals, and also offers social networking and question and answer features.
- The stats.stackexchange.com has an open-source competitor in http://metaoptimize.com/qa/ which works quite similarly. Their open-source software can and is being applied to other topics.
- http://www.reddit.com is a popular news story sharing and discussion site where users vote on stories and comments.
- http://www.quora.com/ is another general-purpose question and answer site.
It isn’t quite as explicit, but internet giants like google and facebook are also based on the idea of rating and reputation.
A growing number of academics blog, and people have been discussing how people could get academic credit for blogging. People like John Ioannidis are calling attention to how difficult it is to interpret the a scientific literature because of publication bias and other problems. Of course thoughtful individuals have other concerns about academic publishing. Many of these concerns will be addressed soon, with the rise of crowdsourcing and online reputation systems.
I believe that sharing data, and the code to analyze it, should be more common in academia – especially in sociology. It will save a lot of researcher time, and ensure that the work that we do is reproducible. My last post announced my contribution to github which consists of code to prepare some data from the exceptionally popular Add Health data. I’ll use this post to briefly describe github, revision control, and open-source programming more generally.
Let’s start with something everyone knows about… Wikipedia. It’s one of the marvels of our time, and it was created by hundreds of thousands of people collaborating with little top-down control.
Unlike Wikipedia, sites hosting open-source software, like github, often embrace having multiple versions (known as “forks”) of the same code. For every contributor to a project, many more people merely fork it and make minor changes for their personal use. Still, a significant number of programmers do seek the satisfaction, and the badge of honor, that comes with suggesting a change in the code of a popular project, and having it approved.
An absolutely integral part of all collaborative software engineering, not just open-source, is version control (aka revision control) software. This software allows users to keep a history of all previous versions of a piece of code (or some times other kinds of document), and to instantly highlight differences, and merge improvements made on separate copies.
I recommend most data analysts begin using version control software, even if they don’t plan on writing code collaboratively, because its easy to use and it facilitates backup, and good organization. I first discussed version control for data analysis here, but there is a much better discussion of for data analysts here. I took that advice, about a year ago, and started using git. It was simpler than I thought, and it is necessary to contribute to (though not to copy code from) github. Note that while I’m using R for the Add Health code I’m sharing, there is absolutely no reason not to use tools like git and github with SPPS, STATA, SAS, html, etc.
The National Longitudinal Study of Adolescent Health, aka Add Health, has been in use for more than a decade ago. Thousands of researchers have used it. This is fantastic. There are great economies of scale in the data collection.
Sadly, we researchers have wasted years doing things that others have already done. Anyone beginning a new project must first clean their data. Add Health doesn’t require as much cleaning as some other, messier sources of data, thanks to people like Joyce Tabor, James Moody, Ken Frank, and many others. Still, I think research would be sped up quite a lot, and communication greatly enhanced, if people shared their code more widely. Therefore I’ve created my first github code repository which prepares the variables from the widely used in-school questionnaire portion of Add Health.
This will be of most use to people using R, but the data could be exported. The script also includes cross tabulations and fairly detailed comments which I hope will help people think about the data. Some time soon I’ll upload more code.
p.s. Do share links to other code repositories in the comments!