I believe that sharing data, and the code to analyze it, should be more common in academia – especially in sociology. It will save a lot of researcher time, and ensure that the work that we do is reproducible. My last post announced my contribution to github which consists of code to prepare some data from the exceptionally popular Add Health data. I’ll use this post to briefly describe github, revision control, and open-source programming more generally.
Let’s start with something everyone knows about… Wikipedia. It’s one of the marvels of our time, and it was created by hundreds of thousands of people collaborating with little top-down control.
In the world of computer programming, sites like github have done something similar to Wikipedia in the way they’ve harnessed the energy of the crowd. A legitimate complaint about Wikipedia, often made by experts, is that non-experts have difficulty accounting for the uneven quality of its articles. Sites like github mostly avoid this problem. There are plenty of people posting code that isn’t very good, but this isn’t really a problem because most users never even come across it, and they have a pretty good idea how much to trust something based on the reputation of the author. For some of the most popular projects “editor” or “maintainer” would be a more accurate title than “author,” because many individuals are contributing and the owner of the project spends more time approving changes other people make than writing their own changes.
Unlike Wikipedia, sites hosting open-source software, like github, often embrace having multiple versions (known as “forks”) of the same code. For every contributor to a project, many more people merely fork it and make minor changes for their personal use. Still, a significant number of programmers do seek the satisfaction, and the badge of honor, that comes with suggesting a change in the code of a popular project, and having it approved.
An absolutely integral part of all collaborative software engineering, not just open-source, is version control (aka revision control) software. This software allows users to keep a history of all previous versions of a piece of code (or some times other kinds of document), and to instantly highlight differences, and merge improvements made on separate copies.
I recommend most data analysts begin using version control software, even if they don’t plan on writing code collaboratively, because its easy to use and it facilitates backup, and good organization. I first discussed version control for data analysis here, but there is a much better discussion of for data analysts here. I took that advice, about a year ago, and started using git. It was simpler than I thought, and it is necessary to contribute to (though not to copy code from) github. Note that while I’m using R for the Add Health code I’m sharing, there is absolutely no reason not to use tools like git and github with SPPS, STATA, SAS, html, etc.