Neal Caren is on github, replication in social science!

December 11, 2012

I’m passionate about open-source science, so I had to give Big Ups to Neal Caren who I just learned is sharing code on github.  His latest offering  essentially replicates the Mark Regnerus study of children whose parents had same-sex relationships.  The writeup of this exercise is at Scatterplot.

My previous posts on github and sharing code are here and here.  If you’re on github, follow me.

Advertisement

Finding Data

July 16, 2012
A friend asked me about where he might find education data to practice/play with.
 
Here are some links I came up with:
 http://www.cpc.unc.edu/projects/addhealth/ has open-access and restricted data.  
 http://www.infochimps.com/tags/school 
http://www.factual.com/product/data?selected=education
http://stats.stackexchange.com/questions/27237/what-are-the-most-useful-sources-of-economics-data
http://stats.stackexchange.com/questions/7/locating-freely-available-data-samples
http://stats.stackexchange.com/questions/27061/how-is-research-based-on-the-u-s-census-organized

Open Science: what we can learn from open-source software engineers

April 1, 2012

I believe that sharing data, and the code to analyze it, should be more common in academia – especially in sociology.  It will save a lot of researcher time, and ensure that the work that we do is reproducible.  My last post announced my contribution to github which consists of code to prepare some data from the exceptionally popular Add Health data.  I’ll use this post to briefly describe github, revision control, and open-source programming more generally.

Let’s start with something everyone knows about… Wikipedia.  It’s one of the marvels of our time, and it was created by hundreds of thousands of people collaborating with little top-down control.

In the world of computer programming, sites like github have done something similar to Wikipedia in the way they’ve harnessed the energy of the crowd.   A legitimate complaint about Wikipedia, often made by experts, is that non-experts have difficulty accounting for the uneven quality of its articles.  Sites like github mostly avoid this problem.  There are plenty of people posting code that isn’t very good, but this isn’t really a problem because most users never even come across it, and they have a pretty good idea how much to trust something based on the reputation of the author.  For some of the most popular projects “editor” or “maintainer” would be a more accurate title than “author,” because many individuals are contributing and the owner of the project spends more time approving changes other people make than writing their own changes.

Unlike Wikipedia, sites hosting open-source software, like github, often embrace having multiple versions (known as “forks”) of the same code.  For every contributor to a project, many more people merely fork it and make minor changes for their personal use.  Still, a significant number of programmers do seek the satisfaction, and the badge of honor, that comes with suggesting a change in the code of a popular project, and having it approved.

An absolutely integral part of all collaborative software engineering, not just open-source, is version control (aka revision control) software.  This software allows users to keep a history of all previous versions of a piece of code (or some times other kinds of document), and to instantly highlight differences, and merge improvements made on separate copies.

I recommend most data analysts begin using version control software, even if they don’t plan on writing code collaboratively, because its easy to use and it facilitates backup, and good organization.  I first discussed version control for data analysis here, but there is a much better discussion of  for data analysts here.  I took that advice, about a year ago, and started using git.  It was simpler than I thought, and it is necessary to contribute to (though not to copy code from)  github.  Note that while I’m using R for the Add Health code I’m sharing, there is absolutely no reason not to use tools like git and github with SPPS, STATA, SAS, html, etc.


Share your code! (Here is some for Add Health)

March 26, 2012

The National Longitudinal Study of Adolescent Health, aka Add Health, has been in use for more than a decade ago.  Thousands of researchers have used it.  This is fantastic.  There are great economies of scale in the data collection.

Sadly, we researchers have wasted years doing things that others have already done. Anyone beginning a new project must first clean their data.  Add Health doesn’t require as much cleaning as some other, messier sources of data, thanks to people like Joyce Tabor, James Moody, Ken Frank, and many others.  Still, I think research would be sped up quite a lot, and communication greatly enhanced, if people shared their code more widely.  Therefore I’ve created my first github code repository which prepares the variables from the widely used in-school questionnaire portion of Add Health.

https://github.com/MichaelMBishop/inschoolAddHealth

This will be of most use to people using R, but the data could be exported.  The script also includes cross tabulations and fairly detailed comments which I hope will help people think about the data.  Some time soon I’ll upload more code.

I recommend Jeremy Freese on reproducibility in sociological research here and here.  Andy Abbott’s best objections don’t apply to a widely used data source like Add Health.

p.s.  Do share links to other code repositories in the comments!


Transparency from the ASA and US Government

April 1, 2011

Intuition suggests that transparency shouldn’t cost that much money, but has the potential to be a powerful force for improving institutional incentives.

Recently, the sociology blogosphere has been discussing the ASA’s proposed dues increase (See here, here, here, and here). Many are skeptical that the dues increase is in the best interest of the members. But even those who might support the increase can get behind the call for more transparency from the ASA.

In a related story, The Sunlight Foundation reports:

Some of the most important technology programs that keep Washington accountable are in danger of being eliminated. Data.gov, USASpending.gov, the IT Dashboard and other federal data transparency and government accountability programs are facing a massive budget cut, despite only being a tiny fraction of the national budget. Help save the data and make sure that Congress doesn’t leave the American people in the dark.


Ngrams of Social Science Disciplines

January 24, 2011

An “ngram” is an n-word phrase.  So, “human science” is a 2-gram.  If you’ve been living under a rock, you may not have heard about the latest gift from google – having scanned most published books in a number of major languages, they recently provided us the data, and a tool for easy visualization, of the relative popularity of words and phrases over time.  I thought I’d explore some terms of broad interest to sociologists with no particular idea about what I’d find.  Please take a look and help me interpret them.

Below you’ll find the relative frequency with which the major social scientific disciplines (plus psychology) are mentioned in books.  Let me explain the numbers on the Y-axis.  “psychology” is the most common word.  In 1950, it accounted for about 0.0034% of all words published.  In other words, google takes all the books published in a given year, and counts how many occurrences there are for each word.  Then it divides that number by the total number of words published.  There are many methodological considerations… for example, each book only counts once, regardless of how many copies are sold.

So what do we see?  Well, the rank order doesn’t really change over time.  Psychology gets the most mentions, then economics, sociology, anthropology and finally political science.  It’s tempting to interpret this as measuring the prominence of each discipline, but this isn’t a great test.  For starters, authors aren’t generally referring to the academic discipline when they use the word “psychology,” but they are when they use the phrase “political science.”  Sociology is probably between the two in terms of, “the share of word-mentions which actually refer to the academic discipline.”

I feel a bit more comfortable making inferences based on how each of these terms changes over time.  For example.  In in 1950, sociology received almost twice as many mentions as anthropology.  The situation was similar in 1980.  But in 1999, anthropology achieved parity with sociology, and they have been close to even in the decade since.  This appears to be evidence that anthropology gained prominence, relative to sociology, in the last half of the twentieth century.  Naturally, I don’t think we should put too much stock in this single measure of prominence.  We might want to look at trends in the number of students, and people working in each discipline.  We could count mentions in periodicals, citations to academic articles.  We could look to see how each word is used, and how much their usage changes over time.  Do these other measures corroborate, counter, or otherwise contextualize these trends?

I can’t give you easy access to all that data, but you can explore ngrams for yourself!

So readers, what do you see in this graph?  Care to nominate and discuss plausible/potentially useful and/or plainly dangerous assumptions that help us interpret these ngrams or lead us astray?


facebook has a soul?*

February 10, 2010

i genuinely hope to get back to more (semi-)regular blogging here soon. But, in the meantime, in case you haven’t seen this one yet – here‘s a wild potential data release that may interest some of you. (ht BW)

____
*it’s highly possible i saw this same title in someone else’s mention of this elsewhere today. but if so, i can’t for the life of me recall where.