Revision Control Statistics Bleg

revision control system, for those with even less programming experience than myself, manages “changes to documents, programs, and other information stored as computer files.”  The most advanced ones are used by teams of programmers who simultaneously edit the same code.   Simpler revision control is built in to things like wikis and word processors.

I’m wondering whether a revision control system would be helpful for me now, or in the future, even if all I’m doing is statistics.

I’m working with a big dataset (ok Scott, not that big) and I’ve written a fair bit of code.  Nothing too complicated, it is half data preparation, and half analysis and graphics.  Every so often I save my code under a new name, that way, if I accidentally save bad changes, I can always revert to a previous state.  I do the same thing with the dataset itself, and, in R, with my workspace.  In fact, I have an extra reason to do this with the data and my R workspace: memory management.  R often complains that  its running out of memory so I respond by deleting variables that I probably won’t need or could recreate without too much trouble.

It is sometimes annoying to find code that I wrote simply because there is a lot of text to go through.  I can only organize it one way, e.g. I could put all the code that makes graphs together, but then the code that makes graphs wouldn’t be placed next to the code that creates the data the graphs are based on.

Is a revision control system overkill for what I’m doing?  Any other thoughts?

20 Responses to Revision Control Statistics Bleg

  1. Drew Conway says:

    Version control is key, no matter how small the project.

    I use Git for everything now, even collaborating on LaTeX documents remotely, it is great—and super easy to learn.

  2. I posted my own very simple version control approach last year. Basically, in each of my project directories I have an “archive” subdirectory and I wrote a shell script (which I have as a right-click menu option) that lets me save a timestamped snapshot of any file to the archive subdirectory. This works for me because my work is either solo or with a clear division of labor, but if and when I get into a very tight collaboration I’ll seriously consider Git

  3. Kieran says:

    Yes, use version control. I use git for everything at this point, but mercurial is very good too. Each is easy to use via any modern programmer’s editor. It’s worth it. Git’s model of easy/cheap branching seems like a natural fit for the problem you’re having. Incidentally, in addition to all the vc benefits, a repository is also a full backup, and with github, an instant offsite backup + collaboration tool. (More, possibly too much more, from me on this and related stuff here. I should expand the revision control section.)

  4. Michael Bishop says:

    I can always count on you three when I need technical advice! Thanks!

  5. With git for windows it looks like I have the choice between

    Cygwin http://www.cygwin.com/setup.exe

    and

    msysGit http://code.google.com/p/msysgit/downloads/list

    any more words of wisdom?

  6. I’m also in the situation of evaluating how to incorporate revision control into my data analysis workflow. Your post gave me the push to post a question on Stack Overlow on the topic: http://stackoverflow.com/questions/2712421/r-and-version-control-for-the-solo-data-analyst

  7. lingpipe says:

    I use version control for everything from papers to R code to my address book. Even if it’s just you on a single machine, the ability to back off to tagged (or even dated) versions is something I can’t live without. I can’t even imagine working on multiple machines (e.g. home and office) or with multiple people without version control.

    I don’t think which version control you use makes much difference. They offer similar functionality with slightly different inerfaces.

    We use devguard.com, who for a monthly fee, offers very solid Subversion (aka svn) support.

  8. Not using version control is like not using a condom. Everyone who works with text should use it.

  9. I second the recommendation for Git. I settled on Git after trying both darcs and Mercurial. The msys version works fine on Windows for me. It’s saved my bacon a couple times.

  10. […] analysts that I respect use version control. For example:http://github.com/hadley/See comments on https://permut.wordpress.com/2010/04/21/revision-control-statistics-bleg/However, I’m evaluating whether adopting a version control system such as git would be […]

  11. […] lately. This is useful both as backup and as a user-friendly (but limited capacity) alternative to RCS on the Subversion model. The way it works is that Dropbox creates a folder called ~/Dropbox, then […]

  12. […] it facilitates backup, and good organization.  I first discussed version control for data analysis here, but there is a much better discussion of  for data analysts here.  I took that advice, about a […]

  13. […] 请参阅https://permut.wordpress.com/2010/04/21/revision-control-statistics-bleg/上的评论 […]

Leave a comment