Apache Hadoop Transitions to Git

By September 8, 2014Blog

Hadoop logoThe Apache Infrastructure team has gotten Git migrations down pat. Just ask the Apache Hadoop project, which moved from Subversion to Git in less than 10 days.

Hadoop committer Karthik Kambatla says that getting from the request to Infra to a usable repo “took less than 10 days. The actual migration itself took 3-4 days. Daniel (Gruno, an Infrastructure team member) smartly suggested we start the process on a weekend to minimize disruption… he updated us fairly frequently on the status and made useful suggestions. There were a couple of follow-up items that were fixed fairly quickly as well.”

Git is an Option?

Chris Douglas, PMC Chair for Hadoop, says that Git wasn’t an option when Hadoop spun out of Lucene. “At least, there was no discussion about it. Git wasn’t available, in any case.”

In fact, some projects may not even be aware that it’s an option today. Git is still labeled “work in progress” (WIP), and it may not be obvious that Infra can switch projects to git. David Nalley, VP of Infrastructure at the Apache Software Foundation (ASF), explains that the “WIP” label is just that. “With git used by so many projects, it’s just a label at this point. Git is fully supported and in production at the ASF.”

“Historically, the original git.apache.org service was a svn-to-git mirror, and today that also provides the mirroring capabilities to Github,” says Nalley. “There is a project slated to merge what is git-wip and git.apache.org into a single service to rid us of the WIP moniker.”

Reasons to Switch

Kambatla says that the Hadoop project wanted to switch for a number of reasons. First, he says, “most users and developers were using git for development for a number of reasons: local commits, easy patch updates against latest committed versions, sharing code with others, etc. SVN, in my opinion was being used primarily only to commit code. Using git would avoid this duplication.”

Secondly, Kambatla says that working on feature branches “is easier with git, to keep up with the work on main branches.” And finally, Git provides the “potential for better code review tools.”

Douglas agrees that Git was a better fit for Hadoop. “Hadoop often has several active development branches that require backporting features. I don’t know if subversion added better support for this workflow subsequently, but git made it much easier to manage multiple patches, branches, and review. Because Hadoop uses review-then-commit (RTC), one often has multiple patches in flight that require quick context switches. So most developers had already switched to git for their work on Hadoop.”

Git does introduce some feature trade-offs, though. Douglas notes that the authorization is less fine-grained with git. However, he expresses little concern there. “Our experiments with branch committers suggest that we won’t regret relaxing the strict authorization we exercised with Subversion.”

Daniel Gruno, who performed Hadoop’s migration, says that code provenance remains intact but git isn’t SVN. “There are a few things like property settings that are lost, and the overall structure of a repository changes, which can make it somewhat difficult to browse what happened in the past – until you suddenly learn how it’s all set up, and then it gets easy.”

Mechanics of Switching

Though Kambatla and Douglas make it sound simple, the actual migration from SVN to git isn’t trivial. First, says Gruno, it involves “a lot of paperwork. We have to be very careful when we migrate, and so we have a set of rules we always follow.”

The process involves locking the SVN repository, “so someone doesn’t commit something while we migrate and it gets lost or messes up the migration,” says Gruno. “Once a project decides to move to Git, the Subversion repository is basically voided.”

The physical migration requires “a big ol’ Perl-based system” that “takes ages to complete” says Gruno. Ages, as in “1.5 hours per 1,000 commits.” Hadoop took more than two days to migrate, and he says that OpenOffice took more than a week. Hence the weekend migration plan.

After the migration, the repo is put into read-only mode so the PMC can “inspect” it. After the project’s PMC signs off, write access is allowed. And, finally, the old SVN repository is “partially opened” for write access so the project can update its website.

Git by the Numbers

Gruno says that quite a few projects are currently using git as their Version Control System (VCS) of choice. Out of 151 Top-Level Projects (TLPs), 69 are using Git as the primary VCS (Gruno points out that all Web sites are using SVN), as well as 19 podlings in the Apache Incubator (out of around 30 incubating projects). So, projects at the ASF are about evenly split in terms of which projects use Git and which use Subversion.

Furthermore, the git repositories are pretty active, says Gruno. “When commits are concerned, Git usually outnumbers subversion commits by 1.5/2 to 1, but I suspect this is mainly due to the nature of git commits. In April, 2014, we hit the magic mark where we had more Git commits than Subversion commits.”

Finally, Gruno encourages projects to make the most out of Git if they’re going to switch (or even if they don’t). “If projects do move to git, they really really really should take advantage of our GitHub integration! It opens up ASF to a whole new bunch of wonderful people and for a lot of projects, it results in a ton of new ideas, pull requests, comments, you name it. We can enable GitHub integration for any project (even Subversion projects) quite quickly.”

Want to learn more about using Git at the ASF? There will be a session on Git at ApacheCon Europe for attendees to learn more about using Git with Apache projects.