Tuesday, April 27, 2010

Subversion replication at Atlassian

SkyHi @ Tuesday, April 27, 2010

It's cool working for an international company with an open philosophy, but our decentralised setup can cause some real headaches for sysadmins. One of these is giving fast access the source-code repository to our developers and support staff spread over 3 continents, all working on a common code base.

Subversion is the existing version-control system here, primarily for the tool support and well-understood workflow. But it's not without its problems, not least that its chatty-on-the-wire nature causes problems when latency is introduced. And when your developers are in Sydney and your servers in St. Louis that's about as high-latency as you're going to get on the internet.

While Subversion 1.5 introduced the concept of a write-through proxy, the devil is in the details. The documentation of how to do this is sparse, and developing a robust method of replication is "left as an exercise to the reader". This post documents some of the considerations that need to be taken into account and the method we are using at Atlassian to get reliable high-speed Subversion servers in a distributed environment.

The basic replication architecture is straight-forward: the slave server serves up checkouts and meta-data from a local cache but transparently proxies checkins to the master server. The concept of how checkins are replicated to the slaves is also simple enough:

  1. A user checks-out a working copy from the slave and makes changes
  2. User issues 'svn commit' which pushes the changes to the slave
  3. The slave transparently pushes the commit to the master
  4. The master completes the commit and invokes its post-commit hook
  5. The post-commit hook contains code to push the update all the known slaves

However the devil is in the details. The exact method of push-to-slave operation is poorly documented; there is a brief suggested method in the readme file that is unfortunately highly synchronous. As already mentioned, an alternative method using svnsync is "left as an exercise to the reader". We need method method of doing this that minimises commit time while keeping all slaves up to date.

The problem

The problem with the documented SSH + dump/restore method is that it will tie up the committing client and the server for the entire time it takes to upload and import the incremental dump. But if that slave is unavailable for any reason it will hang until the TCP session times-out. Furthermore that slave will then be out of sync with the master repository and future commits will fail. What we need is a method where the slaves are updated asynchronously and will compensate for missed commits.

The solution

Enter svnsync. This allows mirroring of a subversion repository is a transaction-aware manner, only pulling down revisions it does not currently have. It performs its own local locking on the mirrored repository so collisions are not an issue.

However there is still the question of when to run the updates. We could just poll the repository with a cron-script, but this creates a window where the slaves are out of sync unless the sync is run constantly, which would be wasteful. However a purely event-driven system suffers from the some of problems as the SSH dump/restore system above; if an update is missed the slave is out of sync until the next update is received. Furthermore if the event is implemented synchronously the post-commit script is tied-up.

In the end I opted for a hybrid solution that where each slave runs a server that accepts a single UDP packet to trigger an update (allowing the post-commit script to fire-and-forget) with intermittent scheduled update to compensate for missed events.

Setting up the mirror

The first step is to initialise the svnsync mirror. This requires setting up new repository then initialising it from the master. To ensure repository integrity only a special svnsync user can write to the repository:

sudo su - svnsync
svnadmin create /opt/svn/repositories/atlassian/private-mirror

Before synchronisation property-revisions must be enabled on the mirror. Again, only the special user can perform this action. Create the file /opt/svn/repositories/atlassian.com/private-mirror/hooks/pre-revprop-change and add the following:

#!/bin/sh
USER="$3"

if [ "$USER" = "svnsync" ]; then
# Allow
exit 0;
fi

echo "Only the svnsync user can change revprops" >&2
exit 1

Then convert the repository to a synchronisable one by setting the remote source. Then perform the initial sync-up:

svnsync init file:///opt/svn/repositories/atlassian.com/private-mirror https://svn.atlassian.com/svn/private
svnsync sync file:///opt/svn/repositories/atlassian.com/private-mirror

This copies the entire history of the master to the slave, so depending on your repository size it may time some time. Once this is done the following will update the mirror to latest the master revision:

svnsync sync file:///opt/svn/repositories/atlassian.com/private-mirror

One problem you are likely to hit with this setup is that because we created a new repository from scratch it has a different UUID from the master. This is fine for checkouts but will fail on commits. However we can manually copy the UUID across from the master:

cd /opt/svn/repositories/atlassian.com/private-mirror/db/
scp svn.atlassian.com:/opt/svn/repositories/atlassian.com/private-mirror/db/uuid .

You should now have a working mirror which can be made available via the SVN 1.5 proxy in Apache (authentication is ignored in this example):

   
DAV svn
SVNPath /opt/svn/repositories/atlassian.com/private-mirror
SVNMasterURI https://svn.atlassian.com/svn/private

The next step is keep the mirror up-to-date ....

The Update Event Server

So we need a server that will accept UDP packets, fork off and monitor sub-processes, and trigger time-based events. We could probably monkey-up something with inetd and cron but I like to keep all the variables in one place so I implemented my own server that handles all the tasks in the same place. Of course, reinventing the wheel sucks so I turned to the Python Twisted framework which supplies all of the necessary pieces ...

import sys, re
from twisted.internet.protocol import DatagramProtocol, ProcessProtocol
from twisted.internet import reactor, task

cmdline = ['svnsync', 'sync', 'file:///opt/svn/repositories/atlassian.com/private-mirror']
lockmsg = "Failed to get lock"

_debug = False
def debug(msg):
if _debug:
print >> sys.stderr, msg
def error(msg):
print >> sys.stderr, msg
def log(msg):
print >> sys.stdout, msg

class SyncProcess(ProcessProtocol):
def __init__(self):
self.running = False

def connectionMade(self):
self.running = True
log("SVN sync process started")

def outReceived(self, data):
log("stdout> %s" % data)
if data.find(lockmsg) > -1:
error("ERROR: The mirror repo has a lock on it")

def errReceived(self, data):
log("stderr> %s" % data)

def inConnectionLost(self):
debug("inConnectionLost! stdin is closed! (we probably did it)")

def outConnectionLost(self):
debug("outConnectionLost! The child closed their stdout!")

def errConnectionLost(self):
debug("errConnectionLost! The child closed their stderr.")

def processEnded(self, status):
self.running = False
log("Sync process ended, status %d" % status.value.exitCode)


class SyncListener (DatagramProtocol):

def __init__(self):
self.prochandler = SyncProcess()
self.timeout = task.LoopingCall(self.runsync)

def startProtocol(self):
print "Starting UDP server and timeout"
self.timeout.start(120, now=False)

def datagramReceived(self, data, (host, port)):
log("Received packet from %s:%d" % (host, port))
self.runsync()

def runsync(self):
if self.prochandler.running:
log("Not running sync as another process is present")
else:
reactor.spawnProcess(self.prochandler, cmdline[0], cmdline, {})


reactor.listenUDP(9999, SyncListener())
reactor.run()

This server runs constantly on the slave server listening on port 9999. On receiving a packet it forks off an svnsync process (unless one is already running). Additionally, every two minutes it runs a sync regardless. The server is started via daemontools, which ensures that if the server quits for any reason is restarted.

Triggering updates

When the master receives a commit it triggers an update on each slave by sending a UDP packet to them. This is done in the post-commit script using the netcat network tool:

echo 1 | nc -w1 -u svn.sydney.atlassian.com 9999

And that's it, with the exception of some caveats ...

Locking

It's not clear how locking interacts with replication; however distributed locking is not something that should be taken lightly. For this reason I've disabled locking on both the master and slave repositories. This is just a matter of putting the following in the pre-lock hook:

#!/bin/sh

# Disable locking as we are doing replication and it's not clear how
# they will interact.
echo "Locking is disabled due to replication" >&2
exit -1

This will return a meaningful error message if someone attempts to lock a file.

Client version issue

There is a known issue with some versions of Subversion clients when adding files to replicated slaves. The list of clients I've tested is below:

ClientVersionWorking
Subversion commandline1.4.*Yes
Subversion commandline1.5.0Yes
Subversion commandline1.5.\[1-4\]No
TortoiseSVN 1.4.*Yes
TortoiseSVN 1.5.*No
IDEA7.0.*Yes
IDEA8.0M1Yes

Distributed VCS

The elephant in the room here is that none of this should really be necessary. There are now a number of version-control systems, commercial and open-source, that are distributed in nature and so don't need this special treatment. With these systems commits are two phase, with a local checkin followed (optionally) by a merge to a remote repository (or a pull depending on your development model). This is undoubtedly the way of the future and there has already been discussion about trialling them internally at Atlassian. However there are two short-term issues that prevent an immediate migration:

  • Tool support. Fisheye, Crucible, Maven, IDEA; until these parts of our tool-chain have native support for these next-gen systems our workflow would have to be severely modified.
  • Developer process. Because a local commit does not automatically propagate changes to the master repository more discipline is required from developers. In practice this would probably require creating the role of a merge-master on each team who would make sure all working trees are regularly merged and conflicts resolved.

Neither of these problems are insurmountable though, and I expect that in time distributed source-control will become the norm rather than the niche it currently is.


REFERENCES

http://blogs.atlassian.com/developer/2008/11/subversion_replication_at_atla.html