Lessons learned: migrating a complicated repository from subversion to mercurial

13 jan 2012
Written by Boris Drajer
Published in Development Environment
Print
Email

Migrating from SVN to Mercurial is a simple process only if the SVN repository has a straight-and-square structure - that is, there are trunk, branches and tags folders in the root and nothing else, not in its present state or ever before. If you used your SVN repository in a way that was convenient in SVN but not in Mercurial – for example, you created branches in various subdirectories, it still shouldn’t be too hard to migrate. But if you, like me, decided late in the game to create the mentioned folders in SVN and then moved an renamed your folders, you will need to invest serious time if you don’t want to lose parts of your history. You need to plot your migration very thoroughly and do a lot of test runs.

The reason for this is that Mercurial’s ConvertExtension is somewhat of a low-level tool. (In other words, although reliable it is not too bright). Browsing the internet you may get the impression that it’s an automated conversion system: it isn’t. It does fully automated migration only for straight SVN repositories, but for the rest it’s more like something to use in your migration script. It seems to do its primary purpose – converting revisions from one repository format to another – quite well but the rest of the tool is not so intelligent and it needs help. So, lesson number one: if you have a complex repository, don’t take the migration lightly.

A small disclaimer is in order: this post is not intended to be a complete step-by-step guide to migration. Rather, it’s something to fill in the blanks left by what little is available on the internet. I’ve done a complex migration and I want to do a brain dump for my future reference or for “whomeverother it may concern”.

Between the two alternatives I perceived as most promising, hgsubversion and the convert extension, i chose the latter. Hgsubversion was claimed by some to be the better tool for this job, but it was somewhat troublesome. The problem with hgsubversion was that it had a memory leak and broke easily in the middle of the conversion (note that this happened a couple of months ago: things may have changed in the meantime). The solution, they say, was to do hg pull repeatedly until it finishes. I wanted to do a hg clone with a filemap, but when the import broke I was in trouble because hg pull doesn’t accept filemaps. (It could be that the filemap was cached somewhere inside of the new repository and my worries were unfounded, I don’t really know). I may try that in the future. One other way around it would be to do a straight clone of SVN – no branches or anything – into an intermediate mercurial repository and then split that into separate final repositories. In that case, hgsubversion could be a viable solution, maybe even better than the conversion extension. I had more success with the conversion extension so this is what we’ll talk about here.

Part 1 – splitting import by revision

The repository in question here – that is, a folder within the SVN repository – started as part of another project and only a thousand revisions later was moved into its separate folder - thankfully, at that point I at least created the proper trunk and branches folders. So, I had one part of history where everything was trunk but it moved around the repository, and another part where nothing moved but I had a trunk and a couple of branches. Luckily, I had a buffer zone of a couple of hundred revisions where everything was trunk and nothing moved so I didn’t have to pinpoint the exact revision on which I had to split the import.

Let’s say that the early version had a structure like this:

Crm/Fwk1 
Crm/Fwk2 
Crm/Other unimportant folders (that is, not to be imported)

At revision 1000 the first two folders were moved to

Framework/trunk/Fwk1 
Framework/trunk/Fwk2

The first branches appeared at revision 1200 in Framework/branches.

So, it was to be done like this:

Step one, import everything up to the revision 1100 into the default branch. Include the Crm/Fwk* folders and Framework/trunk/Fwk* folders. As I said, in this revision the Framework/branches folder was empty so we don’t lose anything.
Step two, import the rest but tell the conversion extension that the branches are in the Framework/branches folder so that it picks them up properly.

Sounds simple? Note that I had to perform a serious research of my repository: had a branch been created earlier than revision 1000 or had I made a branch at the same moment I created the trunk, things would have been more complicated. That is to say, I would have probably had to split the import into more steps and repeat some operations and do more testing to see on which exact revision I should stop the first step. Lesson number two: know thy repository.

The first step of this import is not so hard. I used the --rev 1100 argument to stop it at revision 1100: the convert extension purportedly remembers what it imported so far and when called again continues at that point (well, not exactly… read on).

"c:\Program Files\TortoiseHg\hg.exe" convert d:\data\Subversion Framework --rev 1100
  -s svn --filemap=fwkmap_step1.txt

Note that I have access to a local SVN repository – for some reason, the convert extension didn’t want to access the local repository using an svn:// url (possibly the firewall had something to do with it).

The only thing left is to make a good filemap. Something like this:

include Crm/Fwk1 
include Crm/Fwk2 
include Framework/trunk 

rename Framework/trunk .

What we want to do here is to include the old folders – this is the first pair of lines: the folders exist only in earlier revisions since they were removed (that is, moved) later. I was under the impression that it would have been sufficient to include just “Framework/trunk” and that the convert extension would somehow detect where this path originated from and include the full history, but it didn’t work out. On another repository I tried I was surprised to see that it actually did something like that, but it may have been a coincidence (a combination of other includes, possibly). In any case, it doesn’t hurt to specify the filemap as precisely as possible since you may have to fiddle with various parameters and do repeat runs. Make the filemap tight so that nothing unexpected leaks through it and eliminate any uncertainty.

The last line of the file map - “rename Framework/trunk .” tells it to make the trunk folder root. This is to make sure the structure of the folders is the same as it will be in the second step, where we use different parameters and import a completely different structure into the same folders.

Always keep in mind that the filemap (probably as well as everything else) is case sensitive. I spent hours debugging my imports because I didn’t notice the difference in case. Also, if you have a folder (or file) whose case has changed through history, it may be wise to add a rename statement in your filemap to make it consistent so that the conversion logic understands that it’s the same file/folder in different revisions (otherwise I’m not sure it would?).

In step two, we tell it where the trunk and branches are, using the ----config convert.svn.trunk and convert.svn.branches parameters. I’ve come to the conclusion that this changes the game for everything: the convert extension regards trunk and each branch as a root folder so a filemap like the one from the first step wouldn’t work. I haven’t tried it with pathnames relative to trunk and/or branches, though, and it may be worth investigating. In this case I didn’t need a filemap because after revision 1100 everything was done “by the book” in the Framework/trunk and Framework/branches folders.

So, when I ran both steps I got two distinct revision lineages: one that started at 0 and finished at 1100 and included the first run, and another that had revisions from 1000 onwards, but there was no connection between the two, each ended with its own head. And – oh, yeah, I got the branches the way I wanted them in the second part. But, how to connect the two parts?

The thing is, when running an import from a local repository (be it SVN or another HG repo), a file called SHAMAP is stored in the .hg folder of the destination repository (if you import from a remote repository, there’s an equivalent file the name of which I forgot – I believe it’s stored somewhere in .hg/svnsomething). The SHAMAP file contains pairs of revision hashes/numbers so that it knows which source revision was converted into which destination revision. For SVN import, it contains a GUID for the repository and a revision number, in the format of “SVN_REPO_GUID@SVNREV”. I’m also under impression that revisions stored here won’t be imported again on subsequent repeated conversions – this is (as far as I know) wrong because filemap include/exclude may cause a partial import of a revision and other parts of it may need to be updated again in the following steps. In such cases it is you who needs to help by supplying your own revision mapping file, and that’s probably what the convert extension authors also thought because it can be done by supplying the REVMAP parameter to hg convert. Remember what I said about it being a low-level tool? This is it. You need to write your own script to do the import properly, and hg convert is a tool used in the script. The bottom line – at the end, you should know what you’re doing. You can (and probably will) learn as you go, though, so don’t be afraid to experiment. (And while we’re at it: if you’re doing a time-consuming import in multiple steps, test each step separately and when you’re satisfied with it, zip the resulting repository so that you don’t have to repeat that step while testing the next one).

But I digress… Back to SHAMAP: the fact that it remembers revisions already imported and doesn’t allow repeated imports didn’t bother me here because I don’t have overlaps – that is, I don’t need to import the same revision (but different files) in multiple passes. The Crm/* folders disappear long before revision 1000 and at that point I only need Framework/trunk, which is also true in the second step that comes in after revision 1100.

Ok, but I did get duplicate revisions. It imported Framework/trunk up to rev 1100 in the first step and then again imported Framework/trunk from its inception to the end. Looking at SHAMAP shows why: the revisions were registered in a different way here in the second step. Instead of the “SVN_REPO_GUID@SVNREV” format, it stored something like “SVN_REPO_GUID/Framework/trunk@SVNREV”. Why? I’m not sure, it may have something to do with treating the trunk and branch folders as roots. It’s probably an attempt to prevent the problem mentioned above, when a revision needs to be imported multiple times. But this is far from complete, because in that case the filemap also needs to have similar influence on the SHAMAP so that it reflects both filtering and renaming policies set by it. (Mission impossible, I know… That’s probably why the convert extension is badly documented – when you need to explain something like this you risk receiving questions like “so why didn’t you make it better to prevent this problem?”).

One solution for this could be the splice map: it’s a file wherein you can define which revision needs to be connected to which during import. I tried this without success (I suspect that I didn’t pick the right revisions – probably the two spliced revisions need to be identical) but found a hack that produced immediate results: I opened SHAMAP and did a quick find/replace of “SVN_REPO_GUID@” with “SVN_REPO_GUID/Framework/trunk@”. This converted the revision hashes into the format the second step used, so it understood them and connected them correctly.

Here’s the command used for conversion.

"c:\Program Files\TortoiseHg\hg.exe" convert d:\data\Subversion Framework -s svn
  --config convert.svn.trunk=Framework/trunk --config convert.svn.branches=Framework/branches 
  --config convert.svn.tags=Framework/tags --branchmap=fwkbranchmap.txt

The fwkbranchmap.txt file has one line (I’m not sure why it is needed anyway, I supposed that the convert extension understands that “trunk” in SVN is “default” in Mercurial):

trunk default

So this is one way to do it. I thought I would need to investigate the things further for the import of other repositories, but came up with a different strategy. So it’s left at this state, a bit unpolished but usable.

Part 2 – splitting import by trunk and branches

For the second repository, I had three projects of which two were partial branches of the third one which from now on is to be considered the trunk. So I thought I could import them one by one: the trunk has moved a bit through the repository, and the branches have stayed mostly in place. This is what it looks like:

trunk - up to revision 2000:

Crm/Crm1  
Crm/Crm2 
Crm/Fwk* which were imported in part 1 and need to be ignored now.

trunk - after revision 2000:

Crm/trunk/Crm1  
Crm/trunk/Crm2 
Crm/branches - which we will ignore to make things simpler, as they are obsolete anyway

branch for client1:

Client1/trunk/Crm1 
Client1/trunk/Crm2 
Client1/branches – ignored for simplicity

branch for client2:

Client2/trunk/Crm1 
Client2/trunk/Crm2 
Client2/branches – ignored for simplicity

What do we do now? Split the repository vertically – import the trunk folders into default branch in step 1. Then import the first client in step 2 using branchmap to move the default branch into a new “client1” branch and repeat for client2 branch. We won’t use the convert.svn.branches parameter but import each branch explicitly. This we can do even with a straight Mercurial-to-Mercurial conversion, which I tried to do: I imported the full SVN repo into a Mercurial repo and then did the next conversion from it. I thought it would be faster: it wasn’t. Also, the difference in speed between importing from an SVN repository folder and importing from a local SVN server is not significant.

In a case like this, you need to exert full control over import. Treat the convert extension like it doesn’t know much and tell it all the details about the conversion. From what I’ve seen, its logic is somewhat counter-intuitive: it would make sense for it to reconstruct the revision history by following each file through its revisions, whether through branching or moving around in the repository. In that case it could reconstruct a file’s history from its creation till today, and all you would need is to tell it where the file is today. But it doesn’t do it like that: instead, you give it a bunch of include/exclude filters to tell it which files to retrieve, and these filters are applied at any point in history. If you made an accidental move of a folder at some point in the past, make sure you include that path also or your revision history will stop at that point.

We will possibly be importing each revision multiple times (in case multiple branches were committed to SVN in a single commit - which is improbable but possible), but each time with a different include/exclude filter in the filemap – for this, we have to make sure the filters don’t overlap unless necessary. Here we come to a new problem: it seems that the conversion extension doesn’t want to do repeated conversions of old history. It seems to remember what was the last revision imported and only imports newer ones. When convert.svn.branches or convert.svn.trunk is used, it views the revisions differently (relative to a different root) and doesn’t mind importing them again. But we won’t use those here, at least in this case, and it wouldn’t have made much difference anyway – I think it would just help with this particular problem and nothing else.

We solve this by using an empty REVMAP file. An empty REVMAP will replace the SHAMAP and make it look like nothing was imported yet. And, better still, we can also use it to get rid of the splicing problem: put in it the mapping for revisions where the source was branched so that the two branches connect at the proper point. Otherwise, the conversion will create branches that aren’t connected. How do we do this? After the first step – in which we import the trunk with all of its history from day one (and here I assume that all branches originate from it), we should have all junction points in the repository. The next step would be to view the history for each branch and note at what revision in the trunk it was branched. We find that revision in our SHAMAP file (which was updated at step 1 - trunk import - but won’t be used afterwards) and add this line in our REVMAP file for that branch. It may happen that multiple revisions from the source repository were imported into this one in our new trunk, in that case I put all of them in the REVMAP, just in case.

One important thing to note here is that we have full power over the outcome. If we miss-connect the branches, we may get odd results – I don’t expect the source to be screwed up, but the history may become a bit strange. In fact, the complicated combination of filemap filtering, splicing and the rest may produce some odd contents in the repository. I managed to get a folder that was deleted years ago reappear in the latest version of the repository. It seems that it was deleted on the trunk after a branch was split from it, but the branch didn’t include it in the first place – it probably happened that the filtermap filters for the branch made it ignore the folder completely, and since it wasn’t mentioned anywhere in the branch history (primarily not as being deleted), it appeared in the branch as it was at the point where the branch was created. Strange stuff, but instead of trying to tune everything to get it correctly imported (and risk getting more unneeded garbage in the process), I simply deleted the folder and committed this change in the destination repository.

There’s another lesson learned here: in order to make the import for this part easier, I tried moving folders around in the SVN repository to recreate the canonical trunk/branches/tags structure. If you want to do your conversion without convert.svn.branches – that is, convert each branch separately, don’t do it. It will only make your life harder because you will have also to include that folder in your filemap, and probably to rename its contents to become root. I myself stripped this revision from the destination repository as if it never happened.

The command sequence looks something like this. First the trunk import:

"c:\Program Files\TortoiseHg\hg.exe" convert Full Crm -v --filemap=crmfullmap_step1.txt

(I added the “-v” switch so that I can see what’s going on, remove if the output is too verbose… This switch is useful because it causes the printout of all files included, so you can check to see if there’s anything suspicious – it won’t be of too much help determining whether something’s missing but you’ll be able to see if there’s anything you don’t want).

At this point we need to look at the SHAMAP file generated and create REVMAP files for each branch, as described above. Since the file will be overwritten by the import logic, you may want to make a revmap template file and copy it to the real file used each time this step is run (I zipped the repository after the first step so that I can repeat the second part as many times as needed until I get it right). It looks something like this:

del crmfullrevmap_step2.txt 
  copy "crmfull revmap template step 2.txt" crmfullrevmap_step2.txt 
  "c:\Program Files\TortoiseHg\hg.exe" convert Full Crm --filemap crmfullmap_step2.txt
  --branchmap crmfullbranchmap_step2.txt crmfullrevmap_step2.txt

Note that here I have a Mercurial copy of SVN in the folder named “Full”, and I import from it. The same thing could probably be done directly from SVN, only the REVMAP file would look different.

The “crmfullrevmap template step2.txt" file:

d379848121be332d162b1df014670558e2fa8dd4   be7726e1e2c98b3694b0c28ca5f058769a382018 
3d753bea786bdbc1d25747c93e2554aa4134dd0c   be7726e1e2c98b3694b0c28ca5f058769a382018 
d3c9339835df79c8d6cb10d3e9e589669cd993bd   be7726e1e2c98b3694b0c28ca5f058769a382018

Here be7726e1e2c98b3694b0c28ca5f058769a382018 is the hash ID of the revision at which the branch and trunk join. I got this number by going to the original SVN history to see what revision it was joined at, then went to my destination repository (imported at step 1) to find the equivalent Mercurial revision – luckily, the SVN revision numbers are kept even in svn-to-mercurial-to-mercurial conversion. Just to be sure, I put in all the lines from SHAMAP file where this revision appears in the right column.

The branch file crmfullbranchmap_step2.txt is a one-liner to move everything into the appropriate branch, called “client1”:

default client1

This process is then repeated for other branches.

Conclusion

The end result here is that we have working repositories in Mercurial, have been using them for a couple of months now (yes, this post is a bit old but hindsight is also worth something) and all seems right. I haven’t noticed losing any of SVN revisions (although I may have) and the imported Mercurial repositories behave just like any others – they even exhibit Mercurial’s flaws (like problems with unicode comments) the same way in the imported and newly created revisions. So, this procedure may be far from perfect but it did the job. If someone creates a better and more automated one, I’ll be sure to try it since I have a couple of low-key projects still left in SVN and awaiting migration.

More in this category: « A macro to find missing files in Visual Studio Solutions CruiseControl.Net Missing Xml node (sourceControls) for required member (ThoughtWorks . CruiseControl . Core . Sourcecontrol . MultiSourceControl . SourceControls). »

Na vrh

Part 1 – splitting import by revision

Part 2 – splitting import by trunk and branches

Conclusion

Leave a comment