I have a Subversion server with a few different projects in the standard layout like so:
ProjectA/
trunk/
branches/
tags/
ProjectB/
trunk/
FolderOfBinaries/
SourceFolderA/
SourceFolderB/
SourceFolderC/
branches/
tags/
v1.0/
v1.1/
v2.0/
ProjectC/
trunk/
branches/
tags/
ProjectB is going to be be migrated to Git, but not with a standard clone. I want to split the project into two Git repositories – one for the folder full of large binaries that change relatively often and another repository for everything else. I did a clone of the repository in full and it’s a few GBs, but the binaries folder is probably 90% of that, and running git gc takes a long time. I’d rather have a small fast repository and then add the binaries folder as a submodule if the developer requires it.
I’ve found two potential options so far. First, I could use git branch-filter to try and remove the folder of binaries from the history as shown in the Git Book. Second, I could use svndumpfilter to split the current Subversion repository into two and then git svn clone each separately.
My question is though, what will happen to all the history, and particularly the branches and tags? I’d still like to know what the folder of binaries looked like at every tag in the project, even though the binaries may not have changed between two tags. is that possible?
Edit: The folder of binaries is not full of build artefacts (*.class, *.o, *.dll etc) so I can’t just strip it out and make them external. It’s full of binaries that are output from a third-party program that need to be versioned (think OpenOffice documents, Photoshop files etc.).
Well, I’ve managed to do this, but it wasn’t all that straightforward. There may be a better way but not one that I could work out. I did the following:
Create a dump of the current repository:
svnadmin dump /opt/repo > full_dumpFilter the dump to remove the binaries folder:
svndumpfilter exclude *folderofbinaries* --pattern --renumber-revs --drop-empty-revs < full_dump > filtered_dump. I needed to makefolderofbinariesa pattern because way back in the past someone had actually checked a binary directly into a tag (!) so the next step was failing due to a missing folder.Create a local SVN repository with the filtered dump:
mkdir repo-filtered;svnadmin create repo-filtered;
svnadmin load repo-filtered < filtered_dump
Clone both the full and filtered repo into different folders (I used svn2git). The filtered repo will not contain any of the binaries. If, in the full repo, only the binaries folder changed between tags A and B, in the new filtered Git repo the two tags will point to the same commit, which is exactly what I wanted.
In the full Git repo, use Git to strip out everything except the binaries folder.
The reason that I had to use Git to isolate the binaries folder was because I couldn’t work out how to maintain the tags just using
svndumpfilter(especially given I had a binary committed directly into a tag). After the conversion I get the same behaviour as in the filtered repo – if no binaries changed between two tags then they both point to the same commit.The commands for the final step were:
which I got from this question.
Now I have an 80MB sources repository and a 1.5GB binaries repository from my original 4.4GB SVN dump file! I can recreate the exact state of the original SVN repo by adding the binaries folder as a Git submodule of the sources repo and checking out the same tag on each (which is why I needed to preserve all the tag info) whilst not having one mammoth Git repo that’s slow to work with.