--- title: Subversion migration to Git date: 2017-05-12 18:30:00 --- Some time ago I was tasked with migrating our Subversion repositories to Git. This article was only written recently because, well, I had forgotten about the notes I had taken during the migration and only stumbled on them recently. Our largest repository was something like 500Go and contained a little more than 50'000 commits. The goal was to recover the svn history into git, keep a much information as possible about the commits and the links between them, keep the branches. During the history, there were a number of periodic database dumps that were committed that now weighted down the repository without serving any purpose. There were also a number of branches that were never used and contained nothing of interest. The decision was also taken to split some of the tools into their own repositories instead of keeping them into the same repository, cleaning up the main repository to keep only the main project and related sources. ## Principles * After some experiments, I decided to use svn2git, a tool used by KDE for their migration. It has the advantage of taking a rule file that allow splitting a repository by the svn path, processing tags and branches and transforming them, ignoring other paths, ... * As the import of such a large repository is slow, I decided to mount a btrfs partition so that each step can be snapshotted, allowing me to test the next step without having any fear of having to start again at the beginning. * Some binary files were added to the svn history and it made sense keeping them. I decided to migrate them to git-lfs to reduce the history size without losing them completely. * A lot of commit messages contain references to other commits, I wanted to process these commit messages and transform the reference to a `r` commit into a git hash so that tools can create a link automatically. ## Tools The first to retrieve is [svn2git](https://github.com/svn-all-fast-export/svn2git). The compilation should be easy. First install the dependencies and compile it. ``` $ git clone https://github.com/svn-all-fast-export/svn2git.git $ sudo apt install libqt4-dev libapr1-dev libsvn-dev $ qmake . $ make ``` Once the tool is compiled, we can prepare the btrfs mount in which we will run the migration steps. ``` $ mkdir repositories $ truncate -s 300G repositories.btrfs $ sudo mkfs.btrfs repositories.btrfs $ sudo mount repositories.btrfs repositories $ sudo chown 1000:1000 repositories ``` We will also write a small tool in Go to process the commit messages. ``` sudo apt install golang ``` We will also need `bfg`, a git cleansing tool. You can download the jar file on the [BFG Repo-Cleaner website](https://rtyley.github.io/bfg-repo-cleaner/). ## First steps The first step of the migration is to retrieve the svn repository itself on the local machine. This is not a checkout of the repository, we need the server folder directly, with the whole history and metadata. ``` rsync -avz --progress sshuser@svn.myserver.com:/srv/svn_myrepository/ . ``` In this case I had SSH access to the server, allowing me to simply rsync the repository. Doing so allowed me to prepare the migration in advance, only copying the new commits on each synchronisation and not the whole repository with its large history. Most of the repository files are never updated so this step is only slow on the first execution. ### User mapping The first step is to create a mapping file that will map the svn users to git users. A user in svn is a username whereas in git this is a name and email address. To get a list of user accounts, we can use the svn command directly on the local repository like this : ``` svn log file:///home/tsc/svn_myrepository \ | egrep '^r.*lines?$' \ | awk -F'|' '{print $2;}' \ | sort \ | uniq ``` This will return the list of users in the logs. For each of these users, you should create a line in a mapping file, like so : ``` auser Albert User aperson Anaelle Personn ``` This file will be given as input to `svn2git` and should be complete, otherwise the import will fail. ### Path mapping The second mapping for the svn to git migration of a repository is the svn2git rules. This file will tell the program what will go where. In our case, the repository was not stricly adhering to the svn standard tree, containing a trunk, tags and branches structure as well as some other folders for "out-of-branch" projects. ```txt # We create the main repository create repository svn_myrepository end repository # We create repositories for external tools that will move # to their own repositories create repository aproject end repository create repository bproject end repository create repository cproject end repository # We declare a variable to ease the declaration of the # migration rules further down declare PROJECTS=aproject|bproject|cproject # We create repositories for out-of-branch folders # that will migrate to their own repositories create repository aoutofbranch end repository create repository boutofbranch end repository # We always ignore database dumps wherever there are. # In our case, the database dumps are named "database-dump-20100112" # or forms close to that. match /.*/database([_-][^/]+)?[-_](dump|oracle|mysql)[^/]+ end match # There are also dumps stored in their own folder match /.*/database/backup(/old)?/.*(.zip|.sql|.lzma) end match # At some time the build results were also added to the history, we want # to ignore them match /.*/(build|dist|cache)/ end match # We process our external tools only on the master branch. # We use the previously declared variable to reduce the repetition # and use the pattern match to move it to the correct repository. match /trunk/(tools/)?(${PROJECTS})/ repository \2 branch master end match # And we ignore them if there are on tags or branches match /.*/(tools/)?${PROJECTS}/ end match # We start processing our main project after the r10, as the # first commits were missing the trunk and moved the branches, trunk and tags # folders around. match /trunk/ min revision 10 repository svn_myrepository branch master end match # There are branches that are hierarchically organized. # Such cases have to be explicitly configured. match /branches/(old|dev|customers)/([^/]+)/ repository svn_myrepository branch \1/\2 end match # Other branches are as expected directly in the branches folder. match /branches/([^/]+)/ repository svn_myrepository branch \1 end match # The tags were used in a strange fashion before the commit r2500, # so we ignore everything before that refactoring match /tags/([^/]+)/ max revision 2500 end match # After that, we create a branch for each tag as the svn tags # were not used correctly and were committed to. We just name # them differently and will process them afterwards. match /tags/([^/]+)/([^/]+)/ min revision 2500 repository svn_myrepository branch \1-\2 end match # Our out-of-branch folder will be processed directly, only creating # a master branch. match /aoutofbranch/ repository aoutofbranch branch master end match match /boutofbranch/ repository boutofbranch branch master end match # Everything else is discarded and ignored match / end match ``` This file will quickly grow with the number of migration operations that you want to do. Ignore the files here if possible as it will reduce the migration time as well as the postprocessing that will need to be done afterwards. In my case, a number of files were too complex to match during the migration or were spotted only afterwards and had to be cleaned in a second pass with other tools. ### Migration This step will take a lot of time as it will read the whole svn history, process the declared rules and generate the git repositories and every commit. ``` $ cd repositories $ ~/workspace/svn2git/svn-all-fast-export \ --add-metadata \ --svn-branches \ --identity-map ~/workspace/migration-tools/accounts-map.txt \ --rules ~/workspace/migration-tools/svnfast.rules \ --commit-interval 2000 \ --stat \ /home/tsc/svn_myrepository ``` If there is a crash during this step, it means that you are either missing an account in your mapping, that one of your rule is emitting an erroneous branch, repository or that no rule is matching. Once this step finished, I like to do a btrfs snapshot so that I can return to this step when putting the next steps into place. ``` btrfs subvolume snaphost -r repositories repositories/snap-1-import ``` ## Cleanup The next phase is to cleanup our import. There will always be a number of branches that are unused, named incorrectly, contain only temporary files or branches that are so far from the standard naming that our rules cannot process them correctly. We will simply delete them or rename them using git. ``` $ cd svn_myrepository $ git branch -D oldbranch-0.3.1 $ git branch -D customer/backup_temp $ git branch -m customer/stable_v1.0 stable-1.0 ``` The goal at this step is to cleanup the branches that will be kept after the migration. We do this now to reduce the repository size early on and thus reduce the time needed for the next steps. If you see branches that can be deleted or renamed further down the road, you can also remove or rename them then. I like to take a snapshot at this stage as the next stage usually involves a lot of tests and manually building a list of things to remove. ``` btrfs subvolume snaphost -r repositories repositories/snap-2a-cleanup ``` We can also remove files that were added and should not have been by checking a list of every file every checked into our new git repository, inspecting it manually and add the identifiers of files to remove in a new file : ```sh $ git rev-list --objects --all > ./all-files $ cat ./all-files | your-filter | cut -d' ' -f1 > ./to-delete-ids $ java -jar ~/Downloads/bfg-1.12.15.jar --private --no-blob-protection --strip-blobs-with-ids ./to-delete-ids ``` We will take a snapshot again, as the next step also involves checks and tests. ``` btrfs subvolume snaphost -r repositories repositories/snap-2b-cleanup ``` Next, we will convert the binary files that we still want to keep in our repository to Git-LFS. This allows git to only keep track of the hash of the file in the history and not store the whole binary in the repository, thus reducing the size of the clones. BFG does this quickly and efficiently, removing every file matching the given name from the history and storing it in Git-LFS. This step will require some exploration of the previous `all-files` file to identify which files need to be converted. ```sh $ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs 'my-important-archive*.zip' $ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs '*.ear' ``` After the cleanup, I also like to do a btrfs snapshot so that the history rewrite step can be executed and tested multiple times. ``` btrfs subvolume snaphost -r repositories repositories/snap-2c-cleanup ``` ### Linking a svn revision to a git commit The logs prints for each revision a line mapping to a mark on the git marks file. In the git repository, there is then a marks file that map this mark to a commit hash. We can use this information to build a mapping database that can store that information for later. In our case, I wrote a Java program that will parse both files and store the resulting mapping into a LevelDB database. This database will then be used by a Golang server that will read this mapping database in memory and serve a RPC server that we will call from Golang binaries in a `git filter-branch` call. The Golang server will also need to keep track of the modifications to the git commit hashes as the history rewrite changes them. First, the Java tool to read the logs and generate the LevelDB database : ```java import com.google.common.collect.BiMap; import com.google.common.collect.HashBiMap; import java.io.File; import java.io.FileOutputStream; import java.io.FileReader; import java.io.FileWriter; import java.io.PrintStream; import java.util.ArrayList; import java.util.Collection; import java.util.Collections; import java.util.HashMap; import java.util.LinkedHashMap; import java.util.List; import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.stream.Collectors; import org.apache.commons.io.FileUtils; import org.apache.commons.io.IOUtils; import org.apache.commons.io.filefilter.DirectoryFileFilter; import org.apache.commons.io.filefilter.IOFileFilter; import org.iq80.leveldb.DB; import org.iq80.leveldb.Options; import org.iq80.leveldb.impl.Iq80DBFactory; public class CommitMapping { public static String FILE_LOG_IMPORT = "../log-svn_myrepository"; public static String FILE_MARKS = "marks-svn_myrepository"; public static String FILE_BFG_DIR = "../svn_myrepository.bfg-report"; public static Pattern PATTERN_LOG = Pattern.compile("^progress SVN (r\\d+) branch .* = (:\\d+)"); public static void main(String[] args) throws Exception { List importLines = IOUtils.readLines(new FileReader(new File(FILE_LOG_IMPORT))); List marksLines = IOUtils.readLines(new FileReader(new File(FILE_MARKS))); Collection passFilesCol = FileUtils.listFiles(new File(FILE_BFG_DIR), new IOFileFilter() { @Override public boolean accept(File pathname, String name) { return name.equals("object-id-map.old-new.txt"); } @Override public boolean accept(File path) { return this.accept(path, path.getName()); } }, DirectoryFileFilter.DIRECTORY); List passFiles = new ArrayList<>(passFilesCol); Collections.sort(passFiles, (File o1, File o2) -> o1.getParentFile().getName().compareTo(o2.getParentFile().getName())); Map commitToIdentifier = new LinkedHashMap<>(); Map identifierToHash = new HashMap<>(); for (String importLine : importLines) { Matcher marksMatch = PATTERN_LOG.matcher(importLine); if (marksMatch.find()) { String dest = marksMatch.group(2); if (dest == null || dest.length() == 0 || ":0".equals(dest)) continue; commitToIdentifier.put(marksMatch.group(1), dest); } else { System.err.println("Unknown line : " + importLine); } } File dbFile = new File(System.getenv("HOME") + "/mapping-db"); File humanFile = new File(System.getenv("HOME") + "/mapping"); FileUtils.deleteQuietly(dbFile); Options options = new Options(); options.createIfMissing(true); DB db = Iq80DBFactory.factory.open(dbFile, options); marksLines.stream().map((line) -> line.split("\\s", 2)).forEach((parts) -> identifierToHash.put(parts[0], parts[1])); BiMap commitMapping = HashBiMap.create(commitToIdentifier.size()); for (String commit : commitToIdentifier.keySet()) { String importId = commitToIdentifier.get(commit); String hash = identifierToHash.get(importId); if (hash == null) continue; commitMapping.put(commit, hash); } System.err.println("Got " + commitMapping.size() + " svn -> initial import entries."); for (File file : passFiles) { System.err.println("Processing file " + file.getAbsolutePath()); List bfgPass = IOUtils.readLines(new FileReader(file)); Map hashMapping = bfgPass.stream().map((line) -> line.split("\\s", 2)).collect(Collectors.toMap(parts -> parts[0], parts -> parts[1])); for (String hash : hashMapping.keySet()) { String rev = commitMapping.inverse().get(hash); if (rev != null) { String newHash = hashMapping.get(hash); System.err.println("Replacing r" + rev + ", was " + hash + ", is " + newHash); commitMapping.replace(rev, newHash); } } } PrintStream fos = new PrintStream(humanFile); for (Map.Entry entry : commitMapping.entrySet()) { String commit = entry.getKey(); String target = entry.getValue(); fos.println(commit + "\t" + target); db.put(Iq80DBFactory.bytes(commit), Iq80DBFactory.bytes(target)); } db.close(); fos.close(); } } ``` We will use RPC between a client and server to allow the LevelDB database to be kept open and have very light clients that query a running server as they will be executed for each commit. After some tests, opening the database was really time consuming thus this approach, even though the server will do very little. The structure of our go project is the following : ```txt go-gitcommit/client-common: rpc.go go-gitcommit/client-insert: insert-mapping.go go-gitcommit/client-query: query-mapping.go go-gitcommit/server: server.go ``` First, some plumping for the RPC in `rpc.go` : ```go package Client import ( "net" "net/rpc" "time" ) type ( // Client - Client struct { connection *rpc.Client } // MappingItem is the response from the cache or the item to insert into the cache MappingItem struct { Key string Value string } // BulkQuery allows to mass query the DB in one go. BulkQuery []MappingItem ) // NewClient - func NewClient(dsn string, timeout time.Duration) (*Client, error) { connection, err := net.DialTimeout("tcp", dsn, timeout) if err != nil { return nil, err } return &Client{connection: rpc.NewClient(connection)}, nil } // InsertMapping - func (c *Client) InsertMapping(item MappingItem) (bool, error) { var ack bool err := c.connection.Call("RPC.InsertMapping", item, &ack) return ack, err } // GetMapping - func (c *Client) GetMapping(bulk BulkQuery) (BulkQuery, error) { var bulkResponse BulkQuery err := c.connection.Call("RPC.GetMapping", bulk, &bulkResponse) return bulkResponse, err } ``` Next the Golang server that will read this database in `server.go` : ```go package main import ( "fmt" "log" "net" "net/rpc" "os" "time" "github.com/syndtr/goleveldb/leveldb" Client "../client-common" ) var ( cacheDBPath = os.Getenv("HOME") + "/mapping-db" cacheDB *leveldb.DB flowMap map[string]string f *os.File g *os.File ) type ( // RPC is the base class of our RPC system RPC struct { } ) func main() { var cacheDBerr error cacheDB, cacheDBerr = leveldb.OpenFile(cacheDBPath, nil) if cacheDBerr != nil { fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.") log.Fatal(cacheDBerr) } roErr := cacheDB.SetReadOnly() if roErr != nil { fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.") log.Fatal(roErr) } flowMap = make(map[string]string) f, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.log") defer f.Close() g, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.ins") defer g.Close() rpc.Register(NewRPC()) l, e := net.Listen("tcp", ":9876") if e != nil { log.Fatal("listen error:", e) } go flushLog() rpc.Accept(l) } func flushLog() { for { time.Sleep(100 * time.Millisecond) f.Sync() } } // NewRPC - func NewRPC() *RPC { return &RPC{} } // InsertMapping - func (r *RPC) InsertMapping(mappingItem Client.MappingItem, ack *bool) error { old := mappingItem.Key new := mappingItem.Value flowMap[old] = new g.WriteString(fmt.Sprintf("Inserted mapping %s -> %s\n", old, new)) *ack = true return nil } // GetMapping - func (r *RPC) GetMapping(bulkQuery Client.BulkQuery, resp *Client.BulkQuery) error { for i := range bulkQuery { key := bulkQuery[i].Key response, _ := cacheDB.Get([]byte(key), nil) gitCommit := key if response != nil { responseStr := string(response[:]) responseUpdated := flowMap[responseStr] if responseUpdated != "" { gitCommit = string(responseUpdated[:])[:12] + "(" + key + ")" f.WriteString(fmt.Sprintf("Response to mapping %s -> %s\n", bulkQuery[i].Key, gitCommit)) } else { f.WriteString(fmt.Sprintf("No git mapping for entry %s\n", responseStr)) } } else { f.WriteString(fmt.Sprintf("Unknown revision %s\n", key)) } bulkQuery[i].Value = gitCommit } *resp = bulkQuery return nil } ``` And finally our clients. The insert client will be called from `git filter-branch` with the previous and current commit hashes after processing each commit. We store this information into the database so that the hashes are correct when mapping a revision. The code goes into `insert-mapping.go` : ```go package main import ( "fmt" "log" "os" "time" Client "../client-common" ) func main() { old := os.Args[1] new := os.Args[2] rpcClient, err := Client.NewClient("localhost:9876", time.Millisecond*500) if err != nil { log.Fatal(err) } mappingItem := Client.MappingItem{ Key: old, Value: new, } ack, err := rpcClient.InsertMapping(mappingItem) if err != nil || !ack { log.Fatal(err) } fmt.Println(new) } ``` The query client will receive the commit message for each commit, check whether it contains a `r` mapping and query the server for a hash for this commit. It goes into `query-mapping.go` : ```go package main import ( "bufio" "fmt" "log" "os" "regexp" "strings" "time" client "../client-common" ) func main() { reader := bufio.NewReader(os.Stdin) text, _ := reader.ReadString('\n') re := regexp.MustCompile(`\Wr[0-9]+`) matches := re.FindAllString(text, -1) if matches == nil { fmt.Print(text) return } rpcClient, err := client.NewClient("localhost:9876", time.Millisecond*500) if err != nil { log.Fatal(err) } var bulkQuery client.BulkQuery for i := range matches { if matches[i][0] != '-' { key := matches[i][1:] bulkQuery = append(bulkQuery, client.MappingItem{Key: key}) } } gitCommits, _ := rpcClient.GetMapping(bulkQuery) for i := range gitCommits { gitCommit := gitCommits[i].Value key := gitCommits[i].Key text = strings.Replace(text, key, gitCommit, 1) } fmt.Print(text) } ``` For this step, we will need to first compile and execute the Java program. Once it succeeded in creating the database, we will compile and execute the Go server in the background. Then, we can launch `git filter-branch` on our repository to rewrite the history : ```sh $ git filter-branch \ --commit-filter 'NEW=`git_commit_non_empty_tree "$@"`; \ ${HOME}/migration-tools/go-gitcommit/client-insert/client-insert $GIT_COMMIT $NEW' \ --msg-filter "${HOME}/migration-tools/go-gitcommit/client-query/client-query" \ -- --all --author-date-order ``` As after each step, we will generate a snapshot, even though it should be the last step that cannot be repeated easily. ``` btrfs subvolume snaphost -r repositories repositories/snap-3-mapping ``` We now clean the repository that should contain a lot of unused blobs, branches, commits, ... ```sh $ git reflog expire --expire=now --all $ git prune --expire=now --progress $ git repack -adf --window-memory=512m ``` We now have a repository that should be more or less clean. You will have to check the history, the size of the blobs and whether some branches can still be deleted before pushing it to your server.