2019-12-02 20:55:07 +00:00
1 changed files with 780 additions and 0 deletions
--- a/articles/2017-05-12-subversion-migration.md
+++ b/articles/2017-05-12-subversion-migration.md
@ -0,0 +1,780 @@
+---
+title: Subversion migration to Git
+date: 2017-05-12 18:30:00
+---
+
+Some time ago I was tasked with migrating our Subversion repositories to Git. This article was only written
+recently because, well, I had forgotten about the notes I had taken during the migration and only stumbled on
+them recently.
+
+Our largest repository was something like 500Go and contained a little more than 50'000 commits. The
+goal was to recover the svn history into git, keep a much information as possible about the commits and
+the links between them, keep the branches. During the history, there were a number of periodic database dumps
+that were committed that now weighted down the repository without serving any purpose. There were also a number
+of branches that were never used and contained nothing of interest.
+
+The decision was also taken to split some of the tools into their own repositories instead of keeping them
+into the same repository, cleaning up the main repository to keep only the main project and related sources.
+
+## Principles
+* After some experiments, I decided to use svn2git, a tool used by KDE for their migration. It has the
+  advantage of taking a rule file that allow splitting a repository by the svn path, processing tags and
+  branches and transforming them, ignoring other paths, ... 
+* As the import of such a large repository is slow, I decided to mount a btrfs partition so that each
+  step can be snapshotted, allowing me to test the next step without having any fear of having to start
+  again at the beginning.
+* Some binary files were added to the svn history and it made sense keeping them. I decided to migrate
+  them to git-lfs to reduce the history size without losing them completely.
+* A lot of commit messages contain references to other commits, I wanted to process these commit messages
+  and transform the reference to a `r` commit into a git hash so that tools can create a link automatically.
+
+## Tools
+The first to retrieve is [svn2git](https://github.com/svn-all-fast-export/svn2git).
+
+The compilation should be easy. First install the dependencies and compile it.
+
+```
+$ git clone https://github.com/svn-all-fast-export/svn2git.git
+$ sudo apt install libqt4-dev libapr1-dev libsvn-dev
+$ qmake .
+$ make
+```
+
+Once the tool is compiled, we can prepare the btrfs mount in which we will run the migration steps.
+
+```
+$ mkdir repositories
+$ truncate -s 300G repositories.btrfs
+$ sudo mkfs.btrfs repositories.btrfs
+$ sudo mount repositories.btrfs repositories
+$ sudo chown 1000:1000 repositories
+```
+
+We will also write a small tool in Go to process the commit messages.
+
+```
+sudo apt install golang
+```
+
+We will also need `bfg`, a git cleansing tool. You can download the jar
+file on the [BFG Repo-Cleaner website](https://rtyley.github.io/bfg-repo-cleaner/).
+
+## First steps
+The first step of the migration is to retrieve the svn repository itself on the local machine. This is not a
+checkout of the repository, we need the server folder directly, with the whole history and metadata.
+
+```
+rsync -avz --progress sshuser@svn.myserver.com:/srv/svn_myrepository/ .
+```
+
+In this case I had SSH access to the server, allowing me to simply rsync the repository. Doing so allowed
+me to prepare the migration in advance, only copying the new commits on each synchronisation and not the
+whole repository with its large history. Most of the repository files are never updated so this step is
+only slow on the first execution.
+
+### User mapping
+The first step is to create a mapping file that will map the svn users to git users. A user in svn is a username
+whereas in git this is a name and email address.
+
+To get a list of user accounts, we can use the svn command directly on the local repository like this :
+
+```
+svn log file:///home/tsc/svn_myrepository \
+    | egrep '^r.*lines?$' \
+    | awk -F'|' '{print $2;}' \
+    | sort \
+    | uniq
+```
+
+This will return the list of users in the logs. For each of these users, you should create a line in a mapping
+file, like so :
+
+```
+auser Albert User <albert.user@example.com>
+aperson Anaelle Personn <anaelle.personn@example.com>
+```
+
+This file will be given as input to `svn2git` and should be complete, otherwise the import will fail.
+
+### Path mapping
+The second mapping for the svn to git migration of a repository is the svn2git rules. This file will tell
+the program what will go where. In our case, the repository was not stricly adhering to the svn standard tree,
+containing a trunk, tags and branches structure as well as some other folders for "out-of-branch" projects.
+
+```txt
+# We create the main repository
+create repository svn_myrepository
+end repository
+
+# We create repositories for external tools that will move
+# to their own repositories
+create repository aproject
+end repository
+create repository bproject
+end repository
+create repository cproject
+end repository
+
+# We declare a variable to ease the declaration of the
+# migration rules further down
+declare PROJECTS=aproject|bproject|cproject
+
+# We create repositories for out-of-branch folders
+# that will migrate to their own repositories
+create repository aoutofbranch
+end repository
+create repository boutofbranch
+end repository
+
+# We always ignore database dumps wherever there are.
+# In our case, the database dumps are named "database-dump-20100112"
+# or forms close to that.
+match /.*/database([_-][^/]+)?[-_](dump|oracle|mysql)[^/]+
+end match
+
+# There are also dumps stored in their own folder
+match /.*/database/backup(/old)?/.*(.zip|.sql|.lzma)
+end match
+
+# At some time the build results were also added to the history, we want
+# to ignore them
+match /.*/(build|dist|cache)/
+end match
+
+# We process our external tools only on the master branch.
+# We use the previously declared variable to reduce the repetition
+# and use the pattern match to move it to the correct repository.
+match /trunk/(tools/)?(${PROJECTS})/
+  repository \2
+  branch master
+end match
+
+# And we ignore them if there are on tags or branches
+match /.*/(tools/)?${PROJECTS}/
+end match
+
+# We start processing our main project after the r10, as the
+# first commits were missing the trunk and moved the branches, trunk and tags
+# folders around.
+match /trunk/
+  min revision 10
+  repository svn_myrepository
+  branch master
+end match
+
+# There are branches that are hierarchically organized.
+# Such cases have to be explicitly configured.
+match /branches/(old|dev|customers)/([^/]+)/
+  repository svn_myrepository
+  branch \1/\2
+end match
+
+# Other branches are as expected directly in the branches folder.
+match /branches/([^/]+)/
+  repository svn_myrepository
+  branch \1
+end match
+
+# The tags were used in a strange fashion before the commit r2500,
+# so we ignore everything before that refactoring
+match /tags/([^/]+)/
+  max revision 2500
+end match
+
+# After that, we create a branch for each tag as the svn tags
+# were not used correctly and were committed to. We just name
+# them differently and will process them afterwards.
+match /tags/([^/]+)/([^/]+)/
+  min revision 2500
+  repository svn_myrepository
+  branch \1-\2
+end match
+
+# Our out-of-branch folder will be processed directly, only creating
+# a master branch.
+match /aoutofbranch/
+  repository aoutofbranch
+  branch master
+end match
+
+match /boutofbranch/
+  repository boutofbranch
+  branch master
+end match
+
+# Everything else is discarded and ignored
+match /
+end match
+```
+
+This file will quickly grow with the number of migration operations that you want to do. Ignore the
+files here if possible as it will reduce the migration time as well as the postprocessing that will
+need to be done afterwards. In my case, a number of files were too complex to match during the migration
+or were spotted only afterwards and had to be cleaned in a second pass with other tools.
+
+### Migration
+This step will take a lot of time as it will read the whole svn history, process the declared rules and generate
+the git repositories and every commit.
+
+```
+$ cd repositories
+$ ~/workspace/svn2git/svn-all-fast-export \
+    --add-metadata \
+    --svn-branches \
+    --identity-map ~/workspace/migration-tools/accounts-map.txt \
+    --rules ~/workspace/migration-tools/svnfast.rules \
+    --commit-interval 2000 \
+    --stat \
+    /home/tsc/svn_myrepository
+```
+
+If there is a crash during this step, it means that you are either missing an account in your mapping, that
+one of your rule is emitting an erroneous branch, repository or that no rule is matching.
+
+Once this step finished, I like to do a btrfs snapshot so that I can return to this step when putting the
+next steps into place.
+
+```
+btrfs subvolume snaphost -r repositories repositories/snap-1-import
+```
+
+## Cleanup
+The next phase is to cleanup our import. There will always be a number of branches that are unused, named
+incorrectly, contain only temporary files or branches that are so far from the standard naming that our
+rules cannot process them correctly.
+
+We will simply delete them or rename them using git.
+
+```
+$ cd svn_myrepository
+$ git branch -D oldbranch-0.3.1
+$ git branch -D customer/backup_temp
+$ git branch -m customer/stable_v1.0 stable-1.0
+```
+
+The goal at this step is to cleanup the branches that will be kept after
+the migration. We do this now to reduce the repository size early on and
+thus reduce the time needed for the next steps.
+
+If you see branches that can be deleted or renamed further down the road,
+you can also remove or rename them then.
+
+I like to take a snapshot at this stage as the next stage usually involves
+a lot of tests and manually building a list of things to remove.
+
+```
+btrfs subvolume snaphost -r repositories repositories/snap-2a-cleanup
+```
+
+We can also remove files that were added and should not have been by checking
+a list of every file every checked into our new git repository, inspecting
+it manually and add the identifiers of files to remove in a new file :
+
+```sh
+$ git rev-list --objects --all > ./all-files
+$ cat ./all-files | your-filter | cut -d' ' -f1 > ./to-delete-ids
+$ java -jar ~/Downloads/bfg-1.12.15.jar --private --no-blob-protection --strip-blobs-with-ids ./to-delete-ids
+```
+
+We will take a snapshot again, as the next step also involves checks and
+tests.
+
+```
+btrfs subvolume snaphost -r repositories repositories/snap-2b-cleanup
+```
+
+Next, we will convert the binary files that we still want to keep in our
+repository to Git-LFS. This allows git to only keep track of the hash of
+the file in the history and not store the whole binary in the repository,
+thus reducing the size of the clones.
+
+BFG does this quickly and efficiently, removing every file matching the
+given name from the history and storing it in Git-LFS. This step will
+require some exploration of the previous `all-files` file to identify which
+files need to be converted.
+
+```sh
+$ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs 'my-important-archive*.zip'
+$ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs '*.ear'
+```
+
+After the cleanup, I also like to do a btrfs snapshot so that the history
+rewrite step can be executed and tested multiple times.
+
+```
+btrfs subvolume snaphost -r repositories repositories/snap-2c-cleanup
+```
+
+### Linking a svn revision to a git commit
+The logs prints for each revision a line mapping to a mark on the git marks file. In the git repository, there
+is then a marks file that map this mark to a commit hash. We can use this information to build a mapping database
+that can store that information for later.
+
+In our case, I wrote a Java program that will parse both files and store
+the resulting mapping into a LevelDB database.
+
+This database will then be used by a Golang server that will read this mapping
+database in memory and serve a RPC server that we will call from Golang
+binaries in a `git filter-branch` call. The Golang server will also need
+to keep track of the modifications to the git commit hashes as the history
+rewrite changes them.
+
+First, the Java tool to read the logs and generate the LevelDB database :
+
+```java
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.FileReader;
+import java.io.FileWriter;
+import java.io.PrintStream;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.commons.io.IOUtils;
+import org.apache.commons.io.filefilter.DirectoryFileFilter;
+import org.apache.commons.io.filefilter.IOFileFilter;
+import org.iq80.leveldb.DB;
+import org.iq80.leveldb.Options;
+import org.iq80.leveldb.impl.Iq80DBFactory;
+
+public class CommitMapping {
+
+    public static String FILE_LOG_IMPORT = "../log-svn_myrepository";
+    public static String FILE_MARKS = "marks-svn_myrepository";
+    public static String FILE_BFG_DIR = "../svn_myrepository.bfg-report";
+
+    public static Pattern PATTERN_LOG = Pattern.compile("^progress SVN (r\\d+) branch .* = (:\\d+)");
+
+    public static void main(String[] args) throws Exception {
+
+        List<String> importLines = IOUtils.readLines(new FileReader(new File(FILE_LOG_IMPORT)));
+        List<String> marksLines = IOUtils.readLines(new FileReader(new File(FILE_MARKS)));
+        
+        Collection<File> passFilesCol = FileUtils.listFiles(new File(FILE_BFG_DIR), new IOFileFilter() {
+            @Override
+            public boolean accept(File pathname, String name) {
+                return name.equals("object-id-map.old-new.txt");
+            }
+
+            @Override
+            public boolean accept(File path) {
+                return this.accept(path, path.getName());
+            }
+        }, DirectoryFileFilter.DIRECTORY);
+        
+        List<File> passFiles = new ArrayList<>(passFilesCol);
+        
+        Collections.sort(passFiles, (File o1, File o2) -> o1.getParentFile().getName().compareTo(o2.getParentFile().getName()));
+
+        Map<String, String> commitToIdentifier = new LinkedHashMap<>();
+        Map<String, String> identifierToHash = new HashMap<>();
+
+        for (String importLine : importLines) {
+            Matcher marksMatch = PATTERN_LOG.matcher(importLine);
+
+            if (marksMatch.find()) {
+                String dest = marksMatch.group(2);
+                if (dest == null || dest.length() == 0 || ":0".equals(dest)) continue;
+                
+                commitToIdentifier.put(marksMatch.group(1), dest);
+            } else {
+                System.err.println("Unknown line : " + importLine);
+            }
+
+        }
+
+        File dbFile = new File(System.getenv("HOME") + "/mapping-db");
+        File humanFile = new File(System.getenv("HOME") + "/mapping");
+
+        FileUtils.deleteQuietly(dbFile);
+
+        Options options = new Options();
+        options.createIfMissing(true);
+        DB db = Iq80DBFactory.factory.open(dbFile, options);
+
+        marksLines.stream().map((line) -> line.split("\\s", 2)).forEach((parts) -> identifierToHash.put(parts[0], parts[1]));
+        
+        BiMap<String, String> commitMapping = HashBiMap.create(commitToIdentifier.size());
+        for (String commit : commitToIdentifier.keySet()) {
+            
+            String importId = commitToIdentifier.get(commit);
+            String hash = identifierToHash.get(importId);
+            
+            if (hash == null) continue;
+            commitMapping.put(commit, hash);
+        }
+        
+        System.err.println("Got " + commitMapping.size() + " svn -> initial import entries.");
+        
+        for (File file : passFiles) {
+            System.err.println("Processing file " + file.getAbsolutePath());
+
+            List<String> bfgPass = IOUtils.readLines(new FileReader(file));
+            Map<String, String> hashMapping = bfgPass.stream().map((line) -> line.split("\\s", 2)).collect(Collectors.toMap(parts -> parts[0], parts -> parts[1]));
+            
+            for (String hash : hashMapping.keySet()) {
+                String rev = commitMapping.inverse().get(hash);
+                if (rev != null) {
+                    String newHash = hashMapping.get(hash);
+                    System.err.println("Replacing r" + rev + ", was " + hash + ", is " + newHash);
+                    commitMapping.replace(rev, newHash);
+                }
+            }
+        }
+
+        PrintStream fos = new PrintStream(humanFile);
+        for (Map.Entry<String, String> entry : commitMapping.entrySet()) {
+            String commit = entry.getKey();
+            String target = entry.getValue();
+
+            fos.println(commit + "\t" + target);
+            db.put(Iq80DBFactory.bytes(commit), Iq80DBFactory.bytes(target));
+        }
+
+        db.close();
+        fos.close();
+    }
+}
+```
+
+We will use RPC between a client and server to allow the LevelDB database
+to be kept open and have very light clients that query a running server
+as they will be executed for each commit. After some tests, opening the
+database was really time consuming thus this approach, even though the
+server will do very little.
+
+The structure of our go project is the following :
+
+```txt
+go-gitcommit/client-common:
+rpc.go
+
+go-gitcommit/client-insert:
+insert-mapping.go
+
+go-gitcommit/client-query:
+query-mapping.go
+
+go-gitcommit/server:
+server.go
+```
+
+First, some plumping for the RPC in `rpc.go` :
+
+```go
+package Client
+
+import (
+	"net"
+	"net/rpc"
+	"time"
+)
+
+type (
+	// Client -
+	Client struct {
+		connection *rpc.Client
+	}
+
+	// MappingItem is the response from the cache or the item to insert into the cache
+	MappingItem struct {
+		Key   string
+		Value string
+	}
+
+	// BulkQuery allows to mass query the DB in one go.
+	BulkQuery []MappingItem
+)
+
+// NewClient -
+func NewClient(dsn string, timeout time.Duration) (*Client, error) {
+	connection, err := net.DialTimeout("tcp", dsn, timeout)
+	if err != nil {
+		return nil, err
+	}
+	return &Client{connection: rpc.NewClient(connection)}, nil
+}
+
+// InsertMapping -
+func (c *Client) InsertMapping(item MappingItem) (bool, error) {
+	var ack bool
+	err := c.connection.Call("RPC.InsertMapping", item, &ack)
+	return ack, err
+}
+
+// GetMapping -
+func (c *Client) GetMapping(bulk BulkQuery) (BulkQuery, error) {
+	var bulkResponse BulkQuery
+	err := c.connection.Call("RPC.GetMapping", bulk, &bulkResponse)
+	return bulkResponse, err
+}
+```
+
+Next the Golang server that will read this database in `server.go` :
+
+```go
+package main
+
+import (
+	"fmt"
+	"log"
+	"net"
+	"net/rpc"
+	"os"
+	"time"
+
+	"github.com/syndtr/goleveldb/leveldb"
+
+	Client "../client-common"
+)
+
+var (
+	cacheDBPath = os.Getenv("HOME") + "/mapping-db"
+
+	cacheDB *leveldb.DB
+	flowMap map[string]string
+
+	f *os.File
+	g *os.File
+)
+
+type (
+	// RPC is the base class of our RPC system
+	RPC struct {
+	}
+)
+
+func main() {
+	var cacheDBerr error
+
+	cacheDB, cacheDBerr = leveldb.OpenFile(cacheDBPath, nil)
+	if cacheDBerr != nil {
+		fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.")
+		log.Fatal(cacheDBerr)
+	}
+
+	roErr := cacheDB.SetReadOnly()
+	if roErr != nil {
+		fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.")
+		log.Fatal(roErr)
+	}
+
+	flowMap = make(map[string]string)
+
+	f, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.log")
+	defer f.Close()
+	g, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.ins")
+	defer g.Close()
+
+	rpc.Register(NewRPC())
+
+	l, e := net.Listen("tcp", ":9876")
+	if e != nil {
+		log.Fatal("listen error:", e)
+	}
+
+	go flushLog()
+
+	rpc.Accept(l)
+}
+
+func flushLog() {
+	for {
+		time.Sleep(100 * time.Millisecond)
+		f.Sync()
+	}
+}
+
+// NewRPC -
+func NewRPC() *RPC {
+	return &RPC{}
+}
+
+// InsertMapping -
+func (r *RPC) InsertMapping(mappingItem Client.MappingItem, ack *bool) error {
+	old := mappingItem.Key
+	new := mappingItem.Value
+
+	flowMap[old] = new
+
+	g.WriteString(fmt.Sprintf("Inserted mapping %s -> %s\n", old, new))
+
+	*ack = true
+
+	return nil
+}
+
+// GetMapping -
+func (r *RPC) GetMapping(bulkQuery Client.BulkQuery, resp *Client.BulkQuery) error {
+	for i := range bulkQuery {
+		key := bulkQuery[i].Key
+
+		response, _ := cacheDB.Get([]byte(key), nil)
+
+		gitCommit := key
+		if response != nil {
+			responseStr := string(response[:])
+			responseUpdated := flowMap[responseStr]
+			if responseUpdated != "" {
+				gitCommit = string(responseUpdated[:])[:12] + "(" + key + ")"
+
+				f.WriteString(fmt.Sprintf("Response to mapping %s -> %s\n", bulkQuery[i].Key, gitCommit))
+			} else {
+				f.WriteString(fmt.Sprintf("No git mapping for entry %s\n", responseStr))
+			}
+		} else {
+			f.WriteString(fmt.Sprintf("Unknown revision %s\n", key))
+		}
+
+		bulkQuery[i].Value = gitCommit
+	}
+
+	*resp = bulkQuery
+
+	return nil
+}
+```
+
+And finally our clients. The insert client will be called from `git filter-branch`
+with the previous and current commit hashes after processing each commit. We
+store this information into the database so that the hashes are correct when
+mapping a revision. The code goes into `insert-mapping.go` :
+
+```go
+package main
+
+import (
+	"fmt"
+	"log"
+	"os"
+	"time"
+
+	Client "../client-common"
+)
+
+func main() {
+	old := os.Args[1]
+	new := os.Args[2]
+
+	rpcClient, err := Client.NewClient("localhost:9876", time.Millisecond*500)
+	if err != nil {
+		log.Fatal(err)
+	}
+
+	mappingItem := Client.MappingItem{
+		Key:   old,
+		Value: new,
+	}
+
+	ack, err := rpcClient.InsertMapping(mappingItem)
+	if err != nil || !ack {
+		log.Fatal(err)
+	}
+
+	fmt.Println(new)
+}
+```
+
+The query client will receive the commit message for each commit, check
+whether it contains a `r` mapping and query the server for a hash for this
+commit. It goes into `query-mapping.go` :
+
+```go
+package main
+
+import (
+	"bufio"
+	"fmt"
+	"log"
+	"os"
+	"regexp"
+	"strings"
+	"time"
+
+	client "../client-common"
+)
+
+func main() {
+	reader := bufio.NewReader(os.Stdin)
+	text, _ := reader.ReadString('\n')
+
+	re := regexp.MustCompile(`\Wr[0-9]+`)
+	matches := re.FindAllString(text, -1)
+
+	if matches == nil {
+		fmt.Print(text)
+		return
+	}
+
+	rpcClient, err := client.NewClient("localhost:9876", time.Millisecond*500)
+	if err != nil {
+		log.Fatal(err)
+	}
+
+	var bulkQuery client.BulkQuery
+
+	for i := range matches {
+		if matches[i][0] != '-' {
+			key := matches[i][1:]
+			bulkQuery = append(bulkQuery, client.MappingItem{Key: key})
+		}
+	}
+
+	gitCommits, _ := rpcClient.GetMapping(bulkQuery)
+
+	for i := range gitCommits {
+		gitCommit := gitCommits[i].Value
+		key := gitCommits[i].Key
+
+		text = strings.Replace(text, key, gitCommit, 1)
+	}
+
+	fmt.Print(text)
+}
+
+```
+
+For this step, we will need to first compile and execute the Java program.
+Once it succeeded in creating the database, we will compile and execute
+the Go server in the background.
+
+Then, we can launch `git filter-branch` on our repository to rewrite the
+history :
+
+```sh
+$ git filter-branch \
+    --commit-filter 'NEW=`git_commit_non_empty_tree "$@"`; \
+                     ${HOME}/migration-tools/go-gitcommit/client-insert/client-insert $GIT_COMMIT $NEW' \
+    --msg-filter "${HOME}/migration-tools/go-gitcommit/client-query/client-query" \
+    -- --all --author-date-order
+```
+
+As after each step, we will generate a snapshot, even though it should be
+the last step that cannot be repeated easily.
+
+```
+btrfs subvolume snaphost -r repositories repositories/snap-3-mapping
+```
+
+We now clean the repository that should contain a lot of unused blobs,
+branches, commits, ...
+
+```sh
+$ git reflog expire --expire=now --all
+$ git prune --expire=now --progress
+$ git repack -adf --window-memory=512m
+```
+
+We now have a repository that should be more or less clean. You will have
+to check the history, the size of the blobs and whether some branches can
+still be deleted before pushing it to your server.