blog-hugo/articles/2017-05-12-subversion-migration.md

780 lines
23 KiB
Markdown

---
title: Subversion migration to Git
date: 2017-05-12 18:30:00
---
Some time ago I was tasked with migrating our Subversion repositories to Git. This article was only written
recently because, well, I had forgotten about the notes I had taken during the migration and only stumbled on
them recently.
Our largest repository was something like 500Go and contained a little more than 50'000 commits. The
goal was to recover the svn history into git, keep a much information as possible about the commits and
the links between them, keep the branches. During the history, there were a number of periodic database dumps
that were committed that now weighted down the repository without serving any purpose. There were also a number
of branches that were never used and contained nothing of interest.
The decision was also taken to split some of the tools into their own repositories instead of keeping them
into the same repository, cleaning up the main repository to keep only the main project and related sources.
## Principles
* After some experiments, I decided to use svn2git, a tool used by KDE for their migration. It has the
advantage of taking a rule file that allow splitting a repository by the svn path, processing tags and
branches and transforming them, ignoring other paths, ...
* As the import of such a large repository is slow, I decided to mount a btrfs partition so that each
step can be snapshotted, allowing me to test the next step without having any fear of having to start
again at the beginning.
* Some binary files were added to the svn history and it made sense keeping them. I decided to migrate
them to git-lfs to reduce the history size without losing them completely.
* A lot of commit messages contain references to other commits, I wanted to process these commit messages
and transform the reference to a `r` commit into a git hash so that tools can create a link automatically.
## Tools
The first to retrieve is [svn2git](https://github.com/svn-all-fast-export/svn2git).
The compilation should be easy. First install the dependencies and compile it.
```
$ git clone https://github.com/svn-all-fast-export/svn2git.git
$ sudo apt install libqt4-dev libapr1-dev libsvn-dev
$ qmake .
$ make
```
Once the tool is compiled, we can prepare the btrfs mount in which we will run the migration steps.
```
$ mkdir repositories
$ truncate -s 300G repositories.btrfs
$ sudo mkfs.btrfs repositories.btrfs
$ sudo mount repositories.btrfs repositories
$ sudo chown 1000:1000 repositories
```
We will also write a small tool in Go to process the commit messages.
```
sudo apt install golang
```
We will also need `bfg`, a git cleansing tool. You can download the jar
file on the [BFG Repo-Cleaner website](https://rtyley.github.io/bfg-repo-cleaner/).
## First steps
The first step of the migration is to retrieve the svn repository itself on the local machine. This is not a
checkout of the repository, we need the server folder directly, with the whole history and metadata.
```
rsync -avz --progress sshuser@svn.myserver.com:/srv/svn_myrepository/ .
```
In this case I had SSH access to the server, allowing me to simply rsync the repository. Doing so allowed
me to prepare the migration in advance, only copying the new commits on each synchronisation and not the
whole repository with its large history. Most of the repository files are never updated so this step is
only slow on the first execution.
### User mapping
The first step is to create a mapping file that will map the svn users to git users. A user in svn is a username
whereas in git this is a name and email address.
To get a list of user accounts, we can use the svn command directly on the local repository like this :
```
svn log file:///home/tsc/svn_myrepository \
| egrep '^r.*lines?$' \
| awk -F'|' '{print $2;}' \
| sort \
| uniq
```
This will return the list of users in the logs. For each of these users, you should create a line in a mapping
file, like so :
```
auser Albert User <albert.user@example.com>
aperson Anaelle Personn <anaelle.personn@example.com>
```
This file will be given as input to `svn2git` and should be complete, otherwise the import will fail.
### Path mapping
The second mapping for the svn to git migration of a repository is the svn2git rules. This file will tell
the program what will go where. In our case, the repository was not stricly adhering to the svn standard tree,
containing a trunk, tags and branches structure as well as some other folders for "out-of-branch" projects.
```txt
# We create the main repository
create repository svn_myrepository
end repository
# We create repositories for external tools that will move
# to their own repositories
create repository aproject
end repository
create repository bproject
end repository
create repository cproject
end repository
# We declare a variable to ease the declaration of the
# migration rules further down
declare PROJECTS=aproject|bproject|cproject
# We create repositories for out-of-branch folders
# that will migrate to their own repositories
create repository aoutofbranch
end repository
create repository boutofbranch
end repository
# We always ignore database dumps wherever there are.
# In our case, the database dumps are named "database-dump-20100112"
# or forms close to that.
match /.*/database([_-][^/]+)?[-_](dump|oracle|mysql)[^/]+
end match
# There are also dumps stored in their own folder
match /.*/database/backup(/old)?/.*(.zip|.sql|.lzma)
end match
# At some time the build results were also added to the history, we want
# to ignore them
match /.*/(build|dist|cache)/
end match
# We process our external tools only on the master branch.
# We use the previously declared variable to reduce the repetition
# and use the pattern match to move it to the correct repository.
match /trunk/(tools/)?(${PROJECTS})/
repository \2
branch master
end match
# And we ignore them if there are on tags or branches
match /.*/(tools/)?${PROJECTS}/
end match
# We start processing our main project after the r10, as the
# first commits were missing the trunk and moved the branches, trunk and tags
# folders around.
match /trunk/
min revision 10
repository svn_myrepository
branch master
end match
# There are branches that are hierarchically organized.
# Such cases have to be explicitly configured.
match /branches/(old|dev|customers)/([^/]+)/
repository svn_myrepository
branch \1/\2
end match
# Other branches are as expected directly in the branches folder.
match /branches/([^/]+)/
repository svn_myrepository
branch \1
end match
# The tags were used in a strange fashion before the commit r2500,
# so we ignore everything before that refactoring
match /tags/([^/]+)/
max revision 2500
end match
# After that, we create a branch for each tag as the svn tags
# were not used correctly and were committed to. We just name
# them differently and will process them afterwards.
match /tags/([^/]+)/([^/]+)/
min revision 2500
repository svn_myrepository
branch \1-\2
end match
# Our out-of-branch folder will be processed directly, only creating
# a master branch.
match /aoutofbranch/
repository aoutofbranch
branch master
end match
match /boutofbranch/
repository boutofbranch
branch master
end match
# Everything else is discarded and ignored
match /
end match
```
This file will quickly grow with the number of migration operations that you want to do. Ignore the
files here if possible as it will reduce the migration time as well as the postprocessing that will
need to be done afterwards. In my case, a number of files were too complex to match during the migration
or were spotted only afterwards and had to be cleaned in a second pass with other tools.
### Migration
This step will take a lot of time as it will read the whole svn history, process the declared rules and generate
the git repositories and every commit.
```
$ cd repositories
$ ~/workspace/svn2git/svn-all-fast-export \
--add-metadata \
--svn-branches \
--identity-map ~/workspace/migration-tools/accounts-map.txt \
--rules ~/workspace/migration-tools/svnfast.rules \
--commit-interval 2000 \
--stat \
/home/tsc/svn_myrepository
```
If there is a crash during this step, it means that you are either missing an account in your mapping, that
one of your rule is emitting an erroneous branch, repository or that no rule is matching.
Once this step finished, I like to do a btrfs snapshot so that I can return to this step when putting the
next steps into place.
```
btrfs subvolume snaphost -r repositories repositories/snap-1-import
```
## Cleanup
The next phase is to cleanup our import. There will always be a number of branches that are unused, named
incorrectly, contain only temporary files or branches that are so far from the standard naming that our
rules cannot process them correctly.
We will simply delete them or rename them using git.
```
$ cd svn_myrepository
$ git branch -D oldbranch-0.3.1
$ git branch -D customer/backup_temp
$ git branch -m customer/stable_v1.0 stable-1.0
```
The goal at this step is to cleanup the branches that will be kept after
the migration. We do this now to reduce the repository size early on and
thus reduce the time needed for the next steps.
If you see branches that can be deleted or renamed further down the road,
you can also remove or rename them then.
I like to take a snapshot at this stage as the next stage usually involves
a lot of tests and manually building a list of things to remove.
```
btrfs subvolume snaphost -r repositories repositories/snap-2a-cleanup
```
We can also remove files that were added and should not have been by checking
a list of every file every checked into our new git repository, inspecting
it manually and add the identifiers of files to remove in a new file :
```sh
$ git rev-list --objects --all > ./all-files
$ cat ./all-files | your-filter | cut -d' ' -f1 > ./to-delete-ids
$ java -jar ~/Downloads/bfg-1.12.15.jar --private --no-blob-protection --strip-blobs-with-ids ./to-delete-ids
```
We will take a snapshot again, as the next step also involves checks and
tests.
```
btrfs subvolume snaphost -r repositories repositories/snap-2b-cleanup
```
Next, we will convert the binary files that we still want to keep in our
repository to Git-LFS. This allows git to only keep track of the hash of
the file in the history and not store the whole binary in the repository,
thus reducing the size of the clones.
BFG does this quickly and efficiently, removing every file matching the
given name from the history and storing it in Git-LFS. This step will
require some exploration of the previous `all-files` file to identify which
files need to be converted.
```sh
$ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs 'my-important-archive*.zip'
$ java -jar ~/Downloads/bfg-1.12.15.jar --no-blob-protection --private --convert-to-git-lfs '*.ear'
```
After the cleanup, I also like to do a btrfs snapshot so that the history
rewrite step can be executed and tested multiple times.
```
btrfs subvolume snaphost -r repositories repositories/snap-2c-cleanup
```
### Linking a svn revision to a git commit
The logs prints for each revision a line mapping to a mark on the git marks file. In the git repository, there
is then a marks file that map this mark to a commit hash. We can use this information to build a mapping database
that can store that information for later.
In our case, I wrote a Java program that will parse both files and store
the resulting mapping into a LevelDB database.
This database will then be used by a Golang server that will read this mapping
database in memory and serve a RPC server that we will call from Golang
binaries in a `git filter-branch` call. The Golang server will also need
to keep track of the modifications to the git commit hashes as the history
rewrite changes them.
First, the Java tool to read the logs and generate the LevelDB database :
```java
import com.google.common.collect.BiMap;
import com.google.common.collect.HashBiMap;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.commons.io.filefilter.DirectoryFileFilter;
import org.apache.commons.io.filefilter.IOFileFilter;
import org.iq80.leveldb.DB;
import org.iq80.leveldb.Options;
import org.iq80.leveldb.impl.Iq80DBFactory;
public class CommitMapping {
public static String FILE_LOG_IMPORT = "../log-svn_myrepository";
public static String FILE_MARKS = "marks-svn_myrepository";
public static String FILE_BFG_DIR = "../svn_myrepository.bfg-report";
public static Pattern PATTERN_LOG = Pattern.compile("^progress SVN (r\\d+) branch .* = (:\\d+)");
public static void main(String[] args) throws Exception {
List<String> importLines = IOUtils.readLines(new FileReader(new File(FILE_LOG_IMPORT)));
List<String> marksLines = IOUtils.readLines(new FileReader(new File(FILE_MARKS)));
Collection<File> passFilesCol = FileUtils.listFiles(new File(FILE_BFG_DIR), new IOFileFilter() {
@Override
public boolean accept(File pathname, String name) {
return name.equals("object-id-map.old-new.txt");
}
@Override
public boolean accept(File path) {
return this.accept(path, path.getName());
}
}, DirectoryFileFilter.DIRECTORY);
List<File> passFiles = new ArrayList<>(passFilesCol);
Collections.sort(passFiles, (File o1, File o2) -> o1.getParentFile().getName().compareTo(o2.getParentFile().getName()));
Map<String, String> commitToIdentifier = new LinkedHashMap<>();
Map<String, String> identifierToHash = new HashMap<>();
for (String importLine : importLines) {
Matcher marksMatch = PATTERN_LOG.matcher(importLine);
if (marksMatch.find()) {
String dest = marksMatch.group(2);
if (dest == null || dest.length() == 0 || ":0".equals(dest)) continue;
commitToIdentifier.put(marksMatch.group(1), dest);
} else {
System.err.println("Unknown line : " + importLine);
}
}
File dbFile = new File(System.getenv("HOME") + "/mapping-db");
File humanFile = new File(System.getenv("HOME") + "/mapping");
FileUtils.deleteQuietly(dbFile);
Options options = new Options();
options.createIfMissing(true);
DB db = Iq80DBFactory.factory.open(dbFile, options);
marksLines.stream().map((line) -> line.split("\\s", 2)).forEach((parts) -> identifierToHash.put(parts[0], parts[1]));
BiMap<String, String> commitMapping = HashBiMap.create(commitToIdentifier.size());
for (String commit : commitToIdentifier.keySet()) {
String importId = commitToIdentifier.get(commit);
String hash = identifierToHash.get(importId);
if (hash == null) continue;
commitMapping.put(commit, hash);
}
System.err.println("Got " + commitMapping.size() + " svn -> initial import entries.");
for (File file : passFiles) {
System.err.println("Processing file " + file.getAbsolutePath());
List<String> bfgPass = IOUtils.readLines(new FileReader(file));
Map<String, String> hashMapping = bfgPass.stream().map((line) -> line.split("\\s", 2)).collect(Collectors.toMap(parts -> parts[0], parts -> parts[1]));
for (String hash : hashMapping.keySet()) {
String rev = commitMapping.inverse().get(hash);
if (rev != null) {
String newHash = hashMapping.get(hash);
System.err.println("Replacing r" + rev + ", was " + hash + ", is " + newHash);
commitMapping.replace(rev, newHash);
}
}
}
PrintStream fos = new PrintStream(humanFile);
for (Map.Entry<String, String> entry : commitMapping.entrySet()) {
String commit = entry.getKey();
String target = entry.getValue();
fos.println(commit + "\t" + target);
db.put(Iq80DBFactory.bytes(commit), Iq80DBFactory.bytes(target));
}
db.close();
fos.close();
}
}
```
We will use RPC between a client and server to allow the LevelDB database
to be kept open and have very light clients that query a running server
as they will be executed for each commit. After some tests, opening the
database was really time consuming thus this approach, even though the
server will do very little.
The structure of our go project is the following :
```txt
go-gitcommit/client-common:
rpc.go
go-gitcommit/client-insert:
insert-mapping.go
go-gitcommit/client-query:
query-mapping.go
go-gitcommit/server:
server.go
```
First, some plumping for the RPC in `rpc.go` :
```go
package Client
import (
"net"
"net/rpc"
"time"
)
type (
// Client -
Client struct {
connection *rpc.Client
}
// MappingItem is the response from the cache or the item to insert into the cache
MappingItem struct {
Key string
Value string
}
// BulkQuery allows to mass query the DB in one go.
BulkQuery []MappingItem
)
// NewClient -
func NewClient(dsn string, timeout time.Duration) (*Client, error) {
connection, err := net.DialTimeout("tcp", dsn, timeout)
if err != nil {
return nil, err
}
return &Client{connection: rpc.NewClient(connection)}, nil
}
// InsertMapping -
func (c *Client) InsertMapping(item MappingItem) (bool, error) {
var ack bool
err := c.connection.Call("RPC.InsertMapping", item, &ack)
return ack, err
}
// GetMapping -
func (c *Client) GetMapping(bulk BulkQuery) (BulkQuery, error) {
var bulkResponse BulkQuery
err := c.connection.Call("RPC.GetMapping", bulk, &bulkResponse)
return bulkResponse, err
}
```
Next the Golang server that will read this database in `server.go` :
```go
package main
import (
"fmt"
"log"
"net"
"net/rpc"
"os"
"time"
"github.com/syndtr/goleveldb/leveldb"
Client "../client-common"
)
var (
cacheDBPath = os.Getenv("HOME") + "/mapping-db"
cacheDB *leveldb.DB
flowMap map[string]string
f *os.File
g *os.File
)
type (
// RPC is the base class of our RPC system
RPC struct {
}
)
func main() {
var cacheDBerr error
cacheDB, cacheDBerr = leveldb.OpenFile(cacheDBPath, nil)
if cacheDBerr != nil {
fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.")
log.Fatal(cacheDBerr)
}
roErr := cacheDB.SetReadOnly()
if roErr != nil {
fmt.Fprintln(os.Stderr, "Unable to initialize the LevelDB cache.")
log.Fatal(roErr)
}
flowMap = make(map[string]string)
f, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.log")
defer f.Close()
g, _ = os.Create(os.Getenv("HOME") + "/go-server/gomapping.ins")
defer g.Close()
rpc.Register(NewRPC())
l, e := net.Listen("tcp", ":9876")
if e != nil {
log.Fatal("listen error:", e)
}
go flushLog()
rpc.Accept(l)
}
func flushLog() {
for {
time.Sleep(100 * time.Millisecond)
f.Sync()
}
}
// NewRPC -
func NewRPC() *RPC {
return &RPC{}
}
// InsertMapping -
func (r *RPC) InsertMapping(mappingItem Client.MappingItem, ack *bool) error {
old := mappingItem.Key
new := mappingItem.Value
flowMap[old] = new
g.WriteString(fmt.Sprintf("Inserted mapping %s -> %s\n", old, new))
*ack = true
return nil
}
// GetMapping -
func (r *RPC) GetMapping(bulkQuery Client.BulkQuery, resp *Client.BulkQuery) error {
for i := range bulkQuery {
key := bulkQuery[i].Key
response, _ := cacheDB.Get([]byte(key), nil)
gitCommit := key
if response != nil {
responseStr := string(response[:])
responseUpdated := flowMap[responseStr]
if responseUpdated != "" {
gitCommit = string(responseUpdated[:])[:12] + "(" + key + ")"
f.WriteString(fmt.Sprintf("Response to mapping %s -> %s\n", bulkQuery[i].Key, gitCommit))
} else {
f.WriteString(fmt.Sprintf("No git mapping for entry %s\n", responseStr))
}
} else {
f.WriteString(fmt.Sprintf("Unknown revision %s\n", key))
}
bulkQuery[i].Value = gitCommit
}
*resp = bulkQuery
return nil
}
```
And finally our clients. The insert client will be called from `git filter-branch`
with the previous and current commit hashes after processing each commit. We
store this information into the database so that the hashes are correct when
mapping a revision. The code goes into `insert-mapping.go` :
```go
package main
import (
"fmt"
"log"
"os"
"time"
Client "../client-common"
)
func main() {
old := os.Args[1]
new := os.Args[2]
rpcClient, err := Client.NewClient("localhost:9876", time.Millisecond*500)
if err != nil {
log.Fatal(err)
}
mappingItem := Client.MappingItem{
Key: old,
Value: new,
}
ack, err := rpcClient.InsertMapping(mappingItem)
if err != nil || !ack {
log.Fatal(err)
}
fmt.Println(new)
}
```
The query client will receive the commit message for each commit, check
whether it contains a `r` mapping and query the server for a hash for this
commit. It goes into `query-mapping.go` :
```go
package main
import (
"bufio"
"fmt"
"log"
"os"
"regexp"
"strings"
"time"
client "../client-common"
)
func main() {
reader := bufio.NewReader(os.Stdin)
text, _ := reader.ReadString('\n')
re := regexp.MustCompile(`\Wr[0-9]+`)
matches := re.FindAllString(text, -1)
if matches == nil {
fmt.Print(text)
return
}
rpcClient, err := client.NewClient("localhost:9876", time.Millisecond*500)
if err != nil {
log.Fatal(err)
}
var bulkQuery client.BulkQuery
for i := range matches {
if matches[i][0] != '-' {
key := matches[i][1:]
bulkQuery = append(bulkQuery, client.MappingItem{Key: key})
}
}
gitCommits, _ := rpcClient.GetMapping(bulkQuery)
for i := range gitCommits {
gitCommit := gitCommits[i].Value
key := gitCommits[i].Key
text = strings.Replace(text, key, gitCommit, 1)
}
fmt.Print(text)
}
```
For this step, we will need to first compile and execute the Java program.
Once it succeeded in creating the database, we will compile and execute
the Go server in the background.
Then, we can launch `git filter-branch` on our repository to rewrite the
history :
```sh
$ git filter-branch \
--commit-filter 'NEW=`git_commit_non_empty_tree "$@"`; \
${HOME}/migration-tools/go-gitcommit/client-insert/client-insert $GIT_COMMIT $NEW' \
--msg-filter "${HOME}/migration-tools/go-gitcommit/client-query/client-query" \
-- --all --author-date-order
```
As after each step, we will generate a snapshot, even though it should be
the last step that cannot be repeated easily.
```
btrfs subvolume snaphost -r repositories repositories/snap-3-mapping
```
We now clean the repository that should contain a lot of unused blobs,
branches, commits, ...
```sh
$ git reflog expire --expire=now --all
$ git prune --expire=now --progress
$ git repack -adf --window-memory=512m
```
We now have a repository that should be more or less clean. You will have
to check the history, the size of the blobs and whether some branches can
still be deleted before pushing it to your server.