July | 2011 | Cycles

Versioning multiple versions of data in relational databases

July 28, 2011 § Leave a comment

Introduction

At MusicBrainz, we have quite an interesting approach to letting users edit our data. Because our main product is our database, we care deeply about having the highest quality data we can. To ensure that we get high quality data, we ask that most changes have to be voted on by the community before the changes are actually reflected in the database. It’s a fairly simple process, but it catches all sorts of problems – from typos, to potentially disastrous modifications.

It’s also an interesting technical challenge. In this article, I’m going to explain my brief research into a system for versioning data in a relational database, with the ability to maintain full referential integrity and past revisions. Before we start looking at my solution, lets analyze the requirements first.

We need…

A system where we can defer changes – recording them, and applying them later.
A system which allows multiple changes to be outstanding. We have an active editor community, and it’s important that the system does not stop them from entering edits
A system which allows meta-data to be attached to a change – most importantly the ability to have a discussion about the change.

There are a few ways you can get to these goals. Our current solution is to encode changes into a single object, which we refer to as “edits.” An edit is a container of all the data needed to apply it, and we serialize this as JSON at the moment. It’s a horrible model. With this model, you lose the ability to query inside edits (without going down the path of writing a JSON parser/indexing support for PostgreSQL), you lose referential integrity, and it’s painful to work with.

I’ve spent a good deal of time trying to find a better solution to this, and I think I’ve got somewhere close now. The beauty is, I didn’t invent it – Linus Torvalds did. It’s Git! As I spent time designing a better system for edits, I wanted something which was essentially a persistent graph – this allows us to maintain past versions of data entirely by design, without any special tricks. Git is also a beautifully simple system – you just have objects, commits, trees and… wait that’s pretty much it. Unfortunately, relational theory doesn’t lend itself intuitively to persistent data structures, and it takes a step back to see how to get there.

Example

Lets look at an example of versioning some data. In this example, we’ll take a very simple schema with “artists” and “cds” (MusicBrainz folks are probably going to flame me!).

CREATE TABLE artist (
    id SERIAL NOT NULL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE cd (
    id SERIAL NOT NULL PRIMARY KEY,
    artist INT NOT NULL REFERENCES artist (id),
    name TEXT NOT NULL
);

However, to get this into a persistent graph, we have to understand that the above is simply the view that we want but it’s not enough for underlying storage. We need a little bit of extra information – a way to identify the version of an artist or cd, and we have to expand the one-to-many relationship between CDs and artists into a many-to-many relationship:

CREATE TABLE artist_version (
    id INT NOT NULL,
    version SERIAL NOT NULL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE cd_version (
    id INT NOT NULL,
    version INT NOT NULL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE artist_version_cd (
    artist_version INT NOT NULL REFERENCES artist_version (version),
    cd_version INT NOT NULL REFERENCES cd_version (version)
);

Why do we need to expand the relationship to a many-to-many relationship? In order to have a persistent graph, we need to maintain links between values (artists and cds) as data changes. For example, renaming an artist shouldn’t change the CDs that the artist released. This means there are now 2 versions of an artist with the same set of CDs, hence many-to-many. Now, all we need to do is create a few views on top of this, and we can get back to the original view of the data:

CREATE VIEW artist AS
SELECT DISTINCT ON (artist.id) artist.id, artist.name FROM artist
ORDER BY version DESC;

-- Likewise for cd

Going further

What we have above is enough to version artists and cds, but it doesn’t meet all the criteria above – namely we don’t have that metadata support, nor do we have the ability to defer changes. A new version will immediately show up in our view, and that’s not what we want! So we need another level of indirection, and again – Git has this problem solved

When we introduce the concept of ‘branches’ into the system, things get much more interesting. To start with, all versioned data has a ‘master’ branch. The view above would use the master branch to determine which version to display. When we create new versions in the future, they get their own branch which means they dont cause the “official” view of data to be changed. Further more, a branch is something we can refer to in our schema – which means we can attach the necessary metadata for it. And to apply the change, all we need to do is merge it into master (the simplest way to do this is simply make master = new branch – a fast forward merge in Git terminology).

Summary

With just a little bit more indirection and an ample amount of motivation from Git, we can achieve a very powerful model of versioning data. It plays entirely to the strengths of a relational database, and as such performs well – in my test databases for this I’ve seen little overhead on a low performance laptop, so I imagine this is a trivial issue for larger servers (though I definitely acknowledge this area needs a little more research).

With a little thought, the approach is also very intuitive. There are really no complex concepts such as ‘edits’ or ‘acception’ and ‘rejection’ events – you simply create a new version of the data, and you merge it or you don’t. This isn’t to say you can’t layer more behavior on top, however. I can immediately see all sorts of potential for this to go a lot further:

By introducing the idea of ‘commits’ we can attach metadata to versions of data, and it’s now possible to discuss individual changes as part of a larger picture. For example, a user may add a new CD, but have made a typo on one track. There’s now a clear line of action – simply add a new commit that fixes the typo, and then this is all ready to be merged in one go
Because this is now all relational data, we can derive a fantastic amount of information from it. Want to know where a certain change came from (ala git-bisect)? That’s just a quick query away
Want to rollback changes? That’s no problem, you can simply move to a previous version of the data.

I haven’t entirely finished everything yet, but I have to say I’m extremely excited about where this is going.