Crowd Sourcing Organisation

November 29, 2011 § Leave a comment

There’s recently been a lot of work in MusicBrainz trying to clarify the direction of the project, and roughing out a schedule of how urgent tasks are. I initially selected a few people who are extremely active in the community to help me with the task of sorting through around 200 issues, with the hope that we could decide how important they were. After getting through 30 issues, and scheduling ~25, it was clear that this was a fairly hopeless endeavor. Not only were we not going to get through everything, part of me was left uneasy by the alienation of the rest of the community.

Along with the scheduling discussions, there has been light talk about decentralizing our organisation (at least away from the IRC centralization we currently have) and introducing more democracy into the process. Sorting out this scheduling seems like a perfect candidate to try this out, but it needed organisation to get the results we wanted.

Last Friday, I began work hacking away on mashing up a subset of issues in our bug tracker in a tiny little web app which I fairly unimaginatively called “the Voting Game.” The idea is simple: present users an issue at random and ask they if they think it has to be fixed within 3 months, 12 months, or can remain unscheduled.

The results have consistently amazed me.

I threw the first version of the application online a few hours after I start working it, and results came in thick and fast. At first I put this down slightly to people just enjoying using something new/wanting to break it, but good data came in after the initial announcement. I deemed the project a good idea and continued work on it, finally getting it officially online yesterday.

In less than 24 hours since I announced the launch, I’ve had users encounter problems of emptying the queue – they’ve managed to vote on every issue in the system! Not only that, some issues have feedback from 8 people, considerably outweighing the input of a little group of IRC regulars.

I’m sure to a lot of people this sounds obvious, but with a community as vocal as MusicBrainz, it can often become a burden to manage so much input. This tool has shown me that if you give people the right tools, it can do just the work for you, and the answers just fall into place.

The Scheduling Game is live on ocharles.org.uk, and source code is available at GitHub.

Versioning multiple versions of data in relational databases

July 28, 2011 § Leave a comment

Introduction

At MusicBrainz, we have quite an interesting approach to letting users edit our data. Because our main product is our database, we care deeply about having the highest quality data we can. To ensure that we get high quality data, we ask that most changes have to be voted on by the community before the changes are actually reflected in the database. It’s a fairly simple process, but it catches all sorts of problems – from typos, to potentially disastrous modifications.

It’s also an interesting technical challenge. In this article, I’m going to explain my brief research into a system for versioning data in a relational database, with the ability to maintain full referential integrity and past revisions. Before we start looking at my solution, lets analyze the requirements first.

We need…

  • A system where we can defer changes – recording them, and applying them later.
  • A system which allows multiple changes to be outstanding. We have an active editor community, and it’s important that the system does not stop them from entering edits
  • A system which allows meta-data to be attached to a change – most importantly the ability to have a discussion about the change.

There are a few ways you can get to these goals. Our current solution is to encode changes into a single object, which we refer to as “edits.” An edit is a container of all the data needed to apply it, and we serialize this as JSON at the moment. It’s a horrible model. With this model, you lose the ability to query inside edits (without going down the path of writing a JSON parser/indexing support for PostgreSQL), you lose referential integrity, and it’s painful to work with.

I’ve spent a good deal of time trying to find a better solution to this, and I think I’ve got somewhere close now. The beauty is, I didn’t invent it – Linus Torvalds did. It’s Git! As I spent time designing a better system for edits, I wanted something which was essentially a persistent graph – this allows us to maintain past versions of data entirely by design, without any special tricks. Git is also a beautifully simple system – you just have objects, commits, trees and… wait that’s pretty much it. Unfortunately, relational theory doesn’t lend itself intuitively to persistent data structures, and it takes a step back to see how to get there.

Example

Lets look at an example of versioning some data. In this example, we’ll take a very simple schema with “artists” and “cds” (MusicBrainz folks are probably going to flame me!).

CREATE TABLE artist (
    id SERIAL NOT NULL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE cd (
    id SERIAL NOT NULL PRIMARY KEY,
    artist INT NOT NULL REFERENCES artist (id),
    name TEXT NOT NULL
);

However, to get this into a persistent graph, we have to understand that the above is simply the view that we want but it’s not enough for underlying storage. We need a little bit of extra information – a way to identify the version of an artist or cd, and we have to expand the one-to-many relationship between CDs and artists into a many-to-many relationship:

CREATE TABLE artist_version (
    id INT NOT NULL,
    version SERIAL NOT NULL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE cd_version (
    id INT NOT NULL,
    version INT NOT NULL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE artist_version_cd (
    artist_version INT NOT NULL REFERENCES artist_version (version),
    cd_version INT NOT NULL REFERENCES cd_version (version)
);

Why do we need to expand the relationship to a many-to-many relationship? In order to have a persistent graph, we need to maintain links between values (artists and cds) as data changes. For example, renaming an artist shouldn’t change the CDs that the artist released. This means there are now 2 versions of an artist with the same set of CDs, hence many-to-many. Now, all we need to do is create a few views on top of this, and we can get back to the original view of the data:

CREATE VIEW artist AS
SELECT DISTINCT ON (artist.id) artist.id, artist.name FROM artist
ORDER BY version DESC;

-- Likewise for cd

Going further

What we have above is enough to version artists and cds, but it doesn’t meet all the criteria above – namely we don’t have that metadata support, nor do we have the ability to defer changes. A new version will immediately show up in our view, and that’s not what we want! So we need another level of indirection, and again – Git has this problem solved

When we introduce the concept of ‘branches’ into the system, things get much more interesting. To start with, all versioned data has a ‘master’ branch. The view above would use the master branch to determine which version to display. When we create new versions in the future, they get their own branch which means they dont cause the “official” view of data to be changed. Further more, a branch is something we can refer to in our schema – which means we can attach the necessary metadata for it. And to apply the change, all we need to do is merge it into master (the simplest way to do this is simply make master = new branch – a fast forward merge in Git terminology).

Summary

With just a little bit more indirection and an ample amount of motivation from Git, we can achieve a very powerful model of versioning data. It plays entirely to the strengths of a relational database, and as such performs well – in my test databases for this I’ve seen little overhead on a low performance laptop, so I imagine this is a trivial issue for larger servers (though I definitely acknowledge this area needs a little more research).

With a little thought, the approach is also very intuitive. There are really no complex concepts such as ‘edits’ or ‘acception’ and ‘rejection’ events – you simply create a new version of the data, and you merge it or you don’t. This isn’t to say you can’t layer more behavior on top, however. I can immediately see all sorts of potential for this to go a lot further:

  • By introducing the idea of ‘commits’ we can attach metadata to versions of data, and it’s now possible to discuss individual changes as part of a larger picture. For example, a user may add a new CD, but have made a typo on one track. There’s now a clear line of action – simply add a new commit that fixes the typo, and then this is all ready to be merged in one go
  • Because this is now all relational data, we can derive a fantastic amount of information from it. Want to know where a certain change came from (ala git-bisect)? That’s just a quick query away
  • Want to rollback changes? That’s no problem, you can simply move to a previous version of the data.

I haven’t entirely finished everything yet, but I have to say I’m extremely excited about where this is going.

Merry Christmas everyone!

December 25, 2010 § Leave a comment

While I’m not a Christian, it’s still a nice few days to relax and enjoy. I hope you all have a nice time – get stuffing yourself with good food! I have some interesting projects to blog about soon including using the Perl compiler for syntax highlighting, some book reviews and other bits and pieces. I’ll be back to blogging shortly!

Sorting subjective media competitively

December 15, 2010 § Leave a comment

I own a lot of music, but I find it difficult to find what I want to play. Being a DJ, I’ve developed a pretty good memory of what I like to listen and what goes together to create a consistent mood, but I’m also a computer programmer so I like to make the computer work for me as much as possible.

Back when I was a gamer, I spent some time looking at applying sport statistics to the games I played – in particular I spent some time working with the Elo rating system. It worked great, and the results I got mostly matched the communities opinion of who the best teams were. However, what really struck me was the simplicity of the Elo rating functions – you just give it 2 scores, tell who “won” and you get 2 new scores back. There is little more to this, and it leaves a lot of scope back to the user of what the score means, and what a victory or defeat is. This has left a bit of a nagging curiosity in my mind – if we can do this for sorting games, can we apply it to subjective media?

I’m currently in the process of planning how I want to approach this problem, but allow me to quickly summarize my thought so far.

At the beginning of the process, all the music belongs to a single set. We then sample this set (I haven’t decided if this should be done randomly or with heuristics yet), playing each piece of music one at a time. Once the user has heard 2 pieces of music, they need to decide which they liked more. Here are my current options:

  • The previous track was better
  • The recent track was better
  • I enjoyed them equally
  • I couldn’t compare them

The first 3 options map to a victory, a loss and a draw, respectively. The last one is a little different – but if I declare that 2 pieces of music could not really be compared, then a new set is created (X’). We also make a note that the set’s X and X’ cannot be compared.

Now, as this system goes on, we will hopefully build up a set of sets, and we’ll also define a set of relationships between these sets.

I’m not really sure what the results of this will be, or how useful they will be. My motivation is to create some sort of equivalent of the music genome project, but I’m not sure it can really get that far. However, this should at least end up giving me a set of genre collections, sorted by rating as well.

Now, just to implement this and get some data in!

A Journey of Optimizing Perl

November 28, 2010 § 4 Comments

Perl has a lot of ill-informed preconceptions in the programming community, and speed does seem to be one of them. There is an assumption that one must either make a mutually exclusive decision – clean, readable code that runs slowly, and noisey code that performs all sorts of tricks to become efficient. I’ve recently been in a position of having to do some fast work with Perl, so I am now sharing some of my experiences.

Context: Data transformations

In order to set the context for this article, allow me to touch on what I was working on. At MusicBrainz, we are moving from an old database schema to a much newer one. In the process of this, we need to migrate data from the old schema, to a new one. For the most part, this can be done in SQL, but some of the data requires some more intricate processing.

We store records called “edits” in our database, which represent some sort of change to data in the system – for example adding data, editing data and removing data are all represented as an edit. Previously, edits consisted of an old value and a new value, in various formats. Some edits used a custom serialization format of a hashmap, others used single strings, others used a scheme that made sense at the time, but does not fit in consistently with any common format.

The new schema changes some of this. Firstly, both new value and previous value have been combined into one (as it doesn’t make sense to have either of these for some edits), and the serialization format is always JSON. This alone is a transformation that would be difficult to achieve in SQL.

On top of this, the row IDs that this serialized data points to may no longer be valid. The new schema merges many rows together (segmenting some data out to other tables), so we need to alter these FKs appropriately.

The data set being migrated here is roughly 12 million rows, and there are 70 different types of rows – not a trivial data migration problem. To add further constraints, the migration process can only be run at the point of schema change – when we launch the new version. As we don’t want downtime for the site during upgrade we put our database into a readonly state while we migrate, but we need to still minimize the amount of time here – 4 hours in total is really at the edge of this limit.

So Make it FAST

Step 1: Prototype

I spent a fair amount of time considering the correct approach to migrate this data. Do I try to be clever and handle as much as possible in a single complex transformation? Do I operate on a distinct type at a time, and transform a stream? Do I operate on a row by row basis?

In order to make this decision, I decided to TIAS. I grabbed my toolkit of CPAN modules that make my life easier and threw some solutions together. In the end, the solution that felt most productive to work with and understand was operating on a row by row basis.

In this approach, I created Moose classes for each transformation. I would inflate a row hash (from DBD) into one of these classes, and call the upgrade method, which gave me a new object back (correctly migrated and ready for re-insertion).

Finally I ended up with a migration script that ran, but sadly, not within the time frame that we were constrained by. However, program correctness was my main concern here, we are now ready to continue to the next step.

Step 2: Understand your constraints

Before really diving in and understand the flaws of the program, it’s important to understand your constraints. Where will your program be running? What hardware do you have available? What is the timeframe/rate your program needs to run within? Is your program running in isolation, or do you need to consider load issues?

These are all important questions that will have an impact on the decisions you make later. The more you can ask, the better an idea of the problem domain you will have. In our case, we would be running the script on a machine with at least 4gb of RAM, and as the main process. These constraints are generous – giving us a lot of freedom to optimize.

Step 3: Identify the bottlenecks

We have all done it, but diving into a program and blindly rewriting code under the assumption it will run faster is an extremely unproductive task, that rarely yields the goal of program optimization. Rather, the more productive task is to instrument code to identify the critical points in a program – where the real time is spent.

There are many approaches to instrumentation, and it may be enough to instrument code yourself. We began by lightly instrumenting our master script with the difference in time using Time::HiRes, a high resolution implementation of gettimeofday for Perl. Later, we needed further granularity, so we used a combination of Devel::NYTProf and the included nytprofhtml to generate reports. I may well have more to say on using Devel::NYTProf in the future…

Step 4: Reducing the impact of bottlenecks

With a clear idea of the problem, and the most significant bottlenecks, you are now reading to begin optimization. There is no hard and fast rule on the best way to remove bottlenecks, but here are some of the tricks we employed:

Trade space for time

This is the catchy motto of Memoize, a Perl module implementing automatic memoization of subroutines, based on the input given. Memoization is the process of caching the result of function, based on its input. Memoization thus allows constant time calls to functions when called with repeated input. There are a few considerations for memoization:

Referential transparency

It does not make sense to memoize functions that are not referentially transparent. A function with the property of referential transparent is a function in the purely mathematically sense – for the same input, you will always see the same output. This means functions that depend on outside state will not be suitably for memoization, for example subroutines that depend on network access, random values, and so on may not be suitable.

However, if you can guarantee a consistent world state /during the execution of your program/, then memoization may turn out to be a useful optimization – avoiding heavy disk I/O when you know the result will be the same.

Hit rate

Another consideration of memoization is one that will be familiar to those who have had to implement caching – the hit rate. Memoizing a function called 20000 times sounds sensible at first, but when you discover it’s called with 20000 inputs, it becomes less sensible. All you achieve here is more memory usage, but the cache is never hit. However, if it’s called with only 200 inputs, then memoization is a strong contender for reducing the impact of this bottleneck.

Approach the problem from a different angle

One of the first bottlenecks we encountered was how to get data into the script for the migration process. SELECTing the entire table was not a solution, as this would use too much memory. The prototype SELECTed in small chunks, but to do this we had to order the entire table for each select, and pass an offset – an expensive operation, and one that becomes more expensive as the offset increased.

The next approach we took was to dump the table into CSV files of chunks, and then read from these. Text::CSV_XS allowed us to read rows in with confidence (trusting the correctness of someone else over ourselves) and drastically reduced the times. Reading a chunk from disk into memory almost felt like a no-op compared to the speed of our SQL query.

Finally, after more optimizations this too became a bottleneck again, and with a little more thought we used a COPY statement to stream the table into the script. The important thing here though is that we were able to gradually reduce this bottleneck – we did only what was needed to make something else the most important optimization.

Reduce logic where constraints are known

After reducing the larger bottlenecks, a simple function had become the bottleneck. This function decodes a string from an old encoding format, into a Perl string. The initial implementation of the function, copied straight out of the legacy code base was:

sub _decode_value
{
	my ($scheme, $data) = $_[1] =~ /\A\x1B(\w+);(.*)\z/s
		or return $_[1];
	return uri_unescape($data) if $scheme eq "URI";
	die "Unknown encoding scheme '$scheme'";
}

Our initial approach was to memoize this. However, the hit rate is fairly low, so this still didn’t truly help the problem. But looking at this code it felt as simple as it could be, what were we missing?

The import realisation was that this code contains logic that can be hard coded. Notice how there is a regular expression to check for if any encoding is present, and a further condition for the scheme itself? But there is only a single possible scheme! With this understand, we can use this much more efficient code:

sub decode_value
{
    my $value = shift;
    return uri_unescape(substr($value, 5));
}

This code doesn’t scale well, if the scheme changes, but we know this cannot happen – as the data we are operating on has already been created (and as the data is immutable, we know our assumptions will hold).

Substitute for more efficient technologies

Once we had reached program correctness, profiling showed that a substantial amount of time was actually spent within Moose::new_object – constructing objects. Our objects themselves did not really use much of the power given by Moose however – no need for type checks, meta programming, the expressiveness of attributes. In the end, we replaced all classes with much simpler objects that simply used Class::Accessor::Fast::XS. This achieved a speed up in this bottleneck by an order of magnitude.

Mistakes

This is my first foray into real optimizations, and I definitely made a lot of mistakes along the way. Here are some that I think stand out the most:

A lack of understanding of bottlenecks

I spent a lot of time making assumptions about my code and where it was slow – after all, I wrote it. Sadly, this is the classic problem of programmer ego taking control. Only after watching my futile attempts actually result in a program that ran continually slower, did I step back and take a more scientific approach to the problem. I could have saved a lot of time if I was willing to slow down and actually gain a true understanding of the problem first.

Overly ambitious optimizations

As we moved through the process of optimizing, our ideas for optimizing became drastically over thought. The best example of this follows once we had a working chunking system. Our system was able to migrate small chunks of data, and appeared CPU bound, so we concluded the best approach would be multi-threading it and using some shared memory. This is a classic example of over-engineering a problem. It’s also a solution I don’t have enough experience in to implement reliably (not to mention I don’t believe Perl is the right job for concurrent programming).

In the end a few hours of my time were spent into failing to get this system working, when really it was a solution that was not directly addressing the bottleneck. Rather than working around a bottleneck, the better approach would have been to attack the bottleneck head on.

Conclusion

Hopefully here I’ve shown you that it is possible to write fast Perl code, we just have to be a little bit more careful in our planning before we begin to approach the problem. Optimizing any code is a delicate operation – you’re really striving for a balance between all parts of the system, and upsetting this balance can easily be done. This is not specific to Perl, and I imagine most parts of this article people will already be aware of from other languages. That said, it’s certainly helped me to step back and work out exactly what I’m trying to achieve, even if this did take a few days more than I would have liked.

Implementing Factories in Perl

November 24, 2010 § 13 Comments

Factories are a useful construct, and even though their usage is common in Java and other heavily biased object-orientation languages, they don’t see as much in more dynamic languages. This doesn’t mean you can’t use them or they don’t have their uses though. In this article, I’m going to try and explain why and when we need to use factories and how we can go about implementing them in Perl.

This article was inspired be a recent discussion I had with someone in #moose, but hopefully this larger write up will be useful to some people, though understand the concepts here are simple and I’m writing for a different audience than usual. With that introduction over, lets get started!

What Are Factories

A factory is really just the name for something that creates objects. We could say new is a very specific factory (that only creates objects of the same type as the class), but normally a factory performs a little more logic. We usually use factories when we need to create objects, but we don’t know what type of object until we get to run time. We will work with the following example.

Our system has Flight objects, and there are different types of Flights. For now, lets say we have Flight::Holidays and Flight::Cargos. Holiday flights take a set of passengers, and cargo flights take a set of cargo items. Our job is to take flight bookings, and store them in a database somehow. As part of our solution to this problem we decide that we will need to create the appropriate Flight object, and then it can be stored.

package Flight;
use Moose::Role;

package Flight::Holiday;
use Moose;
with 'Flight';

has 'passengers' => ( is => 'ro' );

package Flight::Cargo;
use Moose;
with 'Flight';

has 'cargo' => ( is => 'ro' );

Simple so far, right? Where do factories come into play? The problem is that the external data we get doesn’t specify which type of flight we need to create! Lets pretend we’re given a hash reference of parameters to new. We would like to inspect this to decide how to create objects. Rather than doing this every time we create a Flight, we should put this in a separate function:

sub new_flight {
    my ($class, $data) = @_;
    if (exists $data->{cargo}) {
        return Flight::Cargo->new($data);
    }
    elsif (exists $data->{passengers}) {
        return Flight::Holiday->new($data);
    }
    else {
        die "I don't know how to create this type of Flight";
    }
}

Nothing complicated here? Well guess what… we just wrote a factory! Move this to a separate FlightFactory class, and we’re done. We can now create flights by calling FlightFactory->new_flight({ cargo => [] }) and we will get a Flight::Cargo back. Neat

Going Further

This is great, we’ve already abstracted the object construction away, but we can do better. There is a problem with our current factory, it introduces multiple points of change. Our factory is also doing too much – why should FlightFactory care about what makes a Flight::Cargo? Surely that’s Flight::Cargo‘s job. Let’s address this issue first:

sub new_flight {
    my ($class, $data) = @_;
    if (Flight::Cargo->understands($data)) {
        return Flight::Cargo->new($data);
    }
    elsif (Flight::Holiday->understands($data)) {
        return Flight::Holiday->new($data);
    }
    else {
        die "I don't know how to create this type of Flight";
    }
}

And we add code like the following to Flight::Holiday and Flight::Cargo:

sub understands {
    my ($class, $data) = @_;
    return exists $data->{cargo};
}

Great! Now the logic for deciding which class to instantiate has been moved to the appropriate area of responsibility. But we still have the problem about multiple points of change. Let’s have a look at that deeper to see what the problem is.

Imagine our requirements change, and we’re now asked to handle Flight::Personals – people flying their own planes. What changes does this require? Well, we need a Flight::Personal class, that’s for sure:

package Flight::Personal;
use Moose;
with 'Flight';

sub understands {
    my ($class, $data) = @_;
    return exists $data->{owner};
}

This should be enough, but it’s not. If we pass { owner => 'Ollie' } to new_flight we won’t get a Flight::Personal back because the factory does not yet know about Flight::Personal, so let’s add it in:

sub new_flight {
    ...
    elsif (Flight::Personal->understands($data)) {
        return Flight::Personal->new($data);
    }
    ...
}

Wait a minute! I see an abstraction emerging here! We seem to be doing the same sort of code for each branch in our if statement, lets see if we can do better here… maybe it will reveal a solution to the problem we’re investigating

sub new_flight {
    my ($class, $data) = @_;

    my @classes = (
        Flight::Personal,
        Flight::Holiday,
        Flight::Cargo;
    );
    for my $subclass (@classes) {
        return $subclass->new($data)
            if $subclass->understands($data);
    }

    die "I don't know how to create this type of flight";
}

Aha! Not only have we abstracted out some repetition and made it easier to change, we’ve reduced the effort to add a new type of Flight. We’re not happy that the factory has to change at all though – can you see how to achieve this yet? We need a way to dynamically set @classes. There are a few ways to do this, but I’ll show you a solution using Module::Pluggable:

package FlightFactory;
use Module::Pluggable search_path => 'Flight', sub_name => 'classes';

sub new_flight {
    my ($class, $data) = @_;

    for my $subclass ($class->classes) {
        # As before
    }
}

Module::Pluggable gives us the classes class method, which returns all classes under the Flight:: namespace. We probably want to be a bit more specific here and make sure we only get things that are concrete classes – checking that they do the Flight role would be a start. I’ll leave this to readers as an exercise.

Don’t reinvent the wheel

LeoNerd in the comments below has pointed out that the idiom of looping over clases to filter a specific one, is what Module::PluginFinder was designed for. So, in the spirit of writing even better code, lets try using that! Module::PluginFinder works like Module::Pluggable, but we can specify a filter for matching classes. It can also handle the instantiation for us:

package FlightFactory;
use Module::PluginFinder;

my $finder = Module::PluginFinder->new(
    search_path => 'Flight',
    filter => sub { 
        my ($class, $data) = @_;
        $class->understands($data)
    }
);

sub new_flight {
    my ($self, $data) = @_;

    return $finder->construct($data, $data)
        or die "I don't know how to create this type of Flight";
}

Conclusion

Hopefully in this post I’ve given you a clear illustration of the need for factories, when we might want to use them, and how we can implement them. We went past basic factories to make them dynamic, and extendible (even extendible outside the distribution). Along the way, I tried to illustrate this in the approach I would take while I do this at work, which also has hopefully given you a good idea of how you can apply basic refactoring to your code as you write it, and end up with clean, separated code.

If you want to follow along with this tutorial, I have pushed out a Git repository to my Github account. You can follow along with this by checking out the code, and then reading the log with patches, in reverse:

git clone git@github.com:ocharles/OCharles-Blog-Factories.git
git log --reverse -p

It’s a little different, as I wrote it after the article, but hopefully it’s useful. This is the first time I’ve tried posting accompanying code, so I’m curious to see how useful people find it. If you want to run Example.pm you will need Moose, Module::Pluggable, and a reasonable version of Perl (5.8 upwards should do the job).

Defect Driven Testing… and You’re the Driver

November 21, 2010 § 3 Comments

Recently at $work I’ve been doing, quite frankly, a poor job of maintaining one of my areas of responsibility. I am the developer of a script that deals with the migration of about 12 million rows of data from one data format to another. This work has become sufficiently complicated that it can’t really be done in SQL, and has a set of Perl classes to do the migration. It’s a very critical piece of work right now, because it defines how long we have down time, and we only get to run this script once – if it’s wrong, then I may have corrupt a decades worth of historic data (effectively an audit trail). This is Not A Good Thing.

As you can see, I’m aware of these problems, and have made progress in increasing the speed of script, along with the quality of it. What I’m not so happy to tell you, is that I’ve also managed to commit this script in various forms of stability – from syntax errors, start up runtime errors, to errors that happen at the end of the script – something that you still end up waiting a good hour until you find out. I’m a perfectionist, and no doubt this cycle of “hack-break-hack-break-hack-hey it works!-hack-break-hack-fix” is getting extremely tedious, and doesn’t feel at all representative of my actual ability. Finally, that little moment of enlightenment has come and it’s time to sort this out. My solution? >buzzword<developer driven defect driven development>/buzzword< (someone please make me a manager).

Defect Driven Development

Before we get on to talking about my revolutionary testing approach which will probably cure world hunger, we need to discuss the concept of Defect Driven Development (DDD, for now). DDD is a testing methodology, where by you ensure you have a test for a bug. Dead simple. Here’s the basic methodology: you work on a great new feature and release it. User U finds a bug in your feature and reports it. You, the developer, write a test and then fix the bug. Now the bug will never occur again (theory here, work with me…)

Now here’s the twist:

You Are the User

While developing, consider yourself the user as well. So you hack away, and something breaks. At this point, you don’t fix the break, you write a test to reproduce it. And I mean everywhere. Even down to the smallest details – no matter where the bug is. Syntax error? Reproduce it in a test1. A function crashes with some input? Reproduce it in a test. The world is beginning to implode? Reprod… ok, it won’t really help here.

I doubt that I’ve really made much of an impact of this post but that’s my point. This is a very simple way to get tests into your codebase, and it has one some excellent consequences when you actually get to the end of your feature. Every new feature or enhancement really needs tests to go with it, but tests are hard, and testing is boring. I’ve written unit tests before that test basic exercise, but finding the edge cases that do break the system is difficult. I don’t think my tests often help me develop my feature, they are more useful later as I start to integrate tests for bug fixes. But with this new approach, you automatically do get some tests of where things have broken in the past – and I believe if it’s broke before, it’ll probably break again, unless you’re taking steps to prevent that.


1 While yes, the code wouldn’t actually run, in Perl we can still have a test to make sure something compiles. use_ok in the Test::More distribution can do this, for example.

Introducing Magpie – flexible test doubles & mocking for Perl

November 19, 2010 § 3 Comments

Introduction

Magpie is a new distribution I have just released which brings the power of test doubles to Perl. There are already a few solutions to test doubles in Perl, but Magpie takes a different approach. Inspired heavily by Mockito for Java, Magpie gives you test doubles that are based are spying and verification, not expectations. So, before we really dive into it, how does it look?

use Test::Magpie qw( mock when );
use Test::More;
my $mocked_list = mock;

when($mocked_list)->get(0)->then_return('first');
when($mocked_list)->get(1)->then_die('Kaboom!');

is($mocked_list->get(0) => 'first');
ok(exception { $mocked_list->get(1) });
is($mocked_list->get => undef);

So, what’s going on here? First of all we create a mock object. This object does every role, is a subclass of every class, and can run any method (returning undef by default). We then stub this object to handle some method calls using the when construct. We specify that when we request item 0 from our mocked list we should return the string ‘first’ and when request item 1, we throw an exception string ‘Kaboom!’. Simple! And as you can see, the tests following all verify this behaviour – this example is straight out of t/mockito_examples.t

What does Magpie have to offer?

What you just saw was the basics of Magpie – there are a lot more cool features available, that come in to be very useful!

Verification

As well as stubbing methods, you can also verify that methods were called. The following example from the synopsis illustrates how this may be useful:

use Test::Magpie qw( mock verify when );

my $baker = mock;
my $bakery = Bakery->new( bakers => [ $baker ] );
my $bread = $bakery->buy_loaf( amount => 2, type => 'white' );
verify($baker, times => 2)->bake_loaf('white');

As you can see, we are able to verify a method was called, and also add some extra details – for now the amount of times a method was called, and which arguments it was called with.

Argument matchers

Argument matchers allow you to be more general in your specification for stubs and verification. Rather than saying “when this method is called with exactly these arguments” we can say the more general “when this method is called with arguments that match these predicates.” In practice, it might look like this:

when($child)->eat(type(Broccoli))->then_die('Yuck!');
when($child)->eat(type(SugaryGoodness))->then_return('Ooo, yum!')

In this example Broccoli and SugaryGoodness are type constraints. There are already a few argument matchers that ship with Magpie, and it’s trivial to define your own with the custom_matcher generator.

Extra extra! Read more about it!

There’s more to Test::Magpie than what I’ve mentioned in this post, but if you’re interested, I recommend the official documentation. The basic and Mockito example tests serve as a great demonstration of how Magpie can be practically used.

Go go gadget CPAN – installing Magpie

Magpie is already available for use now, and is on cpan:

cpan Test::Magpie

From your shell, or however you wish to install CPAN modules

I really hope you enjoy this module, I’m already finding it powerful enough to use at work. If you have any criticisms, bugs, feature requests, or ponys to give me – drop me a comment here, an issue on RT, or poke me on IRC (I’m ocharles).

Happy testing!

Edit: 0.04 had a release problem and might not have installed cleanly. 0.05 should fix this. Sorry!

Time to end an information addiction

May 10, 2010 § Leave a comment

I’ve become increasingly frustrated with how I spend my free time. Bouncing back and forth and back and forth between Reddit and Hacker News, and having a constant drip of information straight to me. It’s come to the point now, where my attention span is terrible – I can go barely half an hour without needing to find something funny to look at, or a new video to watch, or catch up on some technology that I’m going to forget 10 seconds later that has zero impact on my life.

I often spend time thinking back to myself as a child. There were things I do differently in life now, but there are still a lot of traits that I’ve lost as a kid. The fascination with everything, a need to understand – really understand – how everything worked. I could spend hours hacking away at little things and feel that joy and buzz when I got it working. I don’t get this so much anymore.

Part of it is growing up, no doubt. I’m 7 years older than these first memories – it’s hard to constantly understand new things, because there is a limit to the speed at which you can discover stuff. But the larger problem is just that I’m not trying! I still have the ideas, but there is no development. An idea might make it to a repository, but die out shortly afterwards.

Now I’m not suggesting I finish all these projects – there are countless past projects that never made it anywhere. However, there were projects that did have longevity – now there aren’t.

So, Reddit, Hacker News, it’s time to say good bye for now – it’s been fun, but you’re just starting to drain me of creativity. I’m sure the transitional period is going to be rocky, but I fell this has got to be done!

Now, maybe I can finish reading these books on my self, finish really learning Haskell and Lisp, practice guitar more and continue my piano studies…


Interesting footnote: During the writing of this short post, I was distracted probably 5 times with YouTube and other stuff – at one point entirely forgetting what I was writing about.

Have a nice Easter!

April 4, 2010 § Leave a comment

Hello, just wishing everyone a happy Easter – have a nice day! 🙂