A Journey of Optimizing Perl

November 28, 2010 § 4 Comments

Perl has a lot of ill-informed preconceptions in the programming community, and speed does seem to be one of them. There is an assumption that one must either make a mutually exclusive decision – clean, readable code that runs slowly, and noisey code that performs all sorts of tricks to become efficient. I’ve recently been in a position of having to do some fast work with Perl, so I am now sharing some of my experiences.

Context: Data transformations

In order to set the context for this article, allow me to touch on what I was working on. At MusicBrainz, we are moving from an old database schema to a much newer one. In the process of this, we need to migrate data from the old schema, to a new one. For the most part, this can be done in SQL, but some of the data requires some more intricate processing.

We store records called “edits” in our database, which represent some sort of change to data in the system – for example adding data, editing data and removing data are all represented as an edit. Previously, edits consisted of an old value and a new value, in various formats. Some edits used a custom serialization format of a hashmap, others used single strings, others used a scheme that made sense at the time, but does not fit in consistently with any common format.

The new schema changes some of this. Firstly, both new value and previous value have been combined into one (as it doesn’t make sense to have either of these for some edits), and the serialization format is always JSON. This alone is a transformation that would be difficult to achieve in SQL.

On top of this, the row IDs that this serialized data points to may no longer be valid. The new schema merges many rows together (segmenting some data out to other tables), so we need to alter these FKs appropriately.

The data set being migrated here is roughly 12 million rows, and there are 70 different types of rows – not a trivial data migration problem. To add further constraints, the migration process can only be run at the point of schema change – when we launch the new version. As we don’t want downtime for the site during upgrade we put our database into a readonly state while we migrate, but we need to still minimize the amount of time here – 4 hours in total is really at the edge of this limit.

So Make it FAST

Step 1: Prototype

I spent a fair amount of time considering the correct approach to migrate this data. Do I try to be clever and handle as much as possible in a single complex transformation? Do I operate on a distinct type at a time, and transform a stream? Do I operate on a row by row basis?

In order to make this decision, I decided to TIAS. I grabbed my toolkit of CPAN modules that make my life easier and threw some solutions together. In the end, the solution that felt most productive to work with and understand was operating on a row by row basis.

In this approach, I created Moose classes for each transformation. I would inflate a row hash (from DBD) into one of these classes, and call the upgrade method, which gave me a new object back (correctly migrated and ready for re-insertion).

Finally I ended up with a migration script that ran, but sadly, not within the time frame that we were constrained by. However, program correctness was my main concern here, we are now ready to continue to the next step.

Step 2: Understand your constraints

Before really diving in and understand the flaws of the program, it’s important to understand your constraints. Where will your program be running? What hardware do you have available? What is the timeframe/rate your program needs to run within? Is your program running in isolation, or do you need to consider load issues?

These are all important questions that will have an impact on the decisions you make later. The more you can ask, the better an idea of the problem domain you will have. In our case, we would be running the script on a machine with at least 4gb of RAM, and as the main process. These constraints are generous – giving us a lot of freedom to optimize.

Step 3: Identify the bottlenecks

We have all done it, but diving into a program and blindly rewriting code under the assumption it will run faster is an extremely unproductive task, that rarely yields the goal of program optimization. Rather, the more productive task is to instrument code to identify the critical points in a program – where the real time is spent.

There are many approaches to instrumentation, and it may be enough to instrument code yourself. We began by lightly instrumenting our master script with the difference in time using Time::HiRes, a high resolution implementation of gettimeofday for Perl. Later, we needed further granularity, so we used a combination of Devel::NYTProf and the included nytprofhtml to generate reports. I may well have more to say on using Devel::NYTProf in the future…

Step 4: Reducing the impact of bottlenecks

With a clear idea of the problem, and the most significant bottlenecks, you are now reading to begin optimization. There is no hard and fast rule on the best way to remove bottlenecks, but here are some of the tricks we employed:

Trade space for time

This is the catchy motto of Memoize, a Perl module implementing automatic memoization of subroutines, based on the input given. Memoization is the process of caching the result of function, based on its input. Memoization thus allows constant time calls to functions when called with repeated input. There are a few considerations for memoization:

Referential transparency

It does not make sense to memoize functions that are not referentially transparent. A function with the property of referential transparent is a function in the purely mathematically sense – for the same input, you will always see the same output. This means functions that depend on outside state will not be suitably for memoization, for example subroutines that depend on network access, random values, and so on may not be suitable.

However, if you can guarantee a consistent world state /during the execution of your program/, then memoization may turn out to be a useful optimization – avoiding heavy disk I/O when you know the result will be the same.

Hit rate

Another consideration of memoization is one that will be familiar to those who have had to implement caching – the hit rate. Memoizing a function called 20000 times sounds sensible at first, but when you discover it’s called with 20000 inputs, it becomes less sensible. All you achieve here is more memory usage, but the cache is never hit. However, if it’s called with only 200 inputs, then memoization is a strong contender for reducing the impact of this bottleneck.

Approach the problem from a different angle

One of the first bottlenecks we encountered was how to get data into the script for the migration process. SELECTing the entire table was not a solution, as this would use too much memory. The prototype SELECTed in small chunks, but to do this we had to order the entire table for each select, and pass an offset – an expensive operation, and one that becomes more expensive as the offset increased.

The next approach we took was to dump the table into CSV files of chunks, and then read from these. Text::CSV_XS allowed us to read rows in with confidence (trusting the correctness of someone else over ourselves) and drastically reduced the times. Reading a chunk from disk into memory almost felt like a no-op compared to the speed of our SQL query.

Finally, after more optimizations this too became a bottleneck again, and with a little more thought we used a COPY statement to stream the table into the script. The important thing here though is that we were able to gradually reduce this bottleneck – we did only what was needed to make something else the most important optimization.

Reduce logic where constraints are known

After reducing the larger bottlenecks, a simple function had become the bottleneck. This function decodes a string from an old encoding format, into a Perl string. The initial implementation of the function, copied straight out of the legacy code base was:

sub _decode_value
{
	my ($scheme, $data) = $_[1] =~ /\A\x1B(\w+);(.*)\z/s
		or return $_[1];
	return uri_unescape($data) if $scheme eq "URI";
	die "Unknown encoding scheme '$scheme'";
}

Our initial approach was to memoize this. However, the hit rate is fairly low, so this still didn’t truly help the problem. But looking at this code it felt as simple as it could be, what were we missing?

The import realisation was that this code contains logic that can be hard coded. Notice how there is a regular expression to check for if any encoding is present, and a further condition for the scheme itself? But there is only a single possible scheme! With this understand, we can use this much more efficient code:

sub decode_value
{
    my $value = shift;
    return uri_unescape(substr($value, 5));
}

This code doesn’t scale well, if the scheme changes, but we know this cannot happen – as the data we are operating on has already been created (and as the data is immutable, we know our assumptions will hold).

Substitute for more efficient technologies

Once we had reached program correctness, profiling showed that a substantial amount of time was actually spent within Moose::new_object – constructing objects. Our objects themselves did not really use much of the power given by Moose however – no need for type checks, meta programming, the expressiveness of attributes. In the end, we replaced all classes with much simpler objects that simply used Class::Accessor::Fast::XS. This achieved a speed up in this bottleneck by an order of magnitude.

Mistakes

This is my first foray into real optimizations, and I definitely made a lot of mistakes along the way. Here are some that I think stand out the most:

A lack of understanding of bottlenecks

I spent a lot of time making assumptions about my code and where it was slow – after all, I wrote it. Sadly, this is the classic problem of programmer ego taking control. Only after watching my futile attempts actually result in a program that ran continually slower, did I step back and take a more scientific approach to the problem. I could have saved a lot of time if I was willing to slow down and actually gain a true understanding of the problem first.

Overly ambitious optimizations

As we moved through the process of optimizing, our ideas for optimizing became drastically over thought. The best example of this follows once we had a working chunking system. Our system was able to migrate small chunks of data, and appeared CPU bound, so we concluded the best approach would be multi-threading it and using some shared memory. This is a classic example of over-engineering a problem. It’s also a solution I don’t have enough experience in to implement reliably (not to mention I don’t believe Perl is the right job for concurrent programming).

In the end a few hours of my time were spent into failing to get this system working, when really it was a solution that was not directly addressing the bottleneck. Rather than working around a bottleneck, the better approach would have been to attack the bottleneck head on.

Conclusion

Hopefully here I’ve shown you that it is possible to write fast Perl code, we just have to be a little bit more careful in our planning before we begin to approach the problem. Optimizing any code is a delicate operation – you’re really striving for a balance between all parts of the system, and upsetting this balance can easily be done. This is not specific to Perl, and I imagine most parts of this article people will already be aware of from other languages. That said, it’s certainly helped me to step back and work out exactly what I’m trying to achieve, even if this did take a few days more than I would have liked.

Implementing Factories in Perl

November 24, 2010 § 13 Comments

Factories are a useful construct, and even though their usage is common in Java and other heavily biased object-orientation languages, they don’t see as much in more dynamic languages. This doesn’t mean you can’t use them or they don’t have their uses though. In this article, I’m going to try and explain why and when we need to use factories and how we can go about implementing them in Perl.

This article was inspired be a recent discussion I had with someone in #moose, but hopefully this larger write up will be useful to some people, though understand the concepts here are simple and I’m writing for a different audience than usual. With that introduction over, lets get started!

What Are Factories

A factory is really just the name for something that creates objects. We could say new is a very specific factory (that only creates objects of the same type as the class), but normally a factory performs a little more logic. We usually use factories when we need to create objects, but we don’t know what type of object until we get to run time. We will work with the following example.

Our system has Flight objects, and there are different types of Flights. For now, lets say we have Flight::Holidays and Flight::Cargos. Holiday flights take a set of passengers, and cargo flights take a set of cargo items. Our job is to take flight bookings, and store them in a database somehow. As part of our solution to this problem we decide that we will need to create the appropriate Flight object, and then it can be stored.

package Flight;
use Moose::Role;

package Flight::Holiday;
use Moose;
with 'Flight';

has 'passengers' => ( is => 'ro' );

package Flight::Cargo;
use Moose;
with 'Flight';

has 'cargo' => ( is => 'ro' );

Simple so far, right? Where do factories come into play? The problem is that the external data we get doesn’t specify which type of flight we need to create! Lets pretend we’re given a hash reference of parameters to new. We would like to inspect this to decide how to create objects. Rather than doing this every time we create a Flight, we should put this in a separate function:

sub new_flight {
    my ($class, $data) = @_;
    if (exists $data->{cargo}) {
        return Flight::Cargo->new($data);
    }
    elsif (exists $data->{passengers}) {
        return Flight::Holiday->new($data);
    }
    else {
        die "I don't know how to create this type of Flight";
    }
}

Nothing complicated here? Well guess what… we just wrote a factory! Move this to a separate FlightFactory class, and we’re done. We can now create flights by calling FlightFactory->new_flight({ cargo => [] }) and we will get a Flight::Cargo back. Neat

Going Further

This is great, we’ve already abstracted the object construction away, but we can do better. There is a problem with our current factory, it introduces multiple points of change. Our factory is also doing too much – why should FlightFactory care about what makes a Flight::Cargo? Surely that’s Flight::Cargo‘s job. Let’s address this issue first:

sub new_flight {
    my ($class, $data) = @_;
    if (Flight::Cargo->understands($data)) {
        return Flight::Cargo->new($data);
    }
    elsif (Flight::Holiday->understands($data)) {
        return Flight::Holiday->new($data);
    }
    else {
        die "I don't know how to create this type of Flight";
    }
}

And we add code like the following to Flight::Holiday and Flight::Cargo:

sub understands {
    my ($class, $data) = @_;
    return exists $data->{cargo};
}

Great! Now the logic for deciding which class to instantiate has been moved to the appropriate area of responsibility. But we still have the problem about multiple points of change. Let’s have a look at that deeper to see what the problem is.

Imagine our requirements change, and we’re now asked to handle Flight::Personals – people flying their own planes. What changes does this require? Well, we need a Flight::Personal class, that’s for sure:

package Flight::Personal;
use Moose;
with 'Flight';

sub understands {
    my ($class, $data) = @_;
    return exists $data->{owner};
}

This should be enough, but it’s not. If we pass { owner => 'Ollie' } to new_flight we won’t get a Flight::Personal back because the factory does not yet know about Flight::Personal, so let’s add it in:

sub new_flight {
    ...
    elsif (Flight::Personal->understands($data)) {
        return Flight::Personal->new($data);
    }
    ...
}

Wait a minute! I see an abstraction emerging here! We seem to be doing the same sort of code for each branch in our if statement, lets see if we can do better here… maybe it will reveal a solution to the problem we’re investigating

sub new_flight {
    my ($class, $data) = @_;

    my @classes = (
        Flight::Personal,
        Flight::Holiday,
        Flight::Cargo;
    );
    for my $subclass (@classes) {
        return $subclass->new($data)
            if $subclass->understands($data);
    }

    die "I don't know how to create this type of flight";
}

Aha! Not only have we abstracted out some repetition and made it easier to change, we’ve reduced the effort to add a new type of Flight. We’re not happy that the factory has to change at all though – can you see how to achieve this yet? We need a way to dynamically set @classes. There are a few ways to do this, but I’ll show you a solution using Module::Pluggable:

package FlightFactory;
use Module::Pluggable search_path => 'Flight', sub_name => 'classes';

sub new_flight {
    my ($class, $data) = @_;

    for my $subclass ($class->classes) {
        # As before
    }
}

Module::Pluggable gives us the classes class method, which returns all classes under the Flight:: namespace. We probably want to be a bit more specific here and make sure we only get things that are concrete classes – checking that they do the Flight role would be a start. I’ll leave this to readers as an exercise.

Don’t reinvent the wheel

LeoNerd in the comments below has pointed out that the idiom of looping over clases to filter a specific one, is what Module::PluginFinder was designed for. So, in the spirit of writing even better code, lets try using that! Module::PluginFinder works like Module::Pluggable, but we can specify a filter for matching classes. It can also handle the instantiation for us:

package FlightFactory;
use Module::PluginFinder;

my $finder = Module::PluginFinder->new(
    search_path => 'Flight',
    filter => sub { 
        my ($class, $data) = @_;
        $class->understands($data)
    }
);

sub new_flight {
    my ($self, $data) = @_;

    return $finder->construct($data, $data)
        or die "I don't know how to create this type of Flight";
}

Conclusion

Hopefully in this post I’ve given you a clear illustration of the need for factories, when we might want to use them, and how we can implement them. We went past basic factories to make them dynamic, and extendible (even extendible outside the distribution). Along the way, I tried to illustrate this in the approach I would take while I do this at work, which also has hopefully given you a good idea of how you can apply basic refactoring to your code as you write it, and end up with clean, separated code.

If you want to follow along with this tutorial, I have pushed out a Git repository to my Github account. You can follow along with this by checking out the code, and then reading the log with patches, in reverse:

git clone git@github.com:ocharles/OCharles-Blog-Factories.git
git log --reverse -p

It’s a little different, as I wrote it after the article, but hopefully it’s useful. This is the first time I’ve tried posting accompanying code, so I’m curious to see how useful people find it. If you want to run Example.pm you will need Moose, Module::Pluggable, and a reasonable version of Perl (5.8 upwards should do the job).

Defect Driven Testing… and You’re the Driver

November 21, 2010 § 3 Comments

Recently at $work I’ve been doing, quite frankly, a poor job of maintaining one of my areas of responsibility. I am the developer of a script that deals with the migration of about 12 million rows of data from one data format to another. This work has become sufficiently complicated that it can’t really be done in SQL, and has a set of Perl classes to do the migration. It’s a very critical piece of work right now, because it defines how long we have down time, and we only get to run this script once – if it’s wrong, then I may have corrupt a decades worth of historic data (effectively an audit trail). This is Not A Good Thing.

As you can see, I’m aware of these problems, and have made progress in increasing the speed of script, along with the quality of it. What I’m not so happy to tell you, is that I’ve also managed to commit this script in various forms of stability – from syntax errors, start up runtime errors, to errors that happen at the end of the script – something that you still end up waiting a good hour until you find out. I’m a perfectionist, and no doubt this cycle of “hack-break-hack-break-hack-hey it works!-hack-break-hack-fix” is getting extremely tedious, and doesn’t feel at all representative of my actual ability. Finally, that little moment of enlightenment has come and it’s time to sort this out. My solution? >buzzword<developer driven defect driven development>/buzzword< (someone please make me a manager).

Defect Driven Development

Before we get on to talking about my revolutionary testing approach which will probably cure world hunger, we need to discuss the concept of Defect Driven Development (DDD, for now). DDD is a testing methodology, where by you ensure you have a test for a bug. Dead simple. Here’s the basic methodology: you work on a great new feature and release it. User U finds a bug in your feature and reports it. You, the developer, write a test and then fix the bug. Now the bug will never occur again (theory here, work with me…)

Now here’s the twist:

You Are the User

While developing, consider yourself the user as well. So you hack away, and something breaks. At this point, you don’t fix the break, you write a test to reproduce it. And I mean everywhere. Even down to the smallest details – no matter where the bug is. Syntax error? Reproduce it in a test¹. A function crashes with some input? Reproduce it in a test. The world is beginning to implode? Reprod… ok, it won’t really help here.

I doubt that I’ve really made much of an impact of this post but that’s my point. This is a very simple way to get tests into your codebase, and it has one some excellent consequences when you actually get to the end of your feature. Every new feature or enhancement really needs tests to go with it, but tests are hard, and testing is boring. I’ve written unit tests before that test basic exercise, but finding the edge cases that do break the system is difficult. I don’t think my tests often help me develop my feature, they are more useful later as I start to integrate tests for bug fixes. But with this new approach, you automatically do get some tests of where things have broken in the past – and I believe if it’s broke before, it’ll probably break again, unless you’re taking steps to prevent that.

¹ While yes, the code wouldn’t actually run, in Perl we can still have a test to make sure something compiles. use_ok in the Test::More distribution can do this, for example.

Introducing Magpie – flexible test doubles & mocking for Perl

November 19, 2010 § 3 Comments

Introduction

Magpie is a new distribution I have just released which brings the power of test doubles to Perl. There are already a few solutions to test doubles in Perl, but Magpie takes a different approach. Inspired heavily by Mockito for Java, Magpie gives you test doubles that are based are spying and verification, not expectations. So, before we really dive into it, how does it look?

use Test::Magpie qw( mock when );
use Test::More;
my $mocked_list = mock;

when($mocked_list)->get(0)->then_return('first');
when($mocked_list)->get(1)->then_die('Kaboom!');

is($mocked_list->get(0) => 'first');
ok(exception { $mocked_list->get(1) });
is($mocked_list->get => undef);

So, what’s going on here? First of all we create a mock object. This object does every role, is a subclass of every class, and can run any method (returning undef by default). We then stub this object to handle some method calls using the when construct. We specify that when we request item 0 from our mocked list we should return the string ‘first’ and when request item 1, we throw an exception string ‘Kaboom!’. Simple! And as you can see, the tests following all verify this behaviour – this example is straight out of t/mockito_examples.t

What does Magpie have to offer?

What you just saw was the basics of Magpie – there are a lot more cool features available, that come in to be very useful!

Verification

As well as stubbing methods, you can also verify that methods were called. The following example from the synopsis illustrates how this may be useful:

use Test::Magpie qw( mock verify when );

my $baker = mock;
my $bakery = Bakery->new( bakers => [ $baker ] );
my $bread = $bakery->buy_loaf( amount => 2, type => 'white' );
verify($baker, times => 2)->bake_loaf('white');

As you can see, we are able to verify a method was called, and also add some extra details – for now the amount of times a method was called, and which arguments it was called with.

Argument matchers

Argument matchers allow you to be more general in your specification for stubs and verification. Rather than saying “when this method is called with exactly these arguments” we can say the more general “when this method is called with arguments that match these predicates.” In practice, it might look like this:

when($child)->eat(type(Broccoli))->then_die('Yuck!');
when($child)->eat(type(SugaryGoodness))->then_return('Ooo, yum!')

In this example Broccoli and SugaryGoodness are type constraints. There are already a few argument matchers that ship with Magpie, and it’s trivial to define your own with the custom_matcher generator.

Extra extra! Read more about it!

There’s more to Test::Magpie than what I’ve mentioned in this post, but if you’re interested, I recommend the official documentation. The basic and Mockito example tests serve as a great demonstration of how Magpie can be practically used.

Go go gadget CPAN – installing Magpie

Magpie is already available for use now, and is on cpan:

cpan Test::Magpie

From your shell, or however you wish to install CPAN modules

I really hope you enjoy this module, I’m already finding it powerful enough to use at work. If you have any criticisms, bugs, feature requests, or ponys to give me – drop me a comment here, an issue on RT, or poke me on IRC (I’m ocharles).

Happy testing!

Edit: 0.04 had a release problem and might not have installed cleanly. 0.05 should fix this. Sorry!

Cycles

A Journey of Optimizing Perl

Context: Data transformations

So Make it FAST

Step 1: Prototype

Step 2: Understand your constraints

Step 3: Identify the bottlenecks

Step 4: Reducing the impact of bottlenecks

Trade space for time

Referential transparency

Hit rate

Approach the problem from a different angle

Reduce logic where constraints are known

Substitute for more efficient technologies

Mistakes

A lack of understanding of bottlenecks

Overly ambitious optimizations

Conclusion

Implementing Factories in Perl

What Are Factories

Going Further

Don’t reinvent the wheel

Conclusion

Defect Driven Testing… and You’re the Driver

Defect Driven Development

You Are the User

Introducing Magpie – flexible test doubles & mocking for Perl

Introduction

What does Magpie have to offer?

Verification

Argument matchers

Extra extra! Read more about it!

Go go gadget CPAN – installing Magpie

Where Am I?