Optimise! Optimise! Optimise!

By | 2010/03/04

To paraphrase Monkey Boy, if you’re a developer you really should be focused on code optimisation. This is something we “old timers” (and I’d hardly call myself a true “old timer” when it comes to software development) had to learn out of simple necessity: we didn’t have enough CPU, memory and spindles to overcome bad software design; it had to be developed, from the ground up, to be as efficient as possible.

This was brought home to me recently when I needed to sanitise a CSV file. There were 130,000 lines in the CSV file, and while some lines were apparently different, they would occasionally share a unique identifier, meaning that they were effectively double-ups when it came to a figure I was hunting for. Timing was of the essence – I didn’t want to spend a lot of time writing code, as it was a one-off run.

Given that for the most part I programme in Perl, I tackled the standardisation issue there. There were two techniques that immediately occurred to me:

List/array approach

Cycle through each line, building up an array of all the unique identifiers. Every time a new line is encountered where its unique identifier is already in the list, disregard that line and move onto the next. So the data structure assembled is @data=(unique_id1,unique_id2,…,unique_idn).

This resulted in some Perl code that looked like the following:

#!/usr/bin/perl -w

use strict;

sub in_list {
    return 0 if (@_+0 < 2);

    my $element = $_[0];
    return 0 if (!defined($element));
    shift @_;

    my @list = @_;
    return 0 if (!@_ || @_+0 == 0);

    my @results = grep(/^$element$/,@list);
    return @results+0;
}

my @ssids = ();
my $count = 0;

while (<>) {
    chomp;
    if ($count % 100 == 0) {
       warn "...$countn";
    }
    $count++;

    my @details = split(/,/,$_);
    my $ssid = $details[6];
    if (in_list($ssid,@ssids)) {
       next;
    } else {
       push(@ssids,$ssid);
    }
    print "$_n";
}

Hash approach

Cycle through each line, building up a hash of all the unique identifiers. So the resulting data structure for unique entries looks like:

$data{unique_id1} = 1

$data{unique_id2} = 1

$data{unique_idn} = 1

etc.

Every time a new line is encountered, check the hash to see if there’s a data entry for the hashed identifier. If there is, disregard that line and move onto the next.

Here’s what the Perl code for this variant looked like:

#!/usr/bin/perl -w
use strict;
my %ssids = ();
my $count = 0;
while (<>) {
  chomp;
  if ($count % 100 == 0) {
      warn "...$countn";
  }
  $count++;
  my @details = split(/,/,$_);
  my $ssid = $details[6];
  if (!defined($ssids{$ssid})) {
     $ssids{$ssid} = 1;
  } else {
     next;
  }
  # If we made it here...
  print "$_n";
}

What I did, and what happened next

The base system I used in this was a Mac Pro 3.2GHz 8-core system with 12GB of RAM. Being a single-threaded Perl program, of course the program (either one) would just simply run on a single core.

While I work with hashes all the time in Perl, I tend not to use them much for “quick and dirty” scripts that I’m only running once – e.g., just for testing, or just for a one-off data reduction. So I did the array based implementation, which ran, and ran, and ran, and … out of curiosity I let keep running while I wrote the hash based variant.

I then ran the hash based variant, and it completed.

So, how different were the timings?

To run with getting timestamps, the command was:

$ date > start.txt; ./standardise-list.pl < results.csv > new.csv; date > stop.txt

(Why, a regular Unix person might ask me, why don’t I just use time in the above command? Well, I’ve always shied away from using ‘time’ after it had a bug on Tru64 which blew out monitored commands with heavy IO by several orders of magnitude. So if I’m not after über-accuracy, I go for the date method.)

Here are the start and stop times:

$ cat start.txt stop.txt
Tue  2 Mar 2010 13:29:36 EST
Tue  2 Mar 2010 15:16:51 EST

As you can see, that took quite a while to run. Approximately 45 minutes. The reason of course is obvious. Every time we expanded the size of the unique identifiers list, we had a bigger array to check against to see if an identifier was already in the unique identifiers list. I.e., it doesn’t matter if you have a high speed computer with scads of RAM if your algorithm sucks.

Moving on to the hash option, the command was:

$ date > start2.txt; ./standardise-hash.pl < results.csv > new2.csv; date > stop2.txt

Here are the start and stop times:

$ cat start2.txt stop2.txt
Tue  2 Mar 2010 15:18:33 EST
Tue  2 Mar 2010 15:18:35 EST

Huge difference there. 2 seconds to run (probably bound as much as anything by throwing updates every hundred lines to stderr) vs 45 minutes. That points to some serious differences in the suitability of the data structure to the task at hand.

So your next question might be (if you’re a Perl programmer) – maybe I’m doing the unique-check poorly in the list variant.

This leads to four potential enhancements to the array/list based model:

  1. Pass the list as a reference rather than a new list, thereby eliminating the need to create a temporary data structure on each pass.
  2. Do the reference pass as per the above, but do an iterative loop through the list rather than a grep, returning immediately if the instance is found, in case the list/array grep was chewing up too much time on building data structures.
  3. Forget subroutines, and do the list check in the main loop.
  4. Forget subroutines, and do the list check as a grep in the main loop.

Let’s see how each one of those options fared, time-wise:

Variant 1:

  • Start – Wed Mar  3 22:08:56 EST 2010
  • Stop – Wed Mar  3 23:49:56 EST 2010

Variant 2:

  • Start – Wed Mar  3 23:49:56 EST 2010
  • Stop – Thu Mar  4 01:19:50 EST 2010

Variant 3:

  • Start – Thu Mar  4 01:19:50 EST 2010
  • Stop – Thu Mar  4 01:31:38 EST 2010

Variant 4:

  • Start – Thu Mar  4 01:31:38 EST 2010
  • Stop – Thu Mar  4 02:01:00 EST 2010

Overall none of those results are particularly encouraging. While they point to there being some optimisation available in the use of an array for this sort of comparison checking, none of them even come close to the performance offered by picking the correct data structure – the hash.

Net lesson: lazy programming will only get you so far, even on high speed systems. Optimisation should preferably start before you even start coding – picking the right data structure for the situation even if (in my case) you don’t traditionally use it for quick and dirty scripts. Optimisation is also about being willing to rip the guts out of an existing piece of code and completely change how it works when you’ve got a fair idea that it’s not currently working efficiently enough.

And so this serves me as my public reminder as to why, even when I’m just doing a one-off quick-and-dirty script, why I should invest the same time in picking the right algorithm and data structures as I do when I’m writing “serious”/permanent code.