Thursday 31 July 2008

Unwanted Hero

My wife and a friend went on a little excursion to a children's playground the other day - my wife took our daughter, and the friend took her three to make it look proper. The friend happened to notice a little boy climbing onto the outside of a tubular spiral slide and, inevitably, begin to slip off. She rushed over, arms outstretched: but she didn't catch him. His clothes had caught on a screw in the side of the slide; and so he hung - upside down - some distance above the ground.

The friend looked around, expecting to see some parent or guardian come rushing to the rescue. But the relevant parties obviously weren't paying any attention, because no aid was forthcoming. So our hero had to hoist the little chap down herself.

Most interesting was the reaction of the friend's oldest daughter.

"Mum's always doing that. Last week she saved a boy from drowning. It's so embarrassing!"

Monday 28 July 2008

Project Euler Problem 16: Calculating Large Powers of Two

I always feel like I've got good value-for-brain-cycles when I can reuse a solution to a problem. I had that satisfaction when I came to look at Problem 16 in the Project Euler series: calculating the sum of the digits of the number 21000. Why? Because I realised that calculating 21000 (which we clearly need to do to calculate the sum of its digits) is just the same as summing a list of very big integers, which we did to solve Problem 13.

How is that? On the face of it calculating an exponent looks quite different to calculating a sum. But take a look at this line of thought:

2n = 2 x 2n-1
   = 2n-1 + 2n-1
   = (2 x 2n-2) + (2 x 2n-2)
   = (2n-2 + 2n-2) + (2n-2 + 2n-2)
   = ...

Bottom up duck. Picture credit: Glisglis, http://flickr.com/photos/glisglis/365003674/Or if you prefer bottom-up thinking (you are a duck, perhaps):

21 = 2
22 = 2 x 2 = 2 + 2
23 = 2 x (2 x 2) = (2 + 2) + (2 + 2)
24 = 2 x [2 x ( 2 x 2)] = [(2 + 2) + (2 + 2)] + [(2 + 2) + (2 + 2)]
25 = you get the idea

Thus we have a recursive definition for calculating a power of two in terms of the addition operation: just add together two of the previous power of two. Putting my Functional Programming Thinking cap on immediately brings to mind the Unfold method, which generates a sequence by applying a transformation to the previous element. Used in combination with my code from an earlier solution to sum up two numbers represented as digits, this should give us what we need.

In order to know how many terms of the sequence to Unfold to get us up to 21000 we'll use the trick of smuggling some additional information into the terms of the sequence using an anonymous type: in the following code the Power property in the anonymous type indicates which power of two is held in the Digits sequence. Digits is a number represented as a sequence of digits, least significant digit first.

The full code, as always, is available on Code Gallery.

[EulerProblem(16, Title = "What is the sum of the digits of the number 2^1000?")]
public class Problem16
{
    public long Solve()
    {  
        return new {Power = 1, Digits = (IEnumerable<int>) new int[] {2}}
            .Unfold((previous) => new { Power = previous.Power + 1, Digits = SumNumbers(previous.Digits, previous.Digits) })
            .SkipWhile(item => item.Power < 1000)
            .First() // this will be the term containing 2^1000
            .Digits // take the Digits property of the anonymous type
            .Sum(); // sum the digits of 2^1000
    }

    /// <summary>
    /// Calculate the sum of two numbers represented as a sequence of digits. Numbers should be given
    /// with least significant digits first
    /// </summary>
    /// <param name="number1"></param>
    /// <param name="number2"></param>
    /// <returns>The sum of the two numbers as a sequence of digits</returns>
    private IEnumerable<int> SumNumbers(IEnumerable<int> number1, IEnumerable<int> number2)
    {
        // put the numbers together in an array so that we can zip them easily
        var numbers = new[] {number1, number2};

        // process the entire list of numbers column by column
        // ie, process least significant digits of all numbers first, then the next column of digits, etc.
        return numbers
        .Zip()
            // work through each column, calculating which digit represents the sum for that column,
            // and what the carryover is
        .Aggregate(
            // initialise an annoymous data structure to hold our workings out as we
            // move column-by-column.
            // Digits holds the digits of the sum that we've calculated so far
            // CarryOver holds whatever has been carried over from the previous column
            new { Digits = Stack<int>.Empty, CarryOver = 0 },
            (previousResult, columnDigits) =>
            {
                var columnSum = columnDigits.Sum() + previousResult.CarryOver;
                var nextDigit = columnSum % 10;
                var carryOver = columnSum / 10;
                return new { Digits = previousResult.Digits.Push(nextDigit), CarryOver = carryOver };
            },
            // to complete the aggregation, we need to add the digits of the final carry-over
            // to the digits of the sum;
            // note that because completeSum.Digits is a stack, when it is enumerated
            // we get the items back in reverse order - ie most significant digit first; hence the call to Reverse()
            // to make the order of the digits in the sum we return consistent with the input digits
            (completeSum) => completeSum.CarryOver == 0 ? 
                completeSum.Digits.Reverse() 
                : completeSum.CarryOver.Digits().Concat(completeSum.Digits).Reverse());
    }
}

I also had to update the Zip method to fix a bug. Here's the updated version:

public static IEnumerable<IEnumerable<T>> Zip<T>(this IEnumerable<IEnumerable<T>> sequences)
{
    // get the enumerators for each sequence
    var enumerators = sequences.Select(sequence => sequence.GetEnumerator()).ToList();

    var activeEnumerators = new List<IEnumerator<T>>();
    var items = new List<T>();

    while (enumerators.Count > 0)
    {
        items.Clear();
        activeEnumerators.Clear();

        foreach (var enumerator in enumerators)
        {
            if (enumerator.MoveNext())
            {
                items.Add(enumerator.Current);
                activeEnumerators.Add(enumerator);
            }
        }

        if (items.Count > 0)
        {
            yield return items;
        }

        enumerators.Clear();
        enumerators.AddRange(activeEnumerators);
    }
}

Friday 25 July 2008

Inadvertent MSDN Humour

I just came across this whilst looking up the documentation for the WCF <dns> config element on MSDN:

MSDN Humour 

I'm guessing that this is the work of an overzealous link-shortening script. The links actually point to www.microsoft.com and msdn.microsoft.com. I wonder how long it will take Microsoft to dehumourise this?

Thursday 24 July 2008

We're hiring

Are you looking for a new outlet for your talent? Is your current role not stretching you enough? Then read on: I might have just the thing for you. Paragon Simulation, the company I work for, is looking to recruit smart developers with the right blend of mathematical, analytical, and people skills to power our next round of business modelling work - and a new product development.

The official vacancies page has more details. The headlines are:

  • The job is based at our offices in Halesowen, near Birmingham (in the UK), just off M5 junction 3 - Google shows you where.
  • We're offering up to £31k, or up to £43k for the Senior role - both roles with additional bonuses and benefits.
  • We're looking for people with a strong maths or stats background. If you've previously been involved in some kind of computer modelling, especially business modelling, you'll be ideal.
  • You'll be working on a variety of modelling projects (possibly involving simulation tools like Witness or Flexsim or even building bespoke modelling tools in C#/.Net), with opportunities to get involved in the development of our new product (still under wraps at the moment).

If you're interested in applying please send your CV to me (after de-obfuscating the following address): sam [dot] jack [snail-shell symbol traditionally associated with email addresses] paragonsimulation [dot] com. Please include "Simulation Job" in the subject line.

But before you do that you might like to know a bit more about the company, and what we do.

Our Work

Paragon sells its services to companies in a number of sectors, ranging from the Nuclear, through the Environmental, to the Financial and Public Sectors. You should see the case studies if you want to get a feel for the breadth of work that we do. As you can imagine, hopping between sectors like this presents a very interesting challenge to the people that work here: clients have often commented on how quickly we learn the intricacies of their unique business - but that's what it takes to create the insightful models they need.

You can read more in an earlier post about my experience on a couple of modelling projects.

Working for Paragon

I find Paragon a great company to work for. There aren't many of us (currently fewer than 20), but as a Company we have an influence beyond our size. All the organisations we work for are very much larger than ours, and we engage them at the highest level. Everybody on a project has to make a strong contribution in many different roles: analyst, developer and tester. We love to find better ways of working: if you have great ideas, and want to be heard, this is the place to be.

We have a relaxed working environment: open plan, no dress-code, flexible working hours, table football in the kitchen - that kind of thing. There's a good personal development program in place with regular reviews and a clearly laid out career path. We try to ensure that everybody has some scheduled "Personal Development Time" every month to work on something not directly related to their current projects - reading, learning, working on projects of their own. As part of this, the company makes it very easy to get hold of any new books we need for the job. Here's the stack on my desk at the moment (we do also have bookshelves, by the way - I just feel cleverer when I have a pile of books to back me up):

22072008

In terms of the software development environment, we're very much a Microsoft shop. We're fortunate to have the kind of environment which means that we can take up new technologies where ever we think there's a case for them. So we can lay claim to all the latest buzzwords like Visual Studio 2008, .Net 3.5, C#3.0, LINQ, LINQ-to-SQL, WPF, WCF, SQL Server 2005, VSTO, etc., etc. It goes without saying that we have all the staples like Version Control and Bug tracking in place. We also have our Build Server that does Continuous Integration (using CruiseControl.Net) and some automated testing (we favour MbUnit). And you might like to know that we're all Resharper fans round here.

For Simulation work, I've already mentioned that we use Witness and Flexsim, both Discrete Event Simulation packages. Witness is one of the most mature Simulation engines, and is very good at building models that are configurable at run time. It can be programmed in a language that has similarities to Basic. Flexsim is a relative newcomer. Its speciality is 3d visualisation of processes, so we use it when we really want to engage a customer in how their process will look when it is in operation.

Then there's spreadsheet modelling. We have been using Excel with VBA for these kinds of projects, but with some of our recent models hitting 120Mb in size, I think we're pushing the limits of what this combination of technologies can achieve. If you have experience in using other technologies (perhaps VSTO) for serious spreadsheet modelling, or other ideas about how we can manage such large beasts, we'd love to hear from you.

So what are you waiting for? Get your CV up to date, and get it to me.

Tuesday 22 July 2008

What I do at work

When I launched this blog a couple of months ago, I made a vague promise that I would talk a bit about what I do at work. I know your repeat visits have all been in the hope of me keeping that promise. Right?

[Content temporarily removed]

To be revealed...

I can't say much about my most recent and current projects, because they have to do with a new product that Paragon is developing and keeping under wraps . What I think I am allowed to say is that it is to do with spreadsheet modelling and data management.

At the risk of allowing you to join the dots and deduce more than you should know, I'll mention that this project has given us an opportunity to build a client/server application with WCF; we've had lots of fun with LINQ-to-SQL; and we've had plenty of character-building experience with VSTO and managed Excel add-ins - all lubricated with plenty of C#3.0 goodness.

If any of this sounds more appealing to you than your current line of work, look out for my up-coming post on some positions we have open.

[Content temporarily removed]

Friday 18 July 2008

Random Pastel Colour Generator

One of the nice things about WPF is that it sets us free from the world of "battle-ship gray" that is Windows Forms and Win32, and lets us enter the promised land of glorious technicolored, gradient-filled User Interfaces with all kinds of fancy typography and bouncy animated buttons.

Of course, with great power comes great responsibility not to offend users' sensibilities with garish colour schemes. For artistically challenged developers there are sites like Kuler that help with composing tasteful colour schemes1. And now there's my offering.

This is something that a colleague (hi Simon!) and I put together when we were developing a WPF chart control back in the days of .Net 3.0 beta 2. We wanted to be able to generate charts with randomly coloured series, avoiding the horrible mishmash of colours with which Excel paints its charts. We came up with a random pastel colour generator.

The principle is quite simple. You just need to observe that, in the RGB colour model pleasing pastel palettes pop out when you pick red, green and blue values fairly close to each other in the middle of the range 0-256. It generates results like this:

Pastel Colours 1

and this:

Pastel Colours 2

Here's the code (in C#)2:

using System;
using System.Windows.Media;

/// <summary>
/// Provides a range of tasteful random pastel colors
/// </summary>
public class RandomPastelColorGenerator
{
    private readonly Random _random;

    public RandomPastelColorGenerator()
    {
        // seed the generator with 2 because
        // this gives a good sequence of colors
        const int RandomSeed = 2;
        _random = new Random(RandomSeed);
    }

    /// <summary>
    /// Returns a random pastel brush
    /// </summary>
    /// <returns></returns>
    public SolidColorBrush GetNextBrush()
    {
        SolidColorBrush brush = new SolidColorBrush(GetNext());
        // freeze the brush for efficiency
        brush.Freeze();

        return brush;
    }

    /// <summary>
    /// Returns a random pastel color
    /// </summary>
    /// <returns></returns>
    public Color GetNext()
    {
        // to create lighter colours:
        // take a random integer between 0 & 128 (rather than between 0 and 255)
        // and then add 127 to make the colour lighter
        byte[] colorBytes = new byte[3];
        colorBytes[0] = (byte)(_random.Next(128) + 127);
        colorBytes[1] = (byte)(_random.Next(128) + 127);
        colorBytes[2] = (byte)(_random.Next(128) + 127);

        Color color = new Color();

        // make the color fully opaque
        color.A = 255;
        color.R = colorBytes[0];
        color.B = colorBytes[1];
        color.G = colorBytes[2];

        return color;
    }
}

Footnotes

  1. Here's a list of other colour-scheme generators.
  2. If you've not entered the promised land of WPF, it shouldn't be too hard to convert this to Windows Forms lingo. I'll leave that as an exercise to the reader...

Tuesday 15 July 2008

Project Euler Problem 15: City Grids and Pascal's Triangle

Problem 15 in the Project Euler series asks us to count the number of routes through a 20x20 grid starting from the top left and travelling to bottom right, always moving either rightwards or downwards, never backtracking. I can give you the solution in one line of code; explaining why it is the solution will take me many more lines of prose. The solution is:

return PascalsTriangle.GetEntry(2*20, 20);

(OK, you caught me: PascalsTriangle isn't a C# primitive type, so I guess I'll have to give some code for that as well. But all in good time.)

So what does Pascal's triangle have to do with City Grids? This is one of the things that I love about mathematics: seemingly unrelated problems often have subtle and beautiful connections. The connections are completely hidden on the surface, but once you uncover them, link at a time, you wonder that you never spotted them immediately.

How should we go about counting the number of routes through the grid? Here, try with this smaller grid - take my copy and scribble on it all you like:

A 6x6 city grid

My first insight was to start counting, not at the start of the grid, but at the end. Rather than try to work out all the possible routes from the starting point, let's count how many routes there are from points near the end, then work backwards.

Let me show you.

How many routes are there from the point F6 (marked "End") to the same point, F6? Stupid question, obvious answer: one - just follow the complicated directions: "Stand Still". From F5 to F6 there's one route: "Go South one block". From E6 to F6, one route: "Go East one block". But if I'm stood at E5 and want to get to End I have two choices, I can either go "East one block, then south one block" or "South one block, then East one block".

In fact, from anywhere along column F there's only one route to the end ("Go South x blocks"), because we're not allowed to turn back west; and from anywhere along row 6 there's only one route ("Go East x blocks") because we're not allowed to turn up north. On the diagram I've drawn what we've got so far - blue shows the number of routes from each point:

CityGridv3

Now lets consider how to find the number of routes from other spots on the grid. Take E4, for example. If we go east to F4 then we already know that there's only one way we can go on from there. If, however, we go south one block to E5 then we know that there are two routes we can take from there. So this makes a total of three possible routes from E4. By the same argument, we can see that there are 3 possible routes from D5. Then we can apply the same reasoning to cell D4: three possible routes if we go east to E4 plus three possible routes if we go south to D5 make for a total of 6 ways to get from D4 to the end. Another diagram to summarise:

CityGridv4

You see the way this is going now, don't you? To work out the total number of routes from each intersection we add up the total number of routes from the intersections one block to the east, and one block to the south.

Now lets spin the grid about its centre to put the point marked End at the top, drop the gridlines, and write out the numbers by themselves. Like this:

PascalsTriangle

What have we got? Pascal's Triangle - a fascinating mathematical object d'art that has been rediscovered many times throughout history in many different contexts. (Wikipedia will be delighted to tell you more about it- there's more to it than meets the eye). You construct it by writing 1s down the diagonal edges of the triangle, then forming each cell by summing the two neighbours in the row above.

I've highlighted the cells in red, because they are the ones that particularly concern us. Imagine spinning the triangle back so that it aligns with the grid. Then you see that "2" is the number of routes through a 1x1 grid, and "6" is the number of routes through a 2x2 grid.

Let's label the cells of the grid. Call the apex of the triangle row 0, and call the first "1" on each row column 0. Thus the number of routes through a 1x1 grid appears in row 2, column 1; the number of routes through a 2x2 grid appears at row 4 column 2. From there we can take a flying un-mathematical leap to a conclusion and say that the number of routes through an n x n grid will appear in row 2n, column n of Pascal's Triangle. That explains my solution: obvious now?

But how to calculate the number in that cell? My first thought was to write code that builds the triangle row by row. But there's a quicker way. Using Proof by Wikipedia, we can show that P(r,c) = ((r + 1 - c) / c) * P(r, c - 1), where P(r, c) is the value of Pascal's triangle at row r, column c. In other words, to calculate the value of a cell at column, c, in a particular row, r, we multiply the value of the cell to the left by the fraction (r + 1 - c) / c.

A sequence defined recursively like that makes me think of the Unfold method for generating it. Remember that after you have given it a seed value, Unfold generates a sequence for you by applying the lambda you give it to the previous term in the sequence. So our mathematical definition becomes:

public class PascalsTriangle
{
    public static long GetEntry(int row, int column)
    {
        // the L suffix on "Entry = 1L" is to force Entry to have a long type
        return Functional.Unfold(new {Entry = 1L, Column = 1},
                                 previous =>
                                 new
                                     {
                                         Entry = (previous.Entry*(row + 1 - previous.Column))/previous.Column,
                                         Column = previous.Column + 1
                                     })
            .SkipWhile(item => item.Column <= column)
            .First()
            .Entry;
    }
}

Note the use of the anonymous type to pass an additional piece of information through to the next iteration: Entry holds the cell value, and Column holds an incrementing count of the number of columns that we've calculated.

For reference, here's equivalent code that uses looping rather than functional concepts:

public static long GetEntry(int row, int column)
{
    long current = 1;

    for (int i = 1; i <= column; i++ )
    {
        current = (current * (row + 1 - i)) / i;
    }

    return current;
}

The complete code is shown below, and can also be downloaded from my Project Euler on the MSDN Code Gallery.

[EulerProblem(15, Title = "Starting in the top left corner in a 20 by 20 grid, how many routes are there to the bottom right corner?")]
public class Problem15
{
    public long Solve()
    {
        return PascalsTriangle.GetEntry(2*20, 20);
    }
}

public class PascalsTriangle
{

    public static long GetEntry(int row, int column)
    {
        // the L suffix on "Entry = 1L" is to force Entry to have a long type
        return Functional.Unfold(new { Entry = 1L, Column = 1 },
                                 previous =>
                                 new
                                     {
                                         Entry = (previous.Entry * (row + 1 - previous.Column)) / previous.Column,
                                         Column = previous.Column + 1
                                     })
            .SkipWhile(item => item.Column <= column)
            .First()
            .Entry;
    }
}

public static class Functional
{
    public static IEnumerable<T> Unfold<T>(this T seed, Func<T, T> generator)
    {
        // include seed in the sequence
        yield return seed;

        T current = seed;

        // now continue the sequence
        while (true)
        {
            current = generator(current);
            yield return current;
        }
    }
}

Monday 14 July 2008

Using IDisposable - don't feel guilty about 'hijacking' it!

Jason Bock links to my post on (Mis)Using IDisposable (with some kind words about the "cool content" on the rest of blog - thanks Jason!), and says that he sometimes feels "a little guilty about using IDisposable for something other than managing unmanaged resources". I guess the title of my post did suggest that I was encouraging others to indulge along with me in some slightly naughty behaviour.

But nothing so underhand is going on. Eric Gunnerson, previously on the C# design team, has said 

we decided to name it “using” rather than something more specific to disposing objects so that it could be used for exactly this scenario.

So, Jason, we all have the highest authority to (mis)use using and IDisposable for purposes beyond resource clean-up.

Thursday 10 July 2008

Cancelling Long-running LINQ queries

Last time you dropped by, we were discussing how that polite software should tell its users how it's getting on with the tasks it has been set - it should give progress reports; I showed how to implement progress reporting in LINQ queries. Now the book of software etiquette also says (on page 214, but you'll have to trust me on that because I've got the only copy - stored in my imagination) that well mannered programs should allow their users to cancel very long running tasks.

When I was new to the .Net framework, and hadn't read around much, I was inclined to think that the best way of supporting cancellation is spin up a new Thread, do the work there and, if the user should press "Cancel", pull the trigger on the Thread using Thread.Abort(). But as Ian Griffiths explains, Thread.Abort() is a nuclear option that you should only invoke if you're prepared to deal with the radioactive fallout afterwards - by which I mean the inconsistent state that the ThreadAbortException generated by Thread.Abort() can leave in its wake as it goes about terminating your thread. If you want to stop your program in its tracks without damaging the environment you should use a good old-fashioned bool flag that one thread can set and the worker thread can check periodically to see whether it should break out of its current job.

But when we're using LINQ expressions, with nary a loop in sight, how can we periodically check that flag?

Remember the pipeline analogy I used of LINQ queries last time? There we talked about introducing flow meters to observe the number of elements passing through the pipe - that idea led to the progress reporting we needed. Well, the other thing every sensibly-constructed pipeline has are valves that can shut the whole operation down: that's just what we want to support cancellation. The idea is that, when we hear that we need to cancel, we starve the LINQ query of any further items so that it terminates early.

As before, we can implement this LINQ "valve" using an Extension method containing an iterator that just pulls items from the input sequence on the left and passes them through to the output sequence on the right, but crucially calls a delegate after every item to ask whether it should carry on or cut the output sequence short; we can even combine this technique with the one from last time to support reporting progress.

Also as before, I've provided two versions of this extension method: one that buffers the input collection in order to count it, and one that doesn't buffer it - but does require you to tell it how many elements there will be.

public static IEnumerable<T> AsCancelable<T>(this IEnumerable<T> sequence, Func<int,bool> shouldCancel)
{
    if (sequence == null) { throw new ArgumentNullException("sequence"); }

     // make sure we can find out how many elements are in the sequence
     ICollection<T> collection = sequence as ICollection<T>;
     if (collection == null)
     {
        // buffer the entire sequence
        collection = new List<T>(sequence);
     }

     long total = collection.Count;
     return collection.AsCancelable(total, shouldCancel);
}

public static IEnumerable<T> AsCancelable<T>(this IEnumerable<T> sequence, long itemCount, Func<int, bool> shouldCancel)
{
    if (sequence == null) { throw new ArgumentNullException("sequence"); }

    long completed = 0;

    foreach (var item in sequence)
    {
        yield return item;

        completed++;
        bool cancel = shouldCancel((int)(((double)completed / itemCount) * 100));
        if (cancel) { yield break; }
    }
}

To provide an example of how you might use this, I've applied AsCancelable to a slightly modified version of my solution to Project Euler Problem 4. To make the code run for longer (to give us an opportunity to cancel it!), I've changed it to find the largest palindrome that is a product of two 4 digit numbers (Problem 4 was about two 3 digit numbers).

public void TestCancellableProgressReporting()
{
    bool cancelled = false;

    Func<int, bool> progressController = percent =>
                        {
                            Console.SetCursorPosition(1, 0);
                            Console.WriteLine(percent + "%");
                            Console.WriteLine("Press 'c' to cancel");
                            cancelled = (Console.KeyAvailable && Console.ReadKey(true).Key == ConsoleKey.C);
                            return cancelled;
                        };

    var largestPalindrome =
                        (
                        from long x in 1000.To(9999).AsCancelable(progressController)
                        from long y in 1000.To(9999)
                        let product = x * y
                        where product.IsPalindromic()
                        orderby product descending
                        select product
                        )
                        .FirstOrDefault();
    if (!cancelled)
    {
        largestPalindrome
            .DisplayAndPause();
    }
    else
    {
        "Calculation Cancelled".DisplayAndPause();
    }
}

A couple of things to note about this:

  • You need to make sure your LINQ query is prepared for the starvation it might face. In the original solution to Problem 4 I used a call to First() rather than FirstOrDefault() at line 23 (the difference being that First() throws an exception if it doesn't find any elements in the sequence, whereas FirstOrDefault will return the default value for the sequence type - 0 in this case). That was fine in the original context, because there were always elements in the sequence. When we use AsCancelable however, the sequence might be terminated before any elements have been passed through
  • You need to be sure to ignore the result of the query if it has been cancelled. This is obvious really, but I nearly forgot when writing the sample. In this case largestPalindrome will always be assigned a value, but only when we don't cancel the query will it be a valid value.
  • AsCancelable will only work in expressions where we are directly processing sequences - ie when the data source is IEnumerable rather than IQueryable. It will not work with LINQ to SQL for example, because LINQ to SQL translates the expressions you give it to SQL statements which clearly don't support cancellation! You might want to check out AsEnumerable in this context.
  • You might also want to note in passing the way to read input from the Console without blocking: check Console.KeyAvailable first!

Tuesday 8 July 2008

Incongruous alternatives

Overheard when passing a delivery truck parked outside our offices:

"Have you got a lighter, mate? Or a knife?"

Which set my imagination reeling. In what situation could those two implements be possible alternatives?

I can only hope that the courier was wanting, as I guessed, to remove the polythene wrapping from a palette. If he had a trussed up victim hidden in the van whom he wanted to torture then I failed in my civic duty to render aid.

Monday 7 July 2008

Reporting Progress during LINQ queries

How many times have you had someone use your software, then say to you: "That feature is so much faster now. What did you do?" And in fact all you had done was implement a progress bar, thus harnessing, not the power of the CPU, but the power of psychology!

Reporting progress of an operation is very easy when you have a list of objects that you loop through in a for or foreach block: you just need to know the total number of objects before you start, increment a counter as you process them, and periodically raise an event to notify interested parties (usually your UI) of the percentage progress. But how do you do calculate progress when there are no loops in sight - when you are using LINQ queries to process sequences?

I had a flash of inspiration the other night, followed by an impromptu coding session the next morning when the unusually bright sunlight woke me up early. I thought I'd share the results with you.

LINQ Pipeline Monitors

One way to think of LINQ queries with sequences is as a pipeline, or an oil refinery. You have the data source (often a collection) acting like the oil well, but producing items. Items flow along the pipeline (think of the "."s in the expression as the pipe). Some items are removed from the pipeline by filters - Where clauses; some items are converted to other products - Select clauses; finally, just as oil is cracked, and graded into gasoline and diesel etc. items in LINQ queries are often grouped and sorted. In oil pipelines there are also flow meters, measuring how much oil is going through the pipes. If we can create the equivalent of a flow-meter for LINQ queries, then we've got our means of reporting progress.

What I came up with is an extension method, WithProgressReporting(...), that you can just insert in the appropriate place in your LINQ query. Given an input sequence, it simply passes items through to an output sequence. But as items pass through it counts how many its sees, and calls a lambda expression to report progress against a pre-determined count of the total number of items.

I've created two variants of this.

  • The first one ensures that all the items in the sequence have been generated up front, so that it knows how many there are to process; use this if your data source is a collection, or if generating the items takes insignificant time compared with the processing you're going to do.
  • The second variant can be used if generating the items in the sequence takes time (so you want to report progress of this part as well), but you know in advance how many there will be: it doesn't attempt to buffer the sequence before passing the items through to the output sequence.
public static class Extensions
{
    public static IEnumerable<T> WithProgressReporting<T>(this IEnumerable<T> sequence, Action<int> reportProgress)
    {
        if (sequence == null) { throw new ArgumentNullException("sequence"); }
    
        // make sure we can find out how many elements are in the sequence
        ICollection<T> collection = sequence as ICollection<T>;
        if (collection == null)
        {
            // buffer the entire sequence
            collection = new List<T>(sequence);
        }

        int total = collection.Count;
        return collection.WithProgressReporting(total, reportProgress);
    }

    public static IEnumerable<T> WithProgressReporting<T>(this IEnumerable<T> sequence, long itemCount, Action<int> reportProgress)
    {
        if (sequence == null) { throw new ArgumentNullException("sequence"); }

        int completed = 0;
        foreach (var item in sequence)
        {
            yield return item;
    
            completed++;
            reportProgress((int)(((double)completed / itemCount) * 100));
        }
    }
}

WithProgressReporting in action

Here's a toy example showing how you might use the first method when you're using a BackgroundWorker to do calculations on a background thread. Focus on line 11, where I'm using WithProgressReporting to report progress to the BackgroundWorker:

public void TestProgressReporting()
{
    BackgroundWorker worker = new BackgroundWorker();
    worker.WorkerReportsProgress = true;
    worker.DoWork += (sender, e) =>
          {
              // pretend we have a collection of 
              // items to process
              var items = 1.To(1000);
              items
                  .WithProgressReporting(progress => worker.ReportProgress(progress))
                  .ForEach(item => Thread.Sleep(10)); // simulate some real work
          };

    worker.ProgressChanged += (sender, e) =>
        {
            // make sure the figure is written to the
            // same point on screen each time
            Console.SetCursorPosition(1, 0);
            Console.Write(e.ProgressPercentage);
        };

    worker.RunWorkerAsync();
    Console.Read();
}

public static class Extensions
{
   public static void ForEach<T>(this IEnumerable<T> sequence, Action<T> action)
   {
       foreach (var item in sequence)
       {
           action(item);
       }
   }
}

Now, stay tuned. I've not exhausted this flash of inspiration yet...

Friday 4 July 2008

Rating my posts

I've just enabled Blogger's star-rating control on each post of the blog. It would be really helpful if you could let me know what you think of my posts by giving them a mark out of 5. Come on: let your inner teacher out!

I've also enabled Blogger's new inline comment box on each post page, so you've got no excuse for not talking back to me now.

Thursday 3 July 2008

Project Euler Problem 14: Hailstone Numbers

Today we're going to talk about hailstones. No, not the sort that, last week, caused tens of millions of pounds of damage to 30,000 Volkswagen cars in an outside storage area in Germany. Rather the hailstones that are the subject of Project Euler Problem 14: hailstone numbers.

Hailstone numbers are just the kind of thing that fascinate mathematicians. You make them by picking any whole starting number, n, and then repeatedly applying one of the following two rules:

  • if n is even, the next number is n/2
  • if n is odd, the next number is 3n + 1

Try it; Start with 6. Then you get 6, 3, 10, 5, 16, 8, 4, 2, 1, 4, 2, 1, 4, 2, 1, ... . Try it with another number. I bet you find yourself eventually writing 4, 2, 1, 4, 2, 1, ... all over again. If you don't, then, congratulations: you've just solved the Collatz problem, and won yourself at least a £1000 pound reward! You must also have picked a pretty big staring number. Using computers (obviously!) mathematicians have showed that if you kick off the sequence with any number smaller than about 2.7x1016 you'll eventually get down to the 4,2,1 repeating pattern.

You or I might be inclined to think, based on that evidence, that the sequence will always end in 4,2,1 for any starting number; but no quantity of successful tests would satisfy a pure-blooded mathematician: what if the next generation of computers tested even bigger numbers and found a sequence which didn't terminate like that? Hence the reward, which is being offered for a proof that for every whole number, the sequence generated by these rules eventually reaches 1; or, alternatively, for a demonstration of a number which begins a sequence that doesn't reach 1. This problem has taxed the minds of some of the brightest mathematicians for over 70 years.

But why are they called Hailstone numbers? Have a look at this chart, where I've plotted a few sequences starting with different numbers: Hailstone Numbers

See how the sequences jump up and down? Apparently real hailstones do that when they're growing in the clouds: rising currents of air lift dust particles up into the clouds where they get smothered in ice; this makes them heavy, so they sink. Then the ice melts a little which makes them light enough to be pushed back up again- and so the little hailstones bounce up and down in the clouds getting fatter and fatter, until eventually they get so heavy that they fall to the ground, bouncing there until they come to a complete stop.

Now that you know the what and the why of Hailstone numbers, let's get to Euler's problem, which is to do with the length of sequences of hailstone numbers. We define the length of the sequence as being the number of terms up to the first occurrence of the number 1 (the example above starting with 6 has length 9, for example). Euler asks which starting number under 1 million produces the longest sequence?

A recursive solution

How to solve it? We could do it using an iterator or the Unfold method to generate the sequence, then counting the number of elements in it: but we're not actually interested in the numbers in the sequence, just the length of the sequence.

We can find the length of the sequence which starts at a particular number by finding the next term in the sequence, then adding one to the length of the sequence that starts at that term. In other words we can tackle this using a recursive function; and to make things interesting, we'll make it a recursive lambda function (this isn't gratuitous lambdaisation - you'll see why I've done it in a minute).

Func<long, long> findSequenceLength = null;
findSequenceLength = (long n) =>
{
 if (n == 1) { return 1; }
 var nextTerm = n.IsEven() ? n / 2 : 3* n + 1;
 return findSequenceLength(nextTerm) + 1;
};

The trick to recursive lambda functions, as Eric Lippert pointed out, is to do what I've done in line 1 in the snippet above and assign null to the lambda before giving its actual definition; otherwise the compiler complains that the variable holding the lambda is being used before it is assigned.

Actually I'm misleading you a bit here, because we've not really created a recursive function at all. Thanks to the magic of closures, the lambda that we've assigned to findSequenceLength is really just calling the delegate stored in the variable findSequenceLength -that's why we need to be careful to assign that variable first (Wes Dyer explains more about this, and shows how to do real anonymous recursion in C#). But that very fact creates an opportunity for us: by changing the delegate that is stored in findSequenceLength we can subtly but usefully change the way findSequenceLength works.

Memoization

If you try out a couple of hailstone sequences, you'll soon discover that there are many that start off differently, but converge onto the same (bumpy) glide path down to the (4,2,1) ending. If you start with 24, for example, the third term in the sequence will be 6, and then will continue exactly the same as the sequence starting at 6. This means that once we have encountered one of these 'glide paths' and calculated its length, we really shouldn't need to do the same calculations all over again.

Memoization is a neat trick you can apply to give your lambdas a memory of what they've calculated before (thanks, again, to Wes Dyer for his post on memoization). Have a look at this code:

public static Func<T, TResult> Memoize<T, TResult>(this Func<T, TResult> function)
{
  var cache = new Dictionary<T, TResult>();

  return (T arg) =>
  {
      TResult result;
      if (cache.TryGetValue(arg, out result))
      {
          return result;
      }
      else
      {
          result = function(arg);
          cache.Add(arg, result);
          return result;
      }
  };
}

When you pass a delegate to Memoize, it will return a new delegate that only ever calls your original one if it's called with an argument it has not seen before; otherwise it supplies the answer from its cache of previous results. Obviously you'll only want to use this on delegates whose results are wholly dependent on the input parameters: if you've got a function that uses the current time, for example, caching results will get you stale answers.

So we can do this:

Func<long, long> findSequenceLength = null;
findSequenceLength = (long n) =>
{
 if (n == 1) { return 1; }
 var nextTerm = n.IsEven() ? n / 2 : 3* n + 1;
 return findSequenceLength(nextTerm) + 1;
};
findSequenceLength = findSequenceLength.Memoize();

Now, because the original lambda stored in findSequenceLength is defined to call whatever delegate is stored in the findSequenceLength variable, when we store the memoized version of the lambda back into findSequenceLength the result is a recursive function that can short-cut any calculations for parameters that it has seen before.

I'll let you untwist your brain before I carry on.

Recovered? Now the final answer is just a step away.

Using our findSequenceLength lambda we can calculate the sequence length for any number we like. But to get the final answer to the problem we don't actually want to know the sequence length: we need to know which number generates the longest sequence. So we can't just generate all the sequence lengths and find the maximum over that set. Instead, we'll generate objects containing both the starting number and the sequence length, and use my MaxItem extension method which finds the maximum object in a sequence, using which ever property of the object you grab with a lambda expression to make the comparison. Like so:

return 1.To(1000000)
          .Select(n => new { n, Length = findSequenceLength(n) })
          .MaxItem(info => info.Length)
          .n;
The full code is shown below, together with the code for MaxItem. Or you can download it from my Project Euler page on MSDN Code Gallery.

Oh, and what difference in performance does Memoization make? 0.85 seconds to run the code on my machine compared with 4.4 seconds when I take out the call to Memoize().

[EulerProblem(14, Title = "Find the longest sequence using a starting number under one million.")]
public class Problem14
{
  public long Solve()
  {
      Func<long, long> findSequenceLength = null;
      findSequenceLength = (long n) =>
      {
          if (n == 1) { return 1; }
          var nextTerm = n.IsEven() ? n / 2 : 3 * n + 1;
          return findSequenceLength(nextTerm) + 1;
      };

      findSequenceLength = findSequenceLength.Memoize();

      return 1.To(1000000)
          .Select(n => new { n, Length = findSequenceLength(n) })
          .MaxItem(info => info.Length)
          .n;
  }
}
public static T MaxItem<T, TCompare>(this IEnumerable<T> sequence, Func<T, TCompare> comparatorSelector) where TCompare : IComparable<TCompare>
{
  T max = sequence.First();
  TCompare maxComparator = comparatorSelector(max);

  foreach (var item in sequence)
  {
      var comparator = comparatorSelector(item);

      if (comparator.CompareTo(maxComparator) > 0)
      {
          max = item;
          maxComparator = comparator;
      }
  }

  return max;

}