Functional Fun: August 2008

Thursday, 28 August 2008

I am now a Tetrahedron

In my lunch hour yesterday I solved my 25th Project Euler problem. As Euler kindly informed me, that puts me in the top 78% of Project Euler members, and makes me a Tetrahedron according to the newly introduced scoring system. The levels in this system are named after the Platonic Solids: when I've solved another 25 problems, I'll become a cube; after another 50, an octahedron, and so on, evolving step by step to the mathematical perfection of a sphere when I've tucked 250 problems under my belt.

But don't let me get carried away. I started solving problems in Mid March. It's now the end of August. By my reckoning it's taken 27 weeks to solve those 25 problems, averaging about 0.9 problems a week. Extrapolating this, I'm looking at another 3 years to solve the 200 or so current problems, or 4 years to reach Spherehood. And that's looking on the rosy side: the problems get harder as you get further on, so my rate is bound to drop.

With that in mind, I'll probably change my current tack of blogging about the problems one by one, and start picking out those that I think will make for the most interesting posts. I've got my eye on number 96 which is all about Sudoku ...

If you want to follow my progress, you can have a look at my Project Euler profile.

Project Euler Problems 18 and 67: Finding the Maximal Route through a Triangle

If this doesn't pique the interest of my fellow developers then I don't know what will: a problem that will take twenty billion years to solve (at an extremely optimistic estimate), unless we find an efficient algorithm. The problem in question is Project Euler Problem 67. We're given a triangle made up of one hundred rows of numbers, and asked to consider all routes from the top to the bottom, made by moving from one row to the next via adjacent cells. Over all such routes, we have to find the maximum sum. Can't picture it? Let me help:

For the route that I've drawn, the sum is 5 + 15 + 29 + 45 = [train your brain and work it out yourself!].

If we only wanted to tackle the baby brother of this problem (Problem 18 - the same problem for a triangle with just 15 rows), we could get away with checking each of the routes. But it's not feasible to do that here because there are 2⁹⁹ possibilities: hence the need for an efficient algorithm. Any ideas?

You could take a bath and wait for the Eureka moment. Or if you want to avoid the ensuing streaking, read on.

An algorithm

Suppose that for each cell in a row we know the maximum sum over all routes ending at that cell. From that it's easy to work out the same for each of the cells in the next row. Here's how you do it. If you've got a cell with value c, somewhere in the middle of the row, then you can only reach it from either of the two cells adjacent to it in the row above.

We're assuming that we already know the maximum sum for routes ending at these cells: suppose it is a and b respectively. So the biggest possible sum we can get for routes ending at c is Max(c + a, c + b). For the two cells at either end of the row the calculation is even simpler: they can only be reached from the cell at the appropriate end of the row above them, so we just sum the cell value with the known maximum sum to the cell above. In this way, we can crunch through the rows, 'flattening' the triangle until we reach the last row, at which point we will know the maximum possible sum over all routes ending at each of the cells in that row.

That's the middle of an algorithm. To solve the overall problem we just need to add a beginning and an end. The middle part starts by assuming that we already know the maximum sum for all routes ending at a particular row. So to get the algorithm rolling we need a row for which we do have that information, and the first row of the triangle is the obvious candidate: all routes start there, so the maximum up to that point is just the value of the cell in that row. How about the end of the algorithm? To solve the problem, we need to know the maximum sum from top to bottom through the triangle; we've calculated the maximum sum ending at each of the cells in the last row, so we just need to take the maximum over all these cells.

Some code

To code this, we'll reach once again into our Functional Programming toolbox. I think the Aggregate method is just the hammer to crack this particular nut. Do you see how?

Ignore the input part for now, and assume we've got a sequence of List<int>, one List for each row of the triangle. Like this:

{75}
{95, 64}
{17, 47, 82}
{18, 35, 87, 10}
...

In this layout, it's a little tricky to visualise which cells are adjacent to which: you need to look at the cell directly above, and the one above but to the left. For example, the 35 in row four is adjacent to 47 and 17 in the row above.

We'll use Aggregate to 'flatten' the triangle represented by this sequence of Lists using the algorithm I described above: the aggregate that we'll be calculating is a List containing the maximum sum over all routes to each cell up to the last row we've processed. Remember that Aggregate needs a function that combines the aggregate so far with the next item in the series. We'll supply a function that takes the maximums so far and combines them with the cell values for the next row. The code looks like this:

return rows.Aggregate(
    new List<int> {0},
    (currentMaxima, nextRow) =>
        {
            var rowLength = nextRow.Count();
            return nextRow
                .Select((cell, index) =>
                        index == 0 ? cell + currentMaxima[index]
                        : index == rowLength - 1 ? cell + currentMaxima[index - 1]
                        : Math.Max(cell + currentMaxima[index - 1], cell + currentMaxima[index]))
                .ToList();
        }
    )
    .Max();

Notice that at line 7 we're processing the sequence nextRow (containing the cell values) using an overload of the Select method that gives us the index of each cell value in the sequence; we use this index to find the appropriate entries in the list currentMaxima containing the maximum sums to the cells in the row above.

And with that, I believe we've solved the problem. It takes 2 milliseconds to solve the 100-row triangle problem on my computer - a satisfying improvement over 20 billion years.

Here's the complete code. Note that in the source code project I have included two files containing the data for these problems.

namespace ProjectEuler
{
    [EulerProblem(18, Title = "Find the maximum sum travelling from the top of the triangle to the base.")]
    public class Problem18
    {
        public long Solve()
        {
            var dataPath = @"ProblemData\Problem18Data.txt";

            return MaximumSumThroughTriangleSolver.SolveFromFile(dataPath);
        }
    }

    [EulerProblem(67, Title = "Using an efficient algorithm find the maximal sum in the triangle?")]
    public class Problem67
    {
        public long Solve()
        {
            var dataPath = @"ProblemData\Problem67Data.txt";

            return MaximumSumThroughTriangleSolver.SolveFromFile(dataPath);
        }
    }

    public static class MaximumSumThroughTriangleSolver
    {
        public static int SolveFromFile(string dataFilePath)
        {
            string problemData = File.ReadAllText(dataFilePath);

            return Solve(problemData);
        }

        public static int Solve(string problemData)
        {
            var rows = problemData
                .Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
                .Select(line =>
                        line.Split(' ')
                            .Select(token => int.Parse(token))
                            .ToList()
                )
                .ToList();

            return rows.Aggregate(
                new List<int> {0},
                (currentMaxima, nextRow) =>
                    {
                        var rowLength = nextRow.Count();
                        return nextRow
                            .Select((cell, index) =>
                                    index == 0 ? cell + currentMaxima[index]
                                    : index == rowLength - 1 ? cell + currentMaxima[index - 1]
                                    : Math.Max(cell + currentMaxima[index - 1], cell + currentMaxima[index]))
                            .ToList();
                    }
                )
                .Max();
        }
    }
}

Monday, 25 August 2008

Happy Birthday to Me

My colleague Roman had his birthday a couple of days ago; today I have caught up with him in age. His mathematician's heart was saddened by the thought that this year is the last when his age can be represented in the form x^x - unless he discovers the secrets of the Philosopher's stone.

So now you know exactly how old I am.

Friday, 22 August 2008

Anyone for a Slice of LINQ?

Over on his blog, Eric White has a post about how to chunk a collection into groups of three using standard LINQ operators. As it happens, I've been thinking about the same issue as part of a solution to an upcoming Project Euler problem. Rather than use standard LINQ operators though, I've created a new method called Slice to do the job using a C# iterator.

Give Slice() a sequence, tell it how many items you want to be in each slice or chunk (pick your synonym - might depend on whether you prefer cake or chocolate!), and it will give you back a sequence consisting of slices of the original sequence (each slice is returned as a sequence - an IEnumerable). If the original sequence cannot be evenly divided into slices of exactly the size you require, then the last slice you get back will contain the leftover items (the crumbs, as I've called them in the code below - I clearly prefer cake!).

To adapt Eric's example to the gastronomic analogy that has emerged as I've been writing, consider this example:

public void SliceCakes()
{
    string[] cakeData =
        {
            "Mini Rolls Selection",
            "Cabury",
            "2.39",
            "6 Chocolate Flavour Cup Cakes",
            "Fabulous Bakin' Boys",
            "1.25",
            "Galaxy Cake Bars 5 Pack",
            "McVities",
            "1.00",
            "Apple Slices 6pk",
            "Mr Kipling",
            "1.39"
        };

    var cakes = cakeData
        .Slice(3)
        .Select(slice => new
                 {
                     Cake = slice.ElementAt(0),
                     Baker = slice.ElementAt(1),
                     Price = slice.ElementAt(2)
                 });

    foreach (var cake in cakes)
    {
        Console.WriteLine(cake);
    }
}

This produces the output:

{ Cake = Mini Rolls Selection, Baker = Cadbury, Price = 2.39 }
{ Cake = 6 Chocolate Flavour Cup Cakes, Baker = Fabulous Bakin' Boys, Price = 1.25 }
{ Cake = Galaxy Cake Bars 5 Pack, Baker = McVities, Price = 1.00 }
{ Cake = Apple Slices 6pk, Baker = Mr Kipling, Price = 1.39 }

Here's the code for Slice()

public static class Functional
{
    /// <summary>
    /// Slices a sequence into a sub-sequences each containing maxItemsPerSlice, except for the last
    /// which will contain any items left over
    /// </summary>
    public static IEnumerable<IEnumerable<T>> Slice<T>(this IEnumerable<T> sequence, int maxItemsPerSlice)
    {
        if (maxItemsPerSlice <= 0)
        {
            throw new ArgumentOutOfRangeException("maxItemsPerSlice", "maxItemsPerSlice must be greater than 0");
        }

        List<T> slice = new List<T>(maxItemsPerSlice);

        foreach (var item in sequence)
        {
            slice.Add(item);

            if (slice.Count == maxItemsPerSlice)
            {
                yield return slice.ToArray();
                slice.Clear();
            }
        }

        // return the "crumbs" that 
        // didn't make it into a full slice
        if (slice.Count > 0)
        {
            yield return slice.ToArray();
        }
    }
}

Thursday, 21 August 2008

Look at me. I'm on TV

We had a film crew from our regional BBC TV station come into the office on Tuesday. Their piece was aired at lunchtime today, and you can also see it on the BBC website . They focus on the need we have to recruit more people. We've already had one response from a lady who heard that we need people skilled in Mathematical Modelling. She rang to say that her daughter had a maths degree, and was really good with Lego, and did we have a job for her?

Look out for me. I'm the one you see peering over a partition whilst the fingers of my colleague Jason type gibberish into Visual Studio. I wasn't trying to get in shot. Honest!

Which .Net Excel IO Component should I use?

Introduction

When I wrote up my post about the ways of interacting with Excel from .Net I listed a number of .Net components that can be used for reading and writing native Excel files; I left you with the hard task of picking one for yourself. Today I'm going to write you a helping hand.

Over the last couple of days I've been doing some bug fixing on a little utility we have that analyses an Excel spreadsheet and creates named ranges for cells that are formatted in particular styles. The biggest bug I wanted to fix was terrible performance. When I first looked at the application a couple of months ago the performance was unbelievable: never mind making a cup of coffee whilst it did its stuff; we could go home and get a good nights kip before it delivered the results. It didn't take long to pin most of the blame for that on Excel COM Interop: each call to a property or method on a COM object has a significant overhead, and we were querying the Style property of each cell in a 120Mb workbook!

Hence my interest in .Net components for loading Excel files: I had a license to Syncfusion XlsIO available, and by switching the code to use that instead of Excel Automation through COM, the time to query cell styles dropped to almost nothing. However, XlsIO has some performance issues of its own (more on these in a moment), and it still could take up to an hour for the whole operation to complete. So I decided to have a look at the other components in the market place.

In the hope that there will be at least one other person on the web who finds this useful, I'm recording my first impressions of several of these Excel IO components. I'm not claiming that these are objective reviews: your kilometreage will almost certainly vary.

Syncfusion XlsIO

I've been using XlsIO on and off for a couple of years. Mostly, it has helped me get the job done; several times it has been the cause of temporary localised baldness. My thoughts:

It has an API that covers the vast majority of Excel features
The API is mostly consistent with the Excel COM API, though there are several exceptions to catch you out: Workbooks.Add vs Worksheets.Create being one trivial example.
There are some highly non-trivial differences from the Excel COM API. For example, the Excel Range object has Rows and Columns properties that return references to collections; the XlsIO Range object has properties with the same names, but each return arrays of Range objects and allocate a new array every time you access the property. This is a huge performance pitfall for the unwary.
The Range model is different to Excel's in that in XlsIO, a Range can only contain one Rectangular region, whereas excel allows Ranges to be created that are the union of many rectangular regions.
As I hinted in the introduction, there are other performance problems. In my large (120Mb) workbook, deleting a single row could take up 20 seconds. Deleting Named ranges was another costly operation. Excel demonstrates that these operations don't have to take that long.
The documentation is sparse, to put it politely, and the class reference often doesn't state more than the obvious. "ICombinedRange: represents a Combined Range" being one typical example. I have however had assurances from Syncfusion that they are working to improve this.

XLSReadWrite

I started my exploration of other components by downloading and installing XLSReadWrite. Then uninstalling it again. Call me a CLR snob, but I didn't like the thought of working with a component that is clearly designed for Delphi. This showed because the API commits two capital crimes: Every type in is prefixed with a T; and most of the namespaces contain just one type. The other point to note about XLSReadWrite is that the "shape" of the API is nothing like Excel's so any code you have using COM Interop would need a lot of reworking to use this component.

ActiveXLS

I'm afraid that CLR snobbishness also put me off ActiveXLS. The ActiveXLS team produce Spreadsheet components for both .Net and Java, and it appears that the .Net version is a straight port of the Java version: it is the paired "get_" and "set_" methods, and the absence of Properties that give the game away. Surely an organisation selling a component "optimised for Visual Studio" (as the home page claims) should make .Net developers feel at home and at least use Pascal casing for methods, rather than camel casing (which everybody knows should only be used for method parameters)?

The other thing that struck me as odd was that in the ActiveXLS forums (which have been active since 2006) there have only been about 200 posts. Is it that they have an intuitive API with superb documentation: or perhaps a very small user-base? Maybe I'm just cynical.

SpreadsheetGear

SpreadsheetGear was the component I finally settled on. This appears to be far and away the most mature (though correspondingly the most expensive) of the components that I looked at. Though I didn't try that part of it, this component also offers Windows Forms controls for editing spreadsheets. My impressions:

The Object model is very similar to Excel's, and easy to learn. It didn't take me long to port my code from XlsIO (which is also similar to Excel).
One big inconsistency is that properties giving access to cells use a 0-based index system rather than the 1-based system in Excel.
The Range model is (as far as I can tell) identical to Excel's. Intersect and Union methods are provided for Ranges and seem to work as I'd expect.
There's a surprising omission in the API in the current version (2007): no support for Styles. If you need style support you'll need to get the 2008 version (currently in beta).
All the non-public code is obfuscated. This can cause problems when debugging. For example when I was trying to look at the Workbook.Names collection in the Quick Watch window (VS 2008) I expanded one of the Name items in the collection, but was unable to inspect any of its properties. It was only by rewriting the expression in Quick Watch window to include a cast to IName that I could see the property values.
SpreadsheetGear do not have any public forums that I could find: the only way to get support is to fill in a form and wait for them to get back to you.
Performance of the component was very good. Remember the application that took all night with COM Interop, and up to an hour with XlsIO? It now takes under a minute with SpreadsheetGear.
As a bonus, here's the answer to an issue that took me an afternoon to figure out (and has caught me out again since then). When you supply a Range address to an IName (whether by using Names.Add or IName.RefersTo) remember to prefix it with an "=" sign; otherwise the IName.RefersToRange property won't get updated.

Honourable Mentions

I tried contacting Independentsoft about Spreadsheet.Net but never received a link to the evaluation download. I would judge by the website that this component isn't going to be as complete or mature as the others.
Gembox Software offer a completely free version of their spreadsheet control, limited to 5 worksheets of 150 rows each. Unfortunately they don't seem to offer an evaluation version of the full product, so I wasn't able to try it out on my big workbook. A quick scan of the online help shows that the API is not dissimilar to Excel's, and does follow the .Net framework design guidelines.
FarPoint Spread does Excel import and export, but its focus is on providing a spreadsheet-like Grid component, so I didn't look into this any further.
Infragistics have Infragistics.Excel but it looks like it can only be purchased as part of one of their suites. From the documentation, it doesn't look as fully featured as either XlsIO or SpreadsheetGear.
ComponentOne is another component suite vendor that has lumped an Excel IO component in with their suite. Again, the documentation shows that it has fairly limited capabilities compared with the leaders.
Aspose.Cells is a component that appears, by my reading of its documentation, to sit somewhere in the middle of the market, in terms of functionality and price. Aspose is a vendor that sells components for .Net and Java. and gets it right: the API's are "localised" for the framework culture. The .Net API gets Properties and Pascal cased methods, and the Java API keeps its get and set methods.

Tuesday, 19 August 2008

Pre-boarding IQ test

I've not seen this before: requiring passengers to solve a Linear Programming exercise to determine whether they and their buggies are allowed to board the bus:

Friday, 15 August 2008

Sudden Onset Digital Amnesia

On Monday a faithful servant of mine was diagnosed with a very serious illness. It was quite distressing.

I first noticed something was wrong when I went to wake him from Sleep last week. He groaned and made grinding noises rather than his usual cheerful whirs and beeps. No shimmering pearl appeared on his visage; instead the room filled with a smoky odour, and his face remained black.

I phoned the doctor as soon as I could. He came to the house and was soon examining the patient, not with stethoscope, but with digital voltmeter. It told a tale of voltage fluctuations on the motherboard well outside the usual range. This case called for immediate hospitalisation and an operation.

After successful surgery to transplant the power supply, things seemed to be looking up. Many vital signs were testing positive as the doctor went over them, one by one. Then he came to the DVD drive: no sign of life. He fiddled with cabling, and tested again. Nothing. A check of one of the DIMM slots revealed that that too had suffered. But then he made the worst discovery of all.

It was the Primary SATA drive. Dead. Infectiously dead in fact, because, when hooked up to the motherboard it drained life from the whole of the system. With that disconnected however, one bootable hard drive remained to my servant, allowing him to hobble to life again. And thus the diagnosis was made: Sudden Onset Digital Amnesia.

He is home again now, but a shell of his former self. Reduced from his hardware-accelerated, Composited Desktop, Windows Vista glory to Windows XP and GDI graphics; his precious photographic memories of my holidays and special family moments mostly gone; no longer holding any record of my financial accounts; unpublished Project Euler solutions passed away into the digital ether.

He says nothing, but sits with a reproachful air. Why was I not backed up? No RAID. No copying to optical media. Not even an online backup. And so I write, that his suffering might not be in vain, and that you might share in the moral of his story.

The Moral of the Story

So to me has happened one of those things which I always assumed happened only to other people. I had planned to make backups. I even had a box of TDK scratch proof DVD's on the shelf to hold the backups. But somehow it never seemed a top priority. Now I know better.

But I'm an optimist by nature, and things are not as bad as they could be. It was the photos from early 2007 onwards that were held on the disk that went down, but some of the best of those I've put on my daughter's blog. On the disk that remains I have everything prior to 2007. Getting that backed up is now a top priority.

A couple of weeks back I read (via Jurgen Appelo) about Mozy, the online Backup Service. The idea of online backups is too take all the hassle and risk out of the job. You install a piece of software on your computer; it monitors your disks for changes, and sends updated files to be stored somewhere in the cloud. No worries about what to store your backups on; no need to find a secure off-site storage location. It's all handled for you.

In all the reviews of competing Online Backup systems I've looked at MozyHome has consistently come out top, so I signed up. They have a free version that gives you 2GB backup space, but for only $4.95/month per computer (less if you take out yearly or two yearly subscriptions) you get unlimited space on their servers.

I've not used Mozy much yet, but so far this is what I've learnt about it:

You can backup entire folders, or you can ask Mozy to look for files matching particular criteria (file type, size, etc.)
All files are encrypted before being sent over the wire. You can choose your own key, or leave Mozy to take care of that. This adds security, but does have the downside of making Mozy backups take longer than some competitors
You can control how much bandwidth and CPU power Mozy takes up for the backup process, and schedule when backups should happen.
Mozy performs incremental block-level backups, only resending files or parts of files that have changed.
There are plenty of options for getting your files back again: there's Windows Explorer integration, a web-download option, or even an option to have your data put on DVD and posted to you (for an additional charge)

Unless you have the patience of a saint, and feel no particular urgency to get backed up, you'll need a Broadband Internet connection to make this happen, preferably one with very high download caps. But who doesn't have one of those these days?

And what about my other disk, and all those lost digital memories? I could pay to have it recovered professionally, but prices start from around £250. So I've decided to put that money towards a new laptop instead - that way I can compute downstairs, rather than tucked away in the spare bedroom, and maybe my wife will feel less of a computer widow. And sometime in the future, when all my remaining data is safely stashed away, in all the spare time I don't have, I'll have a hack at the disk myself.

Footnotes and Postscripts

MozyHome Reviews:

A variation of Online Backup: CrashPlan. The idea is to pair up with a friend who has a computer someplace else, and then use CrashPlan's software to mutually backup each others data over the Internet. The main benefit appears to be the absence of Monthly fees.

Tuesday, 12 August 2008

Utility to Convert numbers to Words

Update (5/6/2009): I’ve now created a new and improved version of the utility.

I've been having a play with Silverlight 2.0 in the last couple of days. Here's the result: a little utility to convert numbers to words based on the algorithm that I created for Project Euler 17. It's amazingly simple to get going: within minutes of completing the Silverlight install I had my C# code running inside the browser.

Hopefully I'll find time to write up a post on how I did it, and how to take advantage of the free Silverlight hosting that Microsoft offers.

Update: By popular request, the code for this utility is now available here.

A designer's thought process

It was the ForEach method that got me started. List<T> has a ForEach method that calls a delegate on each of its items, and many people are asking why there's no ForEach method on the Enummerable class in the BCL (the answer has to do with encouraging the side-effect-free programming style beloved by Functional-Programming purists). It's not a big deal really. You can knock one up in seconds:

public static class EnumerableExtensions
{
    public static void ForEach<T>(this IEnumerable<T> sequence, Action<T> action)
    {
        foreach (var item in sequence)
        {
            action(item);
        }
    }
}

I was using that very extension method just a few minutes ago, writing code like this.

ViewModel.DataSetList.ForEach(ds => ds.ClearDetails());

As I did a micro code-review before checking in my work, I asked myself: What happens if DataSetList is null? Oops. The code will blow up with a NullReferenceException. Hmm. But at which point? Interestingly, it's not in the line where I'm using the ForEach method, but inside the ForEach method that the exception will be thrown. It would be different if ForEach was defined as an instance method on DataSetList's type; but because ForEach is an extension method, it can safely be called on a null reference: null gets passed to the parameter marked with the this keyword (sequence in this case - then the NullReferenceException will happen in the foreach block). This property of extension methods has inspired a number of interesting techniques, the Null-propagating extension method being one of them.

But how should I fix my code?

Cogs turning, gears grinding.

I could modify ForEach to do nothing if it is passed a null reference in the sequence parameter. That will do the trick for this case, but what effect will it have on the rest of the code-base? Suppose that somewhere else I have an enumerable variable that should always have a value, but a bug means that occasionally it is null. With ForEach written as it is, I'm going to get a glaring unhandled exception dialog alerting me to the bug; if I make the change I've just suggested, all such bugs are going to be silently masked. Not a good idea, methinks.

OK then. How about creating an overload of the ForEach method that takes a parameter to control whether it ignores nulls or throws an exception? Ugh! Or as Brad Abrams puts it (Framework Design Guidelines pp187) "An API [like that] usually reflects the inability of the framework designer to make a decision."

Well, I finally came to my decision. Not so long ago, I was reading in the excellent book The Pragmatic Programmer about the principle of orthogonality. Applying this in miniature to my case, I realised that I needed to separate the responsibility of deciding what to do with null sequences from the responsibility of acting on the sequence. The answer was to create another extension method:

public static IEnumerable<T> EmptyIfNull<T>(this IEnumerable<T> sequence)
{
    return sequence ?? Enumerable.Empty<T>();
}

With that in place, I can fix my code like this:

ViewModel.DataSetList.EmptyIfNull().ForEach(ds => ds.ClearDetails());

Neat, tidy, intention-revealing, and applicable in so many other cases.

[I should note that the most obvious way of solving the problem is to wrap the troublesome line of code in an if (DataSetList != null) block. That didn't occur to me until I was writing up this post. I'm not sure what that says about me.]

[Update: Made EmptyIfNull() more compact thanks to a reminder about null-coalescing operator from Jacob Carpenter (see the comments)]

Friday, 8 August 2008

Project Euler Problem 17: Converting numbers to words

Fancy earning a Lotto prize of 1.2 quadrillion dollars? Then you should move to Zimbabwe! Don't be in too much of a hurry, though. By the time you'd brought it back home, your Zimbabwe-dollar haul would have shrunk to less than 4,000 US Dollars.

A couple of weeks back the BBC published an article talking about the problem of big numbers in Zimbabwe. Due to daily expenses measured in the trillions of dollars, with hyperinflation of 2,200,000% pushing that figure ever higher, the citizens of Zimbabwe have been googling to find the names for the next big numbers after trillions. Coincidentally, I was doing the same thing the day before I read the article as I went about solving Project Euler Problem 17.

The problem wants to know how many letters are needed to write out the numbers 1 to 1000 in words. Clearly this calls for an algorithm to convert numbers to words; since I always like to give a little extra to my loyal readers, I wanted to create an algorithm to handle numbers bigger than 1000. I ended up with one that can convert anything up to long.MaxValue: in words, nine quintillion, two hundred and twenty-three quadrillion, three hundred and seventy-two trillion, thirty-six billion, eight hundred and fifty-four million, seven hundred and seventy-five thousand, eight hundred and seven! Plenty powerful enough to write out Bill Gate's paychecks, and sufficient to keep the Zimbabweans going for the next few months!

So how does the algorithm work? A basic ingredient is a dictionary mapping the basic numbers to words. This covers the numbers 1 to 20, then every multiple of ten up to 100, then each larger number that has a special name. The key ingredient is a recursive function (functional programming had to come into it somewhere) that starts with the largest part of a number and repeatedly breaks it down into its components.

In Pseudo-code:

Let n be the number to process
For numbers smaller than 0, combine "Negative" with the result of calling the algorithm on the absolute value of n
For n between 0 and 20, look up the result in the word dictionary
For n between 20 and 100, calculate the number of 10s and number of units; look up the name for the number of tens; if there are any units, look up the appropriate name and combine the two results with a hyphen
For n between 100 and 1000, calculate the number of 100s and the remainder; look up the name for the number of hundreds; if there is any remainder, recurse to get its wordy representation, then combine it with result for the hundreds using "and" to join the two parts
For n bigger than 1000: decide which is the biggest named number (million, billion, etc.) that divides into n. Calculate the number of that base unit and the remainder. Recurse to convert the number of base units into words, and recurse to convert the remainder into words. Combine the two parts using ","

In C#:

public static class ConvertToWordsExtension
{
 private static Dictionary<long, string> _wordDictionary;
 private const int OneThousand = 1000;
 private const long OneMillion = 1000000;
 private const long OneBillion = 1000000000;
 private const long OneTrillion = 1000000000000;
 private const long OneQuadrillion = 1000000000000000;
 private const long OneQuintillion = 1000000000000000000;

 /// <summary>
 /// Converts a number to its English representation in words.
 /// </summary>
 /// <remarks>Uses the Short Scale for large numbers. See http://en.wikipedia.org/wiki/Long_and_short_scales for more details</remarks>
 public static string ConvertToWords(this long number)
 {
     EnsureWordDictionaryInitialised();

     if (number == long.MinValue)
     {
        throw new ArgumentOutOfRangeException();
     }

     return ConvertToWordsCore(number);
 }

 /// <summary>
 /// Converts a number to its English representation in words
 /// </summary>
 public static string ConvertToWords(this int number)
 {
     return ConvertToWords((long)number);
 }

 private static Dictionary<long, string> CreateWordDictionary()
 {
     return new Dictionary<long, string>
                {
                    {0, "zero"},
                    {1, "one"},
                    {2, "two"},
                    {3, "three"},
                    {4, "four"},
                    {5, "five"},
                    {6, "six"},
                    {7, "seven"},
                    {8, "eight"},
                    {9, "nine"},
                    {10, "ten"},
                    {11, "eleven"},
                    {12, "twelve"},
                    {13, "thirteen"},
                    {14, "fourteen"},
                    {15, "fifteen"},
                    {16, "sixteen"},
                    {17, "seventeen"},
                    {18, "eighteen"},
                    {19, "nineteen"},
                    {20, "twenty"},
                    {30, "thirty"},
                    {40, "forty"},
                    {50, "fifty"},
                    {60, "sixty"},
                    {70, "seventy"},
                    {80, "eighty"},
                    {90, "ninety"},
                    {100, "hundred"},
                    {OneThousand, "thousand"},
                    {OneMillion, "million"},
                    {OneBillion, "billion"},
                    {OneTrillion, "trillion"},
                    {OneQuadrillion, "quadrillion"},
                    {OneQuintillion, "quintillion"}
                };
 }

 private static void EnsureWordDictionaryInitialised()
 {
     // ensure thread-safety when caching our word dictionary
     // note: this doesn't prevent two copies of the word dictionary
     // being initialised - but that doesn't matter; only one would be
     // cached, the other garbage collected.
     if (_wordDictionary == null)
     {
         var dictionary = CreateWordDictionary();
         Interlocked.CompareExchange(ref _wordDictionary, dictionary, null);
     }
 }

 private static string ConvertToWordsCore(long number)
 {
     if (number < 0)
     {
         return "negative " + ConvertToWordsCore(Math.Abs(number));
     }

     // if between 1 and 19, convert to word
     if (0 <= number && number < 20)
     {
         return _wordDictionary[number];
     }

     // if between 20 and 99, convert tens to word then recurse for units
     if (20 <= number && number < 100)
     {
         return ProcessTens(number, _wordDictionary);
     }

     // if between 100 and 999, convert hundreds to word then recurse for tens
     if (100 <= number && number < OneThousand)
     {
         return ProcessHundreds(number, _wordDictionary);
     }

     if (OneThousand <= number && number < OneMillion)
     {
         return ProcessLargeNumber(number, OneThousand, _wordDictionary);
     }

     if (OneMillion <= number && number < OneBillion)
     {
         return ProcessLargeNumber(number, OneMillion, _wordDictionary);
     }

     if (OneBillion <= number && number < OneTrillion)
     {
         return ProcessLargeNumber(number, OneBillion, _wordDictionary);
     }

     if (OneTrillion <= number && number < OneQuadrillion)
     {
         return ProcessLargeNumber(number, OneTrillion, _wordDictionary);
     }

     if (OneQuadrillion <= number && number < OneQuintillion)
     {
         return ProcessLargeNumber(number, OneQuadrillion, _wordDictionary);
     }
     else
     {
         return ProcessLargeNumber(number, OneQuintillion, _wordDictionary);
     }
 }

 private static string ProcessLargeNumber(long number, long baseUnit, Dictionary<long, string> wordDictionary)
 {
     // split the number into number of baseUnits (thousands, millions, etc.)
     // and the remainder
     var numberOfBaseUnits = number / baseUnit;
     var remainder = number % baseUnit;
     // apply ConvertToWordsCore to represent the number of baseUnits as words
     string conversion = ConvertToWordsCore(numberOfBaseUnits) + " " + wordDictionary[baseUnit];
     // recurse for any remainder
                 conversion += remainder <= 0 ? ""
                            : (remainder < 100 ? " and " : ", ") + ConvertToWordsCore(remainder);
     return conversion;
 }

 private static string ProcessHundreds(long number, Dictionary<long, string> wordDictionary)
 {
     var hundreds = number / 100;
     var remainder = number % 100;
     string conversion = wordDictionary[hundreds] + " " + wordDictionary[100];
     conversion += remainder > 0 ? " and " + ConvertToWordsCore(remainder) : "";
     return conversion;
 }

 private static string ProcessTens(long number, Dictionary<long, string> wordDictionary)
 {
     Debug.Assert(0 <= number && number < 100);

     // split the number into the number of tens and the number of units,
     // so that words for both can be looked up independantly
     var tens = (number / 10) * 10;
     var units = number % 10;
     string conversion = wordDictionary[tens];
     conversion += units > 0 ? "-" + wordDictionary[units] : "";
     return conversion;
 }
}

Having coded the algorithm like that, I was casting around to find a way of making the ConvertToWordsCore method more compact. That was when I came across Igor Ostrovsky's post on the ternary conditional operation ("?") and was reminded of a neat trick. I re-wrote the method like this:

private static string ConvertToWordsCore(long number)
{
 return
     number < 0 ? "negative " + ConvertToWordsCore(Math.Abs(number))
         : 0 <= number && number < 20 ? _wordDictionary[number]
         : 20 <= number && number < 100 ? ProcessTens(number, _wordDictionary)
         : 100 <= number && number < OneThousand ? ProcessHundreds(number, _wordDictionary)
         : OneThousand <= number && number < OneMillion ? ProcessLargeNumber(number, OneThousand, _wordDictionary)
         : OneMillion <= number && number < OneBillion ? ProcessLargeNumber(number, OneMillion, _wordDictionary)
         : OneBillion <= number && number < OneTrillion ? ProcessLargeNumber(number, OneBillion, _wordDictionary)
         : OneTrillion <= number && number < OneQuadrillion ? ProcessLargeNumber(number, OneTrillion, _wordDictionary)
         : OneQuadrillion <= number && number < OneQuintillion ? ProcessLargeNumber(number, OneQuadrillion, _wordDictionary)
         : ProcessLargeNumber(number, OneQuintillion, _wordDictionary); // long.Max value is just over nine quintillion
}

That's much more compact, and to my eyes, a lot more elegant. The only downside I can see is that debugging might be more difficult. Which form do you prefer?

With this algorithm in place, solving Project Euler 17 is trivial:

[EulerProblem(17, Title="How many letters would be needed to write all the numbers in words from 1 to 1000?")]
public class Problem17
{
 public long Solve()
 {
     return
         1.To(1000)
             .Select(
                 number => number
                             .ConvertToWords()
                             .ToCharArray()
                             .Count(character => char.IsLetter(character))
             )
             .Sum();
 }
}

As always, full code (including some unit tests for the ConvertToWords method!) is available on my Project Euler code gallery page. You can also try out the algorithm in my handy online utility to convert numbers to words.

Update 27/5/09:Fixed bug where 'and' was missed out in numbers like 1006

Tuesday, 5 August 2008

I'll be speaking at PDC 2008...

... to whoever stands next to me in the lunch queue!

Yes! My employers (Paragon Simulation) have very kindly agreed to send me to Microsoft's Platform Developers Conference 2008 in Los Angeles in October. I've completed the booking today, along with flights. Just got to figure out the best way to get from Birmingham to Heathrow. I'm pretty excited about this. Going to LA (and the US) for the first time will be quite an experience, though I don't expect to see much more of it than the airport, my hotel room and the conference centre (or should that be center!). The agenda looks pretty intense: sessions from around 8:30am to 6:00pm most days,with Lounges, Labs, Expos and UnSessions going on till nine or ten in the evening. Will I have any time for blogging? I sure hope so, because I remember lapping up every post I could find on PDC last time round, and I hope to be a producer rather than a consumer this time!

What's Up?

Those in the know are hyping up the "killer content" at this year's PDC. According to Jon Box it's the one must-attend conference for 2008. The hot topics are likely to be Live Mesh (and Cloud Computing in general), IE 8, Windows 7, and the next version of the .Net Framework (including C# 4.0). Much of the content is kept under wraps until the beginning of the conference, but the published session abstracts give some clues to what will be revealed. Here's what I personally am looking forward to hearing about:

C# 4.0 and .Net 4.0: Anders himself will be speaking on the Future of C#. This will surely be where we find out what's in and what's out of C# 4.0. There has been lots of speculation about what might be coming, but only a few definite maybes (about Dynamic Method/Property Lookup and Covariance/Contravariance). I'm predicting some kind of improved language support for parallelism and concurrency. Connected with this is an intriguing-sounding session about Advances in COM Interop. The session abstract talks about a "much simplified story for interoperating between native COM-based hosts and managed code, including Office". I wonder whether this will involve the DLR?
Oslo: Soon after .Net 3.0 (WPF, WCF, etc.) was released the blogosphere heard rumours that the trio Sells, Anderson and Box in the Connected Systems Division had started work on a new product. Then there was talk about something they nicknamed Emacs.Net. In October last year project Oslo was announced. It seems to have something to do with Service Oriented and Distributed Applications, but it apparently comes in lots of different pieces (supplied in a box with one missing, no doubt!), including a new modelling language and some kind of visual editor, with tie-ins to Cloud computing. It sounds very grand: I'm sure it will be interesting, and could even be relevant to the work we're doing.
Cloud Computing: This seems to be one of the latest buzzwords. Amazon and Google have both jumped on the Cloud Computing bandwagon, and Microsoft are determined not to be left in the dust. They've been talking about Live Mesh already, and in the Agenda are sessions on things they're calling Building Block Services. SQL Server Data Services seems to fit into the picture somewhere. Hopefully this initiative will be more substantial than its name suggests!
Windows 7: For some reason, I'm not as excited about Windows 7 as the other things on the agenda (sticks thermometer in mouth). I use Vista at home, but at work we're plenty satisfied with Windows XP. All the goodness (from a programmer's point of view) these days seems to live in the Frameworks, which work on whichever version of Windows. It looks like it might stay that way for Windows 7, unless they make some big announcements at the PDC. I'll probably listen in on the session about Touch Computing, but at the moment there's nothing much else of relevance to me.

Help Wanted

Since this is my first ever PDC, I've been googling for tips on getting the most out of the conference: Zach Bonham has some useful ones about the conference in general. Scott Berkun's main piece of advice is that "Conversations are more valuable than the sessions". Bearing that in mind, I went searching for articles on how to talk to strangers, never my strongest point - the childhood lesson was drummed in too well. I did find some handy advice from David Funk. If any stranger would like to meet up with me at the conference for some practice in this art, get in touch!

Does anybody else have any useful tips about PDC or conferences in general? Is there anybody in the Birmingham area who would be interested in sharing transport to/from Heathrow (flying out on Saturday 25th, back on Saturday 1st)?

Finally, an unusual request: does anybody know of a sound Evangelical Church where I could worship on the Sunday, in the area near the conference venue - or know of a way to find one?

Pages