Monday, 14 June 2010

Project Euler 59: Cracking an XOR code

Code breaking is about as glamorous as it gets for a Geek. So when it comes to Project Euler Problem 59, think 24, with Jack Bauer unscathed amidst a hail of bullets, memory stick in one hand, terrorist’s neck throttled in the other, threatening terrible retribution to the bad guy’s family if the secret goes to the grave with him. Terrorist rasps out,

“There’s a file … cipher1.txt … ASCII format … urrrh … XOR … 3 … letter … password … lower case … all English”

What would Chloe O’Brian make of that when she received the file over a secure socket from Jack’s Smart Phone? Being the bright girl she is, I’ve no doubt that she’d fire up MonoDevelop (only the dumb FBI would use Visual Studio and Windoze) and write the following:

var encryptedBytes = File.ReadAllText(dataPath)
                        .Split(',')
                        .Select(byte.Parse)
                        .ToArray();

That would get her the encrypted data from the file as a sequence of bytes. Now to generate all possible passwords:

const byte FirstChar = (byte)'a';
const byte LastChar = (byte)'z';

var allPossiblePasswords = from a in FirstChar.To(LastChar)
                           from b in FirstChar.To(LastChar)
                           from c in FirstChar.To(LastChar)
                           select new[] { a, b, c };

XOR encryption requires that the password be repeated cyclically, abcabcabc.... So given a sequence containing the characters of password, how about a Repeat extension method that repeats the sequence ad infinitum? (if you look at the following code and hear the Infinite Loop alarms go off, remember that iterators are lazily evaluated):

public static IEnumerable<T> Repeat<T>(this IEnumerable<T> sequence)
{
    while (true)
    {
        foreach (var item in sequence)
        {
            yield return item;
        }
    }
}

While she’s at it, she’ll need to Xor two sequences together:

public static IEnumerable<byte> Xor(this IEnumerable<byte> sequenceA, IEnumerable<byte> sequenceB)
{
    return sequenceA.Zip(sequenceB, (left, right) => (byte)(left ^ right));
}

This uses the Zip method, new in .Net 4.0, which merges two sequences together, allowing elements from left and right sequences to be processed in pairs (if you’re surprised, as I was, that the (byte) cast is necessary, see this question on Stack Overflow).

Now she can try out all the passwords, seeing what they give her if she uses them to decrypt the text. Note that Chloe has already adopted .Net 4.0 and the PLINQ extensions, so she’s using AsParallel() to max out her 128 Core workstation. (ConvertAsciiBytesToString is another extension method that simply delegates to ASCIIEncoding.GetString).

var possibleDecryptions = from trialPassword in allPossiblePasswords.AsParallel()
                          let repeatedPassword = trialPassword.Repeat()
                          select encryptedBytes
                                          .Xor(repeatedPassword)
                                          .ConvertAsciiBytesToString();

But she’s hardly going to spend the next 24 hours trawling through the possible decryptions to pick out the one in recognisable English from the 17575 in gobledygook. So she needs some means of assessing them automatically. This is where Chloe is forever in my debt. She read last weeks episode of Functional Fun, so she knows about my LanguageAnalyser that uses frequency analysis to determine how likely a sample of text is to be written in a particular language.

Rather than using the analyser to determine which of several languages the sample is written in, Chloe can use her prior knowledge that the coded message is written in English, and determine which of the possible decryptions has letter frequencies that most closely match those in English.

var languageAnalyser = new LanguageAnalyser();

var decryptionsWithDistanceToEnglish = from decryptedString in possibleDecryptions
                                       select new
                                                  {
                                                      DecryptedString = decryptedString,
                                                      Distance = languageAnalyser.DetermineCloseness(decryptedString, LanguageProfiles.English)
                                                  };

var mostLikelyDecryption = decryptionsWithDistanceToEnglish.MinItem(p => p.Distance);

And that's it. Thanks to Linguistic Geometry and LINQ, Chloe can report the decrypted (and surprisingly biblical!) message to Jack, so that he can get on with saving the world.

Ahem. Before we got carried away we were trying to solve Project Euler Problem 59, and that asks for the sum of the Ascii values of the decrypted message. So we need one more step:

var sumOfAsciiValues = mostLikelyDecryption.DecryptedString
                        .ConvertStringToAsciiBytes()
                        .Aggregate(0, (runningTotal, next) => runningTotal += next);

The full code is shown below, and is also available on MSDN Code Gallery.

namespace ProjectEuler
{
    [EulerProblem(59, Title="Using a brute force attack, can you decrypt the cipher using XOR encryption?")]
    public class Problem59
    {
        public long Solve()
        {
            const byte FirstChar = 97;
            const byte LastChar = 122;

            var dataPath = @"ProblemData\Problem59.txt";

            var encryptedBytes = File.ReadAllText(dataPath)
                                    .Split(',')
                                    .Select(byte.Parse)
                                    .ToArray();

            var allPossiblePasswords = from a in FirstChar.To(LastChar)
                                       from b in FirstChar.To(LastChar)
                                       from c in FirstChar.To(LastChar)
                                       select new[] { a, b, c };

            var possibleDecryptions = from trialPassword in allPossiblePasswords.AsParallel()
                                      let repeatedPassword = trialPassword.Repeat()
                                      select encryptedBytes
                                          .Xor(repeatedPassword)
                                          .ConvertAsciiBytesToString();

            var languageAnalyser = new LanguageAnalyser();

            var decryptionsWithDistanceToEnglish = from decryptedString in possibleDecryptions
                                                   select new
                                                              {
                                                                  DecryptedString = decryptedString,
                                                                  Distance = languageAnalyser.DetermineCloseness(decryptedString, LanguageProfiles.English)
                                                              };

            var mostLikelyDecryption = decryptionsWithDistanceToEnglish.MinItem(p => p.Distance);

            var sumOfAsciiValues = mostLikelyDecryption.DecryptedString
                .ConvertStringToAsciiBytes()
                .Aggregate(0, (runningTotal, next) => runningTotal += next);

            return sumOfAsciiValues;
        }
    }
}

9 comments:

svick said...

I think that trying all possible passwords is quite wasteful. Because each decrypted character depends only on one character of the password, you could figure them out independently. This would change the time complexity from exponential to linear (!). It could also introduce some bias to the frequency of letters, but that shouldn't matter for longer texts.

Of course, in this case, there is only 17,576 possible passwords, so even your solution should be pretty quick even on computer with 1/100 the number of cores Chloe has.

Sam said...

@svick,
You're right, of course, that I'm not showing the most efficient mode of cracking the password. Most of my project euler solutions have focused on elegance of LINQ implementation, rather than elegance of mathematics.

Anonymous said...

my immediate reaction is: why use C# to do functional programming? It's so clunky! Or maybe it's just the way you wrote our your variable names. Even my python solution seems nicer (if only cause it's more succinct). I did use your ideas to create the solution, which shows they are good ideas, but C# seems just too clunky.

data = map(int,open('cipher1.txt').read().split(","))

def repeat(l):
while True:
for i in l:
yield i

l = map(ord,"abcdefghijklmnopqrstuvwxyz")
for a in l:
for b in l:
for c in l:
dec=[foo^bar for foo,bar in zip(data,repeat((a,b,c)))]
if any(ch<32 or ch>125 for ch in dec) or\
ord("#") in dec:
continue
print map(chr,(a,b,c))
print "".join(map(chr,dec))

Sam said...

@Anonymous,
I guess it's a matter of personal preference. Your python solution is commendably succinct, but I find the C# solution more immediately readable.

Sævar said...

Very nice. But then again, that is to be expected from your posts. :)

I personally solved this euler problem in haskell, and when I finally solved it and saw everyone solving in a way similar to yours(Instead of a language analyzer, they simply tried to find the key that would result in the highest ratio of readable characters), I realized that I had gone pretty overboard.

I actually implemented the kasiski examination(It's on wikipedia) and solved it that way, so in theory it can break any polyalphabetic cipher.

But mad props for your language analyzer.

Sam said...

Thanks Sævar!

I'd not heard of the Kasiski examination before. Interesting!

Sam said...

Thanks Sævar!

I'd not heard of the Kasiski examination before. Interesting!

Sam said...

@svick,
You're right, of course, that I'm not showing the most efficient mode of cracking the password. Most of my project euler solutions have focused on elegance of LINQ implementation, rather than elegance of mathematics.

Sam said...

Great tutorial, I solved this problem by searching for ' the ' though. :)

Post a Comment