Tuesday, 21 September 2010

Move over Xml – Make Way for Protocol Buffers

In loveIt was in the summer of 2001 that I fell in love with Xml. I was writing an Excel Add-in, and needed to store some configuration data to disk. VBA grudgingly offered me Open, Print, Input, and the like to do the job: “Nested fields, huh? Do it yerself!”. Then along came Xml with her DOMDocument class, elements and attributes. “So you’d like nested fields? Sure – just put one element inside another”.

I was smitten.I would run my program, then call colleagues over to look at the output in notepad. “Look, it’s human-readable!”, I’d say, and so obvious was my infatuation that my kind-hearted co-workers never asked, “Yes – but what kind of human would bother?”

Fast forward 9 years, and I’m over that crush. Xml and I are now just good friends. I still see her every couple of weeks, but now I have eyes for other message formats. Then I saw only pros. Now I can concede the cons.

Way back in 2004 Dare Obasanjo, on the Xml team at Microsoft, wrote the Xml Litmus test to help developers check that they’re using Xml appropriately. His criteria:

  1. There is a need to interoperate across multiple software platforms.
  2. One or more of the off-the-shelf tools for dealing with XML can be leveraged when producing or consuming the data.
  3. Parsing performance is not critical.
  4. The content is not primarily binary content, such as a music or image file.
  5. The content does not contain control characters that aren't carriage return, line feed, or tab because they are illegal in XML.

Dip this litmus strip in your scenario, and see how may of those five points turn green. If you score less than 4 out of 5, then Dare suggests you look elsewhere for your serialization format.

In the application I’m currently working on, I fell into the trap of using Xml just because it’s there. I have a data structure that I need to store in a single column in the database, and I picked Xml as the format. Problem is, most of what our application does revolves around this data structure, and it ends up writing, storing and parsing an awful lot of it.

As our databases ballooned in size, it has become clear to me that Dare needs to add another point to his list:

  1. Document size is not critical

There’s no getting away from it: Xml is verbose. And since my application is the only one that’s going to be using this data, and I don’t need to query it with XPath or validate it with Xml Schema or transform it with XSLT, and I do need to get at it fast, I’ve decided verbosity isn’t a price I’m willing to pay any longer. Xml, with regret, you’re fired! (at least for this part of the application).

Xml Alternatives

So what does one use when one does not want to use Xml? Being with Xml for so long has made me wary of proprietary binary formats, so I don’t want to go down the route of inventing my own.  I’m loathe to bring into the world further chunks of bytes that can only be interpreted with source code to hand.

So I’ve been investigating the state of the art.

An obvious first candidate was Binary Xml. But it has a major black mark against it in that that nobody can agree what Binary Xml should look like. Despite years of committee meetings, there is still no officially sanctioned or widely adopted Binary Xml specification. Lots of organisations have had a shot at it, including Microsoft who in invented their own with the specific needs of WCF in mind. Nicholas Allen, a WCF architect, has the details for that on his blog. 

A Bison!      JSON is the current darling of the Web 2.0 community, and BSON is its binary equivalent. James Newton-King has created a .Net implementation. I didn’t choose BSON for one simple reason: like JSON and Xml, it is self-describing with field names embedded in the message. This is a good thing if you want interoperability, but costs on compactness.

The last two widely adopted formats I found are Thrift and Protocol Buffers. These are both quite similar in their aims and accomplishments (not surprisingly, I understand, because Thrift was  developed by engineers at Facebook who liked Protocol Buffers when they’d worked with it while at Google and so reinvented it). Having established that much, and determined that Protocol Buffers is twice as popular as Thrift (currently 153 questions for Protocol Buffers vs 75 for Thrift on StackOverflow, so don’t argue!) I decided to dig into Protocol Buffers.

I like it.

There’s a little language you use for describing the structure of your messages – and of course, nested messages are supported. Then there’s a compiler that turns your message definitions into code in your language of choice – code for serializing and deserializing binary messages into objects matching your definitions.

There’s very clear documentation of the language and the encoding on the Protocol Buffers website. And if you want to try it out in .Net, there’s not one, but two rival implementations. In the red corner we have Protobuf.net by Marc Gravell, and in the blue corner DotNet-ProtoBufs, written by Jon Skeet (in a couple of idle minutes he had when Stack Overflow was down, no doubt).

I’ve spent the last week or so implementing enough of the Protocol Buffers spec for my needs, and I’ve got some very pleasing results. Storing my data in a Protocol Buffers format uses 45% of the space that SQL Server’s Xml data type requires, and it loads and deserializes in less than half the time.

Stick around. I’ve got a few Protocol Buffers tips to share.


Román Fresneda Quiroga said...

Hey! Remember that nasty thing we had to do related to the XML Column! Beware of breaking it now :-)

Sam said...

Thanks for the reminder. I'm hoping to avoid that for the time being because the serialized data takes up less space. But I have been thinking about it.

John said...

Hi Sam,

I agree with you completely.

I've see it being used in sending data over between completely homogeneous nodes in a datawarehouse where binary data would have taken up a fraction of the bandwidth and message validation plays no role whatsoever.

In XAML.. Seriously:
<HierarchicalDataTemplate DataType="x:Type local:ExprWrapper" ItemsSource="{Binding Children}">

If DataType and ItemsSource are supposed to point to typed objects, why the F*@K are they between quotes implying they are strings?. It's only because the XML spec says that attribute values should be within quotes. Significant whitespace FTW!

It seems that the only thing XML is not being used for is <punchline>adding meaning to text, MARKUP</punchline>.

Funny actually.

Always read your blog, keep it up!

Somenath Mukhopadhyay said...

Hey Sam,

nice information... i need to convert a GPX file into a PBF format. the GPX file format is like http://www.topografix.com/gpx/1/1/. will it be possible for you to help me out in this regards?

Post a Comment