Monday, 12 April 2010

Back from a bug hunt [or, Don’t Call Excel on a Background Thread]

I’m just back from a two-day bug hunt: an exhausting experience, but with a successful outcome that I’d like to share in case anybody else comes across this critter.

Photoxpress_8059464As he was testing our Excel Add-in, our Tester, Rich, noticed some rather odd behaviour. Whenever he used the Scan Tables feature,  then subsequently closed Excel, Excel.exe would refuse to vacate its slot in Task Manager until forcibly evicted.

The feature in question had my grubby fingerprints all over it, so I signed up to fix the bug. But where to start?

There are abundant examples of this kind of behaviour when automating Excel from the outside – usually caused by managed code not releasing its references to the COM objects. But my Add-in is inside Excel: any references it holds will be trashed when Excel is shutdown.

What then? I clearly remembered testing for just this kind of problem a couple of weeks back, and I was sure the application closed cleanly then. This gave me my starting point for a spot of Binary-Search debugging.

The Find

I began by grabbing a version of the code as it was at the time I thought the code was bug-free. Running it confirmed that was indeed the case. Then I pulled down a version that lay about half-way between then and now. That one exhibited all the symptoms of infestation. Hah! So the bug must have slipped in some time between the first version I checked and the half-way version. And thus I proceeded, wading through version after version, at each step slashing in half the list of changes in which the bug must have concealed itself.

After a couple of hours searching I narrowed it down to two versions: version 3320 had the bug, version 3319 was bug free. The check-in comment on version 3320 was rather revealing:

“Moved long-running table-scan operation to a background thread, and introduced progress reporting”

Those two words “background thread” were a smoking gun if ever I saw one, but who pulled the trigger? (I did mention that this bug was wanted for murder, right?). Now to adjust my binary-search-magnifying glass to zoom in, not on versions, but on lines of code.

I started by hacking the method where I launched the BackgroundWorker that did the table-scan so that, once again, it did its stuff in the UI thread. Sure enough, the bug went away, Excel closing cleanly on request.

Then, having restored the BackgroundWorker, I chopped its DoWork method in two, stubbing out one half at a time so that I could see which bunch of code was causing the problem. DoWork called out to a couple of other components, and I subject their methods to the same treatment. As method-call after method-call confirmed its innocence I closed in on the culprits: calls to the Excel Object model – specifically, calls to the Excel Object model made from a background thread.

Excel gave no indication that it had a problem being called in this way. When called on the background thread, it did exactly as when called on the foreground thread, with nary an exception. Seems it preferred to sulk, and protest in a more underhand way.

The Fix

Having found the problem, how was I to fix it?

I didn’t want to move all the work back on to the foreground thread: but clearly Excel wouldn’t listen if I talked to it from any other. Enter SynchronizationContext. SynchronizationContext is a handy little class that wraps up the complexities of executing code in another thread’s context. In the olden days of Windows Forms you’d use Control.BeginInvoke or Control.Invoke to do this kind of thing; these days WPF’s Dispatcher provides the same features. SynchronizationContext abstracts either Control or Dispatcher depending on what kind of message loop you’re running, and gives you two methods, Send and Post: Send executes code synchronously, Post asynchronously.

So all I needed to do was add a SynchronizationContext parameter to my two services which do all the talking to Excel (my IoC container kindly absorbed the consequences of this change – no ripple-down effect here).

Most places where I was calling Excel, I was querying some value. Like this, for example:

var result = (string)_application.Run("MyMacro")

Afterwards my change it looked like this:

var result = Query(() => (string)_application.Run("MyMacro"));

...

private TResult Query<TResult>(Func<TResult> query)
{
    return _synchronizationContext.SendQuery(query);
}

SendQuery is an extension method I created for SynchronizationContext that looks like this:

public static SynchronizationContextExtensions
{
    public static TResult SendQuery<TResult>(this SynchronizationContext context, Func<TResult> query)
    {
        var result = default(TResult);
        context.Send(parameter => result = query());
        return result;
    }
}

And that nailed it good and proper.

The Finale

So, Bing, if anybody asks you why their Excel Process doesn’t close after running an Add-in (managed or otherwise), tell them to make sure they are not calling into its object model from a background thread.

7 comments:

Nate said...

I have this same problem in a VSTO document project. Basically, i wanted to show a processing form while i do some lengthy tasks via a background worker - some of which access the excel object model. I do this for 100's of tasks, many of which access the excel object model, but only 1 of these tasks seems to reproduce this error (and does so 100% of the time). Is it possible that the bw can access some excel object model objects, but not others? Did you find that any and all access to the Excel Object model was causing you problems? For example - setting up a command bar in excel via my bw does not cause this problem. Perhaps i am misunderstanding the definition of "Excel Object Model?"

Unknown said...

Nate,
From what I've read of Excel, it ought to allow calls to the Object Model (ie the COM API, starting with the Application object and everything hanging off it) from any thread - the COM runtime takes care of marshalling them all onto the correct thread. So it's perfectly possible that some parts will behave, while other parts won't.

Which part is causing you problems?

Nate said...

The bw that is causing me problems uses an ADO data adapter to read up about 10 veryhidden sheets from inside the workbook into tables, which are held in memory as member variables of a single instance of a datamanager class. Im at home, but monday i will take a closer look at the code to see which interop objects it accesses. Perhaps, its due to the ado?

Nate said...

I finally got some time to look at this last week. It turned out to be the ADO. This is a document level VSTO excel project in 3.5 framework. I use the ADO on workbook open to read a ton of setting information from many hidden sheets (its much faster than reading any other way - though i recently found that creating a 2d object array, and setting it equal to range.value works almost as well). Anyway, i had been saving a temp copy of the workbook down and using the ado against that because ado uses the last saved copy. Moving the ado code to the main thread stopped the "excel not stopping" bug.

But i thought i could cutout the save and just read against the current book because im only reading up the settings on open, so the workbook must be the same as the last save. This then caused the same "excel not completely stopping" bug. In addition, it added some new odd behavior. If the file is opened and there are 2+ instances of excel open, the dll code runs in both applications causing a whole host of problems. It seems that the ado adapter opens a readonly copy of the document if another instance of excel is running, but otherwise it operates some other way because the dll doesnt run 2x. This was happening even on the main thread. Saving down a temp copy and running the ado against stops the bug.

Anyway, i know its pretty unrelated, but id thought i post it because the bug is the same, but the cause is different.

Unknown said...

@Nate,
Thanks for the update. It will probably be a great help to somebody hitting this page through google!

Uncle Timmy said...

Wow. This is exactly what I just ran into. You saved me time figuring out what the problem was. Surprised anyone else on the planet would be calling excel from the background thread, but good to know there are others out there.

Samuel Jack said...

Glad to hear that I helped you out!

Post a Comment