I’m just back from a two-day bug hunt: an exhausting experience, but with a successful outcome that I’d like to share in case anybody else comes across this critter.
As he was testing our Excel Add-in, our Tester, Rich, noticed some rather odd behaviour. Whenever he used the Scan Tables feature, then subsequently closed Excel, Excel.exe would refuse to vacate its slot in Task Manager until forcibly evicted.
The feature in question had my grubby fingerprints all over it, so I signed up to fix the bug. But where to start?
There are abundant examples of this kind of behaviour when automating Excel from the outside – usually caused by managed code not releasing its references to the COM objects. But my Add-in is inside Excel: any references it holds will be trashed when Excel is shutdown.
What then? I clearly remembered testing for just this kind of problem a couple of weeks back, and I was sure the application closed cleanly then. This gave me my starting point for a spot of Binary-Search debugging.
The Find
I began by grabbing a version of the code as it was at the time I thought the code was bug-free. Running it confirmed that was indeed the case. Then I pulled down a version that lay about half-way between then and now. That one exhibited all the symptoms of infestation. Hah! So the bug must have slipped in some time between the first version I checked and the half-way version. And thus I proceeded, wading through version after version, at each step slashing in half the list of changes in which the bug must have concealed itself.
After a couple of hours searching I narrowed it down to two versions: version 3320 had the bug, version 3319 was bug free. The check-in comment on version 3320 was rather revealing:
“Moved long-running table-scan operation to a background thread, and introduced progress reporting”
Those two words “background thread” were a smoking gun if ever I saw one, but who pulled the trigger? (I did mention that this bug was wanted for murder, right?). Now to adjust my binary-search-magnifying glass to zoom in, not on versions, but on lines of code.
I started by hacking the method where I launched the BackgroundWorker that did the table-scan so that, once again, it did its stuff in the UI thread. Sure enough, the bug went away, Excel closing cleanly on request.
Then, having restored the BackgroundWorker, I chopped its DoWork method in two, stubbing out one half at a time so that I could see which bunch of code was causing the problem. DoWork called out to a couple of other components, and I subject their methods to the same treatment. As method-call after method-call confirmed its innocence I closed in on the culprits: calls to the Excel Object model – specifically, calls to the Excel Object model made from a background thread.
Excel gave no indication that it had a problem being called in this way. When called on the background thread, it did exactly as when called on the foreground thread, with nary an exception. Seems it preferred to sulk, and protest in a more underhand way.
The Fix
Having found the problem, how was I to fix it?
I didn’t want to move all the work back on to the foreground thread: but clearly Excel wouldn’t listen if I talked to it from any other. Enter SynchronizationContext. SynchronizationContext is a handy little class that wraps up the complexities of executing code in another thread’s context. In the olden days of Windows Forms you’d use Control.BeginInvoke or Control.Invoke to do this kind of thing; these days WPF’s Dispatcher provides the same features. SynchronizationContext abstracts either Control or Dispatcher depending on what kind of message loop you’re running, and gives you two methods, Send and Post: Send executes code synchronously, Post asynchronously.
So all I needed to do was add a SynchronizationContext parameter to my two services which do all the talking to Excel (my IoC container kindly absorbed the consequences of this change – no ripple-down effect here).
Most places where I was calling Excel, I was querying some value. Like this, for example:
var result = (string)_application.Run("MyMacro")
Afterwards my change it looked like this:
var result = Query(() => (string)_application.Run("MyMacro"));
...
private TResult Query<TResult>(Func<TResult> query)
{
return _synchronizationContext.SendQuery(query);
}
SendQuery is an extension method I created for SynchronizationContext that looks like this:
public static SynchronizationContextExtensions
{
public static TResult SendQuery<TResult>(this SynchronizationContext context, Func<TResult> query)
{
var result = default(TResult);
context.Send(parameter => result = query());
return result;
}
}
And that nailed it good and proper.
The Finale
So, Bing, if anybody asks you why their Excel Process doesn’t close after running an Add-in (managed or otherwise), tell them to make sure they are not calling into its object model from a background thread.