Monday, March 06, 2006

End of an Epic

We got to the end of a long drwn and almost embarrasing bug last Friday. Dan Lash is the consultant on EMR (Electronic Medical Records) and He sent this refreshing mail. I guess Bunmi was so relaxed he got life going on again by sending a link

The mail as sent by Dan Lash

Ladies and gentleman, I am proud to announce we are the end of an era.

About two months ago the users of EMR started to notice a subtle bug in the scheduling system. It appeared that there were problems with the patient name that was listed in the subject of the schedule not matching the actual patient attached to the schedule. This was observed through several reports:

Viewing a loaded patient’s schedule list
Viewing the branch calendar, and editing schedules
Noticing that some schedules changed branches

These observations led to several hypotheses for the cause of the problem:

Auto-compose was turned off
Creating schedules when a patient was/was not loaded
Recurrence was updating non-recurring schedules

While brainstorming possible causes, we also instituted policies to prevent errors and help track the bug:

Always verify auto-compose is on
Change the patient if you notice the subject doesn’t match the attached patient
Don’t use recurrence

These policies and continued user reporting helped us narrow down our frame of debugging.

Last week all of the development and management team for EMR decided on a plan of attack. We installed a hook (SQLProfiler) into the EMR database to watch all transactions for one day. We also created two isolated copies of the EMR database (one before the hook and one after). Once we had that information, we compared individual records and found one that had changed during observed time period. That record was then used to track through the report, generated from the hook, to watch for a particular pattern of stored procedure calls.

Once we pinpointed the series of stored procedure calls that caused the damage, the development team started combing over the code to find the pattern that executed them. Luckily we were able to find the code! The code was then examined to find all possible ways it could be run and then each of those ways were analyzed for possible mistakes. To everyone’s relief one of the ways the code was executed was identified to have a slight oversight in the code; a check was not made to distinguish a workflow schedule and regular schedule.

From there the development team patched the code in question and analyzed other areas of the code for similar problems. While this was a huge step forward, it was only half of the battle. The EMR database still contained damaged records that had wrong patients attached to schedules. So the development team then went back to the database to fix it. They were able to exploit a feature added to EMR some time ago to generate a fix for the database.

The fix for the database was then tested to check that it fixed most (90% range) of the schedules. Some schedules remain that have a wrong patient attached, but the frequency of these broken schedules has been reduced to 1-3 broken schedules per week. These schedules must be corrected by hand.

During this two month battle there were many people that played an important role in tracking down and fixing this bug. I would like to thank Beth, Michelle and Kristeen for their reports. Also I’d like to thank Keni, Demola, and Seun for their hard work in tracking down the code. Congratulations everyone, we did it!

More good days ahead :)