Unexamined life is not worth living
said Socrates.
I don’t know about that but to become a better, faster, more productive programmer it pays to examine what makes you un-productive.
Fixing bugs is one of those un-productive activities. You have to fix them but it would be even better if you didn’t write them in the first place.
Therefore it’s good to reflect after fixing a bug.
Why did the bug happen?
Could I have done something to not write the bug in the first place?
If I did write the bug, could I do something to diagnose or fix it faster?
This seems like a great idea that I wasn’t doing. Until now.
Here’s a random selection of bugs I found and fixed in SumatraPDF, with some reflections.
SumatraPDF is a C++ win32 Windows app. It’s a small, fast, open-source, multi-format PDF/eBook/Comic Book reader.
To keep the app small and fast I generally avoid using other people’s code. As a result most code is mine and most bugs are mine.
Let’s reflect on those bugs.
TabWidth doesn’t work
A user
reported that
TabWidth
advanced setting doesn’t work in 3.5.2 but worked in 3.4.6.
I looked at the code and indeed: the setting was not used anywhere. The fix was
to use it.
Why did the bug happen? It was a refactoring. I heavily refactored tabs control. Somehow during the rewrite I forgot to use the advanced setting when creating the new tabs control, even though I did write the code to support it in the control.
I guess you could call it sloppiness.
How could I not write the bug? I could review the changes more carefully. There’s no-one else working on this project so there’s no one else to do additional code reviews.
I typically do a code review by myself with
webdiff but let’s face it: reviewing changes right after writing them is the worst possible time. I’m biased to think that the code I just wrote is correct and I’m often mentally exhausted.
Maybe I should adopt a process when I review changes made yesterday with fresh, un-tired eyes?
How could I detect the bug earlier?. 3.5.2 release happened over a year ago. Could I have found it sooner?
I knew I was refactoring tabs code. I knew I have a setting for changing the look of tabs. If I connected the dots at the time, I could have tested if the setting still works.
I don’t make releases too often. I could do more testing before each release and at the very least verify all advanced settings work as expected.
The real problem In retrospect, I shouldn’t have implemented that feature at all. I like Sumatra’s customizability and I think it’s non-trivial contributor to it’s popularity but it took over a year for someone to notice and report that particular bug. It’s clear it’s not a frequently used feature. I implemented it because someone asked and it was easy. I should have said no to that particular request.
Fix printing crash by correctly ref-counting engine
Bugs can crash your program. Users rarely report crashes even though I did put effort into making it easy.
When I a crash happens I have a crash handler that saves the diagnostic info to a file and I show a message box asking users to report the crash and with a press of a button I launch a notepad with diagnostic info and a browser with a page describing how to submit that as a GitHub issue.
The other button is to ignore my pleas for help.
Most users overwhelmingly choose to ignore.
I know that because I also have
crash reporting system that sends me a crash report. I get thousands of crash reports for every crash reported by the user.
Therefore I’m convinced that the single most impactful thing for making software that doesn’t crash is to have a crash reporting system, look at the crashes and fix them.
This is not a perfect system because all I have is a call stack of crashed thread, info about the computer and very limited logs.
Nevertheless, sometimes all it takes is a look at the crash call stack and inspection of the code.
I saw a crash in printing code which
I fixed after some code inspection.
The clue was that I was accessing a seemingly destroyed instance of Engine.
That was easy to diagnose because I just refactored the code to add ref-counting to Engine so it was easy to connect the dots.
I’m not a fan of ref-counting. It’s easy to mess up ref-counting (add too many refs, which leads to memory leaks or too many releases which leads to premature destruction).
I’ve seen codebases where developers were crazy in love with ref-counting: every little thing, even objects with obvious lifetimes.
In contrast,, that was the first ref-counted object in over 100k loc of SumatraPDF code.
It was necessary in this case because I would potentially hand off the object to a printing thread so its lifetime could outlast the lifetime of the window for which it was created.
How could I not write the bug? It’s another case of sloppiness but I don’t feel bad. I think the bug existed there before the refactoring and this is the hard part about programming: complex interactions between distant, in space and time, parts of the program.
Again, more time spent reviewing the change could have prevented it.
As a bonus, I managed to simplify the logic a bit. Writing software is an incremental process.
I could feel bad about not writing the perfect code from the beginning but I choose to enjoy the process of finding and implementing improvements. Making the code and the program better over time.
Tracking down a chm thumbnail crash
Not all crashes can be fixed given information in crash report. I saw a report with crash related to creating a thumbnail crash.
I couldn’t figure out why it crashes but I could
add more logging to help figure out the issue if it happens again.
If it doesn’t happen again, then I win. If it does happen again, I will have more context in the log to help me figure out the issue.
Update: I did fix the crash.
This is the bast case scenario: a bug report with instructions to reproduce a crash.
If I can reproduce the crash when running debug build under the debugger, it’s typically very easy to figure out the problem and fix it.
In this case I’ve recently implemented an improved version of StrVec
(vector of strings) class. It had a compatibility bug compared to previous implementation in that StrVec::InsertAt(0)
into an empty vector would crash. Arguably it’s not a correct usage but existing code used it so I’ve added support to InsertAt()
at the end of vector.
How could I not write the bug? I should have written a unit test (which I did in the fix). I don’t blindly advocate unit tests. Writing tests has a productivity cost but for such low-level, relatively tricky code, unit tests are good.
I don’t feel too bad about it. I did write lots of tests for StrVec
and arguably this particular usage of InsertAt()
was borderline correct so it didn’t occur to me to test that condition.
Use after free
I saw a crash in crash reports, close to DeleteThumbnailForFile()
. I looked at the code:
if (!fs->favorites->IsEmpty()) {
// only hide documents with favorites
gFileHistory.MarkFileInexistent(fs->filePath, true);
} else {
gFileHistory.Remove(fs);
DeleteDisplayState(fs);
}
DeleteThumbnailForFile(fs->filePath);
I immediately spotted suspicious part: we call DeleteDisplayState(fs)
and then might use fs->filePath
.
I looked at DeleteDisplayState
and it does, in fact, deletes fs
and all its data, including filePath
. So we use freed data in a classic use after free bug.
The fix was simple: make a copy of fs->filePath
before calling DeleteDisplayState
and use that.
How could I not write the bug? Same story: be more careful when reviewing the changes, test the changes more.
If I fail that, crash reporting saves my ass. The bug didn’t last more than a few days and affected only one user.
I immediately fixed it and published an update.
Summary of being more productive and writing bug free software
If many people use your software, a crash reporting system is a must. Crashes happen and few of them are reported by users.
Code reviews can catch bugs but they are also costly and reviewing your own code right after you write it is not a good time. You’re tired and biased to think your code is correct.
Maybe reviewing the code a day after, with fresh eyes, would be better. I don’t know, I haven’t tried it.