You don’t want your software to crash, do you?

This post describes my experiences in making SumatraPDF crash less.

SumatraPDF is a Windows desktop app. It’s a fast viewer for PDF, ePub, comic books etc.. It’s small and yet full of features.

Know thy crashes

The most important step in fixing crashes is knowing about them.

There are variations between different Windows versions and variations of how people customize Windows and how they use your software. Sometimes in ways you would never think of.

Those variations can lead to bugs or crashes. I have no hope of testing my software in all possible configurations. If you’re Microsoft or Adobe you can reinvest some of the revenue to hire an army or testers, setup compatibility labs etc. but for a single developer this is not realistic.

Bugs most often lurk in untested code and even a very good testing effort won’t encounter all things that can go wrong in real life.

Get the crash reports automatically

Very few people bother to submit bug reports and crashes. When a program crashes, they just shrug and restart. The only realistic way to be informed about crashes is to automatically gather crash reports without user involvement.

This is a proven idea. Microsoft did it for Windows. Mozilla and Google did it for their browsers.

How to get crashes

Regardless of the platform, the solution involves two parts:

code in the software itself. When a crash happens it runs crash handler which creates crash report and sends it to the server
a server, which accepts crash reports from the software

The server

The server part is simple: it’s a Go server running on Hetzner that accepts crash reports in text form via HTTP POST requests, saves them to disk and provides simple UI for browsing them.

Crash reports are deleted after a week because there’s no point keeping them. If a crash stops happening, it was fixed. If it keeps happening, I’ll get new crash report.

Intercepting crashes in C++ on Windows

SumatraPDF is written in C++. We want our code to be executed when a crash or other fatal thing happen.

To be notified about fatal things in C runtime: usesignal(SIGABRT, onSignalAbort);. This register onSignalAbort to be called on SIGABRT:

void __cdecl onSignalAbort(int) {
    // put the signal back because can be called many times
    // (from multiple threads) and raise() resets the handler
    signal(SIGABRT, onSignalAbort);
    CrashMe();
}

I just induce a hardware crash (referencing invalid memory location 0) so that it’s handled by crash handler:

inline void CrashMe() {
    char* p = nullptr;
    // cppcheck-suppress nullPointer
    *p = 0; // NOLINT
}

To register with C++ runtime: ::set_terminate(onTerminate);. Similarly:

void onTerminate() {
    CrashMe();
}

To register for exceptions generated by CPU:

gDumpEvent = CreateEvent(nullptr, FALSE, FALSE, nullptr);
if (!gDumpEvent) {
	log("InstallCrashHandler: skipping because !gDumpEvent\n");
	return;
}
gDumpThread = CreateThread(nullptr, 0, CrashDumpThread, nullptr, 0, nullptr);
if (!gDumpThread) {
	log("InstallCrashHandler: skipping because !gDumpThread\n");
	return;
}
gPrevExceptionFilter = SetUnhandledExceptionFilter(CrashDumpExceptionHandler);
// 1 means that our handler will be called first, 0 would be: last
AddVectoredExceptionHandler(1, CrashDumpVectoredExceptionHandler);

Generating a crash report

The most important part of a crash report is a readable call stack of a thread that crashed which looks like:

sumatrapdf.exe!RectF::IsEmpty+0x0 \src\utils\GeomUtil.cpp+314
sumatrapdf.exe!DisplayModel::GetContentStart+0x3f \src\DisplayModel.cpp+1196
sumatrapdf.exe!DisplayModel::GoToPrevPage+0x7e \src\DisplayModel.cpp+1386
sumatrapdf.exe!CanvasOnMouseWheel+0x191 \src\Canvas.cpp+1318
sumatrapdf.exe!WndProcCanvasFixedPageUI+0x300 \src\Canvas.cpp+1672
sumatrapdf.exe!WndProcCanvas+0xa7 \src\Canvas.cpp+1993
user32.dll!CallWindowProcW+0x589
user32.dll!TranslateMessage+0x292
sumatrapdf.exe!RunMessageLoop+0x15c \src\SumatraStartup.cpp+531
sumatrapdf.exe!WinMain+0x11fd \src\SumatraStartup.cpp+1407
sumatrapdf.exe!__scrt_common_main_seh+0x106

To get a call stack you can use StackWalk64() from dbghelp.dll. But those are just addresses in memory. You then have to map each address into a loaded dll and offset in that dll. Then you have to match offset in the dll to a function and offset in that functions. And then match that offset in the function to a source code file and line that generated that code.

To do all of that you need symbols in .pdb format.

Because SumatraPDF is open source I decided for an unorthodox approach: For each version of SumatraPDF executable, I store .pdb symbols in online storage (currently it happens to be Cloudflare’s S3-copmatible R2).

When crash happens I download those symbols locally, unpack them, and initialize dbghelp.dll with their locations.

I then resolve addresses in memory to dll name, function name, offset in function and source code file name and line number.

Other stuff in crash report:

info about OS version, processor etc. in case those correlate to the crash
list of loaded dlls. It’s quite common that other software injects their dlls into all executables running and those dlls might have bugs that cause the crash
SumatraPDF configuration. To my detriment I’ve made SumatraPDF quite customizable and sometimes bugs only happen when certain options are used and if I’m not using the same settings, I can’t reproduce the bug even if I execute the same steps
logs. When I can’t figure out a certain crash, I can add additional logging to help me understand what leads to the crash

I include Git has revision in the executable, which is included in crash report. That way I can post-process the crash report on the server and for each stack frame I can generate a link to source code on GitHub.

When crash happens the program is compromised so I take care to pre-compute as much info before crash handler executes. If crash handler crashes, I won’t get the crash report.

SumatraPDF experience

How does it work in practice?

I’ve implemented the system described here in Sumatra 1.5. Sumatra is a rather complicated piece of C++ code and quite popular (several thousand of downloads per day).

Before 1.5 we had a system where we would save the minidump to a disk and after a crash we would ask the user to report it in our bug tracker and attach minidump to the bug report.

Almost no one did that. I only got few crash reports from users in few months. The automated system was sending tens of crash reports per day.

Once I knew about the problems, I would try to fix them.

Some problems I could fix just by looking at crash report.

Some required writing stress tests to make them easier to reproduce locally.

Some of them I can’t fix (e.g. because they are caused by buggy printer drivers or other software that injects buggy dlls into SumatraPDF process).

I do know that I fixed some of the bugs. I can see that a new release generates less crashes and by looking at crash reports I can tell that some crashes that happened frequently in previous releases do not happen anymore.

Building automated crash reporting system was the best investment I could have made for improving reliability of SumatraPDF.

The alternatives

While the general idea is always the same, there are different ways of implementing it.

On Windows a simpler solution is to capture so-called minidumps (using MiniDumpWriteDumpProc() Windows API) instead of going to the trouble of generating human-readable crash reports client side.

I did that too. The problem with that approach is that you have to inspect each crash dump manually in the debugger (e.g. WinDBG). I wrote a python script that automated the process (you can script it by launching cdb debugger with the right parameters and making it run !analyze -v)).

Unfortunately, cdb is buggy and was hanging on some dump files. It’s probably possible to work around with a timeout in the python script, but at that point I stopped caring.

Windows provides native support for minidumps. Google took minidump design and provided cross-platform implementation for Windows, Mac and Linux, as part of breakpad project which was then replaced by Crashpad. They are both

Breakpad is the crash reporting system used by Google for Chrome and Mozilla for Firefox. It contains both client and server parts for native (C/C++ or Objective C) code.

I used it once for a Mac app. For Objective C I prefer the approach described above as it’s simpler to implement, but I’m sure that’s a solid and well tested approach.

On Windows, crash reports from your app are already sent to Microsoft as part of Windows Error Reporting. Apparently, it’s possible to for third party developers to get access to those reports but I never did that so I don’t know how.

References

CrashHandler.cpp is crash handling code in SumatraPDF
Chrome’s crash reporting
Mozilla crash reporting

Edna	speedy note taking app with super powers
SumatraPDF	small, fast, free PDF / ePub / comic book reader for Windows