SumatraPDF is a fast, small, open-source PDF reader for Windows, written in C++.

This article describes how I implemented StrVec class for efficiently storing multiple strings.

Much ado about the strings

Strings are among the most used types in most programs.

Arrays of strings are also used often. I count ~80 uses of StrVec in SumatraPDF code.

This article describes how I implemented an optimized array of strings in SumatraPDF C++ code .

No STL for you

Why not use std::vector<std::string>?

In SumatraPDF I don’t use STL. I don’t use std::string, I don’t use std::vector.

For me it’s a symbol of my individuality, and my belief in personal freedom.

As described here, minimum size of std::string on 64-bit machines is 32 bytes for msvc / gcc and 24 bytes for short strings (15 chars for msvc / gcc, 22 chars for clang).

For longer strings we have more overhead:

³²⁄₂₄ bytes for the header
memory allocator overhead
- allocator metadata
- padding due to rounding allocations to at least 16 bytes

There’s also std::vector overhead: for fast appends (push()) std::vectorimplementations over-allocated space

Longer strings are allocated at random addresses so they can be spread out in memory. That is bad for cache locality and that often cause more slowness than executing lots of instructions.

Design and implementation of StrVec

StrVec (vector of strings) solves all of the above:

per-string overhead of only 8 bytes
strings are laid out next to each other in memory

StrVec

High level design of StrVec:

backing memory is allocated in singly-linked pages
similar to std::vector, we start with small page and increase the size of the page. This strikes a balance between speed of accessing a string at random index and wasted space
unlike std::vector we don’t reallocate memory (most of the time). That saves memory copy when re-allocating backing space

Here’s all there is to StrVec:

struct StrVec {
    StrVecPage* first = nullptr;
    int nextPageSize = 256;
    int size = 0;
}

size is a cached number of strings. It could be calculated by summing the size in all StrVecPages.

nextPageSize is the size of the next StrVecPage. Most array implementation increase the size of next allocation by 1.4x - 2x.

I went with the following progression: 256 bytes, 1k, 4k, 16k, 32k and I cap it at 64k.

I don’t have data behind those numbers, they feel right. Bigger page wastes more space. Smaller page makes random access slower because to find N-th string we need to traverse linked list of StrVecPage.

nextPageSize is exposed to allow the caller to optimize use. E.g. if it expects lots of strings, it could set nextPageSize to a large number.

StrVecPage

Most of the implementation is in StrVecPage. The big idea here is:

we allocate a block of memory
strings are allocated from the end of memory block
at the beginning of the memory block we build and index of strings. For each string we have:
- u32 size
- u32 offset of the string within memory block, counting from the beginning of the block

The layout of memory block is:

StrVecPage
struct { size u32; offset u32 } []
… not yet used space
strings

This is StrVecPage:

struct StrVecPage {
    struct StrVecPage* next;
    int pageSize;
    int nStrings;
    char* currEnd;
}

next is for linked list of pages.

Since pages can have various sizes we need to record pageSize.

nStrings is number of strings in the page and currEnd points to the end of free space within page.

Implementing operations

Appending a string

Appending a string at the end is most common operation.

To append a string:

we calculate how much memory inside a page it’ll need: str::Len(string) + 1 + sizeof(u32) + sizeof(u32). +1 is for 0-termination for compatibility with C APIs that take char*, and 2xu32 for size and offset.
If we have enough space in last page, we add size and offset at the end of index and append a string from the end i.e. `currEnd - (str::Len(string) + 1).
If there is not enough space in last page, we allocate new page

We can calculate how much space we have left with:

int indexEntrySize = sizeof(u32) + sizeof(u32); // size + offset
char* indexEnd = (char*)pageStart + sizeof(StrVecPage) + nStrings*indexEntrySize
int nBytesFree = (int)(currEnd - indexEnd)

Removing a string

Removing a string is easy because it doesn’t require moving memory inside StrVecPage.

We do nStrings-- and move index values of strings after the removed string. I don’t bother freeing the string memory within a page. It’s possible but complicated enough I decided to skip it. You can compact StrVec to remove all overhead.

If you do not care about preserving order of strings after removal, I haveRemoveAtFast() which uses a trick: instead of copying memory of all index values after removed string, I copy a single index from the end into a slot of the string being removed.

Replacing a string or inserting in the middle

Replacing a string or inserting a string in the middle is more complicated because there might not be enough space in the page for the string.

When there is enough space, it’s as simple as append.

When there is not enough space, I re-use the compacting capability: I compact all existing pages into a single page with extra space for the string and some extra space as an optimization for multiple inserts.

Iteration

A random access requires traversing a linked list. I think it’s still fast because typically there aren’t many pages and we only need to look at a single nStrings value.

After compaction to a single page, random access is as fast as it could ever be.

C++ iterator is optimized for sequential access:

struct iterator {
  const StrVec* v;
  int idx;
  
  // perf: cache page, idxInPage from prev iteration
  int idxInPage;
  StrVecPage* page;
}

We cache the current state of iteration as page and idxInPage. To advance to next string we advance idxInPage. If it exceeds nStrings, we advance to page->next.

Optimized search

Finding a string is as optimized as it could be without a hash table.

Typically to compare char* strings you need to call str::Eq(s, s2) for every string you compare it to.

That is a function call and it has to touch s2 memory. That is bad for performance because it blows the cache.

In StrVec I calculate length of the string to find once and then traverse the size / offset index. Only when size is different I have to compare the strings. Most of the time we just look at offset / size in L1 cache, which is very fast.

Compacting

If you know that you’ll not be adding more strings to StrVec you can compact all pages into a single page with no overhead of empty space.

It also speeds up random access because we don’t have multiple pages to traverse to find the item and a given index.

Representing a nullptr char*

Even though I have a string class, I mostly use char* in SumatraPDF code.

In that world empty string and nullptr are 2 different things.

To allow storing nullptr strings in StrVec (and not turning them into empty strings on the way out) I use a trick: a special u32 value kNullOffset represents nullptr.

StrVec is a string pool allocator

In C++ you have to track the lifetime of each object:

you allocate with malloc() or new
when you no longer need to object, you call free() or delete

However, the lifetime of allocations is often tied together.

For example in SumatraPDF an opened document is represented by a class. Many allocations done to construct that object last exactly as long as the object.

The idea of a pool allocator is that instead of tracking the lifetime of each allocation, you have a single allocator. You allocate objects with the same lifetime from that allocator and you free them with a single call.

StrVec is a string pool allocator: all strings stored in StrVec have the same lifetime.

Testing

In general I don’t advocate writing a lot of tests. However, low-level, tricky functionality like StrVec deserves decent test coverage to ensure basic functionality works and to exercise code for corner cases.

I have 360 lines of tests for ~700 lines of of implementation.

Potential tweaks and optimization

When designing and implementing data structures, tradeoffs are aplenty.

Interleaving index and strings

I’m not sure if it would be faster but instead of storing size and offset at the beginning of the page and strings at the end, we could store size / string sequentially from the beginning.

It would remove the need for u32 of offset but would make random access slower.

Varint encoding of size and offset

Most strings are short, under 127 chars. Most offsets are under 16k.

If we stored size and offset as variable length integers, we would probably bring down average per-string overhead from 8 bytes to ~4 bytes.

Implicit size

When strings are stored sequentially size is implicit as difference between offset of the string and offset of next string.

Not storing size would make insert and set operations more complicated and costly: we would have to compact and arrange strings in order every time.

Storing index separately

We could store index of size / offset in a separate vector and use pages to only allocate string data.

This would simplify insert and set operations. With current design if we run out of space inside a page, we have to re-arrange memory.

When offset is stored outside of the page, it can refer to any page so insert and set could be as simple as append.

The evolution of StrVec

The design described here is a second implementation of StrVec.

The one before was simply a combination of str::Str (my std::string) for allocating all strings and Vec<u32> (my std::vector) for storing offset index.

It had some flaws: appending a string could re-allocate memory within str::Str. The caller couldn’t store returned char* pointer because it could be invalidated.

As a result the API was akward and potentially confusing: I was returning offset of the string so the string was str::Str.Data() + offset.

The new StrVec doesn’t re-allocate on Append, only (potentially) on InsertAt and SetAt.

The most common case is append-only which allows the caller to store the returned char* pointers.

Before implementing StrVec I used Vec<char*>. Vec is my version of std::vector and Vec<char*> would just store pointer to individually allocated strings.

Cost vs. benefit

I’m a pragmatist: I want to achieve the most with the least amount of code, the least amount of time and effort.

While it might seem that I’m re-implementing things willy-nilly, I’m actually very mindful of the cost of writing code.

Writing software is a balance between effort and resulting quality.

One of the biggest reasons SumatraPDF so popular is that it’s fast and small. That’s an important aspect of software quality.

When you double click on a PDF file in an explorer, SumatraPDF starts instantly. You can’t say that about many similar programs and about other software in general.

Keeping SumatraPDF small and fast is an ongoing focus and it does take effort.

StrVec.cpp is only 705 lines of code. It took me several days to complete. Maybe 2 days to write the code and then some time here and there to fix the bugs.

That being said, I didn’t start with this StrVec. For many years I used obvious Vec<char*>.

Then I implemented somewhat optimized StrVec. And a few years after that I implemented this ultra-optimized version.

References

SumatraPDF is a small, fast, multi-format (PDF/eBook/Comic Book and more), open-source reader for Windows.

The implementation described here: StrVec.cpp, StrVec.h, StrVec_ut.cpp

By the time you read this, the implementation could have been improved.

Edna	speedy note taking app with super powers
SumatraPDF	small, fast, free PDF / ePub / comic book reader for Windows