This article describes how I implemented StrVec
class for efficiently storing multiple strings.
Much ado about the strings
Strings are among the most used types in most programs.
Arrays of strings are also used often. I count ~80 uses of
StrVec
in
SumatraPDF code.
This article describes how I implemented an optimized array of strings in SumatraPDF
C++ code .
No STL for you
Why not use std::vector<std::string>
?
In SumatraPDF I don’t use STL. I don’t use std::string
, I don’t use std::vector
.
For me it’s a symbol of my individuality, and my belief in personal freedom.
As
described here, minimum size of
std::string
on 64-bit machines is 32 bytes for msvc / gcc and 24 bytes for short strings (15 chars for msvc / gcc, 22 chars for clang).
For longer strings we have more overhead:
- 32⁄24 bytes for the header
- memory allocator overhead
- allocator metadata
- padding due to rounding allocations to at least 16 bytes
There’s also std::vector
overhead: for fast appends (push()) std::vector
implementations over-allocated space
Longer strings are allocated at random addresses so they can be spread out in memory. That is bad for cache locality and that often cause more slowness than executing lots of instructions.
Design and implementation of StrVec
StrVec (vector of strings) solves all of the above:
- per-string overhead of only 8 bytes
- strings are laid out next to each other in memory
StrVec
High level design of StrVec
:
- backing memory is allocated in singly-linked pages
- similar to
std::vector
, we start with small page and increase the size of the page. This strikes a balance between speed of accessing a string at random index and wasted space
- unlike
std::vector
we don’t reallocate memory (most of the time). That saves memory copy when re-allocating backing space
Here’s all there is to StrVec
:
struct StrVec {
StrVecPage* first = nullptr;
int nextPageSize = 256;
int size = 0;
}
size
is a cached number of strings. It could be calculated by summing the size in all StrVecPage
s.
nextPageSize
is the size of the next StrVecPage
. Most array implementation increase the size of next allocation by 1.4x - 2x.
I went with the following progression: 256 bytes, 1k, 4k, 16k, 32k and I cap it at 64k.
I don’t have data behind those numbers, they feel right. Bigger page wastes more space. Smaller page makes random access slower because to find N-th string we need to traverse linked list of StrVecPage
.
nextPageSize
is exposed to allow the caller to optimize use. E.g. if it expects lots of strings, it could set nextPageSize
to a large number.
StrVecPage
Most of the implementation is in StrVecPage
. The big idea here is:
- we allocate a block of memory
- strings are allocated from the end of memory block
- at the beginning of the memory block we build and index of strings. For each string we have:
- u32 size
- u32 offset of the string within memory block, counting from the beginning of the block
The layout of memory block is:
StrVecPage
struct { size u32; offset u32 } []
- … not yet used space
- strings
This is StrVecPage
:
struct StrVecPage {
struct StrVecPage* next;
int pageSize;
int nStrings;
char* currEnd;
}
next
is for linked list of pages.
Since pages can have various sizes we need to record pageSize
.
nStrings
is number of strings in the page and currEnd
points to the end of free space within page.
Implementing operations
Appending a string
Appending a string at the end is most common operation.
To append a string:
- we calculate how much memory inside a page it’ll need:
str::Len(string) + 1 + sizeof(u32) + sizeof(u32)
. +1 is for 0-termination for compatibility with C APIs that take char*, and 2xu32
for size and offset.
- If we have enough space in last page, we add size and offset at the end of index and append a string from the end i.e. `currEnd - (str::Len(string) + 1).
- If there is not enough space in last page, we allocate new page
We can calculate how much space we have left with:
int indexEntrySize = sizeof(u32) + sizeof(u32); // size + offset
char* indexEnd = (char*)pageStart + sizeof(StrVecPage) + nStrings*indexEntrySize
int nBytesFree = (int)(currEnd - indexEnd)
Removing a string
Removing a string is easy because it doesn’t require moving memory inside StrVecPage
.
We do nStrings--
and move index values of strings after the removed string. I don’t bother freeing the string memory within a page. It’s possible but complicated enough I decided to skip it. You can compact StrVec
to remove all overhead.
If you do not care about preserving order of strings after removal, I haveRemoveAtFast()
which uses a trick: instead of copying memory of all index values after removed string, I copy a single index from the end into a slot of the string being removed.
Replacing a string or inserting in the middle
Replacing a string or inserting a string in the middle is more complicated because there might not be enough space in the page for the string.
When there is enough space, it’s as simple as append.
When there is not enough space, I re-use the compacting capability: I compact all existing pages into a single page with extra space for the string and some extra space as an optimization for multiple inserts.
Iteration
A random access requires traversing a linked list. I think it’s still fast because typically there aren’t many pages and we only need to look at a single nStrings
value.
After compaction to a single page, random access is as fast as it could ever be.
C++ iterator is optimized for sequential access:
struct iterator {
const StrVec* v;
int idx;
// perf: cache page, idxInPage from prev iteration
int idxInPage;
StrVecPage* page;
}
We cache the current state of iteration as page
and idxInPage
. To advance to next string we advance idxInPage
. If it exceeds nStrings
, we advance to page->next
.
Optimized search
Finding a string is as optimized as it could be without a hash table.
Typically to compare char*
strings you need to call str::Eq(s, s2)
for every string you compare it to.
That is a function call and it has to touch s2 memory. That is bad for performance because it blows the cache.
In StrVec
I calculate length of the string to find once and then traverse the size / offset index. Only when size is different I have to compare the strings. Most of the time we just look at offset / size in L1 cache, which is very fast.
Compacting
If you know that you’ll not be adding more strings to StrVec
you can compact all pages into a single page with no overhead of empty space.
It also speeds up random access because we don’t have multiple pages to traverse to find the item and a given index.
Representing a nullptr char*
Even though I have a string class, I mostly use char*
in SumatraPDF code.
In that world empty string and nullptr
are 2 different things.
To allow storing nullptr
strings in StrVec
(and not turning them into empty strings on the way out) I use a trick: a special u32
value kNullOffset
represents nullptr
.
StrVec is a string pool allocator
In C++ you have to track the lifetime of each object:
- you allocate with
malloc()
or new
- when you no longer need to object, you call
free()
or delete
However, the lifetime of allocations is often tied together.
For example in SumatraPDF an opened document is represented by a class. Many allocations done to construct that object last exactly as long as the object.
The idea of a pool allocator is that instead of tracking the lifetime of each allocation, you have a single allocator. You allocate objects with the same lifetime from that allocator and you free them with a single call.
StrVec
is a string pool allocator: all strings stored in StrVec
have the same lifetime.
Testing
In general I don’t advocate writing a lot of tests. However, low-level, tricky functionality like StrVec
deserves decent test coverage to ensure basic functionality works and to exercise code for corner cases.
I have 360 lines of tests for ~700 lines of of implementation.
Potential tweaks and optimization
When designing and implementing data structures, tradeoffs are aplenty.
Interleaving index and strings
I’m not sure if it would be faster but instead of storing size and offset at the beginning of the page and strings at the end, we could store size / string sequentially from the beginning.
It would remove the need for u32 of offset but would make random access slower.
Varint encoding of size and offset
Most strings are short, under 127 chars. Most offsets are under 16k.
If we stored size and offset as variable length integers, we would probably bring down average per-string overhead from 8 bytes to ~4 bytes.
Implicit size
When strings are stored sequentially size is implicit as difference between offset of the string and offset of next string.
Not storing size would make insert and set operations more complicated and costly: we would have to compact and arrange strings in order every time.
Storing index separately
We could store index of size / offset in a separate vector and use pages to only allocate string data.
This would simplify insert and set operations. With current design if we run out of space inside a page, we have to re-arrange memory.
When offset is stored outside of the page, it can refer to any page so insert and set could be as simple as append.
The evolution of StrVec
The design described here is a second implementation of StrVec
.
The one before was simply a combination of str::Str
(my std::string
) for allocating all strings and Vec<u32>
(my std::vector
) for storing offset index.
It had some flaws: appending a string could re-allocate memory within str::Str
. The caller couldn’t store returned char*
pointer because it could be invalidated.
As a result the API was akward and potentially confusing: I was returning offset of the string so the string was str::Str.Data() + offset
.
The new StrVec
doesn’t re-allocate on Append
, only (potentially) on InsertAt
and SetAt
.
The most common case is append-only which allows the caller to store the returned char*
pointers.
Before implementing StrVec
I used Vec<char*>
. Vec
is my version of std::vector
and Vec<char*>
would just store pointer to individually allocated strings.
Cost vs. benefit
I’m a pragmatist: I want to achieve the most with the least amount of code, the least amount of time and effort.
While it might seem that I’m re-implementing things willy-nilly, I’m actually very mindful of the cost of writing code.
Writing software is a balance between effort and resulting quality.
One of the biggest reasons SumatraPDF so popular is that it’s fast and small. That’s an important aspect of software quality.
When you double click on a PDF file in an explorer, SumatraPDF starts instantly. You can’t say that about many similar programs and about other software in general.
Keeping SumatraPDF small and fast is an ongoing focus and it does take effort.
StrVec.cpp
is only 705 lines of code. It took me several days to complete. Maybe 2 days to write the code and then some time here and there to fix the bugs.
That being said, I didn’t start with this StrVec. For many years I used obvious Vec<char*>
.
Then I implemented somewhat optimized StrVec
. And a few years after that I implemented this ultra-optimized version.
References
By the time you read this, the implementation could have been improved.