Here’s an example of optimizing a piece of code, starting from simple, unoptimized code to extremely optimized, yet still simple code.
The task is to write a class that parses a string in the form name1=val1&name2=val2&name3=val3&... and provides a simple interface for accessing names and values.
It’s not a made up example but a very real (albeit simplified) problem of parsing arguments of HTTP GET request.
We’ll be trying to minimize size of the code and the amount of memory used at runtime for storing parsed strings. Here’s the interface:
class ParsedStr {
public:
ParsedStr();
bool parse(const char* src);
size_t count();
const char* name(size_t idx);
const char* value(size_t idx);
};
Before we proceed, we need a way to know if our code is working. All versions of the code provide identical interface to the caller, so we’ll write a unit test driver ParsedStrTest.cpp (txt) and a makefile (txt) which will compile all versions as separate programs in debug, release and size-optimize configurations. That way we’ll be able to compare the code size for each version and ensure all versions work the same way.
Common utility routines are in ParsedStrUtil.cpp (txt).
A naive STL version will simply use a vector of strings for storing names and a vector of strings for storing values.
class ParsedStr
{
protected:
std::vector<std::string> _names;
std::vector<std::string> _values;
bool ParsedStr::parse(const char *s)
{
assert(0 == _names.size()); /* don't call me twice */
char *scopy = strdup(s);
for(;;) {
char *name = delim_str_iter(&scopy);
if (NULL == name) { /* finished parsing */
free(scopy);
return true;
}
char *value = delim_str_iter(&scopy);
if (NULL == value) { /* malformed string */
free(scopy);
return false;
}
_names.push_back(std::string(name));
_values.push_back(std::string(value));
}
}
size_t ParsedStr::count()
{
return _names.size();
}
const char* ParsedStr::name(size_t idx)
{
return _names[idx].c_str();
}
const char* ParsedStr::value(size_t idx)
{
return _values[idx].c_str();
}
How much memory are we using to store N strings? We really don’t know. Not only it depends on a particular STL implementation but whatever that value is, STL does a good job at hiding it from us. You can find out if you’re determined enough, but easy it ain’t.
Let’s get rid of STL and use char * instead of std::string and char ** instead of std::vector<std::string>.
STL‘s std::vector takes care of resizing the array but we would have to code it manually. We’ll side-step the problem and do it in two steps:
It’s both simpler and faster.
class ParsedStr
{
protected:
size_t _count;
const char **_names;
const char **_values;
bool ParsedStr::parse(const char *s)
{
assert(NULL == _names); /* don't call me twice */
char *scopy = strdup(s);
int str_count = 0;
char *stmp = scopy;
while (NULL != delim_str_iter(&stmp)) {
++str_count;
}
if (str_count % 2 != 0) { /* malformed string */
free(scopy);
return false;
}
_count = str_count / 2;
_names = (const char**)malloc(_count * sizeof(_names[0]));
_values = (const char**)malloc(_count * sizeof(_values[0]));
stmp = scopy;
for (int i=0; i<_count; i++) {
_names[i] = strdup(stmp);
str_skip(&stmp);
_values[i] = strdup(stmp);
str_skip(&stmp);
}
free(scopy);
return true;
}
ParsedStr::~ParsedStr()
{
for (int i=0; i < _count; i++) {
free((void*)_names[i]);
free((void*)_values[i]);
}
free(_names);
free(_values);
}
In the naive version we make N*2 allocations to store N name/value pairs. Most of those strings are short, so we’re probably badly hit by malloc()‘s overhead. Furthermore, malloc() is relatively slow, so doing less of it can only be good.
Let’s notice two important properties that allows us to optimize this:
ParsedStr::parse()Our optimized version simply duplicates the original string, replaces = and & characters with ’\0’ (for compatibility with C) and all the pointers point inside that copy.
When we’re done we only need to free the string copy.
class ParsedStr
{
protected:
size_t _count;
const char *_str;
const char **_names;
const char **_values;
bool ParsedStr::parse(const char *str)
{
assert(NULL == _str); /* don't call me twice */
_str = strdup(str);
int str_count = 0;
char *s = (char*)_str;
while (NULL != delim_str_iter(&s)) {
++str_count;
}
if (str_count % 2 != 0) { /* malformed string */
return false;
}
_count = str_count / 2;
_names = (const char**)malloc(_count * sizeof(_names[0]));
_values = (const char**)malloc(_count * sizeof(_values[0]));
s = (char*)_str;
int idx = 0;
while (*s) {
_names[idx] = s;
str_skip(&s);
_values[idx] = s;
str_skip(&s);
++idx;
}
assert(idx == _count);
return true;
}
ParsedStr::~ParsedStr()
{
free((void*)_str);
free(_names);
free(_values);
}
Not only this version uses less memory, is faster but also the code is simpler and smaller.
Notice that we allocate two arrays of equal length and we do it only once. We can easily change the code to only use one array of twice the size.
We adopt a simple indexing convention for accessing elements. Element i of _name array is now at index i*2 of the combined array, and element i of _value array is at index i*2+1.
This saves one malloc() and a little bit of memory due to less malloc() overhead.
class ParsedStr
{
protected:
size_t _count;
const char *_str;
const char **_names_values;
ParsedStrOptOneArray.cpp (raw):
bool ParsedStr::parse(const char *str)
{
assert(NULL == _str); /* don't call me twice */
_str = strdup(str);
int str_count = 0;
char *s = (char*)_str;
while (NULL != delim_str_iter(&s)) {
++str_count;
}
if (str_count % 2 != 0) { /* malformed string */
return false;
}
_count = str_count / 2;
_names_values = (const char**)malloc(_count * 2 * sizeof(_names_values[0]));
s = (char*)_str;
int idx = 0;
while (*s) {
_names_values[idx++] = s;
str_skip(&s);
_names_values[idx++] = s;
str_skip(&s);
}
assert(idx == _count * 2);
return true;
}
ParsedStr::~ParsedStr()
{
free((void*)_str);
free(_names_values);
}
const char* ParsedStr::name(size_t idx)
{
return _names_values[idx*2];
}
const char* ParsedStr::value(size_t idx)
{
return _names_values[idx*2+1];
}
This technique can also be applied in languages like C#, Java and with STL vectors.
Depending on access patterns of the data, you might try to improve cache locality (and speed) of access by using a different order e.g. instead of interleaving names and values, you can first store all values and then all names.
A pointer uses 4 bytes on 32-bit machine and 8 bytes on (increasingly more common) 64-bit machines.
All pointers point somewhere within a string so they could be represented as an offset from the beginning of the string.
If we can guarantee that the string size is limited, which is often the case, we can represent the offset with a smaller number of bytes. For strings 3 bytes (24 bits) would be a safe value for most cases. In our case, let’s limit the string size to 64 kB and use 2 bytes for offset.
class ParsedStr
{
protected:
size_t _count;
const char *_str;
unsigned short *_names_values_offsets;
ParsedStrOptOffsets.cpp (raw):
bool ParsedStr::parse(const char *str)
{
assert(NULL == _str); /* don't call me twice */
_str = strdup(str);
int str_count = 0;
char *s = (char*)_str;
while (NULL != delim_str_iter(&s)) {
++str_count;
}
if (str_count % 2 != 0) { /* malformed string */
return false;
}
_count = str_count / 2;
_names_values_offsets = (unsigned short *)malloc(_count * 2 * sizeof(_names_values_offsets[0]));
s = (char*)_str;
int idx = 0;
while (*s) {
_names_values_offsets[idx++] = s - _str;
str_skip(&s);
_names_values_offsets[idx++] = s - _str;
str_skip(&s);
}
assert(idx == _count * 2);
return true;
}
ParsedStr::~ParsedStr()
{
free((void*)_str);
free(_names_values_offsets);
}
const char* ParsedStr::name(size_t idx)
{
return _str + _names_values_offsets[idx*2];
}
const char* ParsedStr::value(size_t idx)
{
return _str + _names_values_offsets[idx*2+1];
}
We save 2 bytes per pointer on 32 bit machines and 6 bytes per pointer on 64 bit machines.
Do we need the offsets at all? We know that strings are laid out in memory in a specific order and if we’re willing to sacrifice a little bit of speed, we don’t have to store the offsets. Let’s just find n-th string starting from the beginning.
ParsedStrOptNoOffsets.h (raw):
class ParsedStr
{
protected:
size_t _count;
const char *_str;
ParsedStrOptNoOffsets.cpp (raw):
bool ParsedStr::parse(const char *str)
{
assert(NULL == _str); /* don't call me twice */
_str = strdup(str);
_count = 0;
char *s = (char*)_str;
while (NULL != delim_str_iter(&s)) {
++_count;
}
if (_count % 2 != 0) { /* malformed string */
return false;
}
_count = _count / 2;
return true;
}
ParsedStr::~ParsedStr()
{
free((void*)_str);
}
The code is not only optimized for size but also simpler than most versions.
Is this approach too slow? It depends.
If there are thousands of strings and we call name() and value() functions millions of times, then yes.
If there are only few strings and name() and value() are called only a couple of times, then no.
How do you know which case applies in your code? Use a profiler, Valgrind is your friend.
What if we can modify the string passed in to ParsedStr::parse()? What if we can assume that the lifetime of that string is longer than lifetime of ParsedStr object?
If that’s true, we can avoid allocating a copy of the string and freeing it.
This optimization can be combined with most other optimization. Here’s the code after applying it previous version:
ParsedStrOptNoOffsetsNoDup.cpp (raw):
bool ParsedStr::parse(const char *str)
{
assert(NULL == _str); /* don't call me twice */
_str = str;
_count = 0;
char *s = (char*)_str;
while (NULL != delim_str_iter(&s)) {
++_count;
}
if (_count % 2 != 0) { /* malformed string */
return false;
}
_count = _count / 2;
return true;
}
This approach is more error prone because clients have to ensure proper lifetime of the string, but hey, this is about extreme optimizations, not safe ones.
What if most of the time the strings are short, say less than 200 bytes in size? If the string to parse is part of HTTP GET request, this is a safe assumption.
We could optimize for common case by using buffer that is part of ParsedStr class. If a string is small enough to fit the buffer (which we presume is common), we save a malloc() and its associated overhead.
If the string doesn’t fit in the buffer, we’ll use more memory (the buffer is unused).
How do we know if it’s worth it and how do we know what buffer size should we pick? It has to come from data based on real-life usage of the application. Instrument the app to gather statistics on the sizes of the string and pick a value that balances small size of the buffer and high hit ratio.
/* make the size of the buffer a multiple of 8, to match malloc() padding */
#define PARSED_STR_BUF_SIZE 256 - sizeof(char*) - sizeof(size_t)
class ParsedStr
{
protected:
size_t _count;
const char *_str;
char _buf[PARSED_STR_BUF_SIZE];
bool ParsedStr::parse(const char *str)
{
assert(NULL == _str); /* don't call me twice */
size_t len = strlen(str);
if (len > sizeof(_buf)-1)
_str = strdup(str);
else {
memcpy(_buf, str, len + 1);
_str = _buf;
}
_count = 0;
char *s = (char*)_str;
while (NULL != delim_str_iter(&s)) {
++_count;
}
if (_count % 2 != 0) { /* malformed string */
return false;
}
_count = _count / 2;
return true;
}
ParsedStr::~ParsedStr()
{
if (_str != _buf)
free((void*)_str);
}
This trick can be used in combination with other optimizations.
Here are the file sizes of various versions, compiled with gcc on Linux and stripped of symbols (to get the most accurate picture of the size of the code):
| Version | File size | File size delta | sizeof(ParsedStr) |
|---|---|---|---|
| A no-op version | 7129 | 0 | 1 |
| Avoiding copying the string | 8033 | 904 | 8 |
| Getting rid of offsets | 8065 | 936 | 8 |
| Naive non-STL version | 8683 | 1554 | 12 |
| Optimizing for common case | 8731 | 1602 | 256 |
| Optimizing allocations of strings | 8791 | 1662 | 16 |
| Offsets instead of pointers | 8791 | 1662 | 12 |
| One array instead of two | 8791 | 1662 | 12 |
| Naive STL version | 14402 | 7273 | 24 |
| gcc (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2) | |||
Here’s the same data when using gcc on Mac OS X 10.5:
| Version | File size | File size delta | sizeof(ParsedStr) |
|---|---|---|---|
| A no-op version | 12888 | 0 | 1 |
| Avoiding copying the string | 13092 | 204 | 8 |
| Getting rid of offsets | 13092 | 204 | 8 |
| Optimizing for common case | 13140 | 252 | 256 |
| Optimizing allocations of strings | 13148 | 260 | 16 |
| Offsets instead of pointers | 13148 | 260 | 12 |
| One array instead of two | 13148 | 260 | 12 |
| Naive non-STL version | 13148 | 260 | 12 |
| Naive STL version | 19296 | 6408 | 24 |
| i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5483) | |||
The interesting metric is file size delta. It shows how much bigger a given implementation of ParsedStr class is compared to a version that does nothing.
Why such a difference between Linux and Mac versions? On Mac, a linker has a -dead_strip option that removes code not used anywhere. On Linux, sadly, this very basic functionality is not yet present in gcc toolchain. Let’s use results from Mac, because they better represent the real code generated.
As to the size, STL version is the worst by a wide margin. Not only it’s more than 31 times bigger than the smallest version but also 24 times bigger than the biggest non-STL version.
Non-STL versions are very close in size, but if you’re an extremist like us, saving 56 or 48 bytes is still worth fighting for.
STL code is arguably the cleanest but you pay a huge price in code size and inflexibility.
Optimization techniques are all based on understanding what’s happening under the hood and exploiting particular properties of the data. STL is bad because it hides how it works internally, which hinders our ability to understand what’s going on and come up with an optimization.
Some of the above techniques can’t be expressed in STL. Some of them would cause the code to loose its simplicity, which is the only thing STL version has going for it.
Important thing to notice is that sometimes optimizing for space also makes the code simpler and faster so you win at multiple fronts.