Case study of over-engineered C++ code

You’ve heard of over-engineered, unnecessarily complex code but what exactly is it?

I believe it’s best to show by example.

While it’s not my intention to criticize other people’s code, I think it’s better to show code and how to improve it rather than vaguely talk about principles.

The code that I consider over-engineered:

The problem it’s trying to solve

This code detects file format from file content. Are the bytes an audio wav file? A jpeg image?

In addition, it extracts some properties of the file format e.g. a dimension of the image file.

Over-engineered implementation

If you’re C++ programmer it’s natural to use object oriented design. You create XBinary base class with some virtual methods and then write a class for each format that inherits from XBinary.

Here’s just a small part:

class XBinary {
    virtual FT getFileType();
    virtual ENDIAN getEndian();
    // ... much more stuff
};

And implementation for detecting BMP images:

class XBMP : public XBinary {
    // ...
};

XBinary::FT XBMP::getFileType()
{
    return FT_BMP;
}

XBinary::MODE XBMP::getMode()
{
    return XBinary::MODE_DATA;
}

Over-engineering in the small: unnecessary virtual functions

This code is over-engineered in the small.

Notice that file type or mode are static per file format. They most likely do not depend on the data.

We can easily eliminate virtual calls by storing this info as class members.

class XBinary {
private:
    FT _fileType;
    MODE _mode;
public:
    FT getFileType() { return _fileType; }
    MODE getMode() { return _mode; }
};

class XBMP : public XBinary {
    XBMP() {
        fileType = FT_BMP;
        mode = XBinary::MODE_DATA;
    }
};

We replaced multiple virtual methods with a single non-virtual method in base class.

That is a saving in both code size and code speed.

Parsing and compiling more stuff requires compiler to do more work which takes more time.

Removing getter methods

We don’t need the method that just returns the value of a variable.

A function is necessary if it does some computation on data. It doesn’t happen here so it’s just bad habit of mindlessly adding unnecessary code.

We can simplify by making class members public:

class XBinary {
public:
    FT fileType;
    MODE mode;
};

Over-engineering in the big: design

XBinary defines 9 virtual classes, each to get small bits of information about the format:

class XBMP {
    virtual FT getFileType();
    virtual QString getMIMEString();
    virtual QString getArch();
    virtual MODE getMode();
    virtual ENDIAN getEndian();

    virtual QString getFileFormatExt();
    virtual QString getFileFormatExtsString();
    virtual _MEMORY_MAP getMemoryMap(MAPMODE mapMode = MAPMODE_UNKNOWN, PDSTRUCT *pPdStruct = nullptr);
    virtual QString getVersion();
}

An observation: they are all cheap to compute. Typically reading a few bytes of memory.

Instead of providing a function for each piece of information, we could have one function that returns everything.

struct FormatInfo {
    FT fileType;
    QString mimeString;
    // ... the other 7 values
};

class XBMP {
    virtual FormatInfo getFormatInfo();
};

We save lots of code by consolidating 9 different functions into one. We also simplified the API and I believe someone new to the library would figure out how to use smaller interface faster.

Each format also provides a per-format data. XBMP has:

class XBMP {
    BMPFILEHEADER getFileHeader();
    BMPINFOHEADER getInfoHeader();
};

Again, we could combine this into one function:

struct XbmpInfo {
    BMPFILEHEADER fileHeader;
    BMPINFOHEADER infoHeader;
};

class XBMP {
    XbmpInfo getXbmpInfo();
    virtual FormatInfo getFormatInfo();
};

But wait, we can combine this info a single function:

struct XbmpInfo {
    FormatInfo formatInfo;
    BMPFILEHEADER fileHeader;
    BMPINFOHEADER infoHeader;
}

class XBMP {
    XbmpInfo getXbmpInfo();
};

In fairness, we’ve lost ability to use the API in a certain way. Before if code was only interested in common properties in FormatInfo it could use the same code for every class representing a format. Now it has to know it’s calling e.g. XBMP and get FormatInfo out of XbmpInfo.

We don’t need classes at all

The best part is no part.

If we had to write this code in C, how would we do it?

We could implement each format detector as a single function.

struct XbmpInfo {
    bool isValid;
    FormatInfo formatInfo;
    BMPFILEHEADER fileHeader;
    BMPINFOHEADER infoHeader;
};

struct PngInfo { ... };

XbmpInfo maybeDetectXbmp(char* data, size_t dataSize);
PngInfo maybeDetectPng(char* data, size_t dataSize);

One function to rule them all

We have a function for each file type, but we can simplify it to a single function that detects the type and returns information for a given file type:

struct XbmpInfo { ... };
struct PngInfo { ... };

enum {
    InvalidTye, // if not a known format
	XbmpType,
	PngType,
	// ....
};

struct FormatInfo {
	enum typ;
	FormatInfo formatInfo;
	union {
		XbmpInfo* xbmp;
		PngInfo* png;
	} extraInfo;
};

FormatInfo detectFileFormat(char* data, size_t dataSize);

The API is now dramatically simpler.

We removed all the classes, all their methods. It’s a massive saving in lines of code.

You save binary size, you make the code faster but most importantly you save yourself time because this is code you don’t have to write.

You can focus on writing the logic for parsing file formats not on incidental complexity of typing up class names and methods.

Why do people over-complicate?

Because simplicity is hard.

Because when you have a hammer, everything looks like a nail.

After you’ve read a book on C++, teaching you about inheritance, virtual functions etc. it’s only natural to start modeling all problems with class inheritance and virtual functions.

To then come up with a solution that reduces all that to a function requires thinking outside the C++ box.

Thinking outside the box is hard.

Edna	speedy note taking app with super powers
SumatraPDF	small, fast, free PDF / ePub / comic book reader for Windows