Trade offs in designing versatile log format

This article shows that when designing software even seemingly simple things are complicated and trade offs abound.

I wanted to log events to a file. I had several requirements for my design:

it should be simple and therefore easy to implement
it should be human-readable
it should allow various types of events

It’s not a hard problem.

I could log it as stream of JSON objects. It would allow different types of events. It’s easy to implement (in the sense that there’s a library in every language for the hard part of encoding/decoding JSON).

One thing: it’s not particularly human friendly. I wouldn’t enjoy looking at tail -f of a JSON log.

How about something like Apache server logs:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

It’s readable but doesn’t allow for different types of events.

Implementing this format isn’t challenging but the format is ad-hoc. For each kind of data I would have to write a completely new formatter / parser.

How about something more structured:

ip: 127.0.0.1
request: GET /apache_pb.gif HTTP/1.0
status code: 200

We can write a library that serializes key / value pairs and that gives us ability to write different kinds of events to the same file with generic code.

But hold on, how do we know where one event ends and another starts?

We must amend our format add add a record separator, like ---:

ip: 127.0.0.1
request: GET /apache_pb.gif HTTP/1.0
status code: 200
---

We’re not there yet. We use newline to separate lines that encode a single key / value pair. What if the value has a newline in it?

body: hello fred
I'm writing to you...

We could escape the value:

body: hello fred\nI'm writing to you...

Escaping is not a good solution.

Without escaping, we can just write out the data as-is.

With escaping we have to scan the whole thing to determine if it needs escaping, and escape if it does.

Same goes for un-escaping.

This is slow and error prone.

Furthermore, what if the value is really large? It won’t be readable encoded on a single line.

Let’s amend our format to accommodate large values:

body:+33
hello fred
I'm writing to you...

33 is the size of the value. Knowing the size we don’t need escaping. It’s faster, simpler to implement, more readable and supports binary data (like images).

To formalize, key / value pair can be encoded in 2 ways:

${key}: ${value}\n
${key}:+${sizeOfValue}\n${value}

Let’s revisit the idea of using a separator:

ip: 10.0.0.1
request: GET / HTTP/1.0
---
ip: 10.0.0.2
request: GET /index.html HTTP/1.0
---

For delimiting records we can use the same tricks we used for encoding large values:

36
ip: 10.0.0.1
request: GET / HTTP/1.0
43
car: Toyota Corolla
year: 2018
price: 21000

First line is size of data as a string + newline. The data of this size follows.

This is very generic framing, agnostic to what is inside. It could be a picture, a JSON-encoded data or our key / value format.

We’ve arrived at a layered design:

first layer is encoding arbitrary chunks of data by writing size and then data
second layer is key / value format inside the data

It would be even simpler if the size was 8 byte, 64-bit integer. It wouldn’t be human-readable, though, so I picked a slightly more complicated, string-based encoding.

It’s not quite right yet.

Above we have logged an HTTP requests info and car info in the same file.

How do we know what type of record did we read?

Let’s amend our framing with optional name:

36 httplog
ip: 10.0.0.1
request: GET / HTTP/1.0
43 carinfo
car: Toyota Corolla
year: 2018
price: 21000

Adding optional name (httplog and carinfo) allows us to know what kind of data is encoded in a given chunk of data.

Finally let’s add non-optional timestamp in Unix Epoch format in milliseconds:

36 1553564864010 httplog
ip: 10.0.0.1
request: GET / HTTP/1.0
43 1553564864115 carinfo
car: Toyota Corolla
year: 2018
price: 21000

Unlike name, which is optional, I decided timestamp is not optional. You can set it to 0 if you don’t need it but for most logging needs you want a timestamp.

Traditional Unix Epoch has precision of a second and that seems not enough. Millisecond-precision seems good enough. Nanosecond was also an option but seems like an over-kill.

What if you need more structure than key / value pairs?

Simplicity means that you can’t implement every possible feature. This format doesn’t preclude more structure: you can always use JSON as the value in key / value field. It’s just not going to be as easy to use.

Implementing the thing

This is not just theoretical exploration. I’ve implemented this format as a Go package siser.

It took me 2 days to implement. It’s just under 500 lines of code:

$ wc -l reader.go record.go util.go writer.go
     167 reader.go
     196 record.go
      79 util.go
      57 writer.go
     499 total

Not counting tests but I do have them because I need this code to be rock solid. Storing data is serious business and has to be reliable.

How fast is it? I benchmarked it against JSON encoding:

BenchmarkSiserMarshal-12      	 1000000	      1136 ns/op
BenchmarkJSONMarshal-12       	 1000000	      1407 ns/op
BenchmarkSiserUnmarshal-12    	 5000000	       374 ns/op
BenchmarkJSONUnmarshal-12     	  500000	      3353 ns/op

The benchmark is mostly to make sure that I didn’t make a big performance mistake. It would be embarrassing to be slower than JSON encoding. As a side note: it’s impressive how fast JSON marshaling in Go standard library is.

I’m moving all my logging needs to this format.

This format is also good for very simple stores aka databases. The file can be seen as an append-only database log. To update a record I just write a new entry and it’ll over-write earlier entry.

The roads not taken

After reading this you might not be impressed. This design is clearly inspired by many others. All I did is put known things in a specific, but very familiar, way.

Simplicity is insidiously non-trivial. The design described here is a v2 of siser library. First design used --- for record separator, didn’t have name and timestamp.

Only several months after first version I got enough insight to improve it.

In turn siser was an evolution of less robust ideas I implemented earlier. It took experiencing the limitations of those earlier designs in real use for better ideas to emerge.

This is even more apparent when you look at mistake of others.

MIME is a format used for encoding e-mail messages. While in some ways it’s very close to this format, they made a mistake of using a boundary string for separating multiple parts. Compared to framing data with size prefix it’s so much hard to implement. A MIME decoder is more than 500 lines of code and doesn’t offer more features.

Other example of massive mistakes of the past is choosing XML as a format for describing ant or Visual Studio build files. XML is super slow, unreadable for humans, hard to work with programmatically. A conforming implementation of XML parser requires thousands of lines of code.

Obvious things are often only obvious in hindsight.