Premature Optimization Is Fun Sometimes

throawayonthe1 pts0 comments

invlpg – Premature Optimization is Fun Sometimes

A colleague of mine was recently discussing a connectivity monitoring<br>system he is working on with me. It’s nothing fancy, just sending ICMP<br>Echo Requests to a couple of different servers, and monitoring latency<br>and dropped packet averages over 1-minute, 5-minute, and 15-minute<br>periods. Up came the topic of how this data should be stored, the<br>natural thought was a 512 entry ring buffer, containing entries like the<br>following:

struct ping_timestamp {<br>uint64_t sent_ns; // when the packed was sent (unit: ns)<br>uint64_t received_ns; // when the packet was received (unit: ns)<br>in_addr_t source_addr; // source address<br>uint16_t seq_no; // echo request sequence number<br>bool received; // has the request been received?<br>};

And backed by the following array

struct ping_timestamp pings_rb[512];

// ...

printf("%zu\n", sizeof(pings_rb));

// 12288

12 KiB. Pretty wasteful, right? We can certainly do better. Do we<br>need to keep fields for both sent and<br>received? What we’re really interested is the latency. We<br>need to know when a packet was sent, only up until we know when it was<br>received, at that point, the data we want to keep is<br>received - sent, so why don’t we make it a tagged<br>union?

struct ping_timestamp_2 {<br>union {<br>uint64_t sent_ts; // unit: 100μs<br>uint64_t elapsed_ts; // unit: 100μs<br>};<br>in_addr_t source_addr;<br>uint16_t seq_no;<br>bool received;<br>};

// ...

printf("%zu\n", sizeof(pings_rb));

// 8192

Not bad, we’ve shaved off an entire page. We can still do better.

Nanosecond precision? In our case, ping times are measured in the<br>tens or hundreds, or even thousands of milliseconds. We don’t<br>need to keep all of those extra bits around. If we change the unit from<br>nanosecond to 100 microsecond increments (0.1ms), then 43-bits is<br>sufficient for us to keep track of pings for up to 20-years1. 20-years is a bit excessive still,<br>but it doesn’t hurt to be at least a little bit future proof.

And received? 8-bit for a true/false value seems<br>altogether too much. The answer: bitfields.

struct ping_timestamp_3 {<br>uint64_t sent_or_elapsed_ts: 43;<br>uint64_t received: 1;<br>uint64_t seq_no: 16;<br>in_addr_t source_addr;<br>};

// ...

printf("%zu\n", sizeof(pings_rb));

// 8192

Wait, what? Why haven’t we saved any space?

The answer is struct padding. The layout of<br>ping_timestamp_2 looks like this:

0 8 16 24 32 40 48 56 64...<br>+-------+-------+-------+-------+-------+-------+-------+-------+<br>| sent/elapsed |<br>+-------+-------+-------+-------+-------+-------+-------+-------+<br>...64 72 80 88 96 104 112 120 128<br>+-------+-------+-------+-------+-------+-------+-------+-------+<br>| source_address | seq_no | recv? | pad |<br>+---------------------------------------------------------------+<br>Where the padding byte at the end is to ensure alignment<br>requirements. ping_timestamp_3 on the other end, looks like<br>this:

0 8 16 24 32 40 R? 48 56 64...<br>+-------+-------+-------+-------+-------+-------+-------+-------+<br>| sent/elapsed || seq_no | P |<br>+-------+-------+-------+-------+-------+-------+-------+-------+<br>...64 72 80 88 96 104 112 120 128<br>+-------+-------+-------+-------+-------+-------+-------+-------+<br>| source_address | padding |<br>+---------------------------------------------------------------+<br>So our optimization there didn’t actually save any space. We’re<br>wasting 36-bits of padding. Is there any way we can somehow do<br>better?

We keep track of the source address due to frequent changes while our<br>product is in operation (on a mobile data network). When the address<br>changes, we also reset the sequence number for reasons that aren’t<br>relevant to the current topic. We have seen, in the past, packets with<br>differing source addreses but identical sequence numbers due to be<br>processed by our application at the same time (the joys of asynchronous<br>programming), so the source address serves to disambiguate these<br>changes.

But there’s another way to disambiguate.

An ICMP echo request has a 16-bit long identifier field<br>to allow applications to identify which echo request packets were sent<br>by them. Its value is completely arbitray. On Linux iputils<br>ping sets it to getpid() & 0xFFFF; on<br>OpenBSD a random number is used instead.

Although it’s 16-bits long, we don’t actually need to use the full<br>16-bits. There’s 4 free bits left in the first 8-bytes of our<br>ping_timestamp_3. Our thought was to use a rolling 4-bit<br>counter, that is increased whenever our source address changes (this is<br>monitored elsewhere in the application), allowing us to uniquely<br>identify2 which source address the packet came<br>from.

Our final struct looks like this:

struct ping_timestamp {<br>uint64_t elapsed_or_sent_ts : 43;<br>uint64_t received : 1;<br>uint64_t counter: 4;<br>uint64_t seq_no: 16;<br>};

// ...

printf("%zu\n", sizeof(pings_rb));

// 4096

Much better. A whole 8-kilobytes of savings, and down to a single<br>page of data. You may have noticed that I have changed the order of the<br>fields slightly. This is to line seq_no up on a<br>16-bit boundary, so that loading it...

uint64_t received struct sent seq_no source

Related Articles