The Path of Madness

jordanscales1 pts0 comments

The Path of Madness — brandur.org

brandur.org

Articles

Atoms

Fragments

Newsletter

Sequences

Now

Uses

About

Auto

029

Nanoglyph

The Path of Madness

Hacker News is the Twitter tech elite’s favorite outfit to hate, taking scorching criticism as the “orange website”, and perceived to be full of nothing better than a hypercritical mob informally operating as the world’s most cynical peanut gallery, which is at least partly true.

But even its most vocal critics can’t seem to help themselves – despite the frequent condemnation they lob its way, someway, somehow, they keep finding themselves back. And there’s a reason for that – it’s not all bad, and actually, most of it is even pretty good. Not only that, but once in a while you strike internet gold and come across rare information that you could never have found anywhere else, reminding you that this elaborate series of tubes that we’re all addicted to has a few good things going for it after all.

One of my favorite comments of all time is from an ex-Oracle engineer who describes what development there is like. Not just at Oracle the company, but on the core Oracle database product itself (to be read in the tone of a Lovecraft short story):

Oracle Database 12.2.

It is close to 25 million lines of C code.

What an unimaginable horror! You can’t change a single line of code in the product without breaking 1000s of existing tests. Generations of programmers have worked on that code under difficult deadlines and filled the code with all kinds of crap.

Very complex pieces of logic, memory management, context switching, etc. are all held together with thousands of flags. The whole code is ridden with mysterious macros that one cannot decipher without picking a notebook and expanding relevant pats of the macros by hand. It can take a day to two days to really understand what a macro does.

Sometimes one needs to understand the values and the effects of 20 different flag to predict how the code would behave in different situations. Sometimes 100s too! I am not exaggerating.

The only reason why this product is still surviving and still works is due to literally millions of tests!

Here is the life of an Oracle Database developer:

Start working on a new bug.

Spend two weeks trying to understand the 20 different flags that interact in mysterious ways to cause this bug.

Add one more flag to handle the new special scenario. Add a few more lines of code that checks this flag and works around the problematic situation and avoids the bug.

Submit the changes to a test farm consisting of about 100 to 200 servers that would compile the code, build a new Oracle DB, and run the millions of tests in a distributed fashion.

Go home. Come the next day and work on something else. The tests can take 20 hours to 30 hours to complete.

Go home. Come the next day and check your farm test results. On a good day, there would be about 100 failing tests. On a bad day, there would be about 1000 failing tests. Pick some of these tests randomly and try to understand what went wrong with your assumptions. Maybe there are some 10 more flags to consider to truly understand the nature of the bug.

Add a few more flags in an attempt to fix the issue. Submit the changes again for testing. Wait another 20 to 30 hours.

Rinse and repeat for another two weeks until you get the mysterious incantation of the combination of flags right.

The above is a non-exaggerated description of the life of a programmer in Oracle fixing a bug. Now imagine what horror it is going to be to develop a new feature. It takes 6 months to a year (sometimes two years!) to develop a single small feature (say something like adding a new mode of authentication like support for AD authentication).

The fact that this product even works is nothing short of a miracle!

A miracle indeed.

Edit-compile-run

Engineers who’ve spent their careers at smaller companies may not fully appreciate a situation like this. Although Oracle may be a particular egregious example, it’s a disturbingly common scenario at larger shops where head count’s been scaled up. Over the years they sink slowly into quagmire, and once in, find it impossible to get themselves back out.

Edit-compile-run is a software engineer’s work loop in which they (1) edit code, (2) compile the code (or start the interpreter), and (3) run the program or test suite. It’s the all-important standard workflow an engineer will run hundreds of times a day, and the speed at which it’s possible is one of the single most important factors for productivity in a codebase. Whether new code is being written or existing code modified, being able to run it and get feedback quickly is of paramount importance.

The ideal edit-compile-run loop for small programs is So 10+ seconds isn’t great. The story above describes how Oracle’s edit-compile-run loop is a full day.

The...

rsquo code oracle tests compile flags

Related Articles