C as an Intermediate Language (2012)

C as an intermediate language

Blog Site 𝕏 Feed

C as an intermediate language

September 30th, 2012

Here's a Forth program debugged in KDevelop – a graphical debugger without Forth support:

Cool stuff, not? The syntax highlighting for Forth files – that's someone else's work that comes with the standard KDevelop installation. But the rest – being able to run Forth under KDevelop, place breakpoints, and look at program state – all that stuff is something we'll develop below.

I'll show all the required code; we don't have to do very much, because we get a lot for free by using C as an intermediate language.

A high-level intermediate language is not unusual. A lot of compilers target an existing high-level platform instead of generating native code – for instance, by generating JVM bytecode or JavaScript source code. Why? Because of all the things you get for free that way:

Portability

Optimization

Some degree of library interoperability

Some degree of tools interoperability (IDEs, debuggers, etc.)

A few languages targeting high-level platforms are rather well-known: Scala and Clojure with compilers targeting the JVM, CoffeeScript and Dart which are compiled to JavaScript. (Then there's Java, which Google famously compiles to JavaScript – though that remains somewhat offbeat.)

Which languages of this kind are the most successful? Without doubt, today the answer is C++ and Objective-C – languages whose first compilers emitted C code.

I think C is an awesome intermediate language for a compiler to emit. It's extremely portable, it compiles in a snap, optimizes nicely, and you get interoperability with loads of stuff.

When I wanted to make a compiler for an interpreted language we developed internally, I actually thought about targeting a VM, not a source language. I planned to emit LLVM IR. It was GD who talked me out of it; and really, why LLVM IR?

After all, LLVM IR is less readable than C, less stable than the C standard – and less portable than C. It will likely always be, almost by definition.

Even if LLVM runs on every hardware platform, LLVM IR will only be supported by the LLVM tools – but not, say, the GNU tools, or Visual Studio. Whereas generating C code gives you great support by LLVM tools – and GNU, and Visual Studio. Debugging a program generated from LLVM IR in Visual Studio will probably always be inferior to debugging auto-generated C code compiled by the Visual Studio compiler.

"C as an intermediate language" is one of those things I wanted to write about for years. What prevented me was, I'd like to walk through an example – including some of the extra work that may be required for better debugging support. But I couldn't think of a blog-scale example ("web-scale" seems to be the new synonym for "loads of data"; I propose "blog-scale" to mean "small enough to fully fit into a blog post".)

Then it dawned on me: Forth! The minimalist language I've fallen out of love with that still has a warm place in my heart.

So, I'll do a Forth-to-C compiler. That'll fit in a blog post – at least a small Forth subset will – and Forth is different enough from C to be interesting. Because my point is, it doesn't have to be C with extensions like C++ or Objective-C. It can be something rather alien to C and you'll still get a lot of mileage out of the C tools supplied by your platform.

Without further ado, let's implement a toy Forth-to-C compiler. We shall:

Define a small Forth subset

Implement a simple compiler and runtime

Debug a Forth program using gdb

Profile a Forth program using KCachegrind

Display the data stack using gdb pretty-printers

Debug in KDevelop using the data stack display

Enough Forth to not be dangerous

To be dangerous, we'd have to support CREATE/DOES> or COMPILE or POSTPONE or something of the sort. We won't – we'll only support enough Forth to implement Euclid's GCD.

So here's our Forth subset – you can skip this if you know Forth:

Forth has a data stack.

Integers are pushed onto the stack. When you say 2 3 , 2 is pushed and then 3.

Arithmetic operators pop operands from the stack and push the result. 6 2 / pops 6 and 2 and pushes 6/2=3. 2 3 = pushes 0 (false), because 2 is not equal to 3.

Stack manipulation words, well, manipulate the stack. DUP duplicates the top of the stack: 2 DUP is the same as 2 2 . Swap: 2 3 SWAP is the same as 3 2 . Tuck: 2 3 TUCK is the same as... errm... 3 2 3 . As you can already imagine, code is more readable with less of these words.

New words are defined with : MYNAME ...code... ; Then if you say MYNAME , you'll execute the code, and return to the point of call when you reach the semicolon. No "function arguments" are declared – rather, code pops arguments from the stack, and pushes results. Say, : SQUARE DUP * ; defines a squaring word; now 3 SQUARE is the same as 3 DUP * – it pops 3 and pushes 9.

Loops: BEGIN ... cond UNTIL is like do { ... } while(!cond), with cond popped by the UNTIL . BEGIN ... cond WHILE ... REPEAT...

C as an Intermediate Language (2012)

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits