C as an Intermediate Language (2012)

downbad_1 pts0 comments

C as an intermediate language

Blog<br>Site<br>๐•<br>Feed

C as an intermediate language

September 30th, 2012

Here's a Forth program debugged in KDevelop โ€“ a graphical debugger without Forth support:

Cool stuff, not? The syntax highlighting for Forth files โ€“ that's someone else's work that comes with the standard KDevelop<br>installation. But the rest โ€“ being able to run Forth under KDevelop, place breakpoints, and look at program state โ€“ all that<br>stuff is something we'll develop below.

I'll show all the required code; we don't have to do very much, because we get a lot for free by using C as an intermediate<br>language.

A high-level intermediate language is not unusual. A lot of compilers target an existing high-level platform instead of<br>generating native code โ€“ for instance, by generating JVM bytecode or JavaScript source code. Why? Because of all the things you<br>get for free that way:

Portability

Optimization

Some degree of library interoperability

Some degree of tools interoperability (IDEs, debuggers, etc.)

A few languages targeting high-level platforms are rather well-known: Scala and Clojure with compilers targeting the JVM,<br>CoffeeScript and Dart which are compiled to JavaScript. (Then there's Java, which Google famously compiles to JavaScript โ€“ though that remains somewhat<br>offbeat.)

Which languages of this kind are the most successful? Without doubt, today the answer is C++ and Objective-C โ€“ languages<br>whose first compilers emitted C code.

I think C is an awesome intermediate language for a compiler to emit. It's extremely portable, it compiles in a snap,<br>optimizes nicely, and you get interoperability with loads of stuff.

When I wanted to make a compiler for an interpreted language we developed internally, I actually thought about targeting a<br>VM, not a source language. I planned to emit LLVM IR. It was GD who talked me out of it; and really, why LLVM IR?

After all, LLVM IR is less readable than C, less stable than the C standard โ€“ and less portable than C. It will<br>likely always be, almost by definition.

Even if LLVM runs on every hardware platform, LLVM IR will only be supported by the LLVM tools โ€“ but not, say, the GNU tools,<br>or Visual Studio. Whereas generating C code gives you great support by LLVM tools โ€“ and GNU, and Visual Studio. Debugging a<br>program generated from LLVM IR in Visual Studio will probably always be inferior to debugging auto-generated C code compiled by<br>the Visual Studio compiler.

"C as an intermediate language" is one of those things I wanted to write about for years. What prevented me was, I'd like to<br>walk through an example โ€“ including some of the extra work that may be required for better debugging support. But I couldn't<br>think of a blog-scale example ("web-scale" seems to be the new synonym for "loads of data"; I propose "blog-scale" to<br>mean "small enough to fully fit into a blog post".)

Then it dawned on me: Forth! The minimalist language I've fallen out of love with that still has a warm<br>place in my heart.

So, I'll do a Forth-to-C compiler. That'll fit in a blog post โ€“ at least a small Forth subset will โ€“ and Forth is different<br>enough from C to be interesting. Because my point is, it doesn't have to be C with extensions like C++ or Objective-C. It can be<br>something rather alien to C and you'll still get a lot of mileage out of the C tools supplied by your platform.

Without further ado, let's implement a toy Forth-to-C compiler. We shall:

Define a small Forth subset

Implement a simple compiler and runtime

Debug a Forth program using gdb

Profile a Forth program using KCachegrind

Display the data stack using gdb pretty-printers

Debug in KDevelop using the data stack display

Enough Forth to not be dangerous

To be dangerous, we'd have to support CREATE/DOES> or COMPILE or POSTPONE or something of the sort. We won't โ€“ we'll only<br>support enough Forth to implement Euclid's GCD.

So here's our Forth subset โ€“ you can skip this if you know Forth:

Forth has a data stack.

Integers are pushed onto the stack. When you say 2 3 , 2 is pushed and then 3.

Arithmetic operators pop operands from the stack and push the result. 6 2 / pops 6 and 2 and<br>pushes 6/2=3. 2 3 = pushes 0 (false), because 2 is not equal to 3.

Stack manipulation words, well, manipulate the stack. DUP duplicates the top of the stack: 2 DUP<br>is the same as 2 2 . Swap: 2 3 SWAP is the same as 3 2 . Tuck: 2 3<br>TUCK is the same as... errm... 3 2 3 . As you can already imagine, code is more readable with less of<br>these words.

New words are defined with : MYNAME ...code... ; Then if you say MYNAME , you'll<br>execute the code, and return to the point of call when you reach the semicolon. No "function arguments" are declared โ€“ rather,<br>code pops arguments from the stack, and pushes results. Say, : SQUARE DUP * ; defines a squaring word; now<br>3 SQUARE is the same as 3 DUP * โ€“ it pops 3 and pushes 9.

Loops: BEGIN ... cond UNTIL is like do { ... } while(!cond), with cond popped by<br>the UNTIL . BEGIN ... cond WHILE ... REPEAT...

forth language code stack llvm intermediate

Related Articles