C as an intermediate language
Blog<br>Site<br>๐<br>Feed
C as an intermediate language
September 30th, 2012
Here's a Forth program debugged in KDevelop โ a graphical debugger without Forth support:
Cool stuff, not? The syntax highlighting for Forth files โ that's someone else's work that comes with the standard KDevelop<br>installation. But the rest โ being able to run Forth under KDevelop, place breakpoints, and look at program state โ all that<br>stuff is something we'll develop below.
I'll show all the required code; we don't have to do very much, because we get a lot for free by using C as an intermediate<br>language.
A high-level intermediate language is not unusual. A lot of compilers target an existing high-level platform instead of<br>generating native code โ for instance, by generating JVM bytecode or JavaScript source code. Why? Because of all the things you<br>get for free that way:
Portability
Optimization
Some degree of library interoperability
Some degree of tools interoperability (IDEs, debuggers, etc.)
A few languages targeting high-level platforms are rather well-known: Scala and Clojure with compilers targeting the JVM,<br>CoffeeScript and Dart which are compiled to JavaScript. (Then there's Java, which Google famously compiles to JavaScript โ though that remains somewhat<br>offbeat.)
Which languages of this kind are the most successful? Without doubt, today the answer is C++ and Objective-C โ languages<br>whose first compilers emitted C code.
I think C is an awesome intermediate language for a compiler to emit. It's extremely portable, it compiles in a snap,<br>optimizes nicely, and you get interoperability with loads of stuff.
When I wanted to make a compiler for an interpreted language we developed internally, I actually thought about targeting a<br>VM, not a source language. I planned to emit LLVM IR. It was GD who talked me out of it; and really, why LLVM IR?
After all, LLVM IR is less readable than C, less stable than the C standard โ and less portable than C. It will<br>likely always be, almost by definition.
Even if LLVM runs on every hardware platform, LLVM IR will only be supported by the LLVM tools โ but not, say, the GNU tools,<br>or Visual Studio. Whereas generating C code gives you great support by LLVM tools โ and GNU, and Visual Studio. Debugging a<br>program generated from LLVM IR in Visual Studio will probably always be inferior to debugging auto-generated C code compiled by<br>the Visual Studio compiler.
"C as an intermediate language" is one of those things I wanted to write about for years. What prevented me was, I'd like to<br>walk through an example โ including some of the extra work that may be required for better debugging support. But I couldn't<br>think of a blog-scale example ("web-scale" seems to be the new synonym for "loads of data"; I propose "blog-scale" to<br>mean "small enough to fully fit into a blog post".)
Then it dawned on me: Forth! The minimalist language I've fallen out of love with that still has a warm<br>place in my heart.
So, I'll do a Forth-to-C compiler. That'll fit in a blog post โ at least a small Forth subset will โ and Forth is different<br>enough from C to be interesting. Because my point is, it doesn't have to be C with extensions like C++ or Objective-C. It can be<br>something rather alien to C and you'll still get a lot of mileage out of the C tools supplied by your platform.
Without further ado, let's implement a toy Forth-to-C compiler. We shall:
Define a small Forth subset
Implement a simple compiler and runtime
Debug a Forth program using gdb
Profile a Forth program using KCachegrind
Display the data stack using gdb pretty-printers
Debug in KDevelop using the data stack display
Enough Forth to not be dangerous
To be dangerous, we'd have to support CREATE/DOES> or COMPILE or POSTPONE or something of the sort. We won't โ we'll only<br>support enough Forth to implement Euclid's GCD.
So here's our Forth subset โ you can skip this if you know Forth:
Forth has a data stack.
Integers are pushed onto the stack. When you say 2 3 , 2 is pushed and then 3.
Arithmetic operators pop operands from the stack and push the result. 6 2 / pops 6 and 2 and<br>pushes 6/2=3. 2 3 = pushes 0 (false), because 2 is not equal to 3.
Stack manipulation words, well, manipulate the stack. DUP duplicates the top of the stack: 2 DUP<br>is the same as 2 2 . Swap: 2 3 SWAP is the same as 3 2 . Tuck: 2 3<br>TUCK is the same as... errm... 3 2 3 . As you can already imagine, code is more readable with less of<br>these words.
New words are defined with : MYNAME ...code... ; Then if you say MYNAME , you'll<br>execute the code, and return to the point of call when you reach the semicolon. No "function arguments" are declared โ rather,<br>code pops arguments from the stack, and pushes results. Say, : SQUARE DUP * ; defines a squaring word; now<br>3 SQUARE is the same as 3 DUP * โ it pops 3 and pushes 9.
Loops: BEGIN ... cond UNTIL is like do { ... } while(!cond), with cond popped by<br>the UNTIL . BEGIN ... cond WHILE ... REPEAT...