Finicky C preprocessing and parsing details

I wrote my second C parser and my second C preprocessor last week. Earlier I had written pytci and cffi-gen.

Last week I rewrote these tools in Lever. I learned a lot of nifty little details about C language.

Cffigen, great idea, needs more work

Other programming languages can call C functions from .dll and .so files. But these shared objects files were meant for other C programs, therefore they do not have any function signatures associated with the functions.

There is a tradition to write bindings in the language that calls the C functions. These bindings wrap each and every function with a type signature needed to call them. Additionally they may clean the API a bit and make it fit their language.

Everything you need in order to call C function han been already written into C library headers. You could use them, if you find them. But parsing C is so hard. Also, these are meant to be read compile-time and they are slow to parse.

Even if you get the C bindings from the C headers, they may be cumbersome to call with the FFI in the language. Well, I took care of that in Lever. With careful design you can make an FFI into your language that is pleasant to call without bells and whistles.

cffi-gen is a library for parsing C headers. It provides all type information in dictionaries and lists that are easy to use. The user of this library writes a script to select the types and variables he wants and automatically rename them. This way you get a really clean library to use.

These header translator scripts dump their results into JSON files. Even long JSON files parse&load in a flash. Then you can attach the output of a json loader into a library and let the program interpret only the type entries it needs.

After little initial work has been done, something that'd be easily done by a library maintainer, you got a zero-cost FFI that works across language boundaries.

There's a way to get even better from here. If you made a nice file format that provides an index for name-value pairs, then you wouldn't even need to load the whole file to use a header. I don't plan to make this soon though, and I will demand benchmarks and specifications before I accept any "solutions".

I knew I had to rewrite cffi-gen eventually. It had few flaws that started to show up.

LR parsing in cffi-gen made changing it felt as if you were playing with Prince Rupert's drops without knowing where the tails are. One inappropriate change and boom, spend three hours figuring out why the LR parser generator doesn't accept a new grammar.

cffi-gen didn't have its own pre-processor and that came with a cost too steep. I used gcc -E to preprocess and gcc -dM -E to grab the macro environment from GCC. Biggest pain is that the output isn't in ANSI C. GCC puts __attribute__ directives into places where it violates the ANSI C grammar. Those directives cannot be filtered out as they sometimes contain interesting info. That also made me doubt how platform-independent the headers I got from GCC were.

Pytci comes from a long rant

Pytci originates from an idea that came when I was studying what it takes to bootstrap a Linux distribution.

When you have pieces that follow Unix-principles, you can pick and take what you need, leave out the ugly pieces and run with it. If you just want a Linux kernel with minimal userspace, you can get it with very few programs.

Despite its annoying qualities, something similar is possible in C. At the best C is a common language that made it all possible. It let everything and everyone to co-operate and it was there. It is there even now and you can write great software in C. Seeing that happen brings tranquility.

In that context Makefiles are awesome too. When they are used right they can make lot of builds clean and easy, while being themselves clean.

Like Lever, C and Makefiles are what their authors make of them. Unfortunately especially new authors aren't that great.

"Lets bootstrap GCC with libgcc, cyclic binary dependencies for the victory!" "Lets bootstrap clang with bunch of c++ libraries that have cyclic dependencies with each other" "Lets force everyone use this crappy XML-defined abomination of a message bus!"

There's a problem when you must use a broken piece in your Linux-ecosystem, because you don't have a choice.

Most Linux distributions have that deadly flaw and it is best illustrated by "Linux From Scratch" -project. Their "stable LFS" bootstraps a minimal Linux system ready to run useful programs. There are hundreds of steps to get there.

Convolutions and interactions with C standard libraries and GCC/clang double the amount of steps you need to bootstrap your distribution. The lion's share of the steps described in the LFS indicates that the foundations are sunken from a corner and lot of work has to be done to straighten up everything that is built on top of it.

And if the problems were only limited to bootstrapping a distribution..

Linus has that rule of "not breaking user space or I break your neck". After all the bumper car rallies in the userspace there's nothing left of that rule. Not even the best conventions can completely ensure that your software binary runs and keeps running as Linux distributions change.

We ended up with distributions that require hundreds of people to keep them maintained. Carefree dependencies and all the little peculiarities in each pivot they use make them fragile.

Linux has managed to retain lot of good properties from Unix. One good property is that you can follow your own paths if you like. It is something what put Ballmer to say Linux is a cancer. The freedom to do things your own way is awesome. Redhat seems to hate it a lot too.

Wouldn't it be nice if you could cross-bootstrap a new Linux distribution without going through hundreds of hacks? Without caring whether the target and host are similar in any way? To do that you'd have to remove GCC and clang because they both have cyclic dependencies with details that easily ooze through cross-compiling barriers.

So this is why I considered to start the "python compiler infrastructure" called 'pytci'. The idea was that if you wrote a compiler in python, or any dynamic language, it'd be easy to understand, control, update and maintain. It'd be perfect for cross-compiling and bootstrapping distributions.

On the time I allotted to pytci, I managed to write a tokenizer and parts from the preprocessor. I did it in hurry and there was lots of errors.

Unfortunately there was lot of things to do. I felt like I woludn't have time to finish pytci on my own. And even if I did there wouldn't be much recompensation for it. Nobody would put commits in to continue from where I left that led me to conclude nobody would give a shit. Also just a python C compiler wouldn't let me achieve anything interesting on its own.

And to note, C is complex enough, if it's being "extended" it becomes a total mess so fragile that nothing can keep it together. I meant that pytci would just implement an ANSI C compliant C compiler without its own extensions. At best it would imitate some better behaviors of GCC to support as much of C software as it is possible.

Intermission

So that's the story behind those two projects. Earlier last week I needed my cffi-gen again. I need new bindings for the editline library.

The python-written library started to show a finger as I tried to generate headers for editline. When I peeked inside to fix the problem, I was greeted by this grammar I remember could explode at any time.

It was time to rewrite this library.

Touching the GCC taught me.. I shouldn't depend on it too much. So I decided to start by writing a working C preprocessor.

Let's get into details

On the Lever we are on "my battleground". I'm not working around the flaws of my language here. Instead I reshape the damn thing when it becomes an obstacle. That's how I designed Lever to be used anyway.

Lever has extreme, modular structure in its self-hosted compiler that makes an ideal tool for implementing parsers. It's still a bit slow but that's okay. There is one-time performance cost in translating the C headers so the code I write wouldn't demand optimizations afterwards.

Lever is perfect for some C parser smashing. Especially the proper closure scope makes wonders for the quality of code you have to write. In this language you need very few bulky classes to write a good parser.

Tokenizing

The C tokenizing rules themselves don't differ much from tokenizing other languages. There are identifiers, strings, numbers and punctuation that has to be broken into tokens prior parsing.

At first sight it seem quite simple to implement properly. The devil is in the details.

After writing lots of tokenizers, I've learned there are some things that give good results with nearly any language that you need to tokenize.

One is that you write yourself a function that returns a character from a stream one at a time. Whatever your language comes with, this seems the best way to implement a tokenizer.

Lever has iterators just like Python has, so I started the tokenizing with a function that turns an iterator into a getch function:

to_getch = (it):
    it = iter(it)
    return ():
        try
            return it.next()
        except UncatchedStopIteration as _
            return ""

Lever represents characters as single-character strings, just like Python does.

The empty string is returned as end-of-file marker to let "++" and ".join" operate without raising an error. You usually don't want string joining to raise an error because the file is near to end.

When iterator is exhausted it throws an exception. We can't immediately stop tokenizing when the end of file is reached, so we have to catch that exception. Fortunately that try-block costs only when the exception actually happens.

Here's an illustration of how this to_getch works:

getch = to_getch("hello")
print(getch()) # outputs "h"
print(getch()) # outputs "e"
...
print(getch()) # outputs ""

Having the stream represented as a getch function makes it easy to implement look-ahead caches and frames.

Trigraphs

"ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set." - Wikipedia

Trigraphs in C are character sequences starting with two question marks. They translate into other characters before tokenizing.

On most compilers trigraphs need to be enabled by a compiler flag if you desire to use them. They are more of a historical curiosity rather than an important feature. But you are expected to support them. Some old programs might contain them.

In C, you have trigraphs for quite significant characters. For example ??= should translate to #. And that character is used for constructing macros.

The trigraphs can be filtered through a filter that checks few characters ahead the stream. It looks like this in Lever:

trigraph_getch = (getch):
    ch0 = getch()
    ch1 = getch()
    return ():
        ch2 = getch()
        if ch0 == '?' and ch1 == '?'
            try
                ch = trigraphs[ch2]
                ch0 := getch()
                ch1 := getch()
                return ch
            except KeyError as _
                null
        ch = ch0
        ch0 := ch1
        ch1 := ch2
        return ch

And here's an illustration again of how it behaves:

getch = to_getch("??(???)[]()")
getch = trigraph_getch(getch)
print(getch()) # outputs "["
print(getch()) # outputs "?"
print(getch()) # outputs "]"
print(getch()) # outputs "["
print(getch()) # outputs "]"
print(getch()) # outputs "("
print(getch()) # outputs ")"
print(getch()) # outputs ""

I barely knew these things existed before I wrote my first tokenizer 10 months back.

Physical vs. Logical lines

In most languages, a single backslash at the end of a line means that the linebreak is discarded and that the reader should treat the lines as a single logical line.

This is a complication because the lines have to be joined before they are tokenized. And the tokenizer still has to keep a number of the physical line to let user locate the errors in input.

In Lever I created a logical character stream, which builds a two-character lookahead buffer around the getch -function. It keeps a track of the line number before those two lookahead characters.

It is bit involved, but not conceptually very different from the trigraph -filter. Instead of a function it's an object with multiple methods though.

It required lot of care to choose the most appropriate point where the physical lines translate into logical lines.

Comments as spaces and strings

Comments in C are string sequences that should be ignored by the parser. There are two common comment sequences: /* */ and //. The star comments do not nest, so they aren't very complex to ignore.

It is supposed that comments are removed before, or during tokenizing stage. But before tokenizing is crowded so even a simple task such as removing comments is hard to implement to occur before tokenizing. There's also a complication that comments cannot occur inside strings.

It has to occur during tokenizing. Fortunately it has been specified that comments are replaced by spaces before the input is passed to the tokenizer. And since they do not occur in strings, we can be sure that a comment always ends a token. So we parse the comments as tokens, and discard them before invoking the tokenizer again.

You might miss that the comments should not occur inside strings. Or well then implement them in wrong place and have a mess afterwards.

Digraphs

For some unknown reason the trigraphs were not enough. C also has digraphs. The concept is similar except that digraphs are tokens that are translated into other tokens.

Digraphs are always available. There are handful of them. Notably there are the %: and %:%: -digraphs. These should translate to # and ## -tokens.

The %:%: is really the hardest digraph to tokenize. It's hard because it's a four character token. I found it easiest to use two-character lookahead, and pick up the %:%: as a special case of the %: -digraph.

You could in theory ignore the digraphs entirely. They are a detail that probably wouldn't be missed. But you would fail a test suite for missing them.

Despite trying to support them, it could be easy to fail supporting the four-character digraph. It was not easy to fix once it was figured out it should be treated differently from the others. And there are so many choices.

Macro tokens

When # appears as the first token, and it can be also represented with digraph %:, you are expected to treat the next logical line as a macro.

You either let the idea of logical line leak into your preprocessor, or then you mark this character a macro in the tokenizer.

It turns out to be quite easy to do it in the tokenizer. The trick is to skip the whitespace characters not belonging to tokens before fetching the next token. If there was newline in those characters you skipped, you will mark the # as a macro character.

It's easy to get this wrong though. You might only check for the occurrence of # and ignore the digraphs. And you could miss having your parser not recognizing the odd digraph as a macro token because it is extremely rare these days to write C code with a keyboard that cannot type a hash character.

Preprocessing

Which one comes first, macro expansion or the function that reads '#' token and reads the line as a macro command?

The macro command needs to call macro expansion, so... Well actually either way. But it is important to note that macro expansion occurs both outside and inside macros, depending on a context. The '#' token, when marked as a macro, should invoke the preprocessor every time, anyway.

There is a fortune that the occurrence of '#' must halt the macro expansion.

There is also a slight chance that if you don't invoke macrocommand before macroexpander gets the token, you'll be looking into the chance that macroexpander loses track of the macro token somehow and it gets ignored.

Equivalence of a macro definition

'define' should report errors if you pass it a macro that isn't equivalent to the already defined macro on the same name. Equivalence of two macros is simple thing.

The specification says that two macros are equivalent when they contain same tokens that have same spacing.

But tokenizer should ignore spacing for most things. Well, the 'define' needs to capture the subsequent tokens on the logical line without expanding them, so you can check for spaces in the stream before calling the tokenizer This appears to force you to skip spaces before, rather than after getting a token.

There may be consequences, but I cheated with token equivalence in Lever's C preprocessor implementation. I parse the macro arguments when 'define' is getting captured and place them along the macro definition. Otherwise I ignore the spacing in the macro definitions entirely.

Macro expansion process

Macro expansion is explained in a complicated manner everywhere. I suppose you can implement them in many ways. It's not entirely clear how they work. So some possibly useful special cases will need to be deferred.

Most likely, some of these details don't match up with standard implementations. However this model matches gcc -E to the extent that I've tested it. It's useful as a starting point.

I implemented macros with a token putback buffer. The idea is that before the macroexpander can chop a new token, it has to empty the putback buffer of tokens. The expander can keep filling up this buffer.

Macros are pretty simple until you get to handle macro functions. Macro function should grab tokens between parentheses. One easy thing to get wrong here is to ignore nested parentheses. Though it will be figured out fast if you get it wrong.

When a macro occurs, you should capture the tokens into macroenvironment and replace the macro with the replacement list.

The C spec tells clearly how to implement '#' and '##'. They can be handled before putting the replacement list into a putback buffer. The behavior of these tokens is undefined if you mix them. But overall the '#' should stringify the argument coming after it, in the form where it appears as an argument to the macro. The '##' should join two adjacent tokens, with the exception that empty macro argument must be treated as no-operation.

The token created by joining tokens with ## is treated as ordinary token. If there's a macro under the name of the expanded token, the macro will be expanded.

Recursive macro rules

C macros cannot be recursive. If there is a recursive rule, it should be ignored. This can be achieved by marking the tokens copied from the replacement list before substituting in the macro arguments. The token that activated the macro can be used as a marker, and this lets the markers effectively stack.

Then, during expansion we can check whether the token was expanded from itself by checking the stack of marks associated with the token. If the token equivalent to it can be found in its stack, macroexpander will know it should be skipped this time.

Identifier produced by ## should be also marked, such that this feature cannot be used to work around this restriction.

Global include line

This is one of the simplest features to implement, but it's worth some discussion. Consider the following line:

#include <stdio.h>

If the less-than character is found in include -command, the macro processor should read the header path in as string until it hits the next '>'. This is worth mentioning because this skips the ordinary behavior of a tokenizer entirely.

When you implement it this way, the consequence is that comments cannot occur inside the include's <> either. Again it's a corner case to handle slashstar or slashlash as an argument to include.

I didn't check entirely what the C spec says to the include.

Macro evaluation on if/elif

If/elif -macro should expand the arguments passed to it, then evaluate the line and check whether it returns '0'.

If the evaluation returns a zero, it should mean that the token group is getting ignored.

It's easy to mess up the ignore of token group. I found it easiest to have a function that separately skips whole group of tokens.

'defined' -directive inside macro if/elif

One special case is that the word "defined" is reserved. It serves a special purpose and can't be used itself as a macro name.

The 'defined' -directive should be considered when if/elif -macro evaluates.

The 'defined' -directive should be expanded into '0' or '1' depending whether there's a macro by that name in the environment.

Parsing

Parsing, the next, and not the last, step towards madness in C compiler after preprocessing. The parser gets an expanded feed of the tokens from the preprocessor.

There's one last major thing you can really mess up.

Grammar ambiguity

The first C parser written in Lever uses the same parser engine used to parse Lever. It allows ambiguity in the grammar.

This feature gets useful with C because C grammar is ambiguous if you aren't aren't parsing it with a parser that calls 'reduce' as soon as it sees a rule to reduce.

The ambiguity problem comes from the 'typedef' -feature. Basically. Typedef forms a type alias for variable. The identifier token of that name should subsequently behave like a token denoting type.

One 'gotcha' here is that the effect of 'typedef' can be also erased by a declaration that uses the same name.

A declaration takes immediately effect after the ",", ";" or "=" tokens. There's a fancy context-free grammar in the web that describes the C syntax.

That grammar describes the rules for C declarations such that you have to hack the parser to get the declarations handled right.

To avoid that, you may want to redefine the grammar slightly:

declaration =>
    declaration_specifiers declarator
    declaration_init ',' declarator

declaration_init =>
    declaration
    declaration '=' initializer

This change to grammar lets an LR parser to call reduce at the point where the declaration should take effect, giving you time to mark the next token before you shift it in.

I didn't want to use LR parser yet, because the LR parsers tend to be fragile as glass for changes and it is still very hard to get good debug information for what went wrong in constructing the parsing tables.

On the current parser I accept there is ambiguity here. I can't really do it the other way. I have to let the parsing engine construct the partially ambiguous parse trees.

I use a trick to solve it. I lug a scope across the parse tree and discard the parse trees that do not match to the information inside the scope.

There are bits of problems there because the parser may detect ambiguity in places where I have list construction going on. I avoided those by writing in some indirections in the grammar.

If I handled a special case, I could avoid modifying the grammar to support the parser. Overall I think the C grammar in Lever will be eventually adjusted for LR parsing.

Fin

So I have a C parser with a preprocessor now. And I'm halfway from generating headers with it. I guess once I'll be done I may update it to parse a larger subset of C, for example, to have it retrieve C headers and print them to a C file, as well as a JSON file.

It could be some food for a full C compiler later on. So sort of. This allows pytci to live on. It sits right in the heart of Lever, in lib/c.lc.

And it translates headers for us!