Challenges in Parsing C to Get Bindings
I made myself a tool to parse C headers. cffi-gen parses the gcc -E
and gcc -dM
output. It does so using the rules I found from this ANSI C grammar file. It forms an index of primitive types, constants and declarators present in the header file. I can use this index to generate API -files.
Maintaining cffi bindings yourself is error prone. The interface to a library should be the responsibility of library maintainers and not the users. By generating the bindings I can ensure they will be correct if my generator is correct. The generator is lot less work to maintain than every binding to every library someone might need.
I've tested the parser by running it through every header file in /usr/include
of my desktop computer. There are about 13000 .h
files. About 5000 can be preprocessed by GCC. My program is able to retrieve index from about 2500 of these files.
Next week I will run this to SDL2 and OpenGL ES headers. Before I'm doing that I have to create the code which renders the output of cffi-gen to .api
-format understood by pyllisp. I will use an index reference, such as this index for SDL2 reference, to figure out which declarations form the library API.
I hope this will become into something that'd let anyone avoid writing a library index. Unfortunately it took almost two weeks of my time, and there are still some complications.
How to use it
The cffi-gen requires my lrkit. The API isn't ready yet, but it'll work roughly like this:
- You give the list of header files to the
cffigen.parse_file(headers)
. You get an environment object, containing dictionaries for every namespace of the C program. - The entries in the environment are either constants, or Declarator -objects. The declarators consists of stack of modifiers and specifiers. You translate the specifiers and modifiers to get the actual type signature of the node. There will be an example of this for the
.api
spec.
Complications
Parsing the header files alone isn't sufficient because they can include other libraries and reference many other functions than just the ones belonging in the library. Also C only has one single namespace so library function names tend to contain a prefix. You can retrieve the list of names from the header file and generate the type signatures but it needs to be pruned, and the names need to be translated to generate good bindings.
I'm letting GCC to preprocess the headers for me because it does it correct. Parsing the headers take some time which would be noticeable in warmup-time of my interpreter, so I need something like the .api
spec. I think something like these should be a standard way to access shared objects. Unfortunately my format doesn't entirely cover the C or match the C way to specify things. Also the .api
specs are plain text, but not every program is using everything in the bindings. The specification files should be structured for quick indexing.
In GCC __attribute__((mode(__QI__)))
can change the meaning of a type in a typedef. It appears in the standard definition of types such as uint8_t
. GCC attributes overall are difficult to parse unambigously using the grammar I found.
The C header definitions consists of declaration specifiers (such as extern, const or int), and a declarator. The grammar allows any combination in both specifier and the declarator. All of them aren't valid sequences. The parser doesn't take this into account. It just produces a binding out of them.
I've found some constant expressions in the header files that calculate an enumerator or index based on to the size of something. Not sure what to do about these yet. They likely have to generate a parse tree instead of evaluating. Most likely bindings containing this kind of constructs won't be portable. I hope they will be avoided entirely when generating bindings.
Finally here's some fun trivia about C:
This is a valid C function header, the function returns a function pointer:
void (*function(int a, int b))(long long int a, char b);
Keep in mind the declarators stack. The (long long int a, char b)
-declarator is part of type specification of returning value. The long long int
is valid type specifier. It's equivalent to long long
. Overall if it ends with int
, and you're not sure what it is, remove the int
and look again.