The Worlds Fastest C Preprocessor

Posted on: 2024-05-02 - Fixed link

Today I announce the worlds fastest C preprocessor - meep (*)!

It might not be the worlds fastest, but for the tools I tested it was ;)
meep is a throw away name, I'm just using it in this post for the tool.

Headline numbers:

For the compilation of multiple source files, I see around 4x the wall clock performance.
Testing a single compile, producing around 1MB of source, is around 2x user time performance.

I'll describe later how I profiled.

The C preprocessor is in some respects quite straightforward. As it is the result of many years of fixes, additions and improvements there are many subtle details, often not documented very well. I thank the developers of gcc for their documentation. The C Preprocessor Iceburg Meme gives a good overview of the rabbit hole.

The development has been somewhat humbling. I've been developing in C and C++ for the past 30 years or so. I thought I knew most things about how the C preprocessor works. Around 25 years ago I even developed a C preprocessor for building websites(!), but it was significantly slower than existing tools and was far from compliant or even arguably correct. I've been through several different algorithms and rewrites with meep to get to this point.

Features

Standards compliant
- Trigraphs/digraphs/line continuations in the middle of tokens etc
- _Pragma
- #pragma once
- __VA_ARGS__, __VA_OPT__, __COUNTER__, __FILE__
- #line
- Tested against the mcpp test suite
Has in memory precompiled headers
Strong diagnostics
- If a token is due to a macro expansion, describes all expansions that led to that point
- Can display how macro expansions were invoked, including listing the args
Macro expansions allowed in all the tricky places including #include, #line etc
On output, tokens honor the input source structure
Full UTF-8 support - allows non ascii identifiers/macros names

Novel Features

The algorithm is of my own creation, I believe it's somewhat novel
- Macro expansion doesn't (generally) allocate any memory
- Minimal copying of tokens
Normal Token type is 12 bytes
Token compression - gets tokens down to 3-4 bytes
- Can decompress on demand as individual tokens are needed
- As I can cache tokenization I wanted to keep memory consumption constrained
Pre-processor can be used with other languages
- System allows mixing of different lexers which are used in different parts of lexing
- Preprocessor/C lexing is used on directives, outside an entirely different lexer can be used
System caches TokenStreams as well as PreprocStreams
- TokenStream holds a tokenization of a source file
- PreprocStream holds the preprocessor directive "program"
Since allows precompiled headers - has to support input source with errors/warnings
- If the warning/error isn't reached they aren't displayed
- On doing lexing/preprocessing we record diagnostics, that are only output on being reached

Results

Overview of how testing was performed:

All tests were single threaded
- It's not 100% clear what cpp or clang do in this regard, but they seem single threaded
Timing was produced via
- bash time
- \time
- perf
All tests produce a formatted token stream to stdio, redirected to /dev/null
Tests were made on real world source code
- Moreover with source, where considerable effort was made to limit #includes to speed up compilation
meep is developed in C++ and compiled via clang for these tests

I would note the results in this article are initial results. I have not profiled and optimized the code base around the meep pre-processor in any way yet. I did create the algorithms to be performant though, and that appears to paid off. There are probably plenty of additional performance gains that could be achieved.

In order to test performance I made part of my engine codebase able to not include any system headers. This was important to make sure that all tests pre-process exactly the same files regardless of platform or toolchain.

I'm currently developing on linux (although meep compiles and works on windows), and the pre-processors I tested against are clang and gccs cpp. Testing with time gives some idea of the performance difference, but testing on a single file is so fast the numbers produced are all over the map. On average just running time and looking at user time we see about half the time. Wall clock time is similar, but this is almost certainly due to reading in the original source files. Using perf produces similar results.

I decided next to try a multiple file test. This was largely to make the test slower to complete and therefore easier to measure. Here's where fairness becomes an issue. My implementation is designed to cache, tokenize, precompile input source once. I don't know how to do this with cpp or clang or if that's possible at all.

When I did a multi file test I did it in meep via a single executable run. For other tools I produced a shell script that invokes the clang or cpp multiple times. This is somewhat unfair, because my solution can read the files, tokenize, precompile only once across all compilations. Also there is some overhead in process setup and tear down.

With those caveats tests indicate meep is around around 4 times as fast as cpp or clang in this scenario.

I would have liked to have compared to the performance of warp. I couldn't find any binaries. I did install the d compiler and attempt to compile warp. Unfortunately this produced numerous errors that I tried to fix. Not being a dlang expert, as more and more errors appeared I dropped the effort. On reading the warp blog post, it seems as if has some similarities to my approach around caching. The project itself appears to no longer have updates, so appears effectively shelved.

Update 1

On doing some more profiling via perf it's perhaps interesting to note, that nearly 2/3 of all execution with meep is in outputting text. This is perhaps not super surprising, because the mechanism to do formatted token output is quite complicated. It uses the TokenStreams from source files, and then looks at how they line up with output tokens. This makes the assumption that the text between consecutive Tokens is either

Whitespace
Comments
Nothing

If it's nothing, then we know there won't be an issue outputting then directly one after another. If it's comments, we don't want to (typically) output the comments but just the "structure" - meaning lines and carriage returns.

Future Work

Would be interesting to compare against mcpp, but I'd need to compile it first
Comparing against Visual Studio could be interesting.
- During my testing I did find examples that Visual Studio preprocessed to different results than mcpp/gcc/clang.