Dense Arena Interning: The Engine of Compiler Performance

Compilers spend a massive amount of time looking at names and structures. Every variable, function, and keyword in your source code is a string that needs to be identified, categorized, and resolved across multiple compiler phases.

The problem isn’t just matching strings or comparing type signatures—it’s doing it over and over again across the entire pipeline. The same variable name counter might be checked hundreds of times: initially in the lexer, again in the parser, repeatedly in the type checker, and throughout optimization passes.

I implemented a Dense Arena Interner to solve this by converting strings and structures into dense integers the moment they are created. We pay the hashing cost upfront during lexing or type construction. In exchange, every subsequent phase of the compiler gets to rely entirely on O(1) array indexing and single-instruction pointer comparisons.

Here’s the technical journey from repeated, expensive structural checks to hardware-level integer comparisons.

compiler-v3

A compiler for a statically-typed, procedural language featuring explicit memory control and an LLVM backend.

CLLVMCOMPILERS

The Linear Bottleneck: O(N*L)

Imagine a Lexer scanning your code. It finds the characters f-u-n-c. It needs to know: “Is this a keyword like func or return, or is it a variable name?”

The naive approach is a linear scan using strcmp against a list of known keywords:

// Naive Lexer logic
const char* keywords[] = { "fn", "if", "else", "while", "return", ... };

for (int i = 0; i < num_keywords; i++) {
    if (strcmp(token_string, keywords[i]) == 0) {
        return keyword_tokens[i];
    }
}

strcmp(s1, s2) is O(L), where L is the string length. It must check every character until it finds a mismatch.
Linear Scan is O(N), where N is the number of keywords.
Total Cost: O(N * L) per token.

If your language has 50 keywords and the average identifier is 8 characters, you’re doing roughly 400 character comparisons just to identify a single word. In a large project, this overhead becomes a massive anchor on performance.

The Hashing Improvement: O(L)

To improve this, we move to a Hash Map. Instead of checking every keyword, we compute a hash of the token and jump directly to the potential match. I use separate chaining with dynamic arrays for buckets. This ensures that even with collisions, performance remains stable. Hashing takes O(L) per lookup (with average-case O(1) bucket access).

// hash_map.c - Simplified insertion
bool hashmap_put(HashMap* map, void* key, void* value) {
    // 1. Automatic Resizing (Load Factor > 0.75)
    if (map->size >= (map->bucket_count * 3) / 4) {
        hashmap_rehash(map, map->bucket_count * 2);
    }

    // 2. O(L) Hashing: We must visit every byte to compute the hash.
    size_t index = hash_func(key) % map->bucket_count;
    DynArray *bucket = &map->buckets[index];

    // 3. O(1) average lookup in bucket chain
    for (size_t i = 0; i < bucket->count; i++) {
        KeyValue *kv = dynarray_get(bucket, i);
        if (cmp_func(kv->key, key) == 0) {
            kv->value = value; // Update existing
            return true;
        }
    }
    
    // 4. Push new entry
    return dynarray_push(bucket, (KeyValue){key, value});
}

While O(L) is much better than O(N * L), we are still paying the hashing price every time we encounter that string in the Parser, Typechecker, or Optimizer. We need a way to phase-shift this cost away from all stages to just the Lexer.

For a symbol reused $k$ times across later passes, the tradeoff is simple: without interning, repeated checks are O(k * L); with interning, you pay O(k * L) only in one phase and later passes become O(k).

Interning shifts the cost to one phase, then reuses a canonical handle for nearly constant-time comparisons.

The Foundation: Stable Memory (The Arena)

To eliminate string comparisons entirely, we need to ensure that identical strings point to the exact same memory address. Standard malloc or realloc can move things around or scatter them, making pointer comparisons risky.

I use an Arena Allocator which manages memory in large contiguous blocks, providing two critical guarantees: O(1) allocation and Pointer Stability.

typedef struct ArenaBlock {
    struct ArenaBlock *next; // Pointer to the next block in the chain
    size_t capacity;         // Total size of this block's data
    size_t used;             // Number of bytes currently allocated
    uint8_t data[];          // The actual memory
} ArenaBlock;

typedef struct {
    ArenaBlock *blocks;      // Head of the block list
    size_t block_size;       // Default size for new blocks
} Arena;

Allocation in an Arena is just “bumping” a pointer forward. Once a string is stored in an Arena block, its address never changes. This stability allows us to use the pointer itself as a unique ID.

void *arena_alloc(Arena *arena, size_t size) {
    if (!arena) return NULL;
    if (size == 0) return NULL; /* semantic choice */

    const size_t align = alignof(max_align_t);

    ArenaBlock *block = arena->blocks;
    if (!block) return NULL;

    /* align the *offset*, not just the size */
    size_t offset = align_up(block->used, align);

    /* If not enough room in current block, allocate a new one */
    if (offset + size > block->capacity) {
        size_t new_capacity = arena->block_size;
        while (new_capacity < size) new_capacity *= 2;

        ArenaBlock *new_block = malloc(sizeof(ArenaBlock) + new_capacity);
        if (!new_block) return NULL;

        new_block->next = arena->blocks;
        new_block->capacity = new_capacity;
        new_block->used = 0;
        arena->blocks = new_block;

        block = new_block;
        offset = 0;
    }

    void *ptr = (void*)(block->data + offset);
    block->used = offset + align_up(size, align); /* bump after alignment */
    return ptr;
}

However, there is a catch: the “Un-freeable” Nature of Arenas. Arena allocators are perfect for batch compilers because you usually just drop the entire memory block at the end of compilation. But if this compiler is ever adapted to run as a long-lived Language Server Protocol (LSP) daemon, memory could balloon since arenas cannot free individual interned strings as files change.

Core Abstractions: Slice and InternResult

Before looking at the interner itself, we need to define the two core data structures used to pass data around.

Slice: A lightweight reference to a string or object. It doesn’t own its memory; it just points to a start and a length. This allows the Lexer to point directly into the source file buffer without copying.

typedef struct {
    const char *ptr;
    size_t len;
} Slice;

InternResult: This is the canonical handle returned by the interner. key points to an arena-allocated Slice, and that slice points to the canonical byte sequence. You may create many temporary slices while lexing, but if two slices have the same bytes and length, they resolve to the same interned record. Entry stores the Dense ID and optional metadata (for example, token type).

typedef struct {
    void *key;      // Arena-allocated Slice* (points to canonical bytes)
    Entry *entry;   // Metadata (Dense ID and metadata)
} InternResult;

The Dense Arena Interner: Buying O(1) Lookups

Interning deduplicates data—whether that’s a string identifier or a complex type struct—so that only one “canonical” copy exists. The Dense part of this implementation refers to how we identify these objects. Instead of just using pointers, we assign every unique item a contiguous, zero-indexed integer (0, 1, 2…).

The DenseArenaInterner uses the Hash Map to find existing entries and the Arena to store new ones.

typedef struct {
    Arena *arena;           // Where canonical data lives
    HashMap *hashmap;       // Maps raw data -> InternResult*
    int dense_index_count;  // Counter for the Dense ID (0, 1, 2...)
} DenseArenaInterner;

When we intern an item, we check the Hash Map. If it’s already there, we return the existing InternResult. If not, we copy it into the Arena exactly once and increment our dense_index_count. This ensures that if we’ve seen 500 unique objects, they are mapped to the integers 0 through 499.

// dense_arena_interner.c - Simplified interning loop
InternResult* intern(DenseArenaInterner *interner, Slice *slice, void *meta) {
    // 1. O(L) Hash + O(1) Map Lookup
    InternResult *found = hashmap_get(
        interner->hashmap,
        slice,
        interner->hash_func,
        interner->cmp_func
    );
    if (found) return found;

    // 2. Create stable, arena-owned key and canonical bytes
    Slice *key_slice = arena_alloc(interner->arena, sizeof(Slice));
    void *canonical_data = interner->copy_func(interner->arena, slice->ptr, slice->len);
    key_slice->ptr = canonical_data;
    key_slice->len = slice->len;

    // 3. Assign the next Dense ID
    InternResult *res = arena_alloc(interner->arena, sizeof(InternResult));
    Entry *ent = arena_alloc(interner->arena, sizeof(Entry));
    ent->meta = meta;
    ent->dense_index = interner->dense_index_count;
    res->key = key_slice;
    res->entry = ent;

    // 4. Update the map for future O(1) lookups
    hashmap_put(
        interner->hashmap,
        key_slice,
        res,
        interner->hash_func,
        interner->cmp_func
    );

    interner->dense_index_count++;
    return res;
}

Lexer Integration: The Memory Win

The lexer maintains two separate interners: one for Keywords and one for Identifiers.

Startup: Bootstrapping Keywords. When the compiler starts, we pre-populate the keyword table. This ensures every keyword is already “canonical” and inside the hash map before we scan a single line of code.

// lexer.c - Initialization logic
static const struct { const char *word; TokenType type; } KEYWORDS[] = {
    {"fn", TOK_FN}, {"if", TOK_IF}, {"return", TOK_RETURN}, ...
};

for (size_t i = 0; KEYWORDS[i].word; i++) {
    Slice s = { .ptr = KEYWORDS[i].word, .len = strlen(KEYWORDS[i].word) };
    // Store the TokenType as metadata directly in the Entry
    intern(lexer->keywords, &s, (void*)(uintptr_t)KEYWORDS[i].type);
}

Runtime: The O(L) Check. Now, when the Lexer scans a word, it uses intern_peek—a function that checks the keyword map without allocating new memory.

// lexer.c - Identifier scanning
static void *lexer_lex_identifier(Lexer *lexer, ...) {
    Slice slice = lexer_make_slice_from_ptrs(start, end);

    // 1. Keyword check (O(L) hashing + O(1) pointer lookup)
    InternResult *kw = intern_peek(lexer->keywords, &slice);
    if (kw) {
        // It's a keyword! Pull the TokenType from the metadata.
        *out_type = (TokenType)(uintptr_t)kw->entry->meta;
        return kw; 
    }

    // 2. Not a keyword? Intern it into the general identifier table.
    *out_type = TOK_IDENTIFIER;
    return intern(lexer->identifiers, &slice, NULL);
}

What did we gain in the lexer? Crucially, we didn’t gain speed. Both a standard hash map and an interner are O(L) operations for the lexer. The win here is twofold:

Memory deduplication: If return appears 500 times, we store it once instead of 500 times.
Dense Indexing Generation: By assigning each identifier a contiguous Dense ID, we enable the later compiler phases—specifically the Scoping and Type Checking systems—to use these IDs for O(1) flat-array lookups.

The lexer pays the O(L) “toll” to generate these IDs, but it is the rest of the compiler that cashes in on the performance.

Beyond Hashing: The Power of Dense IDs

1. Flat Array Symbol Tables. This is the most significant benefit. In a traditional compiler, the Symbol Table is often another complex Hash Map. With Dense IDs, the Symbol Table becomes a flat array.

Since every variable name is guaranteed to have an ID between 0 and N, we can use that ID as a direct index into an array. This turns symbol resolution into a single memory offset calculation—the fastest possible way to access data, which is O(1).

Handling Variable Shadowing. One complexity is lexical scoping. If a variable index is declared in a global scope, and then shadowed by an index parameter in a nested function, both strings resolve to the exact same Dense ID. We solve this by maintaining a hierarchy of Scopes, each with its own sparse array.

// Each scope has its own symbols array, indexed by the global Dense ID
Symbol *scope_lookup_symbol(Scope *scope, InternResult *rec) {
    Scope *current = scope;
    while (current) {
        // Direct O(1) index check per scope level
        if (rec->entry->dense_index < current->capacity) {
            Symbol *symbol = current->symbols[rec->entry->dense_index];
            if (symbol) return symbol; // Found the most local version
        }
        current = current->parent;
    }
    return NULL;
}

While this hierarchy works perfectly, other high-performance compilers sometimes use an Array of Stacks (where each Dense ID maps to a stack of symbols) or an Undo Log (pushing to a global array and rolling back on scope exit). I haven’t implemented these yet, but they offer an interesting alternative to tree-walking by keeping the most local symbol at the top of a global structure. The scope walk itself is proportional to nesting depth: O(h).

2. Memory Efficiency and Cache Locality. Because the IDs are contiguous, we can store metadata in compact, cache-friendly structures. When the compiler iterates over all symbols, it’s performing a linear scan over a contiguous block of memory. This maximizes CPU cache hits and minimizes the memory overhead associated with sparse data structures like Hash Maps.

3. Instant Type and Symbol Comparisons. Interning isn’t just for strings; it’s a strategy for structural canonicalization. When we intern complex types (like “pointer to i32” or “function returning bool”), we ensure that every identical type in the entire program points to the exact same memory address.

This creates a recursive shortcut. Even the process of interning a new complex type is fast because its constituents (return types, parameters, base types) are already interned. Instead of a deep, recursive structural check, we only need to perform a “shallow” comparison of the component pointers.

// Interning a complex function type is just a shallow pointer check
// because 'ret' and 'params' are already canonical pointers.
bool type_compare(Type *a, Type *b) {
    if (a->kind != b->kind) return false;
    if (a->ret != b->ret) return false; // Simple pointer comparison
    if (a->param_count != b->param_count) return false;
    
    // Check if the parameter pointer arrays match
    return memcmp(a->params, b->params, a->param_count * sizeof(Type*)) == 0;
}

By ensuring that every “child” type is already canonical, the interner can determine if a “parent” type exists without ever walking more than one level deep. Once the type is interned, the rest of the compiler can compare complex signatures as if they were simple integers. After canonicalization, equality checks behave like constant-time operations: O(1).

// The ultimate win: O(1) equality everywhere else
if (type_a == type_b) {
    // Verified identical signatures in a single CPU instruction
}

The Result: Free Resolution

By paying the hashing cost upfront during lexing and type construction, every identifier and complex structure is converted into a unique pointer and a dense ID.

From that point on, the Parser, Typechecker, and Optimizer never use strcmp or deep recursive structural checks again. Dense arena interning transforms the compiler from a slow text and tree processor into a lean, integer-crunching machine. This optimization ensures the entire front-end pipeline feels practically instantaneous.