Chapter 24: Performance & Optimization

**Understanding compilation, memory management, and optimization techniques.**

NanoLang Mascot

NanoLang compiles to C and achieves performance within 5-15% of hand-written C code. This chapter explains how to understand the compilation process, manage memory efficiently, profile your programs, and avoid common performance pitfalls.

24.1 Understanding Compilation

From NanoLang to C

NanoLang is a **compiled language** that transpiles to C, then uses your system's C compiler (gcc or clang) to produce native binaries. This approach provides:

  • **Native performance** - No interpreter overhead
  • **Portability** - Runs anywhere C compiles
  • **Interoperability** - Easy FFI with C libraries
  • **Debugging** - Can inspect generated C code

Compilation Stages

The NanoLang compiler (nanoc) processes your code through six stages:


┌─────────────────┐
│  Source Code    │  (.nano file)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ 1. Lexical      │  Tokenizes source into tokens
│    Analysis     │  (lexer.c / lexer.nano)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ 2. Parsing      │  Builds Abstract Syntax Tree (AST)
│                 │  (parser_iterative.c / parser.nano)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ 3. Type         │  Static type checking and inference
│    Checking     │  (typechecker.c / typecheck.nano)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ 4. Shadow       │  Runs compile-time tests
│    Tests        │  (validates function correctness)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ 5. Transpile    │  Two-pass C code generation
│    to C         │  (transpiler_iterative_v3_twopass.c)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ 6. C Compile    │  gcc/clang produces native binary
│                 │  (-std=c99 -Wall -Wextra -Werror)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Native Binary   │  Executable program
└─────────────────┘

Viewing Generated C Code

Understanding the generated C code helps with debugging and optimization. NanoLang provides several ways to inspect it:

**Keep the generated C file:**


# Saves .c file alongside binary
./bin/nanoc program.nano -o bin/program --keep-c
# Creates: bin/program and bin/program.c

**Save to .genC file:**


# Creates program.nano.genC for inspection
./bin/nanoc program.nano -S

**Print to stdout:**


# Displays generated C without compiling
./bin/nanoc program.nano -fshow-intermediate-code

**Verbose compilation:**


# Shows all compilation steps
./bin/nanoc program.nano -o bin/program --verbose

Examples

**NanoLang source:**


fn add(a: int, b: int) -> int {
    return (+ a b)
}

fn main() -> void {
    let result: int = (add 5 3)
    print result
}

**Generated C (simplified):**


#include <stdio.h>
#include "nl_runtime.h"

int64_t nl_add(int64_t a, int64_t b) {
    return (a + b);
}

void nl_main(void) {
    int64_t result = nl_add(5, 3);
    printf("%lld\n", result);
}

int main(int argc, char** argv) {
    nl_gc_init();
    nl_main();
    nl_gc_cleanup();
    return 0;
}

Key observations:

  • Functions are prefixed with nl_ to avoid C name collisions
  • int becomes int64_t (64-bit integers)
  • GC initialization/cleanup wraps main()
  • Prefix operators like (+ a b) become infix (a + b)

24.2 Memory Management

Stack vs Heap

NanoLang uses a **hybrid memory model** where some values live on the stack and others on the heap:

**Stack-Allocated (Fast, No GC overhead):**

TypeSizeNotes
int8 bytes64-bit signed integer
float8 bytes64-bit double
bool1 bytetrue/false
u81 byteUnsigned byte
Small structs≤128 bytesCopied by value
EnumsVariesTag + payload

**Heap-Allocated (GC-Managed):**

TypeOverheadNotes
string~24 bytesUTF-8 text, immutable
bstring~24 bytesBinary data
array<T>~32 bytesDynamic arrays
List<T>~32 bytesGeneric lists
HashMap<K,V>~48 bytesHash tables
Large structs~16 bytes>128 bytes

**Rule of thumb:** Primitives and small structs are stack-allocated; collections and strings are heap-allocated.

Automatic Memory Management (ARC)

**NEW in v2.3.0:** NanoLang uses Automatic Reference Counting (ARC) for zero-overhead memory management. No manual free() calls needed!

**How ARC Works:**

  • **Owned references** (from constructors like parse, new_object) are automatically freed when going out of scope
  • **Borrowed references** (from accessors like get, as_string) have zero overhead - no ref counting
  • **Cycle detection** - Circular references are automatically collected
  • **Deterministic cleanup** - Objects freed immediately when last reference disappears

fn efficient_memory() -> void {
    # Stack allocated - no GC overhead
    let x: int = 42
    let point: Point = Point { x: 1.0, y: 2.0 }

    # Heap allocated - ARC managed (no manual free needed!)
    let name: string = "Hello"
    let numbers: array<int> = [1, 2, 3, 4, 5]

    # When function exits:
    # - Stack variables automatically freed
    # - Heap variables: refcount decremented by ARC
    # - If refcount reaches 0, memory freed automatically
    # No manual cleanup required!
}

**Passing data efficiently:**


# Primitives are copied (cheap for small types)
fn double_int(x: int) -> int {
    return (* x 2)
}

# Arrays/strings are passed by reference (cheap)
fn sum_array(arr: array<int>) -> int {
    # arr is a reference - no copy made
    let mut total: int = 0
    let mut i: int = 0
    while (< i (array_length arr)) {
        set total (+ total (array_get arr i))
        set i (+ i 1)
    }
    return total
}

Memory Leaks and How to Avoid Them

NanoLang's garbage collector prevents most memory leaks, but cycles can cause issues:

**Reference cycles (potential leak):**


struct Node {
    value: int,
    next: Node?  # Optional reference
}

fn create_cycle() -> void {
    let a: Node = Node { value: 1, next: none }
    let b: Node = Node { value: 2, next: some(a) }
    # If we set a.next = some(b), we create a cycle
    # The cycle collector will eventually clean this up
}

**Best practices:**

1. **Avoid circular references** when possible

2. **Use weak references** for back-pointers (if available)

3. **Let values go out of scope** when no longer needed

4. **Don't hold references** longer than necessary

Profiling Memory Usage

**Check GC statistics:**


extern fn gc_print_stats() -> void

fn main() -> void {
    # ... your code ...
    
    # Print GC statistics at end
    gc_print_stats()
}

**Environment variables for GC tuning:**


# Set heap threshold for cycle collection (default: 10MB)
export NANO_GC_THRESHOLD=10485760

# Enable GC debug output
export NANO_GC_DEBUG=1

# Run your program
./bin/myprogram

**Using Valgrind for detailed analysis:**


# Memory leak detection
valgrind --leak-check=full ./bin/myprogram

# Memory profiling over time
valgrind --tool=massif ./bin/myprogram
ms_print massif.out.*

24.3 Profiling Techniques

Time Profiling

**Basic timing:**


# Simple wall-clock timing
time ./bin/myprogram

# Output:
# real    0m2.345s  (wall clock)
# user    0m2.100s  (CPU time in user mode)
# sys     0m0.123s  (CPU time in kernel)

**Built-in profiling with -pg:**


# Compile with profiling enabled
./bin/nanoc program.nano -o bin/program -pg

# Run program (generates profiling data)
./bin/program

# JSON profiling output appears on stderr
./bin/program 2> profile.json

See Chapter 8: LLM-Powered Profiling for comprehensive profiling documentation.

Memory Profiling

**Valgrind massif (heap profiler):**


# Profile heap usage over time
valgrind --tool=massif ./bin/myprogram

# Analyze results
ms_print massif.out.12345

# Key metrics:
# - Peak heap usage
# - Allocation sites
# - Memory timeline

**AddressSanitizer (compile-time):**


# Compile with ASan
./bin/nanoc program.nano -o bin/program -fsanitize=address

# Run - detects memory errors
./bin/program

Hotspot Identification

When profiling shows a function consuming significant time, investigate:

1. **Algorithm complexity** - Is it O(n²) when O(n) is possible?

2. **Loop iterations** - Can iterations be reduced?

3. **Allocations** - Is memory being allocated inside hot loops?

4. **Function calls** - Can expensive calls be hoisted out of loops?

**Example analysis:**


# Profile shows this takes 80% of runtime
fn process_items(items: array<Item>) -> void {
    let count: int = (array_length items)
    let mut i: int = 0
    while (< i count) {
        let item: Item = (array_get items i)
        
        # Suspect: Is expensive_lookup O(n)?
        let data: Data = (expensive_lookup item.id)
        
        # Suspect: String concat in loop?
        set log (+ log item.name)
        
        set i (+ i 1)
    }
}

**Optimization opportunities:**

  • Cache expensive_lookup results in a HashMap
  • Build strings using array, then join once at end

Tools and Techniques

ToolPurposeCommand
timeWall-clock timingtime ./program
-pg flagLLM-ready profilingnanoc -pg then run
gprofCall graph profilinggprof program gmon.out
perfLinux performance countersperf record ./program
valgrindMemory analysisvalgrind --tool=massif
InstrumentsmacOS profiling GUIopen -a Instruments

24.4 Common Pitfalls

Performance Anti-Patterns

**1. Premature optimization**


# ❌ Don't optimize without measuring
fn overly_clever() -> int {
    # Bit hacks that obscure intent
    return (& (>> x 31) 1)  # Just use (< x 0)!
}

# ✅ Write clear code first, optimize measured hotspots
fn clear_intent(x: int) -> bool {
    return (< x 0)
}

**2. Ignoring algorithm complexity**


# ❌ O(n²) - checking every pair
fn has_duplicates_slow(arr: array<int>) -> bool {
    let n: int = (array_length arr)
    let mut i: int = 0
    while (< i n) {
        let mut j: int = (+ i 1)
        while (< j n) {
            if (== (array_get arr i) (array_get arr j)) {
                return true
            }
            set j (+ j 1)
        }
        set i (+ i 1)
    }
    return false
}

# ✅ O(n) - use a set
fn has_duplicates_fast(arr: array<int>) -> bool {
    let seen: HashMap<int, bool> = (hashmap_new)
    let n: int = (array_length arr)
    let mut i: int = 0
    while (< i n) {
        let val: int = (array_get arr i)
        if (hashmap_contains seen val) {
            return true
        }
        hashmap_set seen val true
        set i (+ i 1)
    }
    return false
}

String Building in Loops

This is the **most common performance mistake** in NanoLang:


# ❌ SLOW - O(n²) time complexity!
fn build_csv_slow(items: array<Item>) -> string {
    let mut result: string = ""
    let n: int = (array_length items)
    let mut i: int = 0
    while (< i n) {
        let item: Item = (array_get items i)
        # Each concat copies the ENTIRE string so far
        set result (+ result item.name)
        set result (+ result ",")
        set i (+ i 1)
    }
    return result
}
# For 10,000 items: ~50 million character copies!

**Why it's slow:** Each + string concatenation creates a new string and copies all previous characters. With n items, you copy approximately n²/2 characters total.


# ✅ FAST - O(n) time complexity
fn build_csv_fast(items: array<Item>) -> string {
    let mut parts: array<string> = []
    let n: int = (array_length items)
    let mut i: int = 0
    while (< i n) {
        let item: Item = (array_get items i)
        set parts (array_push parts item.name)
        set i (+ i 1)
    }
    # Single join at the end
    return (str_join parts ",")
}
# For 10,000 items: ~10,000 character copies (1000x faster!)

Unnecessary Allocations

**Allocations in hot loops:**


# ❌ Creates new Point every iteration
fn move_particles_slow(particles: array<Particle>) -> void {
    let n: int = (array_length particles)
    let mut i: int = 0
    while (< i n) {
        let p: Particle = (array_get particles i)
        # New Vec2 allocation each iteration
        let velocity: Vec2 = Vec2 { x: p.vx, y: p.vy }
        # ... use velocity ...
        set i (+ i 1)
    }
}

# ✅ Reuse or avoid allocation
fn move_particles_fast(particles: array<Particle>) -> void {
    let n: int = (array_length particles)
    let mut i: int = 0
    while (< i n) {
        let p: Particle = (array_get particles i)
        # Use fields directly - no allocation
        let new_x: float = (+ p.x p.vx)
        let new_y: float = (+ p.y p.vy)
        # ... update particle ...
        set i (+ i 1)
    }
}

Collection Choices

**Choose the right collection for your access pattern:**

PatternBest CollectionWhy
Sequential accessarray<T>Cache-friendly, O(1) index
Key-value lookupHashMap<K,V>O(1) average lookup
Membership testHashMap<T,bool>O(1) contains check
FIFO queuearray<T> with indexO(1) push/pop from end
Sorted dataarray<T> + sortBinary search O(log n)

**Example - membership testing:**


# ❌ O(n) per lookup
fn is_valid_slow(word: string, valid_words: array<string>) -> bool {
    let n: int = (array_length valid_words)
    let mut i: int = 0
    while (< i n) {
        if (== word (array_get valid_words i)) {
            return true
        }
        set i (+ i 1)
    }
    return false
}

# ✅ O(1) per lookup
fn is_valid_fast(word: string, valid_set: HashMap<string, bool>) -> bool {
    return (hashmap_contains valid_set word)
}

24.5 Performance Checklist

Before optimizing, run through this checklist:

  • [ ] **Measure first** - Is performance actually a problem?
  • [ ] **Profile** - Where is time actually spent?
  • [ ] **Check algorithms** - Is Big-O complexity optimal?
  • [ ] **String concatenation** - Using array + join for loops?
  • [ ] **Collection choice** - Right data structure for access pattern?
  • [ ] **Allocations** - Avoiding allocations in hot loops?
  • [ ] **C optimization** - Compiling generated C with -O2/-O3?

24.6 Summary

AspectRecommendation
**Compilation**Use --keep-c to inspect generated code
**Memory**Prefer stack types; let GC handle heap
**Profiling**Always measure before optimizing
**Strings**Never concatenate in loops - use array + join
**Collections**Match collection to access pattern
**Algorithms**Big-O matters more than micro-optimization

**Performance target:** NanoLang programs typically run within 5-15% of equivalent hand-written C code. If your program is significantly slower, check for the common pitfalls described in this chapter.

---

**Previous:** Chapter 23: Higher-Level Patterns

**Next:** Chapter 25: Contributing & Extending