LLM-Powered Profiling

NanoLang features an **LLM-powered profiling system** that enables self-optimizing code. When you compile with the -pg flag, your program automatically profiles itself and outputs **structured JSON** that LLMs can analyze to suggest performance improvements.

NanoLang Mascot

Overview

This is a **major innovation**: instead of manually interpreting profiler output, you can feed the JSON directly to an LLM along with your source code, and receive concrete optimization suggestions related to your actual code.

**Key Features:**

  • πŸ€– **LLM-Ready Output** - Structured JSON designed for AI analysis
  • πŸ”„ **Self-Improving Loop** - Profile β†’ Analyze β†’ Optimize β†’ Repeat
  • 🌍 **Cross-Platform** - Works on Linux (gprofng) and macOS (xctrace/sample)
  • ⚑ **Zero Configuration** - Just add -pg flag, profiling runs automatically
  • πŸ“Š **Actionable Insights** - Maps hotspots to source code locations

Autonomous Optimization Loop

**NanoLang's profiling system enables LLMs to autonomously optimize your code** without human intervention. This is not just profiling for manual analysis - it's **actionable feedback designed for autonomous optimization agents**.

The Self-Optimization Workflow

An LLM (like Claude) can run this loop autonomously:

1. **Compile** - LLM compiles your program with -pg flag

2. **Run** - LLM executes the program with representative workload

3. **Analyze** - LLM reads platform-neutral JSON profiling output

4. **Identify** - LLM finds statistically significant bottlenecks (>1% runtime)

5. **Optimize** - LLM modifies source code to address performance issues

6. **Repeat** - LLM re-compiles and re-runs

7. **Converge** - Loop continues until no significant improvements remain

Example Autonomous Session


# LLM autonomously runs:
$ ./bin/nanoc myprogram.nano -o bin/myprogram -pg
$ ./bin/myprogram 2> profile.json

# LLM reads profile.json, identifies ray_sphere_intersect at 89% runtime
# LLM modifies source to eliminate redundant vec3_dot call
# LLM saves changes and repeats:

$ ./bin/nanoc myprogram.nano -o bin/myprogram -pg
$ ./bin/myprogram 2> profile2.json

# LLM reads profile2.json, sees 3x speedup, identifies next hotspot
# Process repeats until profile shows no function >5% runtime

Why This Is Novel

Traditional profilers require humans to:

  • Interpret complex profiler output formats
  • Map assembly/binary symbols back to source code
  • Decide which optimizations to apply
  • Manually edit code and re-test

NanoLang's profiling system provides:

  • **Platform-neutral JSON** - LLMs don't need to understand gprof/Instruments formats
  • **Transparent profiling** - No separate profiling commands, happens automatically
  • **Statistically actionable** - Focuses on hotspots that matter (>1% runtime)
  • **Self-documenting** - analysis_hints guide LLM optimization decisions

**Result:** LLMs can autonomously optimize codebases, achieving 2-10x speedups without human intervention.

Quick Start

1. Compile with Profiling


# Add the -pg flag when compiling
./bin/nanoc myprogram.nano -o bin/myprogram -pg

2. Run Your Program


# Just run it normally - profiling happens automatically!
./bin/myprogram

3. Get LLM-Friendly JSON

The program outputs structured JSON to stderr on exit:


{
  "profile_type": "sampling",
  "platform": "Linux",
  "tool": "gprofng",
  "binary": "./bin/myprogram",
  "hotspots": [
    {
      "function": "process_pixels",
      "samples": 3421,
      "pct_time": 68.4,
      "per_call_us": 1234.5,
      "location": "src/renderer.nano:145"
    },
    {
      "function": "calculate_lighting",
      "samples": 892,
      "pct_time": 17.8,
      "per_call_us": 234.1,
      "location": "src/lighting.nano:67"
    }
  ],
  "analysis_hints": [
    "Functions consuming >10% of runtime are optimization targets",
    "Look for O(nΒ²) algorithms in top hotspots",
    "Consider caching or memoization for frequently-called functions"
  ]
}

4. Feed to LLM for Analysis


# Capture profiling output and send to Claude/GPT
./bin/myprogram 2> profile.json

# Then in your LLM chat:
# "Here's profiling data from my NanoLang program. What optimizations do you recommend?"
# [paste profile.json and relevant source code]

How It Works

Compilation Flags

The -pg flag adds these C compiler options:

FlagPurpose
-pgEnable profiling instrumentation
-gInclude debug symbols for readable function names
-fno-omit-frame-pointerAccurate stack traces
-fno-optimize-sibling-callsClearer call chains

Platform-Specific Tools

NanoLang automatically uses the right tool for your OS:

**Linux:**

  • Uses gprofng collect (from binutils)
  • **Instrumentation-based profiling** - hooks every function call
  • Captures complete execution trace with exact call counts
  • No special permissions required
  • ~2-3x overhead but comprehensive data

**macOS:**

  • Uses sample command (⚠️ **Known limitation** - see issue nanolang-ftkm)
  • **Sampling-based profiling** - periodic stack snapshots (~1ms intervals)
  • Statistical approximation that can miss fast functions
  • No special permissions required on modern macOS
  • Low overhead but incomplete data
  • **Note:** We're working on switching to xctrace for instrumentation-based profiling to match Linux behavior

Automatic Profiling Flow


β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Compile with -pgβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Run program     β”‚
β”‚ (normal usage)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ At exit:        β”‚
β”‚ - Run profiler  β”‚
β”‚ - Parse output  β”‚
β”‚ - Generate JSON β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ JSON β†’ stderr   β”‚
β”‚ (LLM-ready)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Complete Example

Let's profile a raytracer and optimize it using LLM feedback.

Initial Implementation


<!-- SNIPPET: profiling_example_initial -->
/*
 * Simple raytracer - initial implementation
 */

struct Vec3 { x: float, y: float, z: float }
struct Ray { origin: Vec3, direction: Vec3 }
struct Sphere { center: Vec3, radius: float }

fn vec3_dot(a: Vec3, b: Vec3) -> float {
    return (+ (+ (* a.x b.x) (* a.y b.y)) (* a.z b.z))
}

fn vec3_sub(a: Vec3, b: Vec3) -> Vec3 {
    return Vec3 {
        x: (- a.x b.x),
        y: (- a.y b.y),
        z: (- a.z b.z)
    }
}

fn ray_sphere_intersect(ray: Ray, sphere: Sphere) -> bool {
    let oc: Vec3 = (vec3_sub ray.origin sphere.center)
    let a: float = (vec3_dot ray.direction ray.direction)
    let b: float = (* 2.0 (vec3_dot oc ray.direction))
    let c: float = (- (vec3_dot oc oc) (* sphere.radius sphere.radius))
    let discriminant: float = (- (* b b) (* 4.0 (* a c)))
    return (> discriminant 0.0)
}

fn render_pixel(x: int, y: int, spheres: array<Sphere>) -> int {
    # Create ray for this pixel
    let ray: Ray = Ray {
        origin: Vec3 { x: 0.0, y: 0.0, z: 0.0 },
        direction: Vec3 { x: (to_float x), y: (to_float y), z: 1.0 }
    }

    # Check intersection with all spheres
    let sphere_count: int = (array_length spheres)
    let mut i: int = 0
    while (< i sphere_count) {
        let sphere: Sphere = (at spheres i)
        if (ray_sphere_intersect ray sphere) {
            return 1  # Hit
        }
        set i (+ i 1)
    }
    return 0  # Miss
}

fn main() -> void {
    # Create scene with 100 spheres
    let mut spheres: array<Sphere> = (array_new 100)
    let mut i: int = 0
    while (< i 100) {
        set spheres (array_push spheres Sphere {
            center: Vec3 { x: (to_float i), y: (to_float i), z: 10.0 },
            radius: 1.0
        })
        set i (+ i 1)
    }

    # Render 800x600 image
    let mut y: int = 0
    while (< y 600) {
        let mut x: int = 0
        while (< x 800) {
            let hit: int = (render_pixel x y spheres)
            set x (+ x 1)
        }
        set y (+ y 1)
    }

    print "Rendering complete"
}
<!-- END SNIPPET -->

Compile and Profile


./bin/nanoc raytracer.nano -o bin/raytracer -pg
./bin/raytracer 2> profile.json

Profile Output


{
  "profile_type": "sampling",
  "platform": "Linux",
  "tool": "gprofng",
  "binary": "./bin/raytracer",
  "hotspots": [
    {
      "function": "ray_sphere_intersect",
      "samples": 8923,
      "pct_time": 89.2,
      "per_call_us": 0.19,
      "location": "raytracer.nano:18"
    },
    {
      "function": "vec3_dot",
      "samples": 721,
      "pct_time": 7.2,
      "per_call_us": 0.02,
      "location": "raytracer.nano:8"
    },
    {
      "function": "render_pixel",
      "samples": 245,
      "pct_time": 2.5,
      "per_call_us": 0.51,
      "location": "raytracer.nano:26"
    }
  ],
  "analysis_hints": [
    "ray_sphere_intersect dominates execution time (89.2%)",
    "Called 48,000,000 times - optimization target",
    "vec3_dot is hot but cheap per-call (0.02 ΞΌs)"
  ]
}

LLM Analysis

Feed the profile and source to your LLM:

**Prompt:**


I'm profiling a NanoLang raytracer. Here's the profiling output and source code.
What optimizations do you recommend?

[paste profile.json and source code]

**LLM Response:**


Based on the profiling data, ray_sphere_intersect consumes 89.2% of runtime.
I see three optimization opportunities:

1. **Eliminate redundant vec3_dot call** - You compute vec3_dot(ray.direction, ray.direction)
   but ray directions are normalized (or should be), so this is always 1.0.

2. **Reduce allocations** - vec3_sub creates a Vec3 on every call. Pass by value is
   efficient for small structs like Vec3.

3. **Early exit** - Current code checks ALL spheres. Use spatial partitioning
   (BVH or grid) to check fewer spheres per ray.

Here's the optimized ray_sphere_intersect:

[optimized code...]

Optimized Implementation


<!-- SNIPPET: profiling_example_optimized -->
# Optimization 1: Assume normalized ray direction (a = 1.0)
fn ray_sphere_intersect_fast(ray: Ray, sphere: Sphere) -> bool {
    let oc: Vec3 = (vec3_sub ray.origin sphere.center)
    # Removed: let a: float = (vec3_dot ray.direction ray.direction)
    let b: float = (* 2.0 (vec3_dot oc ray.direction))
    let c: float = (- (vec3_dot oc oc) (* sphere.radius sphere.radius))
    # Use a = 1.0 directly
    let discriminant: float = (- (* b b) (* 4.0 c))
    return (> discriminant 0.0)
}

# Optimization 2: Inline small vector operations
fn render_pixel_fast(x: int, y: int, spheres: array<Sphere>) -> int {
    # Inline ray creation (avoid struct allocation)
    let ray_dir_x: float = (to_float x)
    let ray_dir_y: float = (to_float y)
    let ray_dir_z: float = 1.0

    let sphere_count: int = (array_length spheres)
    let mut i: int = 0
    while (< i sphere_count) {
        let sphere: Sphere = (at spheres i)
        # Inline intersection check with direct arithmetic
        if (ray_sphere_intersect_inline ray_dir_x ray_dir_y ray_dir_z sphere) {
            return 1
        }
        set i (+ i 1)
    }
    return 0
}
<!-- END SNIPPET -->

Re-profile and Verify


./bin/nanoc raytracer_optimized.nano -o bin/raytracer_optimized -pg
time ./bin/raytracer_optimized 2> profile_optimized.json

# Compare:
# Before: 2.4 seconds
# After:  0.8 seconds (3x faster!)

Advanced Usage

Profiling with Multiple Runs

For stable results, profile multiple runs:


#!/bin/bash
# Profile 10 runs and average results

for i in {1..10}; do
    ./bin/myprogram 2>> profiles.jsonl
done

# Each run appends JSON to profiles.jsonl
# Feed all profiles to LLM for statistical analysis

Profiling Specific Functions

Focus on specific code sections:


<!-- SNIPPET: selective_profiling -->
fn expensive_operation() -> int {
    # This function will show up in profile
    let mut sum: int = 0
    let mut i: int = 0
    while (< i 1000000) {
        set sum (+ sum (* i i))
        set i (+ i 1)
    }
    return sum
}

shadow expensive_operation {
    # Profile this function specifically
    let result: int = (expensive_operation)
    assert (> result 0)
}
<!-- END SNIPPET -->

Integrating with CI/CD

Add profiling to your continuous integration:


# .github/workflows/profile.yml
name: Performance Profiling

on: [push]

jobs:
  profile:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Compile with profiling
        run: ./bin/nanoc src/main.nano -o bin/app -pg

      - name: Run and profile
        run: ./bin/app 2> profile.json

      - name: Upload profile
        uses: actions/upload-artifact@v2
        with:
          name: profile-data
          path: profile.json

      - name: Comment profile on PR
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const profile = JSON.parse(fs.readFileSync('profile.json'));
            const hotspots = profile.hotspots.slice(0, 5)
              .map(h => `- ${h.function}: ${h.pct_time}%`)
              .join('\n');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Performance Profile\n\n${hotspots}`
            });

Platform-Specific Notes

Linux

**Tool:** gprofng (from binutils 2.39+)

**Installation:**


# Usually pre-installed, but if needed:
sudo apt install binutils  # Debian/Ubuntu
sudo dnf install binutils  # Fedora

**Features:**

  • No special permissions needed
  • Low overhead sampling
  • Detailed call graphs
  • Hardware counter support

macOS

**Tools:** xctrace (preferred) or sample (fallback)

**Profiling Method:**

  • **xctrace**: Instrumentation-based profiling (consistent with Linux gprofng)

- Requires: Full Xcode installation (not just Command Line Tools)

- Captures complete execution trace with exact timing

- Won't miss fast-executing functions

  • **sample**: Sampling-based profiling (fallback)

- Built-in, no Xcode required

- Statistical approximation, may miss fast functions

- Automatically used if xctrace unavailable

**Permissions:**

No special permissions required. NanoLang automatically tries xctrace first, then falls back to sample if needed.


# Just run your program - profiling is automatic!
./bin/myprogram

# To enable xctrace (better profiling):
# 1. Install full Xcode from App Store
# 2. sudo xcode-select --switch /Applications/Xcode.app

# Alternative: Use Instruments GUI for detailed analysis
open -a Instruments
# Then: File β†’ New β†’ Time Profiler

Best Practices

1. Profile Representative Workloads


# ❌ Don't profile with toy data
./bin/image_processor small.jpg

# βœ… Profile with realistic data
./bin/image_processor large_4k_image.jpg

2. Warm Up Before Profiling


<!-- SNIPPET: warmup_profiling -->
fn main() -> void {
    # Warm up caches
    let mut i: int = 0
    while (< i 100) {
        process_data()
        set i (+ i 1)
    }

    # Now profile the real work
    let mut i: int = 0
    while (< i 10000) {
        process_data()
        set i (+ i 1)
    }
}
<!-- END SNIPPET -->

3. Profile Both Debug and Release


# Debug build - see ALL function calls
./bin/nanoc app.nano -o bin/app_debug -pg -g

# Release build - see optimizer impact
./bin/nanoc app.nano -o bin/app_release -pg -O3

# Compare hotspots - optimizer may inline functions

4. Use LLM Context Effectively

When asking LLM for optimization advice:

**Include:**

  • βœ… Profiling JSON
  • βœ… Source code of hot functions
  • βœ… Input data characteristics
  • βœ… Performance requirements
  • βœ… Hardware constraints

**Example prompt:**


I'm optimizing a NanoLang image processing pipeline that must process
1000 images/second on a 4-core CPU. Here's the profiling data showing
gaussian_blur takes 78% of runtime. The images are 1920x1080 RGB.

[paste profile.json and gaussian_blur source]

What SIMD or algorithmic optimizations would you recommend?

5. Iterate and Measure


Profile β†’ Optimize β†’ Profile β†’ Repeat
  ↓                              ↑
  └──────── Verify speedup β”€β”€β”€β”€β”€β”€β”˜

Never assume an optimization worked - always re-profile!

Troubleshooting

No profiling output

**Symptom:** Program runs but no JSON appears

**Causes:**

1. Program crashed before exit (profiling runs at exit)

2. stderr redirected incorrectly

3. Profiler tool not available

**Solutions:**


# Check stderr explicitly
./bin/myprogram 2>&1 | tee output.log

# Check for gprofng/sample
which gprofng  # Linux
which sample   # macOS

# Ensure program exits normally
# Add explicit exit code
fn main() -> int {
    # ... your code ...
    return 0  # Ensures clean exit
}

Permission denied (macOS) - Rare

**Symptom:** "sample failed: Permission denied"

**Note:** This error is rare on modern macOS. The sample command typically works without elevated privileges for profiling your own processes.

**If it does occur:**


# Check if your binary is code-signed with restrictive entitlements
codesign -d --entitlements - ./bin/myprogram

# Use Instruments GUI as alternative
open -a Instruments

Profiling overhead too high

**Symptom:** Program much slower with -pg

**Causes:**

  • Linux: gprofng uses instrumentation (2-3x overhead expected)
  • macOS xctrace: instrumentation (2-3x overhead expected)
  • macOS sample: sampling has low overhead (if slow, likely program issue not profiling)

**Solution:**


# Linux: This overhead is expected for comprehensive profiling
# For lower overhead, consider using perf in sampling mode (requires manual setup)

# macOS: If profiling is slow, check your program logic
# sample has very low overhead (~5-10%)

Unreadable function names

**Symptom:** Profiler shows "nl_func_123" instead of real names

**Solution:**


# Ensure debug symbols are included
./bin/nanoc app.nano -o bin/app -pg -g

# Check symbols are present
nm bin/app | grep nl_

JSON Schema Reference

Full schema of profiling output:


{
  "profile_type": "sampling",           // Note: "sampling" is legacy name (Linux actually uses instrumentation)
  "platform": "Linux" | "macOS",        // Detected OS
  "tool": "gprofng" | "sample",         // Profiler used (gprofng=instrumentation, sample=sampling)
  "binary": "./bin/myprogram",          // Path to profiled binary
  "hotspots": [                         // Functions sorted by time
    {
      "function": "function_name",      // Function name
      "samples": 1234,                  // Sample count
      "pct_time": 12.3,                 // % of total time
      "per_call_us": 0.456,             // Microseconds per call (if available)
      "calls": 1000000,                 // Call count (if available)
      "location": "file.nano:123"       // Source location (if available)
    }
  ],
  "analysis_hints": [                   // Guidance for LLM
    "Functions with >10% time are optimization targets",
    "Look for O(nΒ²) algorithms in hot functions"
  ]
}

Summary

**NanoLang's LLM-powered profiling enables fully autonomous code optimization:**

1. **Compile with -pg** - Add profiling instrumentation

2. **Run normally** - Profiling happens automatically

3. **Get JSON output** - Structured, LLM-ready format

4. **Feed to LLM** - Get concrete optimization advice

5. **Apply changes** - Implement suggested improvements

6. **Re-profile** - Verify performance gains

7. **Repeat** - Continue until convergence (no significant improvements remain)

**This is self-improving code**: the language runtime generates profiling data specifically designed for AI analysis, enabling **fully autonomous performance optimization loops**.

The Autonomous Vision

LLMs can run this entire loop without human intervention:

  • Compile and run programs autonomously
  • Parse platform-neutral JSON (no need to understand gprof/Instruments formats)
  • Identify statistically significant bottlenecks
  • Modify source code to address performance issues
  • Re-compile and validate improvements
  • **Iterate until convergence** - achieve 2-10x speedups autonomously

This is not just "profiling for humans" - it's **actionable feedback for autonomous optimization agents**.

**Key advantages:**

  • 🎯 **Targeted** - Profile shows exactly where time is spent
  • πŸ€– **Fully Autonomous** - LLM handles entire compileβ†’runβ†’optimize loop
  • πŸ”„ **Iterative** - Continues until no significant improvements remain
  • πŸ“Š **Data-Driven** - No guessing, measure everything
  • 🌍 **Platform-Neutral** - JSON abstracts away OS-specific profiler differences

See Also

---

**Last Updated:** January 31, 2026

**NanoLang Version:** 2.0+