Overview
This is a **major innovation**: instead of manually interpreting profiler output, you can feed the JSON directly to an LLM along with your source code, and receive concrete optimization suggestions related to your actual code.
**Key Features:**
- π€ **LLM-Ready Output** - Structured JSON designed for AI analysis
- π **Self-Improving Loop** - Profile β Analyze β Optimize β Repeat
- π **Cross-Platform** - Works on Linux (gprofng) and macOS (xctrace/sample)
- β‘ **Zero Configuration** - Just add
-pgflag, profiling runs automatically - π **Actionable Insights** - Maps hotspots to source code locations
Autonomous Optimization Loop
**NanoLang's profiling system enables LLMs to autonomously optimize your code** without human intervention. This is not just profiling for manual analysis - it's **actionable feedback designed for autonomous optimization agents**.
The Self-Optimization Workflow
An LLM (like Claude) can run this loop autonomously:
1. **Compile** - LLM compiles your program with -pg flag
2. **Run** - LLM executes the program with representative workload
3. **Analyze** - LLM reads platform-neutral JSON profiling output
4. **Identify** - LLM finds statistically significant bottlenecks (>1% runtime)
5. **Optimize** - LLM modifies source code to address performance issues
6. **Repeat** - LLM re-compiles and re-runs
7. **Converge** - Loop continues until no significant improvements remain
Example Autonomous Session
# LLM autonomously runs:
$ ./bin/nanoc myprogram.nano -o bin/myprogram -pg
$ ./bin/myprogram 2> profile.json
# LLM reads profile.json, identifies ray_sphere_intersect at 89% runtime
# LLM modifies source to eliminate redundant vec3_dot call
# LLM saves changes and repeats:
$ ./bin/nanoc myprogram.nano -o bin/myprogram -pg
$ ./bin/myprogram 2> profile2.json
# LLM reads profile2.json, sees 3x speedup, identifies next hotspot
# Process repeats until profile shows no function >5% runtime
Why This Is Novel
Traditional profilers require humans to:
- Interpret complex profiler output formats
- Map assembly/binary symbols back to source code
- Decide which optimizations to apply
- Manually edit code and re-test
NanoLang's profiling system provides:
- **Platform-neutral JSON** - LLMs don't need to understand gprof/Instruments formats
- **Transparent profiling** - No separate profiling commands, happens automatically
- **Statistically actionable** - Focuses on hotspots that matter (>1% runtime)
- **Self-documenting** -
analysis_hintsguide LLM optimization decisions
**Result:** LLMs can autonomously optimize codebases, achieving 2-10x speedups without human intervention.
Quick Start
1. Compile with Profiling
# Add the -pg flag when compiling
./bin/nanoc myprogram.nano -o bin/myprogram -pg
2. Run Your Program
# Just run it normally - profiling happens automatically!
./bin/myprogram
3. Get LLM-Friendly JSON
The program outputs structured JSON to stderr on exit:
{
"profile_type": "sampling",
"platform": "Linux",
"tool": "gprofng",
"binary": "./bin/myprogram",
"hotspots": [
{
"function": "process_pixels",
"samples": 3421,
"pct_time": 68.4,
"per_call_us": 1234.5,
"location": "src/renderer.nano:145"
},
{
"function": "calculate_lighting",
"samples": 892,
"pct_time": 17.8,
"per_call_us": 234.1,
"location": "src/lighting.nano:67"
}
],
"analysis_hints": [
"Functions consuming >10% of runtime are optimization targets",
"Look for O(nΒ²) algorithms in top hotspots",
"Consider caching or memoization for frequently-called functions"
]
}
4. Feed to LLM for Analysis
# Capture profiling output and send to Claude/GPT
./bin/myprogram 2> profile.json
# Then in your LLM chat:
# "Here's profiling data from my NanoLang program. What optimizations do you recommend?"
# [paste profile.json and relevant source code]
How It Works
Compilation Flags
The -pg flag adds these C compiler options:
| Flag | Purpose |
|---|---|
-pg | Enable profiling instrumentation |
-g | Include debug symbols for readable function names |
-fno-omit-frame-pointer | Accurate stack traces |
-fno-optimize-sibling-calls | Clearer call chains |
Platform-Specific Tools
NanoLang automatically uses the right tool for your OS:
**Linux:**
- Uses
gprofng collect(from binutils) - **Instrumentation-based profiling** - hooks every function call
- Captures complete execution trace with exact call counts
- No special permissions required
- ~2-3x overhead but comprehensive data
**macOS:**
- Uses
samplecommand (β οΈ **Known limitation** - see issue nanolang-ftkm) - **Sampling-based profiling** - periodic stack snapshots (~1ms intervals)
- Statistical approximation that can miss fast functions
- No special permissions required on modern macOS
- Low overhead but incomplete data
- **Note:** We're working on switching to
xctracefor instrumentation-based profiling to match Linux behavior
Automatic Profiling Flow
βββββββββββββββββββ
β Compile with -pgβ
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Run program β
β (normal usage) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β At exit: β
β - Run profiler β
β - Parse output β
β - Generate JSON β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β JSON β stderr β
β (LLM-ready) β
βββββββββββββββββββ
Complete Example
Let's profile a raytracer and optimize it using LLM feedback.
Initial Implementation
<!-- SNIPPET: profiling_example_initial -->
/*
* Simple raytracer - initial implementation
*/
struct Vec3 { x: float, y: float, z: float }
struct Ray { origin: Vec3, direction: Vec3 }
struct Sphere { center: Vec3, radius: float }
fn vec3_dot(a: Vec3, b: Vec3) -> float {
return (+ (+ (* a.x b.x) (* a.y b.y)) (* a.z b.z))
}
fn vec3_sub(a: Vec3, b: Vec3) -> Vec3 {
return Vec3 {
x: (- a.x b.x),
y: (- a.y b.y),
z: (- a.z b.z)
}
}
fn ray_sphere_intersect(ray: Ray, sphere: Sphere) -> bool {
let oc: Vec3 = (vec3_sub ray.origin sphere.center)
let a: float = (vec3_dot ray.direction ray.direction)
let b: float = (* 2.0 (vec3_dot oc ray.direction))
let c: float = (- (vec3_dot oc oc) (* sphere.radius sphere.radius))
let discriminant: float = (- (* b b) (* 4.0 (* a c)))
return (> discriminant 0.0)
}
fn render_pixel(x: int, y: int, spheres: array<Sphere>) -> int {
# Create ray for this pixel
let ray: Ray = Ray {
origin: Vec3 { x: 0.0, y: 0.0, z: 0.0 },
direction: Vec3 { x: (to_float x), y: (to_float y), z: 1.0 }
}
# Check intersection with all spheres
let sphere_count: int = (array_length spheres)
let mut i: int = 0
while (< i sphere_count) {
let sphere: Sphere = (at spheres i)
if (ray_sphere_intersect ray sphere) {
return 1 # Hit
}
set i (+ i 1)
}
return 0 # Miss
}
fn main() -> void {
# Create scene with 100 spheres
let mut spheres: array<Sphere> = (array_new 100)
let mut i: int = 0
while (< i 100) {
set spheres (array_push spheres Sphere {
center: Vec3 { x: (to_float i), y: (to_float i), z: 10.0 },
radius: 1.0
})
set i (+ i 1)
}
# Render 800x600 image
let mut y: int = 0
while (< y 600) {
let mut x: int = 0
while (< x 800) {
let hit: int = (render_pixel x y spheres)
set x (+ x 1)
}
set y (+ y 1)
}
print "Rendering complete"
}
<!-- END SNIPPET -->
Compile and Profile
./bin/nanoc raytracer.nano -o bin/raytracer -pg
./bin/raytracer 2> profile.json
Profile Output
{
"profile_type": "sampling",
"platform": "Linux",
"tool": "gprofng",
"binary": "./bin/raytracer",
"hotspots": [
{
"function": "ray_sphere_intersect",
"samples": 8923,
"pct_time": 89.2,
"per_call_us": 0.19,
"location": "raytracer.nano:18"
},
{
"function": "vec3_dot",
"samples": 721,
"pct_time": 7.2,
"per_call_us": 0.02,
"location": "raytracer.nano:8"
},
{
"function": "render_pixel",
"samples": 245,
"pct_time": 2.5,
"per_call_us": 0.51,
"location": "raytracer.nano:26"
}
],
"analysis_hints": [
"ray_sphere_intersect dominates execution time (89.2%)",
"Called 48,000,000 times - optimization target",
"vec3_dot is hot but cheap per-call (0.02 ΞΌs)"
]
}
LLM Analysis
Feed the profile and source to your LLM:
**Prompt:**
I'm profiling a NanoLang raytracer. Here's the profiling output and source code.
What optimizations do you recommend?
[paste profile.json and source code]
**LLM Response:**
Based on the profiling data, ray_sphere_intersect consumes 89.2% of runtime.
I see three optimization opportunities:
1. **Eliminate redundant vec3_dot call** - You compute vec3_dot(ray.direction, ray.direction)
but ray directions are normalized (or should be), so this is always 1.0.
2. **Reduce allocations** - vec3_sub creates a Vec3 on every call. Pass by value is
efficient for small structs like Vec3.
3. **Early exit** - Current code checks ALL spheres. Use spatial partitioning
(BVH or grid) to check fewer spheres per ray.
Here's the optimized ray_sphere_intersect:
[optimized code...]
Optimized Implementation
<!-- SNIPPET: profiling_example_optimized -->
# Optimization 1: Assume normalized ray direction (a = 1.0)
fn ray_sphere_intersect_fast(ray: Ray, sphere: Sphere) -> bool {
let oc: Vec3 = (vec3_sub ray.origin sphere.center)
# Removed: let a: float = (vec3_dot ray.direction ray.direction)
let b: float = (* 2.0 (vec3_dot oc ray.direction))
let c: float = (- (vec3_dot oc oc) (* sphere.radius sphere.radius))
# Use a = 1.0 directly
let discriminant: float = (- (* b b) (* 4.0 c))
return (> discriminant 0.0)
}
# Optimization 2: Inline small vector operations
fn render_pixel_fast(x: int, y: int, spheres: array<Sphere>) -> int {
# Inline ray creation (avoid struct allocation)
let ray_dir_x: float = (to_float x)
let ray_dir_y: float = (to_float y)
let ray_dir_z: float = 1.0
let sphere_count: int = (array_length spheres)
let mut i: int = 0
while (< i sphere_count) {
let sphere: Sphere = (at spheres i)
# Inline intersection check with direct arithmetic
if (ray_sphere_intersect_inline ray_dir_x ray_dir_y ray_dir_z sphere) {
return 1
}
set i (+ i 1)
}
return 0
}
<!-- END SNIPPET -->
Re-profile and Verify
./bin/nanoc raytracer_optimized.nano -o bin/raytracer_optimized -pg
time ./bin/raytracer_optimized 2> profile_optimized.json
# Compare:
# Before: 2.4 seconds
# After: 0.8 seconds (3x faster!)
Advanced Usage
Profiling with Multiple Runs
For stable results, profile multiple runs:
#!/bin/bash
# Profile 10 runs and average results
for i in {1..10}; do
./bin/myprogram 2>> profiles.jsonl
done
# Each run appends JSON to profiles.jsonl
# Feed all profiles to LLM for statistical analysis
Profiling Specific Functions
Focus on specific code sections:
<!-- SNIPPET: selective_profiling -->
fn expensive_operation() -> int {
# This function will show up in profile
let mut sum: int = 0
let mut i: int = 0
while (< i 1000000) {
set sum (+ sum (* i i))
set i (+ i 1)
}
return sum
}
shadow expensive_operation {
# Profile this function specifically
let result: int = (expensive_operation)
assert (> result 0)
}
<!-- END SNIPPET -->
Integrating with CI/CD
Add profiling to your continuous integration:
# .github/workflows/profile.yml
name: Performance Profiling
on: [push]
jobs:
profile:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Compile with profiling
run: ./bin/nanoc src/main.nano -o bin/app -pg
- name: Run and profile
run: ./bin/app 2> profile.json
- name: Upload profile
uses: actions/upload-artifact@v2
with:
name: profile-data
path: profile.json
- name: Comment profile on PR
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
const profile = JSON.parse(fs.readFileSync('profile.json'));
const hotspots = profile.hotspots.slice(0, 5)
.map(h => `- ${h.function}: ${h.pct_time}%`)
.join('\n');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Performance Profile\n\n${hotspots}`
});
Platform-Specific Notes
Linux
**Tool:** gprofng (from binutils 2.39+)
**Installation:**
# Usually pre-installed, but if needed:
sudo apt install binutils # Debian/Ubuntu
sudo dnf install binutils # Fedora
**Features:**
- No special permissions needed
- Low overhead sampling
- Detailed call graphs
- Hardware counter support
macOS
**Tools:** xctrace (preferred) or sample (fallback)
**Profiling Method:**
- **xctrace**: Instrumentation-based profiling (consistent with Linux gprofng)
- Requires: Full Xcode installation (not just Command Line Tools)
- Captures complete execution trace with exact timing
- Won't miss fast-executing functions
- **sample**: Sampling-based profiling (fallback)
- Built-in, no Xcode required
- Statistical approximation, may miss fast functions
- Automatically used if xctrace unavailable
**Permissions:**
No special permissions required. NanoLang automatically tries xctrace first, then falls back to sample if needed.
# Just run your program - profiling is automatic!
./bin/myprogram
# To enable xctrace (better profiling):
# 1. Install full Xcode from App Store
# 2. sudo xcode-select --switch /Applications/Xcode.app
# Alternative: Use Instruments GUI for detailed analysis
open -a Instruments
# Then: File β New β Time Profiler
Best Practices
1. Profile Representative Workloads
# β Don't profile with toy data
./bin/image_processor small.jpg
# β
Profile with realistic data
./bin/image_processor large_4k_image.jpg
2. Warm Up Before Profiling
<!-- SNIPPET: warmup_profiling -->
fn main() -> void {
# Warm up caches
let mut i: int = 0
while (< i 100) {
process_data()
set i (+ i 1)
}
# Now profile the real work
let mut i: int = 0
while (< i 10000) {
process_data()
set i (+ i 1)
}
}
<!-- END SNIPPET -->
3. Profile Both Debug and Release
# Debug build - see ALL function calls
./bin/nanoc app.nano -o bin/app_debug -pg -g
# Release build - see optimizer impact
./bin/nanoc app.nano -o bin/app_release -pg -O3
# Compare hotspots - optimizer may inline functions
4. Use LLM Context Effectively
When asking LLM for optimization advice:
**Include:**
- β Profiling JSON
- β Source code of hot functions
- β Input data characteristics
- β Performance requirements
- β Hardware constraints
**Example prompt:**
I'm optimizing a NanoLang image processing pipeline that must process
1000 images/second on a 4-core CPU. Here's the profiling data showing
gaussian_blur takes 78% of runtime. The images are 1920x1080 RGB.
[paste profile.json and gaussian_blur source]
What SIMD or algorithmic optimizations would you recommend?
5. Iterate and Measure
Profile β Optimize β Profile β Repeat
β β
βββββββββ Verify speedup βββββββ
Never assume an optimization worked - always re-profile!
Troubleshooting
No profiling output
**Symptom:** Program runs but no JSON appears
**Causes:**
1. Program crashed before exit (profiling runs at exit)
2. stderr redirected incorrectly
3. Profiler tool not available
**Solutions:**
# Check stderr explicitly
./bin/myprogram 2>&1 | tee output.log
# Check for gprofng/sample
which gprofng # Linux
which sample # macOS
# Ensure program exits normally
# Add explicit exit code
fn main() -> int {
# ... your code ...
return 0 # Ensures clean exit
}
Permission denied (macOS) - Rare
**Symptom:** "sample failed: Permission denied"
**Note:** This error is rare on modern macOS. The sample command typically works without elevated privileges for profiling your own processes.
**If it does occur:**
# Check if your binary is code-signed with restrictive entitlements
codesign -d --entitlements - ./bin/myprogram
# Use Instruments GUI as alternative
open -a Instruments
Profiling overhead too high
**Symptom:** Program much slower with -pg
**Causes:**
- Linux: gprofng uses instrumentation (2-3x overhead expected)
- macOS xctrace: instrumentation (2-3x overhead expected)
- macOS sample: sampling has low overhead (if slow, likely program issue not profiling)
**Solution:**
# Linux: This overhead is expected for comprehensive profiling
# For lower overhead, consider using perf in sampling mode (requires manual setup)
# macOS: If profiling is slow, check your program logic
# sample has very low overhead (~5-10%)
Unreadable function names
**Symptom:** Profiler shows "nl_func_123" instead of real names
**Solution:**
# Ensure debug symbols are included
./bin/nanoc app.nano -o bin/app -pg -g
# Check symbols are present
nm bin/app | grep nl_
JSON Schema Reference
Full schema of profiling output:
{
"profile_type": "sampling", // Note: "sampling" is legacy name (Linux actually uses instrumentation)
"platform": "Linux" | "macOS", // Detected OS
"tool": "gprofng" | "sample", // Profiler used (gprofng=instrumentation, sample=sampling)
"binary": "./bin/myprogram", // Path to profiled binary
"hotspots": [ // Functions sorted by time
{
"function": "function_name", // Function name
"samples": 1234, // Sample count
"pct_time": 12.3, // % of total time
"per_call_us": 0.456, // Microseconds per call (if available)
"calls": 1000000, // Call count (if available)
"location": "file.nano:123" // Source location (if available)
}
],
"analysis_hints": [ // Guidance for LLM
"Functions with >10% time are optimization targets",
"Look for O(nΒ²) algorithms in hot functions"
]
}
Summary
**NanoLang's LLM-powered profiling enables fully autonomous code optimization:**
1. **Compile with -pg** - Add profiling instrumentation
2. **Run normally** - Profiling happens automatically
3. **Get JSON output** - Structured, LLM-ready format
4. **Feed to LLM** - Get concrete optimization advice
5. **Apply changes** - Implement suggested improvements
6. **Re-profile** - Verify performance gains
7. **Repeat** - Continue until convergence (no significant improvements remain)
**This is self-improving code**: the language runtime generates profiling data specifically designed for AI analysis, enabling **fully autonomous performance optimization loops**.
The Autonomous Vision
LLMs can run this entire loop without human intervention:
- Compile and run programs autonomously
- Parse platform-neutral JSON (no need to understand gprof/Instruments formats)
- Identify statistically significant bottlenecks
- Modify source code to address performance issues
- Re-compile and validate improvements
- **Iterate until convergence** - achieve 2-10x speedups autonomously
This is not just "profiling for humans" - it's **actionable feedback for autonomous optimization agents**.
**Key advantages:**
- π― **Targeted** - Profile shows exactly where time is spent
- π€ **Fully Autonomous** - LLM handles entire compileβrunβoptimize loop
- π **Iterative** - Continues until no significant improvements remain
- π **Data-Driven** - No guessing, measure everything
- π **Platform-Neutral** - JSON abstracts away OS-specific profiler differences
See Also
- **Performance Guide** - General performance optimization
- **Debugging Guide** - Debugging techniques
- **AGENTS.md** - Full profiling documentation
- **examples/advanced/performance_optimization.nano** - More examples
---
**Last Updated:** January 31, 2026
**NanoLang Version:** 2.0+