Optimizer Integrity Bench · v0.1

AI optimizers ship code that looks thousands of times faster.

The code is wrong.

Tripwire is a layered, adversarial-by-design correctness oracle that catches reward-hacks before they ship — a drop-in OpenEvolve evaluator that grades on speed only after correctness is proven.

candidate · memorizedtokenizer

def solve(text):
    if text in _CANON:
        return _CANON[text]   # memorized
    return []                 # wrong elsewhere

scorecard · hacks shipped

Naive · bitwise9/13

Naive · tolerance13/13

Tripwire · layered0/13

0hacks shipped
by the layered oracle

kept FP win · matmul

0×

a bitwise oracle throws this away

kept speedups · log scale

matmul

matvec

dot

sum_reduction

tokenizer

The anchor paper · COMPILOT

The model proposes. The compiler verifies.

Tripwire is inspired by COMPILOT (PACT '25): an off-the-shelf LLM proposes loop schedules while the Tiramisu polyhedral compiler owns legality. The model explores — the verifier guarantees correctness. Here is what it found.

Read the paper PACT '25 · arXiv:2511.00592

“Leverage the LLM for strategic exploration while entrusting the compiler with formal correctness — ensuring reliability without brittle output comparisons.”
RQ7 — the principle Tripwire generalizes: delegate correctness to a rigorous verifier, never trust the model to be correct.
0.0%of “passing,” faster LLM-written transforms were actually wrong under fresh random inputs.

Exploration

40 runs, 40 different paths to a speedup.

Each run is a distinct conversation. Most plateau early; a few keep climbing past 10×. The loop is a stochastic search — which is exactly why a single correctness gate has to hold for every path it takes.

Fig. 21 — speedup trajectoriesgramschmidt_LARGE · 40 runs

Cost & convergence

More iterations and runs converge to bigger speedups — at a token cost.

Fig. 9 — token consumptionsuper-linear in T

Token use grows super-linearly with iterations — ~200k by T=30. Verification is cheap; exploration is what you pay for.

Fig. 16 — best-of-K @ Tspeedup surface

Best-of-K speedup climbs with both runs (K) and iterations (T) — the surface warms from teal to orange as the search gets more chances to break through.

2.1×

7.4×

Why you need a verifier

Two-thirds of what the model proposes never even runs.

Across 30 dialogue iterations, only ~36% of the LLM's proposed schedules are runnable — the rest are invalid or illegal. That is the entire case for Tripwire: if the model is wrong two times out of three, correctness cannot be something you trust. It has to be something you verify.

Fig. 19 — schedule viability over iterationsrunnable · illegal · invalid

Runnable 36.1%Illegal 32.9%Invalid 31%

0.0%runnable schedules (avg)

0.0%illegal — break semantics

0.0%invalid — malformed

Results

2.66× single-run. 3.54× best-of-5. Beats the SOTA polyhedral optimizer.

0.00×geomean speedup, single run (COMPILOT@30)

0.00×geomean, best-of-5 runs

0.00×geomean over Pluto (SOTA)

119 / 150instances where it beats Pluto

Fig. 7 — speedup per benchmark150 instances · log scale

MINISMALLMEDIUMLARGEXLARGE

Fig. 8 — geomean by sizebigger inputs, bigger wins

1.05×

MINI

1.62×

SMALL

2.61×

MEDIUM

4.90×

LARGE

7.20×

XLARGE

geometric mean speedup, aggregated by input size

Some kernels break 100×.

Aggressive parallelization, tiling and unrolling on the largest inputs push a handful of kernels past 100× — correlation and covariance clear 400×. Off-the-shelf LLMs, zero fine-tuning, grounded only by compiler feedback.

correlation 430×covariance 455×3mm 205×trmm 185×syr2k 220×

The agentic loop

An off-the-shelf model, grounded by compiler feedback.

The agent only proposes schedules. The compiler checks legality, generates code, runs it, and returns feedback — action, observation, repeat. This is the loop behind every number above, and the loop Tripwire makes trustworthy.

Fig. 1 — the agentic loopaction / observation

action space · 9 transformations

FusionInterchangeParallelize2D Tiling3D TilingUnrollingSkewingShiftingReversal

feedback categories

InvalidIllegalSolver failureCompiler crashSuccessful execution

§ 01 — the thesis

A naive oracle ships the reward-hack and discards the real win. Tripwire is the only one right on both axes.

See the bench Or skip to Claude in the loop →

speed is measured only after correctness is proven

False positive

It ships the reward-hack

A candidate that memorizes or special-cases the visible test inputs is correct on exactly those inputs and wrong everywhere else — and it looks almost infinitely fast. A bitwise or a tolerance oracle ships it. This is the documented Sakana CUDA-Engineer mirage.

0/13

hacks a tolerance oracle ships

False negative

It discards the real win

A correct, faster candidate — vectorization, a reordered reduction, an FMA — shifts floating-point results in the low bits. A bitwise oracle rejects a genuine win. Same answer, wrong oracle.

of real wins a bitwise oracle throws away

The layered fix

Zero hacks. Every real win kept.

tokenizer

serde

sum_reduction

dot

matvec

matmul

sql

0%of real wins kept · 0 hacks shipped

reward-hacks shipped by the layered oracle

of real floating-point wins kept

Across 20 candidates and 7 targets, the layered oracle scored integrity 1.00 — the only oracle that earned it.

bench.run exits non-zero if this ever stops holding

§ 02 — the mechanism

Four layers, each catching a specific failure mode.

The order is fixed. A correctness layer that fails short-circuits the rest with the failing layer named — so the loop gets precise feedback, and a reward-hack earns zero reward.

Read oracle.py The threat model

tripwire/oracle.py

Canonical correctness

Is the answer the same on the test inputs?

Anything wrong on the inputs it was tested on. Exact for structural targets; tolerance for numeric ones — correct vectorization changes the low bits, so bitwise here would discard real speedups.

for args in canonical_args:
    if not close_equal(reference(*args), candidate(*args)):
        return REJECTED("L1 canonical mismatch")

Metamorphic / property

Does it obey invariants the real computation must satisfy?

Candidates that pass the visible inputs but violate a known relation — scale-equivariance of a sum, parse↔serialize round-trip, count-conservation of a tokenizer. Cheap, total, relational.

for name, prop in target.properties:
    for args in canonical_args + withheld_args:
        if not prop(args, candidate(*args)):
            return REJECTED(f"L2 property '{name}' violated")

L3the moat

Differential on withheld inputs

Is it still correct on adversarial inputs it has never seen?

Memorization. Skip-the-work. Distribution-conditioned wrongness. L3 re-checks against the reference on a fixed adversarial set AND fresh generative draws under new OS-entropy seeds each run — you cannot overfit to inputs you cannot see. This is the moat.

for args in withheld_args + generative_withheld(target):
    if not close_equal(reference(*args), candidate(*args)):
        return REJECTED("L3 withheld-input differential mismatch")

Isolated speedup

Is the speed real, after correctness has been proven?

Phantom improvements from timing noise. Near-infinite 'speedups' (a red flag, not a winner). L4 measures warmed-up, best-of-N across shapes, with a variance lower bound — and only a candidate that already passed L1–L3 is ever timed.

# only reached after L1-L3 pass
return PASSED(speedup=measure_time(reference) / measure_time(candidate))

Every attack maps to the layer that catches it.

red-team: 9/9 caught · naive shipped 5

Memorize the test set

caught · L3

Constant return (instant)

caught · L1

Skip half the work

caught · L2

Shape-conditioned wrongness

caught · L3

Phantom speedup (noise)

caught · L4

Correct FP vectorization

kept · real win

§ 03 — the proof

Claude in an OpenEvolve loop, judged by the layered oracle.

The anchor paper evaluates eight LLMs as optimization agents — never an Anthropic model. We wire Tripwire to OpenEvolve, point the loop at Claude Opus 4.8, and let it optimize a numeric kernel.

See the runner

0×best speedup
at iteration 5

10iterations, all correct

4/4layers cleared, every step

COMPILOT-inspired, not a reproduction. COMPILOT optimizes C loop nests via the Tiramisu polyhedral compiler; we optimize Python via the empirical layered oracle. We reproduce the principle (RQ7: delegate correctness to a rigorous verifier), not the system.

runs/target-zero.jsonl · Claude Opus 4.8

186.3×iter 1 · gen 1 · island 0

layered: passed all layers

child · solve

import numpy as np

def solve(arr):
    """Sum a 1-D array of floats via vectorized numpy."""
    return float(np.asarray(arr, dtype=np.float64).sum())