The code is wrong.
Tripwire is a layered, adversarial-by-design correctness oracle that catches reward-hacks before they ship — a drop-in OpenEvolve evaluator that grades on speed only after correctness is proven.
def solve(text):
if text in _CANON:
return _CANON[text] # memorized
return [] # wrong elsewherea bitwise oracle throws this away
Tripwire is inspired by COMPILOT (PACT '25): an off-the-shelf LLM proposes loop schedules while the Tiramisu polyhedral compiler owns legality. The model explores — the verifier guarantees correctness. Here is what it found.
“Leverage the LLM for strategic exploration while entrusting the compiler with formal correctness — ensuring reliability without brittle output comparisons.”
RQ7 — the principle Tripwire generalizes: delegate correctness to a rigorous verifier, never trust the model to be correct.
0.0%of “passing,” faster LLM-written transforms were actually wrong under fresh random inputs.
Each run is a distinct conversation. Most plateau early; a few keep climbing past 10×. The loop is a stochastic search — which is exactly why a single correctness gate has to hold for every path it takes.
Token use grows super-linearly with iterations — ~200k by T=30. Verification is cheap; exploration is what you pay for.
Best-of-K speedup climbs with both runs (K) and iterations (T) — the surface warms from teal to orange as the search gets more chances to break through.
Across 30 dialogue iterations, only ~36% of the LLM's proposed schedules are runnable — the rest are invalid or illegal. That is the entire case for Tripwire: if the model is wrong two times out of three, correctness cannot be something you trust. It has to be something you verify.
geometric mean speedup, aggregated by input size
Aggressive parallelization, tiling and unrolling on the largest inputs push a handful of kernels past 100× — correlation and covariance clear 400×. Off-the-shelf LLMs, zero fine-tuning, grounded only by compiler feedback.
The agent only proposes schedules. The compiler checks legality, generates code, runs it, and returns feedback — action, observation, repeat. This is the loop behind every number above, and the loop Tripwire makes trustworthy.
action space · 9 transformations
feedback categories
speed is measured only after correctness is proven
A candidate that memorizes or special-cases the visible test inputs is correct on exactly those inputs and wrong everywhere else — and it looks almost infinitely fast. A bitwise or a tolerance oracle ships it. This is the documented Sakana CUDA-Engineer mirage.
hacks a tolerance oracle ships
A correct, faster candidate — vectorization, a reordered reduction, an FMA — shifts floating-point results in the low bits. A bitwise oracle rejects a genuine win. Same answer, wrong oracle.
of real wins a bitwise oracle throws away
reward-hacks shipped by the layered oracle
of real floating-point wins kept
Across 20 candidates and 7 targets, the layered oracle scored integrity 1.00 — the only oracle that earned it.
The order is fixed. A correctness layer that fails short-circuits the rest with the failing layer named — so the loop gets precise feedback, and a reward-hack earns zero reward.
Is the answer the same on the test inputs?
Anything wrong on the inputs it was tested on. Exact for structural targets; tolerance for numeric ones — correct vectorization changes the low bits, so bitwise here would discard real speedups.
for args in canonical_args:
if not close_equal(reference(*args), candidate(*args)):
return REJECTED("L1 canonical mismatch")Does it obey invariants the real computation must satisfy?
Candidates that pass the visible inputs but violate a known relation — scale-equivariance of a sum, parse↔serialize round-trip, count-conservation of a tokenizer. Cheap, total, relational.
for name, prop in target.properties:
for args in canonical_args + withheld_args:
if not prop(args, candidate(*args)):
return REJECTED(f"L2 property '{name}' violated")Is it still correct on adversarial inputs it has never seen?
Memorization. Skip-the-work. Distribution-conditioned wrongness. L3 re-checks against the reference on a fixed adversarial set AND fresh generative draws under new OS-entropy seeds each run — you cannot overfit to inputs you cannot see. This is the moat.
for args in withheld_args + generative_withheld(target):
if not close_equal(reference(*args), candidate(*args)):
return REJECTED("L3 withheld-input differential mismatch")Is the speed real, after correctness has been proven?
Phantom improvements from timing noise. Near-infinite 'speedups' (a red flag, not a winner). L4 measures warmed-up, best-of-N across shapes, with a variance lower bound — and only a candidate that already passed L1–L3 is ever timed.
# only reached after L1-L3 pass return PASSED(speedup=measure_time(reference) / measure_time(candidate))
red-team: 9/9 caught · naive shipped 5
The anchor paper evaluates eight LLMs as optimization agents — never an Anthropic model. We wire Tripwire to OpenEvolve, point the loop at Claude Opus 4.8, and let it optimize a numeric kernel.
COMPILOT-inspired, not a reproduction. COMPILOT optimizes C loop nests via the Tiramisu polyhedral compiler; we optimize Python via the empirical layered oracle. We reproduce the principle (RQ7: delegate correctness to a rigorous verifier), not the system.
import numpy as np
def solve(arr):
"""Sum a 1-D array of floats via vectorized numpy."""
return float(np.asarray(arr, dtype=np.float64).sum())