Rules first
Known failure patterns classify deterministically from evidence packets before any model call.
Flagship case study
ForgeBeyond parses failed runs, separates the primary failure from downstream noise, checks failure memory, and writes a PR/MR-native explanation with confidence and next action.
Inputs
Outputs
Eval posture
The docs preserve misses, reruns, and ambiguity. Current honest claim: memory recall is stable on repeated contract cases and cuts tokens materially; superiority claims wait until evals beat both no-memory and generic baselines.
memory retrieved 3/3 repeated cases each time
vs 1696.4 avg in no-memory mode
recurrence, dependency drift, wrapper noise, cross-repo breakage
no “memory moat” claim unless evals beat no-memory and baseline
Architecture
Known failure patterns classify deterministically from evidence packets before any model call.
Failure Memory Objects supply prior incidents and fixes, but do not override present-run evidence.
Every result carries evidence strength, pattern match, signal completeness, and classification clarity.
The open-source path stores normalized memories, provenance, and fix summaries rather than raw private logs.