Your CI
remembers now.
When CI breaks, ForgeBeyond tells your team why, where to look, and what to do — using memory-native failure reasoning that remembers every incident, every fix, every regression.
Open source. Install in 2 minutes. forge quickstart
How it works
Four steps. One memory-aware PR comment.
CI fails
GitHub Actions workflow goes red. ForgeBeyond parses logs, test output, and the git diff — JUnit XML, pytest, TAP, raw logs, workflow YAML.
Classify & remember
Classifies into a bounded failure taxonomy using deterministic rules first. Searches failure memory by fingerprint, component overlap, subcategory, and text similarity for related prior incidents.
Reason about root cause
Attributes the root cause by correlating the failure with the diff, identifies the file and line, routes to the likely owner via CODEOWNERS. Confidence score backed by explicit evidence.
Comment with memory
Posts a PR comment with classification, root cause, fix guidance, and prior incident history — "Previously resolved Feb 3 by @orders-team (commit a1b2c3d). May have regressed."
The product
This is what your team sees.
Every analysis lands as a PR comment — no dashboards, no context switching. Toggle between failure types to see how ForgeBeyond adapts.
fingerprint: match — seen 12 times in 30 daysrerun_rate: 91% success on rerun (11 of 12)pattern: test_checkout_flow intermittent timeout at assertionsource: deterministic — fingerprint + recurrence storeplatform-team CODEOWNERS match on tests/integration/ rerun_likely_flake — Rerun is safe, but this test has been flaky for 23 days. Consider quarantining.
diff_correlation: new stack trace in changed filefile: src/payments/charge.ts:47error: TypeError: Cannot read property 'amount' of undefinedfingerprint: no prior match — new failure patternsource: deterministic — diff correlation + stack trace@sarah-chen PR author — changed files match failing test path investigate_regression — Your changes likely introduced this. Investigate src/payments/charge.ts:47 before rerunning.
exit_code: 137 (OOM kill)possible: infrastructure issue — memory limit exceededfingerprint: no prior matchnote: This is a lead, not a conclusion. Evidence is insufficient for confident classification.unknown No CODEOWNERS match — manual routing needed needs_human_review — Possible infrastructure issue. The evidence is a lead — human review is needed to confirm.
error: KeyError: 'card_token' in test_checkout_flowdiff_correlation: no changes to payment code in this PRdependency: payments-api field renamed 45 min before failuresource: deterministic — error pattern + upstream change correlation| Repo | Change | Author | When | Domains |
|---|---|---|---|---|
payments-api | Rename card_token to payment_method_id (def4567) | bob@acme.co | 45 min ago | api_surface schema |
shared-db | Add index on transactions (e8f9012) | carol@acme.co | 2h ago | migration |
def4567 in payments-api — API-surface change renamed card_token, matching the missing field in the error
bob@acme.co Upstream owner of payments-api — authored the likely causal change check_upstream_dependency — Your tests expect card_token, but payments-api renamed it to payment_method_id 45 minutes ago. Coordinate with bob@acme.co.
diff_correlation: removed null check in validateOrder()file: src/orders/validate.ts:23error: TypeError: Cannot read property 'items' of nullsource: deterministic — diff correlation + stack traceb7e2f1a9c3d0 regression detected regression @orders-team a1b2c3d @orders-team CODEOWNERS match on src/orders/ — same team that resolved previous occurrence investigate_regression — Known pattern regressed. Check if commit a1b2c3d was reverted. Route to @orders-team who fixed this before.
The memory tab is the core of ForgeBeyond: when a failure resembles a prior incident, it surfaces the full history — who fixed it, when, and how — right in the PR comment. Resolution is detected automatically when tests start passing. The upstream tab shows cross-repo awareness: when an upstream service changes its API, ForgeBeyond links the failure to the exact change and author.
The problem
CI failures waste your team's best hours.
Log spelunking
A build goes red. An engineer opens a 500-line log, scrolls to the error, cross-references with the diff, checks if it happened before. This takes 15–60 minutes. It happens multiple times a day.
Reflexive re-runs
When a build fails ambiguously, the default is to click re-run. This wastes CI compute, delays merges, and hides real regressions behind retry luck.
Zero institutional memory
The same Redis timeout broke builds three times this week. Someone fixed it last month — but nobody remembers who, or what they did. Every occurrence is triaged from scratch because your CI has no memory.
Memory-native failure reasoning
Your CI has amnesia.
Ours doesn't.
Every failure becomes a Failure Memory Object — a semantic record that captures the full story: what broke, why, how often, which component, who fixed it, and what they did. Not a hash. A narrative the system can reason about.
Semantic retrieval — when a new failure occurs, ForgeBeyond searches memory using four strategies simultaneously: fingerprint match, component overlap, subcategory correlation, and text similarity. It finds related incidents even when stack traces drift or error messages change.
Resolution tracking — when tests start passing again, resolution is recorded automatically. No manual close-out. The next time the same pattern appears, your team sees: "Previously resolved Feb 3 by @orders-team (commit a1b2c3d). May have regressed."
fingerprint: a3f8c2d environment_infra regression @orders-team a1b2c3d) Trust architecture
Deterministic first. AI second. Always labeled.
Evidence, not vibes.
Every classification is grounded in deterministic signals: fingerprint matches, failure memory retrieval, diff correlation, exit code patterns. Confidence scores come with explicit evidence. Auditable and reproducible.
AI is the fallback, not the headline.
When deterministic methods can't resolve a failure, a bounded AI reasoning step runs on curated, redacted evidence. Its output is always labeled model_assisted.
"I don't know" is a real answer.
When evidence is insufficient, the system says so. Low confidence outputs use hedging language and include whatever partial evidence was gathered. We never fake certainty.
Most classifications never touch AI. That's by design.
Want to try this on your repo?
Drop your email for early access, or clone the repo and run forge quickstart right now.
You're on the list. We'll reach out soon.
Ship-ready
See it work. Then install it.
Try it in 2 minutes.
git clone, make setup, forge quickstart. Runs preflight checks, analyzes 3 real failure scenarios, and shows what ForgeBeyond can do for your repo. Or try forge demo --story for the full memory lifecycle.
GitHub-native delivery.
PR comments and check run annotations via GitHub Actions. forge init generates the workflow. No new dashboard to check — results appear where your team already works.
Redaction by default.
Secrets and tokens stripped before any data reaches AI analysis. Common patterns caught automatically. Least privilege: reads logs, writes comments. No access to code, branches, or settings.
Deterministic-only mode.
For teams that can't use external LLMs: zero data sent to AI providers. Classification runs entirely on deterministic rules. Set deterministic_only: true in config.
Where we're heading
The trajectory.
Every engineering org has institutional knowledge about their failures — it's just trapped in Slack threads and people's heads. ForgeBeyond makes that memory durable and automatic. Today: memory-native CI failure reasoning that classifies, remembers, retrieves, and tracks resolutions. The trajectory: institutional memory across the entire development workflow.
- Memory-native failure reasoning — every failure becomes a semantic Failure Memory Object
- Multi-strategy retrieval (fingerprint, component overlap, subcategory, text similarity)
- Resolution tracking — auto-detects when failures are fixed, records how
- Regression detection — flags when resolved patterns return, with prior fix context
- Deterministic-first classification with bounded LLM fallback (always labeled)
- Deterministic confidence scoring — evidence-grounded, never asks the LLM
- GitHub Actions workflow + PR comments + check run annotations
- CODEOWNERS-based owner routing and cross-service context
- CLI with quickstart, demo (--story, --memory, --all), doctor, init
- Parsers for JUnit XML, pytest, TAP, raw logs, git diff, workflow YAML, CODEOWNERS
- BUILDING Cross-repo recent-change correlation (upstream attribution)
- BUILDING Failure trend dashboards for engineering leads
- BUILDING GitHub App one-click install flow
- BUILDING Pattern velocity alerts (accelerating failure detection)
- FUTURE Multi-CI-provider support (GitLab CI, CircleCI)
- FUTURE Service catalog integration (Backstage, Cortex)
- FUTURE Bounded action execution with policy controls
What we won't build — and why.
General distributed RCA
Requires full topology knowledge we don't have. We're honest about it.
Autonomous remediation
One bad auto-revert destroys all trust. We recommend; you decide.
Observability replacement
We consume signals from Datadog, Grafana, PagerDuty — not compete with them.
Similar to prior incident — resolved Feb 3 by @orders-team (commit a1b2c3d)