Gaurav Yadav - Full Stack Developer & Cybersecurity Enthusiast

Abstract

We've gotten really good at finding vulnerabilities — fuzzing, sanitizers, static analysis — the tooling works. What hasn't caught up is fixing them. Remediation is still overwhelmingly manual, painfully slow, and bottlenecked by a shortage of people who actually understand the code well enough to write safe patches. The result? Bugs sit open for weeks. Security debt piles up. Exposure windows grow.

This paper asks a straightforward question: can LLMs help write patches — not autonomously, not as a replacement for human judgment, but as a tool that drafts candidate fixes for a human to review? We designed a hybrid pipeline around that idea: the model proposes, automated tests validate, and a person approves. The reasoning here is architectural, not benchmarked — we're interested in how to build this safely, not in chasing accuracy numbers on a leaderboard.

What we found is roughly what you'd expect if you've worked with LLMs on real code: they're genuinely useful for structurally simple bugs (missing bounds checks, uninitialized variables), but they introduce real risks — subtle semantic changes, logic drift, patches that pass tests but weaken security. We also tried several "obvious" approaches that failed outright, which reinforced our belief that conservative, human-supervised design isn't just safer — it's the only thing that actually works.

Keywords

automated remediation, vulnerability patching, secure software engineering, LLMs, human-in-the-loop systems, semantic security, CI/CD, operational security.

I. Introduction

Here's the irony of modern security tooling: we can find thousands of bugs a day, but we still fix them one at a time, by hand. Sanitizers catch memory errors. Fuzzers surface edge cases. Static analyzers flag patterns. The detection side of the pipeline has scaled beautifully. The remediation side? It's still a person staring at a stack trace on a Friday afternoon.

That gap — between what we can detect and what we can actually fix — isn't just inconvenient. It's a real security problem. Bugs queue up. Teams triage by severity and ignore the rest. Exposure windows stretch from days to months. And the people qualified to write safe patches are always outnumbered by the bugs waiting for them.

We wanted to know whether LLMs could narrow that gap. Not by replacing developers — that's a fantasy — but by generating draft patches that a human can review, accept, or throw away. Think of it as a first pass, not a final answer.

II. Historical Context: Evolution of Automated Remediation

People have been trying to automate the "fix it" part of security for at least twenty years now. The approaches have changed a lot, but the core frustration hasn't.


Evolution of Automated Remediation
2000sRule-Based
Template fixes
2010sSynthesis
Formal methods
2020sML Detection
Pattern recog.
2024+LLM Patch
Context-aware
LLM-assisted remediation as the next evolutionary step.

The earliest attempts were rule-based — essentially search-and-replace at a slightly higher level. They worked for narrow bug classes but crumbled in real codebases where bugs don't follow templates.

Then came program synthesis and constraint-solving: mathematically rigorous, provably correct, and almost impossibly expensive to run at scale. Beautiful theory, limited practical reach.

More recently, ML-based approaches showed up, but they mostly focused on finding bugs rather than fixing them. LLMs changed the game in a specific way: for the first time, we have tools that can read surrounding code, understand (or at least approximate) context, and generate patches that a human can actually read and evaluate.

That's the space we're working in — not fully autonomous repair, but a practical middle ground between "fix everything manually" and "let the machine handle it."

III. Problem Definition and Research Questions

The problem isn't complicated to state: detection has outpaced remediation by an order of magnitude. CI pipelines surface hundreds or thousands of sanitizer findings per release. Most of them sit in a backlog. Teams don't ignore them out of negligence — they just don't have enough hands. Low-severity bugs get deprioritized. Medium-severity bugs get deferred. And eventually, some of them get exploited.

The question we kept coming back to:

Can LLMs help fix vulnerabilities without making things worse — and without lulling developers into trusting them too much?


Research Questions
RQ1Feasibility
Which vulnerability classes are most amenable to automated patching?
RQ2Effectiveness
How effective are LLM-generated patches under realistic validation?
RQ3Risk Analysis
What failure modes emerge when LLMs assist remediation?
RQ4Design
How can human-in-the-loop design mitigate automation bias?

Rather than proposing a fully autonomous system, this research deliberately investigates assisted remediation as a more realistic and defensible security paradigm.

We deliberately stayed away from the "fully autonomous" framing. It sounds impressive in a demo, but in practice it means nobody's checking the output. Assisted remediation — where the model suggests and a human decides — felt like the only honest approach.

IV. Scope, Constraints, and Non-Goals

We drew hard lines around what this project tries to do and what it doesn't. Overclaiming is the norm in AI research and we wanted to avoid that.

Scope
Post-merge runtime vulnerabilities
Server-side & system software
C/C++, Rust, Java ecosystems
Constraints
LLM context window limits
Non-deterministic model output
Human accountability required
Non-Goals
Replace security engineers
Auto-merge to production
Detect vulnerabilities
Adversarial prompt defense

By clearly stating these non-goals, the system avoids overclaiming capabilities and maintains a conservative security posture.

Listing what we won't do isn't modesty — it's load-bearing. Every non-goal is a boundary that keeps the system from becoming something dangerous.

V. Research Methodology

We didn't run a benchmark suite or fine-tune a model. That wasn't the point. We wanted to understand how to build this thing safely — what the architecture should look like, where it breaks, and what happens when the LLM gets it wrong.

The methodology is design-oriented: we reason about architectures, enumerate failure modes, and model security threats. It's model-agnostic on purpose — the specific LLM doesn't matter; the pipeline structure does.


Research Methodology Stages
STAGE 1
Workflow Decomposition
Remediation pipelines decomposed into discrete stages
STAGE 2
Automation Feasibility Analysis
Each stage evaluated for LLM automation viability
STAGE 3
Failure Mode Enumeration
Systematic identification of AI-introduced failure scenarios
STAGE 4
Architectural Synthesis
Human-in-the-loop architecture designed for security

We cared more about whether this could work in a real CI/CD pipeline than whether it hits some accuracy threshold on a curated dataset.

VI. Design Alternatives and Trade-Off Analysis

Before landing on the current design, we seriously considered — and rejected — two other approaches.


Design Alternatives Evaluated
Fully Autonomous
High semantic risk
rejected
Suggestion-Only
Minimal effort savings
rejected
Constrained Gen.
Efficiency + governance
selected

Comparative Analysis with Existing Approaches


Comparative Analysis Framework
Approach Automation Scalability Risk Effort
Manual Patching None Low Low High
Rule-Based Fixes Partial Medium Medium Medium
Synthesis-Based High Low Low Very High
LLM-Assisted (Proposed)THIS WORK Partial High Mitigated Low
The proposed approach occupies a previously underexplored middle ground.

Approach	Automation	Scalability	Risk	Effort
Manual Patching	None	Low	Low	High
Rule-Based Fixes	Partial	Medium	Medium	Medium
Synthesis-Based	High	Low	Low	Very High
LLM-Assisted (Proposed)THIS WORK	Partial	High	Mitigated	Low

What this comparison shows is that we're not competing on automation level — we're occupying a different part of the design space entirely. One that treats human oversight as a feature, not a workaround.

VII. Proposed AI-Assisted Remediation Architecture

Here's what we actually built — or more precisely, what we designed the system to look like when deployed inside a real pipeline.

Three rules we refused to compromise on:

No auto-merge, ever. Every patch gets seen by a human before it hits the codebase. No exceptions.
Treat generation as probabilistic. The model might get it right on the third try, or never. Build for that.
Minimize what the model sees. The more context you feed it, the more it hallucinates. Give it the crash, the relevant function, and nothing else.


DetectionSanitizers
IsolationContext
LLM PatchGenerate
ValidateCI/CD
ReviewAudit
ApproveMerge
Figure 1: AI-Assisted Vulnerability Remediation Pipeline

VIII. Pipeline Components

A. Vulnerability Detection

This is the part that already works well. Sanitizers catch the bug at runtime, and we capture everything useful — stack traces, execution context, the failing assertion. This metadata feeds the rest of the pipeline.

B. Bug Isolation and Context Reduction

This step turned out to be more important than we expected. You'd think feeding the model more code would help. It doesn't. Too much context leads to confused patches, hallucinated functions, and fixes that reference code that doesn't exist.

Our context reduction process:

Crash-Centric Extraction: Only code paths directly involved in the sanitizer-reported failure are included.
Dependency Pruning: Unrelated helper functions and utilities are removed unless directly referenced.
Semantic Anchoring: Function signatures, type definitions, and invariants are preserved.
Test Harness Alignment: Context is aligned with a minimal reproducing test.

C. Patch Generation Using LLMs

Once we've isolated the vulnerable code, we hand it to the model with a structured prompt. One thing we learned early: generating a single patch and hoping it works is naive. We generate multiple candidates (typically 3–5) because the variance between attempts is surprisingly high.

CODEBASE
LLM MODEL
VALIDATOR
Vulnerable Context + Metadata
Candidate Patch A
Candidate Patch B
Run Validation Suite
PASS: Ready for Review
Figure 2: Multi-Candidate Patch Generation Sequence

D. Patch Generation Algorithm


Algorithm 1: Patch Generation
ALGORITHM: LLM_PATCH_GENERATION(vuln, context)
INPUT:  vuln    ← Sanitizer vulnerability report
        context ← Minimal reproducible code context
OUTPUT: patch   ← Validated remediation patch

1.  candidates ← []
2.  FOR i = 1 TO MAX_RETRIES DO
3.      prompt ← BUILD_PROMPT(vuln, context)
4.      patch_i ← LLM.generate(prompt, temp=0.2)
5.      
6.      IF SYNTAX_CHECK(patch_i) = PASS THEN
7.          IF UNIT_TESTS(patch_i) = PASS THEN
8.              IF SANITIZER_RERUN(patch_i) = CLEAN THEN
9.                  candidates.append(patch_i)
10.             END IF
11.         END IF
12.     END IF
13. END FOR
14.
15. IF candidates.length > 0 THEN
16.     RETURN SELECT_BEST(candidates)  // Human review
17. ELSE
18.     RETURN ESCALATE_TO_HUMAN(vuln)
19. END IF
 LLM Inference
 Validation Gates
 Human Checkpoint

E. Automated Validation

Generated patches undergo:

Unit testing
Regression testing
Sanitizer re-execution

Only patches that pass all automated checks proceed to human review.

F. Human Review and Secure Approval

This is where the rubber meets the road. A human reviewer checks the patch for:

Security regressions (did the fix open a new hole?)
Logic degradation (does the code still do what it's supposed to?)
Silent masking (did the model just hide the symptom instead of fixing the cause?)

We can't stress this enough: without this step, the entire system is dangerous. Automated tests catch structural errors. Humans catch semantic ones.

IX. Evaluation Criteria

We needed a clear way to judge whether a generated patch is actually good — not just "compiles and passes tests."


Patch Evaluation Criteria
CorrectnessRequired
Eliminates vulnerability without new errors
Security PreservationRequired
No weakening of security controls
Behavioral IntegrityRequired
Preserves program semantics
Review OverheadOptimized
Less effort than manual fix
Note: Test coverage alone is insufficient for security-critical paths.

A patch can pass every test in your suite and still be insecure. Test coverage is necessary but nowhere near sufficient for security-critical code.

X. Failure Mode Taxonomy

When LLM patches go wrong, they tend to go wrong in predictable ways. We catalogued the patterns we kept seeing:


Failure Mode Taxonomy
Superficial Fixes
high
Null checks without root cause
Conditional guards masking flaws
Semantic Drift
critical
Altered control flow
Incorrect variable lifetime assumptions
Over-Constraining
medium
Reduced concurrency
Disabled optimizations
Test-Centric Deception
critical
Passes tests but violates invariants
Removed failing assertions

Once you've seen these patterns a few times, you start spotting them almost instinctively during review. That's the point — the taxonomy trains your eye.

XI. Threat Model and Assumptions

Any system that generates code and puts it near a production pipeline needs a threat model. Here's ours:


Threat Model & Assumptions
Threat Vector Assumption Status
CI/CD Pipeline Compromise Pipeline is trusted and access-controlled trusted
LLM Network Access Model operates in sandboxed environment mitigated
Adversarial Training Data Out of scope for this research excluded
Semantic Vulnerability Introduction Primary threat — requires human review critical
Patch Injection via Input Post-detection only, not on user input mitigated

Threat Vector	Assumption	Status
CI/CD Pipeline Compromise	Pipeline is trusted and access-controlled	trusted
LLM Network Access	Model operates in sandboxed environment	mitigated
Adversarial Training Data	Out of scope for this research	excluded
Semantic Vulnerability Introduction	Primary threat — requires human review	critical
Patch Injection via Input	Post-detection only, not on user input	mitigated

The threat that keeps us up at night: semantic vulnerability introduction. A patch that looks fine, passes tests, gets merged — and quietly weakens a security invariant that nobody notices until it's exploited.

XII. Analytical Observations

Looking at how remediation actually plays out in real codebases — based on reported industry patterns and our own reasoning over typical workflows — a few things stand out.

Roughly 10–20% of sanitizer-detected bugs are simple enough for automated patching to have a real shot.

The highest success rates are observed for:

Uninitialized variable usage
Missing bounds checks
Use-after-scope errors
Certain classes of data races

Bugs involving complex business logic, cross-module dependencies, or protocol-level reasoning? The model struggles badly. These require understanding intent, not just syntax — and that's still firmly in human territory.

How We Evaluated

Instead of reporting acceptance rates (which would imply we ran a controlled experiment — we didn't), we categorized patch candidates by how much a reviewer would trust them. "Would you merge this without changes?" vs "Would you use this as a starting point?" vs "Is this actively misleading?" That framing felt more honest.

Why This Still Matters at Low Success Rates

Even if the model only handles 15% of incoming bugs, the operational impact is significant. Security backlogs grow exponentially as detection gets better. Every bug the model handles is one less in the queue.

More concretely:

Simple fixes get cleared faster, reducing the average time-to-fix
The security team can focus their limited attention on the hard stuff
Remediation becomes a background process instead of a bottleneck

The marginal value of even modest automation increases with scale. If you're processing ten bugs a week, it's a convenience. At a thousand bugs a week, it's a survival strategy.

XIII. Negative Results and Observed Limitations

Some things we tried that didn't work — and these are arguably more useful to share than what did:


Negative Results & Failed Approaches
Verbose Prompting
Degraded patch quality
→ Concise context outperforms detailed explanations
Full Repository Context
Increased hallucinations
→ Minimal context reduces confusion
Aggressive Retry Strategies
Diminishing returns
→ 3-5 retries optimal; more wastes compute
Test-Only Validation
Semantic regressions
→ Human review remains essential
Documenting failures reinforces conservative automation boundaries.

We're including these not as caveats but as actual findings. Knowing what doesn't work narrows the design space in ways that positive results can't.

XIV. Security Risks and Ethical Considerations

A. Hallucinated or Misleading Fixes

This is the ugly reality of LLM-generated code: the model doesn't know what "correct" means in your codebase. We've seen it add null checks that suppress the symptom without touching the root cause. We've seen it reduce thread concurrency to "fix" a race condition by eliminating parallelism entirely. And yes, we've seen it delete failing test assertions instead of fixing the code they test.

B. Automation Bias

Here's a subtler danger: once reviewers see that the model produces reasonable-looking patches, they start trusting it more than they should. The patch looks clean, tests pass, and the reviewer rubber-stamps it. This is automation bias, and it's well-documented in other domains (aviation, medical diagnostics). In security, it could be catastrophic.

C. Ethical Deployment

If your organization deploys this kind of system, it needs to be crystal clear — to every engineer in the loop — that these are draft suggestions, not recommendations. The model doesn't understand your threat model, your compliance requirements, or the business logic that lives in six people's heads. Treating its output as authoritative would be irresponsible.

XV. Researcher's Design Rationale

A few choices we made deliberately, and why:


Researcher's Design Decisions
Post-Detection Focus
Detection is solved; remediation is the bottleneck
Architectural Safety First
System design over model optimization
Untrusted LLM Outputs
Treat all AI suggestions as potentially harmful
CI/CD Integration
Real pipelines, not standalone research tools
These decisions reflect a security-first mindset over novelty pursuit.

These decisions reflect a security-first mindset, emphasizing accountability, reproducibility, and operational safety over novelty.

XVI. Claims and Non-Claims

This Research Does NOT Claim
LLMs can replace security engineers
Automated patching is universally applicable
Test coverage guarantees security
AI-generated code should be trusted by default
This Research DOES Argue
Bounded, assistive automation is viable
Human oversight is non-negotiable
Failure documentation is essential
Architectural discipline enables trust

XVII. Reproducibility and Research Transparency

We didn't run controlled experiments — this paper is about architecture and design reasoning, not benchmark numbers. But we've tried to make everything here reproducible.

If you want to test these ideas yourself:

Hook up sanitizer output to an LLM through a structured prompt
Log every patch it generates, along with what the reviewer decided and why
Track rejection reasons — they'll tell you more than acceptance rates
Build your own failure taxonomy over time; ours is a starting point, not the final word

XVIII. Open Problems & Research Directions

Things we haven't solved (and honestly, neither has anyone else):

Multi-file patches. Most real bugs span multiple files. The model can barely handle one.
Fuzzing integration. Closing the loop between fuzzer output and patch generation is an obvious next step.
Organization-specific tuning. Fine-tuning on your own codebase's patterns could dramatically improve patch relevance.
Formal verification. If you could run a lightweight formal check on generated patches, that would change the trust equation entirely.

XIX. Threats to Validity

Let's be upfront about the limitations of our approach:

Internal Validity:
We reasoned about architectures and failure modes — we didn't run A/B tests or controlled experiments. Our observations are grounded but not empirically proven.

External Validity:
This framing works best for server-side software with decent test coverage. If your codebase has weak tests or heavy business logic that lives in people's heads, the results won't transfer cleanly.

Construct Validity:
"Patch quality" is hard to measure. We used sanitizer output and reviewer judgment as proxies, but neither captures every way a patch can be subtly wrong.

None of this invalidates the architecture — but it does mean the next step has to be empirical.

XX. Evaluation Metrics (Defined but Not Measured)

We haven't measured these yet, but when someone does run the experiment, here's what they should track:

Patch Acceptance Rate (%): How often does a reviewer say "yes, merge this"?
Mean Time-to-Fix (MTTF): How long from bug detection to merged patch?
Reviewer Time per Patch: Does the model actually save time, or do reviewers spend just as long verifying the AI's work?
False-Positive Patch Rate: Patches that pass tests but break something in production
Semantic Regression Incidents: The scary one — patches that make it to production and silently weaken security

We're defining these upfront so future work has clear targets.

XXI. Why Automated Remediation Remains Hard

The hard part isn't syntax — it's intent.

Most security bugs come from assumptions that nobody wrote down. Invariants that span three modules. Design decisions that made sense five years ago and live in one person's memory. The code doesn't tell you why it's structured a certain way, only what it does.

LLMs are great at surface-level code transformations. They can add a bounds check, initialize a variable, insert a null guard. But they can't reason about why the code was written the way it was, and that's where the real vulnerabilities live.

We don't see this as a failure of models. It's a constraint — a real one — and the right response is to design around it, not pretend it doesn't exist.

XXII. Summary of Contributions

LLMs can help fix bugs — but only if you build the right guardrails. Fully autonomous patching isn't just premature; it's actively dangerous. The sweet spot is assisted remediation: the model drafts, tests validate, humans decide.

Here's what this work contributes:

A concrete architecture for human-in-the-loop AI-assisted patching, designed for real CI/CD pipelines
A failure taxonomy — the patterns we saw when LLM patches went wrong, catalogued so others can watch for them
A threat model built around the assumption that AI-generated code is untrusted by default
Documented failures — approaches that didn't work, shared because negative results are underreported
Comparative framing that positions this work relative to rule-based, synthesis-based, and manual approaches

We're putting this out not as a finished product but as an inspectable starting point. Critique it, extend it, break it — that's the point.

Research Artifacts Produced

What you can take away and use independently:

The pipeline architecture (works with any LLM)
The failure mode taxonomy (useful for reviewing any AI-generated code, not just patches)
The comparative framework (helps position your own approach)
The threat model (adaptable to your specific deployment context)
The evaluation methodology (reviewer confidence as a metric, not just test pass rates)

Nothing here is tied to a specific model, vendor, or dataset. If the architecture is sound, it should work regardless of which LLM you plug in.

Appendix A: Glossary of Terms

Sanitizer: Runtime instrumentation detecting undefined or unsafe behavior during program execution.

Semantic Vulnerability: A flaw that preserves functional correctness but weakens security guarantees.

Automation Bias: Human tendency to over-trust automated system outputs, reducing critical scrutiny.

Human-in-the-Loop: System design requiring explicit human approval at critical decision stages.

Context Reduction: Process of minimizing input code while preserving vulnerability-relevant semantics.

Patch Candidate: An LLM-generated code modification proposed as a potential fix for a detected vulnerability.

Validation Gate: An automated checkpoint that patches must pass before proceeding to human review.

Acknowledgment

This work draws on publicly available industry research and our own analysis. The interpretations and architectural decisions are ours.

We used LLMs as research tools during this work — for code exploration, draft generation, and idea refinement — but every architectural decision, every design choice, and every conclusion was made by us. The models drafted; we decided.

This document reflects the state of our thinking as of January 2026. It will evolve.

Abstract

Keywords

automated remediation, vulnerability patching, secure software engineering, LLMs, human-in-the-loop systems, semantic security, CI/CD, operational security.

I. Introduction

II. Historical Context: Evolution of Automated Remediation

People have been trying to automate the "fix it" part of security for at least twenty years now. The approaches have changed a lot, but the core frustration hasn't.


Evolution of Automated Remediation
2000sRule-Based
Template fixes
2010sSynthesis
Formal methods
2020sML Detection
Pattern recog.
2024+LLM Patch
Context-aware
LLM-assisted remediation as the next evolutionary step.

Then came program synthesis and constraint-solving: mathematically rigorous, provably correct, and almost impossibly expensive to run at scale. Beautiful theory, limited practical reach.

That's the space we're working in — not fully autonomous repair, but a practical middle ground between "fix everything manually" and "let the machine handle it."

III. Problem Definition and Research Questions

The question we kept coming back to:

Can LLMs help fix vulnerabilities without making things worse — and without lulling developers into trusting them too much?


Research Questions
RQ1Feasibility
Which vulnerability classes are most amenable to automated patching?
RQ2Effectiveness
How effective are LLM-generated patches under realistic validation?
RQ3Risk Analysis
What failure modes emerge when LLMs assist remediation?
RQ4Design
How can human-in-the-loop design mitigate automation bias?

Rather than proposing a fully autonomous system, this research deliberately investigates assisted remediation as a more realistic and defensible security paradigm.

IV. Scope, Constraints, and Non-Goals

We drew hard lines around what this project tries to do and what it doesn't. Overclaiming is the norm in AI research and we wanted to avoid that.

Scope
Post-merge runtime vulnerabilities
Server-side & system software
C/C++, Rust, Java ecosystems
Constraints
LLM context window limits
Non-deterministic model output
Human accountability required
Non-Goals
Replace security engineers
Auto-merge to production
Detect vulnerabilities
Adversarial prompt defense

By clearly stating these non-goals, the system avoids overclaiming capabilities and maintains a conservative security posture.

Listing what we won't do isn't modesty — it's load-bearing. Every non-goal is a boundary that keeps the system from becoming something dangerous.

V. Research Methodology


Research Methodology Stages
STAGE 1
Workflow Decomposition
Remediation pipelines decomposed into discrete stages
STAGE 2
Automation Feasibility Analysis
Each stage evaluated for LLM automation viability
STAGE 3
Failure Mode Enumeration
Systematic identification of AI-introduced failure scenarios
STAGE 4
Architectural Synthesis
Human-in-the-loop architecture designed for security

We cared more about whether this could work in a real CI/CD pipeline than whether it hits some accuracy threshold on a curated dataset.

VI. Design Alternatives and Trade-Off Analysis

Before landing on the current design, we seriously considered — and rejected — two other approaches.


Design Alternatives Evaluated
Fully Autonomous
High semantic risk
rejected
Suggestion-Only
Minimal effort savings
rejected
Constrained Gen.
Efficiency + governance
selected

Comparative Analysis with Existing Approaches


Comparative Analysis Framework
Approach Automation Scalability Risk Effort
Manual Patching None Low Low High
Rule-Based Fixes Partial Medium Medium Medium
Synthesis-Based High Low Low Very High
LLM-Assisted (Proposed)THIS WORK Partial High Mitigated Low
The proposed approach occupies a previously underexplored middle ground.

Approach	Automation	Scalability	Risk	Effort
Manual Patching	None	Low	Low	High
Rule-Based Fixes	Partial	Medium	Medium	Medium
Synthesis-Based	High	Low	Low	Very High
LLM-Assisted (Proposed)THIS WORK	Partial	High	Mitigated	Low

VII. Proposed AI-Assisted Remediation Architecture

Here's what we actually built — or more precisely, what we designed the system to look like when deployed inside a real pipeline.

Three rules we refused to compromise on:

No auto-merge, ever. Every patch gets seen by a human before it hits the codebase. No exceptions.
Treat generation as probabilistic. The model might get it right on the third try, or never. Build for that.
Minimize what the model sees. The more context you feed it, the more it hallucinates. Give it the crash, the relevant function, and nothing else.


DetectionSanitizers
IsolationContext
LLM PatchGenerate
ValidateCI/CD
ReviewAudit
ApproveMerge
Figure 1: AI-Assisted Vulnerability Remediation Pipeline

VIII. Pipeline Components

A. Vulnerability Detection

B. Bug Isolation and Context Reduction

Our context reduction process:

Crash-Centric Extraction: Only code paths directly involved in the sanitizer-reported failure are included.
Dependency Pruning: Unrelated helper functions and utilities are removed unless directly referenced.
Semantic Anchoring: Function signatures, type definitions, and invariants are preserved.
Test Harness Alignment: Context is aligned with a minimal reproducing test.

C. Patch Generation Using LLMs

CODEBASE
LLM MODEL
VALIDATOR
Vulnerable Context + Metadata
Candidate Patch A
Candidate Patch B
Run Validation Suite
PASS: Ready for Review
Figure 2: Multi-Candidate Patch Generation Sequence

D. Patch Generation Algorithm


Algorithm 1: Patch Generation
ALGORITHM: LLM_PATCH_GENERATION(vuln, context)
INPUT:  vuln    ← Sanitizer vulnerability report
        context ← Minimal reproducible code context
OUTPUT: patch   ← Validated remediation patch

1.  candidates ← []
2.  FOR i = 1 TO MAX_RETRIES DO
3.      prompt ← BUILD_PROMPT(vuln, context)
4.      patch_i ← LLM.generate(prompt, temp=0.2)
5.      
6.      IF SYNTAX_CHECK(patch_i) = PASS THEN
7.          IF UNIT_TESTS(patch_i) = PASS THEN
8.              IF SANITIZER_RERUN(patch_i) = CLEAN THEN
9.                  candidates.append(patch_i)
10.             END IF
11.         END IF
12.     END IF
13. END FOR
14.
15. IF candidates.length > 0 THEN
16.     RETURN SELECT_BEST(candidates)  // Human review
17. ELSE
18.     RETURN ESCALATE_TO_HUMAN(vuln)
19. END IF
 LLM Inference
 Validation Gates
 Human Checkpoint

E. Automated Validation

Generated patches undergo:

Unit testing
Regression testing
Sanitizer re-execution

Only patches that pass all automated checks proceed to human review.

F. Human Review and Secure Approval

This is where the rubber meets the road. A human reviewer checks the patch for:

Security regressions (did the fix open a new hole?)
Logic degradation (does the code still do what it's supposed to?)
Silent masking (did the model just hide the symptom instead of fixing the cause?)

We can't stress this enough: without this step, the entire system is dangerous. Automated tests catch structural errors. Humans catch semantic ones.

IX. Evaluation Criteria

We needed a clear way to judge whether a generated patch is actually good — not just "compiles and passes tests."


Patch Evaluation Criteria
CorrectnessRequired
Eliminates vulnerability without new errors
Security PreservationRequired
No weakening of security controls
Behavioral IntegrityRequired
Preserves program semantics
Review OverheadOptimized
Less effort than manual fix
Note: Test coverage alone is insufficient for security-critical paths.

A patch can pass every test in your suite and still be insecure. Test coverage is necessary but nowhere near sufficient for security-critical code.

X. Failure Mode Taxonomy

When LLM patches go wrong, they tend to go wrong in predictable ways. We catalogued the patterns we kept seeing:


Failure Mode Taxonomy
Superficial Fixes
high
Null checks without root cause
Conditional guards masking flaws
Semantic Drift
critical
Altered control flow
Incorrect variable lifetime assumptions
Over-Constraining
medium
Reduced concurrency
Disabled optimizations
Test-Centric Deception
critical
Passes tests but violates invariants
Removed failing assertions

Once you've seen these patterns a few times, you start spotting them almost instinctively during review. That's the point — the taxonomy trains your eye.

XI. Threat Model and Assumptions

Any system that generates code and puts it near a production pipeline needs a threat model. Here's ours:


Threat Model & Assumptions
Threat Vector Assumption Status
CI/CD Pipeline Compromise Pipeline is trusted and access-controlled trusted
LLM Network Access Model operates in sandboxed environment mitigated
Adversarial Training Data Out of scope for this research excluded
Semantic Vulnerability Introduction Primary threat — requires human review critical
Patch Injection via Input Post-detection only, not on user input mitigated

Threat Vector	Assumption	Status
CI/CD Pipeline Compromise	Pipeline is trusted and access-controlled	trusted
LLM Network Access	Model operates in sandboxed environment	mitigated
Adversarial Training Data	Out of scope for this research	excluded
Semantic Vulnerability Introduction	Primary threat — requires human review	critical
Patch Injection via Input	Post-detection only, not on user input	mitigated

XII. Analytical Observations

Looking at how remediation actually plays out in real codebases — based on reported industry patterns and our own reasoning over typical workflows — a few things stand out.

Roughly 10–20% of sanitizer-detected bugs are simple enough for automated patching to have a real shot.

The highest success rates are observed for:

Uninitialized variable usage
Missing bounds checks
Use-after-scope errors
Certain classes of data races

How We Evaluated

Why This Still Matters at Low Success Rates

More concretely:

Simple fixes get cleared faster, reducing the average time-to-fix
The security team can focus their limited attention on the hard stuff
Remediation becomes a background process instead of a bottleneck

The marginal value of even modest automation increases with scale. If you're processing ten bugs a week, it's a convenience. At a thousand bugs a week, it's a survival strategy.

XIII. Negative Results and Observed Limitations

Some things we tried that didn't work — and these are arguably more useful to share than what did:


Negative Results & Failed Approaches
Verbose Prompting
Degraded patch quality
→ Concise context outperforms detailed explanations
Full Repository Context
Increased hallucinations
→ Minimal context reduces confusion
Aggressive Retry Strategies
Diminishing returns
→ 3-5 retries optimal; more wastes compute
Test-Only Validation
Semantic regressions
→ Human review remains essential
Documenting failures reinforces conservative automation boundaries.

We're including these not as caveats but as actual findings. Knowing what doesn't work narrows the design space in ways that positive results can't.

XIV. Security Risks and Ethical Considerations

A. Hallucinated or Misleading Fixes

B. Automation Bias

C. Ethical Deployment

XV. Researcher's Design Rationale

A few choices we made deliberately, and why:


Researcher's Design Decisions
Post-Detection Focus
Detection is solved; remediation is the bottleneck
Architectural Safety First
System design over model optimization
Untrusted LLM Outputs
Treat all AI suggestions as potentially harmful
CI/CD Integration
Real pipelines, not standalone research tools
These decisions reflect a security-first mindset over novelty pursuit.

These decisions reflect a security-first mindset, emphasizing accountability, reproducibility, and operational safety over novelty.

XVI. Claims and Non-Claims

This Research Does NOT Claim
LLMs can replace security engineers
Automated patching is universally applicable
Test coverage guarantees security
AI-generated code should be trusted by default
This Research DOES Argue
Bounded, assistive automation is viable
Human oversight is non-negotiable
Failure documentation is essential
Architectural discipline enables trust

XVII. Reproducibility and Research Transparency

We didn't run controlled experiments — this paper is about architecture and design reasoning, not benchmark numbers. But we've tried to make everything here reproducible.

If you want to test these ideas yourself:

Hook up sanitizer output to an LLM through a structured prompt
Log every patch it generates, along with what the reviewer decided and why
Track rejection reasons — they'll tell you more than acceptance rates
Build your own failure taxonomy over time; ours is a starting point, not the final word

XVIII. Open Problems & Research Directions

Things we haven't solved (and honestly, neither has anyone else):

Multi-file patches. Most real bugs span multiple files. The model can barely handle one.
Fuzzing integration. Closing the loop between fuzzer output and patch generation is an obvious next step.
Organization-specific tuning. Fine-tuning on your own codebase's patterns could dramatically improve patch relevance.
Formal verification. If you could run a lightweight formal check on generated patches, that would change the trust equation entirely.

XIX. Threats to Validity

Let's be upfront about the limitations of our approach:

Internal Validity:
We reasoned about architectures and failure modes — we didn't run A/B tests or controlled experiments. Our observations are grounded but not empirically proven.

Construct Validity:
"Patch quality" is hard to measure. We used sanitizer output and reviewer judgment as proxies, but neither captures every way a patch can be subtly wrong.

None of this invalidates the architecture — but it does mean the next step has to be empirical.

XX. Evaluation Metrics (Defined but Not Measured)

We haven't measured these yet, but when someone does run the experiment, here's what they should track:

Patch Acceptance Rate (%): How often does a reviewer say "yes, merge this"?
Mean Time-to-Fix (MTTF): How long from bug detection to merged patch?
Reviewer Time per Patch: Does the model actually save time, or do reviewers spend just as long verifying the AI's work?
False-Positive Patch Rate: Patches that pass tests but break something in production
Semantic Regression Incidents: The scary one — patches that make it to production and silently weaken security

We're defining these upfront so future work has clear targets.

XXI. Why Automated Remediation Remains Hard

The hard part isn't syntax — it's intent.

We don't see this as a failure of models. It's a constraint — a real one — and the right response is to design around it, not pretend it doesn't exist.

XXII. Summary of Contributions

Here's what this work contributes:

A concrete architecture for human-in-the-loop AI-assisted patching, designed for real CI/CD pipelines
A failure taxonomy — the patterns we saw when LLM patches went wrong, catalogued so others can watch for them
A threat model built around the assumption that AI-generated code is untrusted by default
Documented failures — approaches that didn't work, shared because negative results are underreported
Comparative framing that positions this work relative to rule-based, synthesis-based, and manual approaches

We're putting this out not as a finished product but as an inspectable starting point. Critique it, extend it, break it — that's the point.

Research Artifacts Produced

What you can take away and use independently:

The pipeline architecture (works with any LLM)
The failure mode taxonomy (useful for reviewing any AI-generated code, not just patches)
The comparative framework (helps position your own approach)
The threat model (adaptable to your specific deployment context)
The evaluation methodology (reviewer confidence as a metric, not just test pass rates)

Nothing here is tied to a specific model, vendor, or dataset. If the architecture is sound, it should work regardless of which LLM you plug in.

Appendix A: Glossary of Terms

Sanitizer: Runtime instrumentation detecting undefined or unsafe behavior during program execution.

Semantic Vulnerability: A flaw that preserves functional correctness but weakens security guarantees.

Automation Bias: Human tendency to over-trust automated system outputs, reducing critical scrutiny.

Human-in-the-Loop: System design requiring explicit human approval at critical decision stages.

Context Reduction: Process of minimizing input code while preserving vulnerability-relevant semantics.

Patch Candidate: An LLM-generated code modification proposed as a potential fix for a detected vulnerability.

Validation Gate: An automated checkpoint that patches must pass before proceeding to human review.

Acknowledgment

This work draws on publicly available industry research and our own analysis. The interpretations and architectural decisions are ours.

This document reflects the state of our thinking as of January 2026. It will evolve.

AI-Assisted Remediation

Abstract

Keywords

I. Introduction

II. Historical Context: Evolution of Automated Remediation

Evolution of Automated Remediation

Rule-Based

Synthesis

ML Detection

LLM Patch

III. Problem Definition and Research Questions

Research Questions

IV. Scope, Constraints, and Non-Goals

Scope

Constraints

Non-Goals

V. Research Methodology

Research Methodology Stages

Workflow Decomposition

Automation Feasibility Analysis

Failure Mode Enumeration

Architectural Synthesis

VI. Design Alternatives and Trade-Off Analysis

Design Alternatives Evaluated

Fully Autonomous

Suggestion-Only

Constrained Gen.

Comparative Analysis with Existing Approaches

Comparative Analysis Framework

VII. Proposed AI-Assisted Remediation Architecture

VIII. Pipeline Components

A. Vulnerability Detection

B. Bug Isolation and Context Reduction

C. Patch Generation Using LLMs

D. Patch Generation Algorithm

E. Automated Validation

F. Human Review and Secure Approval

IX. Evaluation Criteria

Patch Evaluation Criteria

X. Failure Mode Taxonomy

Failure Mode Taxonomy

XI. Threat Model and Assumptions

Threat Model & Assumptions

XII. Analytical Observations

How We Evaluated

Why This Still Matters at Low Success Rates

XIII. Negative Results and Observed Limitations

Negative Results & Failed Approaches

XIV. Security Risks and Ethical Considerations

A. Hallucinated or Misleading Fixes

B. Automation Bias

C. Ethical Deployment

XV. Researcher's Design Rationale

Researcher's Design Decisions

XVI. Claims and Non-Claims

This Research Does NOT Claim

This Research DOES Argue

XVII. Reproducibility and Research Transparency

XVIII. Open Problems & Research Directions

XIX. Threats to Validity

XX. Evaluation Metrics (Defined but Not Measured)

XXI. Why Automated Remediation Remains Hard

XXII. Summary of Contributions

Research Artifacts Produced

Appendix A: Glossary of Terms

Acknowledgment

Citation

AI-Assisted Remediation

Abstract

Keywords

I. Introduction

II. Historical Context: Evolution of Automated Remediation

Evolution of Automated Remediation

Rule-Based

Synthesis

ML Detection

LLM Patch

III. Problem Definition and Research Questions

Research Questions

IV. Scope, Constraints, and Non-Goals