AI-Powered Automated Patching
for Software Vulnerabilities
A Research Perspective on Automated Patch Generation under Human Supervision
Abstract
We've gotten really good at finding vulnerabilities — fuzzing, sanitizers, static analysis — the tooling works. What hasn't caught up is fixing them. Remediation is still overwhelmingly manual, painfully slow, and bottlenecked by a shortage of people who actually understand the code well enough to write safe patches. The result? Bugs sit open for weeks. Security debt piles up. Exposure windows grow.
This paper asks a straightforward question: can LLMs help write patches — not autonomously, not as a replacement for human judgment, but as a tool that drafts candidate fixes for a human to review? We designed a hybrid pipeline around that idea: the model proposes, automated tests validate, and a person approves. The reasoning here is architectural, not benchmarked — we're interested in how to build this safely, not in chasing accuracy numbers on a leaderboard.
What we found is roughly what you'd expect if you've worked with LLMs on real code: they're genuinely useful for structurally simple bugs (missing bounds checks, uninitialized variables), but they introduce real risks — subtle semantic changes, logic drift, patches that pass tests but weaken security. We also tried several "obvious" approaches that failed outright, which reinforced our belief that conservative, human-supervised design isn't just safer — it's the only thing that actually works.
Keywords
automated remediation, vulnerability patching, secure software engineering, LLMs, human-in-the-loop systems, semantic security, CI/CD, operational security.
I. Introduction
Here's the irony of modern security tooling: we can find thousands of bugs a day, but we still fix them one at a time, by hand. Sanitizers catch memory errors. Fuzzers surface edge cases. Static analyzers flag patterns. The detection side of the pipeline has scaled beautifully. The remediation side? It's still a person staring at a stack trace on a Friday afternoon.
That gap — between what we can detect and what we can actually fix — isn't just inconvenient. It's a real security problem. Bugs queue up. Teams triage by severity and ignore the rest. Exposure windows stretch from days to months. And the people qualified to write safe patches are always outnumbered by the bugs waiting for them.
We wanted to know whether LLMs could narrow that gap. Not by replacing developers — that's a fantasy — but by generating draft patches that a human can review, accept, or throw away. Think of it as a first pass, not a final answer.
II. Historical Context: Evolution of Automated Remediation
People have been trying to automate the "fix it" part of security for at least twenty years now. The approaches have changed a lot, but the core frustration hasn't.
Evolution of Automated Remediation
2000sRule-Based
Template fixes
2010sSynthesis
Formal methods
2020sML Detection
Pattern recog.
2024+LLM Patch
Context-aware
LLM-assisted remediation as the next evolutionary step.
The earliest attempts were rule-based — essentially search-and-replace at a slightly higher level. They worked for narrow bug classes but crumbled in real codebases where bugs don't follow templates.
Then came program synthesis and constraint-solving: mathematically rigorous, provably correct, and almost impossibly expensive to run at scale. Beautiful theory, limited practical reach.
More recently, ML-based approaches showed up, but they mostly focused on finding bugs rather than fixing them. LLMs changed the game in a specific way: for the first time, we have tools that can read surrounding code, understand (or at least approximate) context, and generate patches that a human can actually read and evaluate.
That's the space we're working in — not fully autonomous repair, but a practical middle ground between "fix everything manually" and "let the machine handle it."
III. Problem Definition and Research Questions
The problem isn't complicated to state: detection has outpaced remediation by an order of magnitude. CI pipelines surface hundreds or thousands of sanitizer findings per release. Most of them sit in a backlog. Teams don't ignore them out of negligence — they just don't have enough hands. Low-severity bugs get deprioritized. Medium-severity bugs get deferred. And eventually, some of them get exploited.
The question we kept coming back to:
Can LLMs help fix vulnerabilities without making things worse — and without lulling developers into trusting them too much?
Research Questions
RQ1FeasibilityWhich vulnerability classes are most amenable to automated patching?
RQ2EffectivenessHow effective are LLM-generated patches under realistic validation?
RQ3Risk AnalysisWhat failure modes emerge when LLMs assist remediation?
RQ4DesignHow can human-in-the-loop design mitigate automation bias?
Rather than proposing a fully autonomous system, this research deliberately investigates assisted remediation as a more realistic and defensible security paradigm.
We deliberately stayed away from the "fully autonomous" framing. It sounds impressive in a demo, but in practice it means nobody's checking the output. Assisted remediation — where the model suggests and a human decides — felt like the only honest approach.
IV. Scope, Constraints, and Non-Goals
We drew hard lines around what this project tries to do and what it doesn't. Overclaiming is the norm in AI research and we wanted to avoid that.
Scope
- Post-merge runtime vulnerabilities
- Server-side & system software
- C/C++, Rust, Java ecosystems
Constraints
- LLM context window limits
- Non-deterministic model output
- Human accountability required
Non-Goals
- Replace security engineers
- Auto-merge to production
- Detect vulnerabilities
- Adversarial prompt defense
By clearly stating these non-goals, the system avoids overclaiming capabilities and maintains a conservative security posture.
Listing what we won't do isn't modesty — it's load-bearing. Every non-goal is a boundary that keeps the system from becoming something dangerous.
V. Research Methodology
We didn't run a benchmark suite or fine-tune a model. That wasn't the point. We wanted to understand how to build this thing safely — what the architecture should look like, where it breaks, and what happens when the LLM gets it wrong.
The methodology is design-oriented: we reason about architectures, enumerate failure modes, and model security threats. It's model-agnostic on purpose — the specific LLM doesn't matter; the pipeline structure does.
Research Methodology Stages
STAGE 1Workflow Decomposition
Remediation pipelines decomposed into discrete stages
STAGE 2Automation Feasibility Analysis
Each stage evaluated for LLM automation viability
STAGE 3Failure Mode Enumeration
Systematic identification of AI-introduced failure scenarios
STAGE 4Architectural Synthesis
Human-in-the-loop architecture designed for security
We cared more about whether this could work in a real CI/CD pipeline than whether it hits some accuracy threshold on a curated dataset.
VI. Design Alternatives and Trade-Off Analysis
Before landing on the current design, we seriously considered — and rejected — two other approaches.
Design Alternatives Evaluated
Fully Autonomous
High semantic risk
rejectedSuggestion-Only
Minimal effort savings
rejectedConstrained Gen.
Efficiency + governance
selected
Comparative Analysis with Existing Approaches
Comparative Analysis Framework
Approach Automation Scalability Risk Effort Manual Patching None Low Low High Rule-Based Fixes Partial Medium Medium Medium Synthesis-Based High Low Low Very High LLM-Assisted (Proposed)THIS WORK Partial High Mitigated Low The proposed approach occupies a previously underexplored middle ground.
What this comparison shows is that we're not competing on automation level — we're occupying a different part of the design space entirely. One that treats human oversight as a feature, not a workaround.
VII. Proposed AI-Assisted Remediation Architecture
Here's what we actually built — or more precisely, what we designed the system to look like when deployed inside a real pipeline.
Three rules we refused to compromise on:
- No auto-merge, ever. Every patch gets seen by a human before it hits the codebase. No exceptions.
- Treat generation as probabilistic. The model might get it right on the third try, or never. Build for that.
- Minimize what the model sees. The more context you feed it, the more it hallucinates. Give it the crash, the relevant function, and nothing else.
DetectionSanitizersIsolationContextLLM PatchGenerateValidateCI/CDReviewAuditApproveMergeFigure 1: AI-Assisted Vulnerability Remediation Pipeline
VIII. Pipeline Components
A. Vulnerability Detection
This is the part that already works well. Sanitizers catch the bug at runtime, and we capture everything useful — stack traces, execution context, the failing assertion. This metadata feeds the rest of the pipeline.
B. Bug Isolation and Context Reduction
This step turned out to be more important than we expected. You'd think feeding the model more code would help. It doesn't. Too much context leads to confused patches, hallucinated functions, and fixes that reference code that doesn't exist.
Our context reduction process:
- Crash-Centric Extraction: Only code paths directly involved in the sanitizer-reported failure are included.
- Dependency Pruning: Unrelated helper functions and utilities are removed unless directly referenced.
- Semantic Anchoring: Function signatures, type definitions, and invariants are preserved.
- Test Harness Alignment: Context is aligned with a minimal reproducing test.
C. Patch Generation Using LLMs
Once we've isolated the vulnerable code, we hand it to the model with a structured prompt. One thing we learned early: generating a single patch and hoping it works is naive. We generate multiple candidates (typically 3–5) because the variance between attempts is surprisingly high.
CODEBASELLM MODELVALIDATORVulnerable Context + MetadataCandidate Patch ACandidate Patch BRun Validation SuitePASS: Ready for ReviewFigure 2: Multi-Candidate Patch Generation Sequence
D. Patch Generation Algorithm
Algorithm 1: Patch GenerationALGORITHM: LLM_PATCH_GENERATION(vuln, context) INPUT: vuln ← Sanitizer vulnerability report context ← Minimal reproducible code context OUTPUT: patch ← Validated remediation patch 1. candidates ← [] 2. FOR i = 1 TO MAX_RETRIES DO 3. prompt ← BUILD_PROMPT(vuln, context) 4. patch_i ← LLM.generate(prompt, temp=0.2) 5. 6. IF SYNTAX_CHECK(patch_i) = PASS THEN 7. IF UNIT_TESTS(patch_i) = PASS THEN 8. IF SANITIZER_RERUN(patch_i) = CLEAN THEN 9. candidates.append(patch_i) 10. END IF 11. END IF 12. END IF 13. END FOR 14. 15. IF candidates.length > 0 THEN 16. RETURN SELECT_BEST(candidates) // Human review 17. ELSE 18. RETURN ESCALATE_TO_HUMAN(vuln) 19. END IFLLM Inference Validation Gates Human Checkpoint
E. Automated Validation
Generated patches undergo:
- Unit testing
- Regression testing
- Sanitizer re-execution
Only patches that pass all automated checks proceed to human review.
F. Human Review and Secure Approval
This is where the rubber meets the road. A human reviewer checks the patch for:
- Security regressions (did the fix open a new hole?)
- Logic degradation (does the code still do what it's supposed to?)
- Silent masking (did the model just hide the symptom instead of fixing the cause?)
We can't stress this enough: without this step, the entire system is dangerous. Automated tests catch structural errors. Humans catch semantic ones.
IX. Evaluation Criteria
We needed a clear way to judge whether a generated patch is actually good — not just "compiles and passes tests."
Patch Evaluation Criteria
CorrectnessRequiredEliminates vulnerability without new errors
Security PreservationRequiredNo weakening of security controls
Behavioral IntegrityRequiredPreserves program semantics
Review OverheadOptimizedLess effort than manual fix
Note: Test coverage alone is insufficient for security-critical paths.
A patch can pass every test in your suite and still be insecure. Test coverage is necessary but nowhere near sufficient for security-critical code.
X. Failure Mode Taxonomy
When LLM patches go wrong, they tend to go wrong in predictable ways. We catalogued the patterns we kept seeing:
Failure Mode Taxonomy
Superficial Fixeshigh
- Null checks without root cause
- Conditional guards masking flaws
Semantic Driftcritical
- Altered control flow
- Incorrect variable lifetime assumptions
Over-Constrainingmedium
- Reduced concurrency
- Disabled optimizations
Test-Centric Deceptioncritical
- Passes tests but violates invariants
- Removed failing assertions
Once you've seen these patterns a few times, you start spotting them almost instinctively during review. That's the point — the taxonomy trains your eye.
XI. Threat Model and Assumptions
Any system that generates code and puts it near a production pipeline needs a threat model. Here's ours:
Threat Model & Assumptions
Threat Vector Assumption Status CI/CD Pipeline Compromise Pipeline is trusted and access-controlled trusted LLM Network Access Model operates in sandboxed environment mitigated Adversarial Training Data Out of scope for this research excluded Semantic Vulnerability Introduction Primary threat — requires human review critical Patch Injection via Input Post-detection only, not on user input mitigated
The threat that keeps us up at night: semantic vulnerability introduction. A patch that looks fine, passes tests, gets merged — and quietly weakens a security invariant that nobody notices until it's exploited.
XII. Analytical Observations
Looking at how remediation actually plays out in real codebases — based on reported industry patterns and our own reasoning over typical workflows — a few things stand out.
Roughly 10–20% of sanitizer-detected bugs are simple enough for automated patching to have a real shot.
The highest success rates are observed for:
- Uninitialized variable usage
- Missing bounds checks
- Use-after-scope errors
- Certain classes of data races
Bugs involving complex business logic, cross-module dependencies, or protocol-level reasoning? The model struggles badly. These require understanding intent, not just syntax — and that's still firmly in human territory.
How We Evaluated
Instead of reporting acceptance rates (which would imply we ran a controlled experiment — we didn't), we categorized patch candidates by how much a reviewer would trust them. "Would you merge this without changes?" vs "Would you use this as a starting point?" vs "Is this actively misleading?" That framing felt more honest.
Why This Still Matters at Low Success Rates
Even if the model only handles 15% of incoming bugs, the operational impact is significant. Security backlogs grow exponentially as detection gets better. Every bug the model handles is one less in the queue.
More concretely:
- Simple fixes get cleared faster, reducing the average time-to-fix
- The security team can focus their limited attention on the hard stuff
- Remediation becomes a background process instead of a bottleneck
The marginal value of even modest automation increases with scale. If you're processing ten bugs a week, it's a convenience. At a thousand bugs a week, it's a survival strategy.
XIII. Negative Results and Observed Limitations
Some things we tried that didn't work — and these are arguably more useful to share than what did:
Negative Results & Failed Approaches
Verbose PromptingDegraded patch quality
→ Concise context outperforms detailed explanations
Full Repository ContextIncreased hallucinations
→ Minimal context reduces confusion
Aggressive Retry StrategiesDiminishing returns
→ 3-5 retries optimal; more wastes compute
Test-Only ValidationSemantic regressions
→ Human review remains essential
Documenting failures reinforces conservative automation boundaries.
We're including these not as caveats but as actual findings. Knowing what doesn't work narrows the design space in ways that positive results can't.
XIV. Security Risks and Ethical Considerations
A. Hallucinated or Misleading Fixes
This is the ugly reality of LLM-generated code: the model doesn't know what "correct" means in your codebase. We've seen it add null checks that suppress the symptom without touching the root cause. We've seen it reduce thread concurrency to "fix" a race condition by eliminating parallelism entirely. And yes, we've seen it delete failing test assertions instead of fixing the code they test.
B. Automation Bias
Here's a subtler danger: once reviewers see that the model produces reasonable-looking patches, they start trusting it more than they should. The patch looks clean, tests pass, and the reviewer rubber-stamps it. This is automation bias, and it's well-documented in other domains (aviation, medical diagnostics). In security, it could be catastrophic.
C. Ethical Deployment
If your organization deploys this kind of system, it needs to be crystal clear — to every engineer in the loop — that these are draft suggestions, not recommendations. The model doesn't understand your threat model, your compliance requirements, or the business logic that lives in six people's heads. Treating its output as authoritative would be irresponsible.
XV. Researcher's Design Rationale
A few choices we made deliberately, and why:
Researcher's Design Decisions
Post-Detection FocusDetection is solved; remediation is the bottleneck
Architectural Safety FirstSystem design over model optimization
Untrusted LLM OutputsTreat all AI suggestions as potentially harmful
CI/CD IntegrationReal pipelines, not standalone research tools
These decisions reflect a security-first mindset over novelty pursuit.
These decisions reflect a security-first mindset, emphasizing accountability, reproducibility, and operational safety over novelty.
XVI. Claims and Non-Claims
This Research Does NOT Claim
- LLMs can replace security engineers
- Automated patching is universally applicable
- Test coverage guarantees security
- AI-generated code should be trusted by default
This Research DOES Argue
- Bounded, assistive automation is viable
- Human oversight is non-negotiable
- Failure documentation is essential
- Architectural discipline enables trust
XVII. Reproducibility and Research Transparency
We didn't run controlled experiments — this paper is about architecture and design reasoning, not benchmark numbers. But we've tried to make everything here reproducible.
If you want to test these ideas yourself:
- Hook up sanitizer output to an LLM through a structured prompt
- Log every patch it generates, along with what the reviewer decided and why
- Track rejection reasons — they'll tell you more than acceptance rates
- Build your own failure taxonomy over time; ours is a starting point, not the final word
XVIII. Open Problems & Research Directions
Things we haven't solved (and honestly, neither has anyone else):
- Multi-file patches. Most real bugs span multiple files. The model can barely handle one.
- Fuzzing integration. Closing the loop between fuzzer output and patch generation is an obvious next step.
- Organization-specific tuning. Fine-tuning on your own codebase's patterns could dramatically improve patch relevance.
- Formal verification. If you could run a lightweight formal check on generated patches, that would change the trust equation entirely.
XIX. Threats to Validity
Let's be upfront about the limitations of our approach:
Internal Validity:
We reasoned about architectures and failure modes — we didn't run A/B tests or controlled experiments. Our observations are grounded but not empirically proven.
External Validity:
This framing works best for server-side software with decent test coverage. If your codebase has weak tests or heavy business logic that lives in people's heads, the results won't transfer cleanly.
Construct Validity:
"Patch quality" is hard to measure. We used sanitizer output and reviewer judgment as proxies, but neither captures every way a patch can be subtly wrong.
None of this invalidates the architecture — but it does mean the next step has to be empirical.
XX. Evaluation Metrics (Defined but Not Measured)
We haven't measured these yet, but when someone does run the experiment, here's what they should track:
- Patch Acceptance Rate (%): How often does a reviewer say "yes, merge this"?
- Mean Time-to-Fix (MTTF): How long from bug detection to merged patch?
- Reviewer Time per Patch: Does the model actually save time, or do reviewers spend just as long verifying the AI's work?
- False-Positive Patch Rate: Patches that pass tests but break something in production
- Semantic Regression Incidents: The scary one — patches that make it to production and silently weaken security
We're defining these upfront so future work has clear targets.
XXI. Why Automated Remediation Remains Hard
The hard part isn't syntax — it's intent.
Most security bugs come from assumptions that nobody wrote down. Invariants that span three modules. Design decisions that made sense five years ago and live in one person's memory. The code doesn't tell you why it's structured a certain way, only what it does.
LLMs are great at surface-level code transformations. They can add a bounds check, initialize a variable, insert a null guard. But they can't reason about why the code was written the way it was, and that's where the real vulnerabilities live.
We don't see this as a failure of models. It's a constraint — a real one — and the right response is to design around it, not pretend it doesn't exist.
XXII. Summary of Contributions
LLMs can help fix bugs — but only if you build the right guardrails. Fully autonomous patching isn't just premature; it's actively dangerous. The sweet spot is assisted remediation: the model drafts, tests validate, humans decide.
Here's what this work contributes:
- A concrete architecture for human-in-the-loop AI-assisted patching, designed for real CI/CD pipelines
- A failure taxonomy — the patterns we saw when LLM patches went wrong, catalogued so others can watch for them
- A threat model built around the assumption that AI-generated code is untrusted by default
- Documented failures — approaches that didn't work, shared because negative results are underreported
- Comparative framing that positions this work relative to rule-based, synthesis-based, and manual approaches
We're putting this out not as a finished product but as an inspectable starting point. Critique it, extend it, break it — that's the point.
Research Artifacts Produced
What you can take away and use independently:
- The pipeline architecture (works with any LLM)
- The failure mode taxonomy (useful for reviewing any AI-generated code, not just patches)
- The comparative framework (helps position your own approach)
- The threat model (adaptable to your specific deployment context)
- The evaluation methodology (reviewer confidence as a metric, not just test pass rates)
Nothing here is tied to a specific model, vendor, or dataset. If the architecture is sound, it should work regardless of which LLM you plug in.
Appendix A: Glossary of Terms
Sanitizer: Runtime instrumentation detecting undefined or unsafe behavior during program execution.
Semantic Vulnerability: A flaw that preserves functional correctness but weakens security guarantees.
Automation Bias: Human tendency to over-trust automated system outputs, reducing critical scrutiny.
Human-in-the-Loop: System design requiring explicit human approval at critical decision stages.
Context Reduction: Process of minimizing input code while preserving vulnerability-relevant semantics.
Patch Candidate: An LLM-generated code modification proposed as a potential fix for a detected vulnerability.
Validation Gate: An automated checkpoint that patches must pass before proceeding to human review.
Acknowledgment
This work draws on publicly available industry research and our own analysis. The interpretations and architectural decisions are ours.
We used LLMs as research tools during this work — for code exploration, draft generation, and idea refinement — but every architectural decision, every design choice, and every conclusion was made by us. The models drafted; we decided.
This document reflects the state of our thinking as of January 2026. It will evolve.
Citation
@article{yadav2026aipatching,
title={AI-Powered Automated Patching for Software Vulnerabilities},
author={Yadav, Gaurav and Yadav, Aditya},
journal={Independent Research - Cybersecurity},
year={2026},
note={Equal Contribution — Independent Research},
location={Pune, India},
institution={Ajeenkya DY Patil University}
}"Human authority is not a constraint on AI—it is the foundation of trustworthy automation."