ofensiva
EDR evasion
LLM attacks
AI malware

EDR evasion with LLMs: how attackers use AI and how to defend in 2026

EDR evasion with LLMs: AI-generated polymorphic malware, AMSI bypass, hallucinated payloads, behavior analytics as defense and purple team validation.

SecraJune 8, 202614 min read

Since 2024 attackers have integrated large language models into their offensive chain to accelerate tasks that used to require experienced operators: generating malware variants, rewriting loaders to evade signatures, crafting contextual payloads and producing convincing phishing at scale. The consequence is that EDR products based mainly on signatures have lost relative effectiveness against a near infinite universe of variants, and defenders need to lean on behavior analytics, deep telemetry and purple team validation with real AI-generated payloads. This article describes how AI is being used in offensive operations today, which techniques scale, which limitations LLMs still face and what defensive strategy is reasonable in 2026 without marketing.

The essentials

  • LLMs accelerate generation of polymorphic loaders, PowerShell obfuscation, lateral movement scripts and contextual phishing.
  • Marketplaces such as WormGPT and FraudGPT existed and variants still circulate, but most real usage involves account abuse against commercial models documented by OpenAI, Anthropic and Microsoft.
  • LLMs still hallucinate APIs, mix syntax across versions and trip on alignment, so they require competent human operators to validate output.
  • Signature-only EDR loses effectiveness: behavior analytics, on-device ML, Sysmon with Sigma and proactive threat hunting are the defensive line that scales.
  • Purple team exercises with AI-generated payloads are the only honest way to validate whether your EDR detects what the vendor promises.

How attackers are using LLMs today

Adversarial LLM usage in 2026 is neither science fiction nor a miracle tool. It is an accelerator with four recurring patterns worth describing precisely.

The first pattern is evasive offensive code generation. The operator asks the model to rewrite a known loader with different control structure, variable names, XOR routines and reflective loading techniques. The point is not to write malware from scratch (LLMs remain mediocre across long projects), but to produce variants of small components that break existing static signatures.

The second pattern is creative obfuscation. PowerShell, HTA-embedded JavaScript, Python in Impacket-style frameworks and assembly in bind shells lend themselves to lexical transformations, chained encodings, string splitting and rewritten cradle techniques. LLMs produce variants at speeds no human can match.

The third pattern is contextual phishing and social engineering. The operator feeds public information about the target (LinkedIn, corporate communications, posts) and gets personalised emails, plausible Teams messages, vishing scripts and adapted landing pages. Tonal coherence is often indistinguishable from real internal communications.

The fourth pattern is bespoke lateral movement and post-exploitation scripts. Once inside, the operator asks for Active Directory enumeration, ACL parsing, abuse of trust relationships, LOLBin commands in specific Windows versions and exfiltration routines through legitimate channels. The model does not operate autonomously: the operator decides, validates and executes.

Documented cases

This section deserves to be handled without glamorising the actors or inflating capability. The public facts of the past two years are enough.

WormGPT and FraudGPT appeared on forums and Telegram between 2023 and 2024 as alignment-free models marketed by subscription. Researchers showed that many offerings were wrappers over open weights models with permissive prompts, and others outright scams. Media narratives inflated their impact, but the underlying pattern is real: there is demand and supply for LLMs with weakened alignment for offensive use.

Abuse of commercial models is documented in public reports from OpenAI, Anthropic and Microsoft. OpenAI has published disruptions to state-affiliated actors (groups linked to Russian, Chinese, Iranian and North Korean intelligence) using ChatGPT for target research, translation, auxiliary script generation and debugging. The use is not offensive autonomy, but to accelerate repetitive intellectual work.

Groups linked to Forest Blizzard (APT28) have tested commercial models for protocol research and script development, per public disclosure by Microsoft Threat Intelligence. There is no credible public evidence of end-to-end campaigns operated autonomously by LLMs at the time of writing.

Real adversarial use is more prosaic than press coverage suggests: acceleration, not autonomy; operational support, not fully automated campaigns.

LLM-assisted evasion techniques

These are the concrete techniques that appear most frequently in offensive exercises and in samples analysed over recent months.

Polymorphic loader generation

The operator starts from a known loader and asks for variants with different internal structure: different decryption algorithm, WinAPI replaced by direct syscalls, semantically neutral junk code and rewritten unhooking routines. Each variant has a different signature and passes static controls that the original no longer survived. Fast generation allows a rotating pool.

AMSI bypass variants

AMSI remains the main Windows control point for script inspection. Classical techniques (patching AmsiScanBuffer, memory manipulation, provider hooking) are documented and signed. An LLM produces lexical and structural variants, modifying offsets, equivalent instructions and operation order. The defence requires behavioural detection and telemetry on memory modifications.

AI-generated PowerShell obfuscation

PowerShell concentrates a large share of offensive noise on Windows endpoints. Traditional techniques (Invoke-Obfuscation, encoded commands, backtick splitting) are covered by entropy and structural rules. LLMs generate more natural obfuscation, with plausible variable names, less monitored operators and uncommon equivalent cmdlets. The result passes filters expecting canonical patterns more easily.

Living-off-the-land binary usage

LOLBins are legitimate system binaries reused for execution, download, persistence or evasion. The LOLBAS list covers hundreds of utilities. An LLM helps find less monitored combinations, generate non-canonical arguments and chain binaries in atypical sequences. Defence requires strict application allowlisting and parent-child chain monitoring.

Sandbox detection scripts

Before executing payload, modern malware checks if it runs in a sandbox, VM or analysis instrumentation. Traditional checks (vendor registry, core count, analysis processes, disproportionate sleep timing) are signed. LLMs generate novel checks and contextual inspections of the target environment (specific software, domain, user configuration), reducing detonation rates.

Mimicking legitimate C2 traffic

The C2 channel is where network defences can catch the attacker. LLMs help build traffic that imitates legitimate patterns: HTTP to real CDNs with realistic jitter, abuse of well known SaaS as indirect channels, beacons respecting business hours and JSON formats that look like ordinary telemetry. Detection requires encrypted traffic analysis and per host baselines with Sigma rules.

Highly contextual phishing lures

Beyond generic email, LLMs produce lures with deep context: references to internal projects visible on professional networks, tone matching public CEO communications, accurate branding, role-tuned calls to action. Classical user training loses relative effectiveness and defence has to lean on technical controls: strict DMARC, attachment sandboxing, click-time URL analysis and post-click behavioural telemetry.

Real limits of offensive LLMs

An honest description of the state of the art has to acknowledge where LLMs fail and why they still need competent human operators.

API and syntax hallucination. Models invent functions that do not exist, mix signatures across versions and produce code that does not compile or fails at runtime, especially for security libraries underrepresented in the corpus and versions past the knowledge cutoff.

Code that does not work first try. Even when it compiles, a significant share has subtle bugs that break behaviour or make it detectable. The operator needs debugging ability, not just copy and paste.

Alignment that fires safeguards. Commercial models recognise explicit offensive patterns and refuse, redirect or produce sterile output. Evasion techniques exist (role play, fragmentation, reformulation), but introduce friction and reduce quality. Alignment-free models compensate availability for technical quality.

Lack of target environment context. An LLM does not know the customer network or the exact EDR deployed. The operator must bring prior reconnaissance and validate empirically.

Dependence on human validation. Every artifact goes through review: the operator tests in lab, adjusts, asks for variants, integrates. The LLM accelerates cycles, not experience or tactical decisions.

These limits explain why the apocalypse of fully autonomous AI-generated malware has not materialised yet. The threat is real and growing, but its current shape is accelerator, not independent agent.

Why signature-based EDR loses effectiveness

The signature detection model assumes the universe of malicious samples is finite and catalogable. As long as actors produce variants at human pace, the model holds.

With LLMs that pace changes. The marginal cost of generating a working variant drops and the effective universe tends toward infinity. Maintaining accurate signature catalogues becomes Sisyphean work, and the latency between appearance and signing leaves uncomfortable exposure windows.

There is an additional problem: many modern offensive techniques rely on legitimate binaries and behaviours that administrators also perform. The boundary between legitimate and malicious runs through context, sequence and aggregated telemetry, not static patterns on disk. An EDR that only looks at the latter is condemned to repeated false negatives.

Signatures are not dead: they remain useful for known samples as a low cost layer. But trusting them as the primary defence against an LLM-assisted attacker is operational naivety.

Defences that scale against LLM-assisted attacks

The good news is that defence also has clear margins if it is built on the right layers. These are the levers with the best return in 2026.

Behavior analytics (UEBA) above signatures. Detecting anomalies in per user and per host usage patterns is resilient to infinite variant generation because the attacker's residual behaviour remains anomalous even if the binary is unique: unusual child processes, atypical access, temporal patterns outside baseline.

EDR with on-device ML detection. Mature EDR platforms combine signatures with ML trained on behaviour. This includes process chain analysis, runtime script evaluation, code injection detection and event correlation. Validating that your EDR brings this layer, not just signatures with AI marketing, is decisive.

Sysmon with bespoke Sigma rules. Sysmon exposes telemetry on process creation, registry modification, network connections and DLL loading. Combining it with Sigma rules curated by the team and derived from internal threat hunting produces granular detection no generic EDR brings out of the box.

Proactive threat hunting. Teams that spend regular time on specific hypotheses (rather than waiting for alerts) find what automated detectors miss. Mature threat hunting relies on intelligence-driven hypotheses, outlier exploration and iterative validation.

Network traffic analysis, including encrypted. TLS hides payload but not metadata. Packet sizes, jitter, contacted hosts, JA3/JA4 fingerprints, beaconing frequency and DNS patterns provide signals actors struggle to erase. Modern NDR leans on these vectors.

Fast endpoint isolation. Isolating the host from the network in seconds limits lateral movement and exfiltration. Having this capability is the difference between contained incident and prolonged crisis.

LSA Protection and Credential Guard. Hardening credential protection on Windows reduces the success of Mimikatz and LLM-assisted variants. Enabling them where compatibility allows is one of the highest impact for effort measures.

Application allowlisting with AppLocker or WDAC. Defining which binaries can execute cuts off at the root an important share of the abused LOLBin catalogue. Requires significant initial investment, but delivers real preventive defence.

A reasonable defensive architecture combines all these layers. None is enough alone. The combination is what makes it expensive for the attacker to sustain operation.

Purple team with LLM-generated payloads

Validating real efficacy of the defensive stack requires purple team exercises where the offensive team generates payloads with LLMs and the defensive team measures coverage. This replaces vendor marketing with evidence.

A serious programme includes several rounds. In each round, the red team uses LLMs to produce new variants of predefined loaders, scripts and techniques; the blue team tries to detect and, where it does not, the gap is documented, detection or configuration is adjusted and the exercise repeated. The relevant metric is not point detection of a sample, but the time the stack takes to improve against an entire class of variants.

Common tools are open frameworks like Atomic Red Team and Caldera, complemented with red team scripting and curated prompts. Some teams integrate the cycle with BAS platforms, replacing static scenarios with generated ones.

The useful deliverable is a coverage map against MITRE ATT&CK updated each quarter, with percentage of techniques detected against fresh variants and a prioritised plan. That map is the only credible defence against generic vendor claims.

Threat hunting hypotheses for AI-assisted attacks

Threat hunting can formulate specific hypotheses when one assumes LLM usage. Some concrete lines:

Anomalous PowerShell entropy. Scripts with creative obfuscation can show entropy profiles different from baseline. Hunting over PowerShell logs with per block entropy and comparison against historical percentiles brings outliers to the surface.

Atypical command line lengths and structures. LLM-generated commands tend to verbose forms, uncommon equivalent parameters or specific concatenation patterns. Looking for high percentiles in command line length, filtered by parent process, identifies candidates.

Parent-child chain anomalies. A legitimate binary spawning atypical children, sequences not seen in baseline or timings outside business hours are signals that survive polymorphic generation because operational context remains.

Beaconing with subtle jitter. Hunting over connections grouped by destination host with interval analysis detects beaconing even when content is encrypted.

LOLBin usage anomalies. Frequency of unusual binaries, non-canonical parameters and previously unseen combinations escape signatures but appear in baseline-driven hunting.

Each hypothesis is documented, executed over historical telemetry and, if it makes sense, crystallised into a Sigma rule or continuous detection.

Regulatory fit

The use of LLMs in offensive operations and the defence against them is not only a technical question. European regulation already frames it partially.

NIS2 article 21, point g on cyber hygiene, requires essential and important entities to adopt basic and advanced practices. Against a threat surface accelerated by AI, defending solely with signatures does not fit expected diligence.

ENS at the high level requires detection and response controls proportional to system category. Monitoring must cover behaviour, not just discrete events, with expectation of threat hunting and exercise-based validation.

DORA requires periodic testing for financial entities and TLPT every three years for significant entities. Incorporating LLM-assisted techniques in test scenarios is an honest representation of the current threat state.

The EU AI Act affects more the provider side and systems using AI internally, but the regulatory conversation on adversarial use of general purpose models is evolving fast.

Frequently asked questions

Can ChatGPT generate functional malware?

It generates fragments and variants that a competent operator integrates and debugs. It does not produce end to end malware ready to deploy without significant human work. Alignment refuses explicit requests, but framing reduces resistance. The useful gain for an actor is time savings, not autonomy.

Is WormGPT a real threat or cybercrime marketing?

Both. There were and still are wrappers marketed with permissive prompts over open weights models, and scams that exploit the brand to defraud criminal buyers. Real capability is closer to a non-aligned Llama 3 than a qualitatively new tool.

Does my EDR detect LLM-generated payloads?

It depends. Signature-based EDR has decreasing detection rates against fresh variants. EDR with behavioural analysis and on-device ML provides better coverage, but no platform detects 100%. The only way to know in your environment is to validate with purple team exercises.

Is purple team with AI payloads mandatory?

Not explicitly in any regulation. It is reasonable and, for DORA and NIS2 entities with high risk profile, defensible as technical diligence. Auditors and regulators increasingly ask how EDR vendor claims are validated.

Is behavior analytics enough alone?

No. It scales best against fresh variants, but produces false positives without calibration and leaves gaps against techniques that respect baselines. Mature defence combines behavior analytics, signatures, allowlisting, hardening, threat hunting and network analysis.

Does the EU AI Act block WormGPT?

It regulates providers and users of AI within the EU, not criminal actors outside its reach. Marketing a non-aligned model for offensive use would fit prohibitions for anyone operating legally, but the operational reality of cybercrime sits outside its direct reach.

Purple team validation with Secra

At Secra we run purple team exercises to validate the real coverage of your EDR stack against LLM-assisted payloads. Methodology based on MITRE ATT&CK, variant generation aligned with techniques seen in recent operations and deliverables with coverage map, prioritised gaps and quarterly improvement plan. We work on Windows, Linux and hybrid environments, and deliver a measurable baseline that replaces vendor marketing with reproducible evidence.

If your organisation is reviewing the real effectiveness of its EDR against the current state of AI-assisted offensive activity, reach us at secra.es/contact for an initial conversation and scope definition.

About the author

Secra Solutions team

Ethical hackers with OSCP, OSEP, OSWE, CRTO, CRTL and CARTE certifications, 7+ years of experience in offensive cybersecurity, and authors of CVE-2025-40652 and CVE-2023-3512.

Share article