AI Red Teaming: how to evaluate the security of AI models and LLMs

AI red teaming is the systematic offensive evaluation of artificial intelligence models and LLM-based applications against realistic adversaries, with the goal of uncovering unsafe behavior, information leakage, alignment bypasses and tool misuse before an attacker does. Unlike traditional pentesting, which focuses on software and configuration vulnerabilities, AI red teaming works against an attack surface made of prompts, embeddings, contexts retrieved by RAG, tools connected to the model and, in some cases, the training data itself. The adversarial techniques are different, the success criteria are different, and the discipline requires people who combine classical offensive security with applied machine learning.

The essentials

AI red teaming evaluates models and LLM applications against adversarial techniques, not just code bugs.

It covers direct and indirect prompt injection, jailbreaking, tool abuse, system prompt extraction, model extraction and data poisoning.

It builds on frameworks such as MITRE ATLAS, NIST AI RMF, OWASP Top 10 for LLM Applications and UK AISI guidance.

The EU AI Act and DORA are pushing adversarial testing as mandatory for high-risk systems and critical financial services.

A serious programme combines continuous internal red teaming, periodic independent external red teaming and vendor evaluation before adoption.

What sets AI red teaming apart from traditional pentesting

Traditional pentesting assumes a software system with code, dependencies, configuration and, where relevant, infrastructure. The attacker looks for known technical vulnerabilities, logic flaws in the application or hardening failures. AI red teaming extends that surface in three directions.

The first direction is the input surface. Models receive user prompts, but they also consume additional context: documents retrieved by RAG systems, web content loaded by agents, outputs from external tools and, in multi-agent setups, messages from other models. Any of those channels can carry adversarial instructions that the model treats as legitimate.

The second direction is the technique set. The AI red teamer is not hunting for a buffer overflow but for formulations that change the behaviour of the model: adversarial perturbations, role playing, encodings that bypass filters, gradient-based attacks against open-weight models and automated permutations in black-box settings. The line between legitimate and malicious input is blurred, and blocklist-based filters routinely fall against lexical variants.

The third direction is the success criterion. In a traditional pentest the goal is typically RCE, privilege escalation or data exfiltration via vulnerabilities. In AI red teaming the goals include extracting information that the model should refuse to share (memorised PII, system prompts, intellectual property), bypassing alignment to produce restricted content, manipulating the output toward biased decisions in systems that affect people, and forcing disproportionate resource consumption. The damage is not always technical: it can be reputational, regulatory or trust based.

An organisation that only audits the application layer above the model leaves the bulk of the risk introduced by generative AI uncovered.

Frameworks and references

The discipline has matured enough to rely on public frameworks. Knowing them helps keep the scope traceable and makes findings comparable across exercises.

MITRE ATLAS is the adaptation of MITRE ATT&CK to AI systems. It catalogues adversarial tactics and techniques against machine learning systems, with documented real-world case studies. Its main value is acting as a shared taxonomy when describing findings and building coverage matrices.

NIST AI Risk Management Framework proposes a risk management process for AI with four functions: Govern, Map, Measure and Manage. Red teaming sits in Measure as an adversarial evaluation mechanism. The Generative AI profile published in 2024 expands the framework with concrete controls.

OWASP Top 10 for LLM Applications lists the ten most relevant categories of risk in applications that integrate models. It is an operational reference for defining a baseline of coverage in a first engagement: prompt injection, insecure output handling, training data poisoning, model denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, excessive agency, overreliance and model theft.

The evaluations published by the UK AI Safety Institute and Anthropic's Responsible Scaling Policies offer public examples of how labs and regulators structure pre-deployment testing. Although they are designed for frontier models, several of the methodologies are applicable to enterprise environments that integrate commercial models.

Common attack techniques

Each technique is exercised with different goals, success indicators and prerequisites. These are the most common in an enterprise engagement.

Direct prompt injection

The user sends instructions that try to overwrite the system directives. Early variants (DAN, role play as characters that ignore restrictions, prompts in languages with weaker alignment coverage) are largely blocked by current commercial models, but more subtle variants remain viable. A red teamer tries reformulations, multi-turn chaining, encodings (base64, leet, translations), adversarial prompts generated with methods such as GCG and combinations with context that justifies the transgression. The objective is not always restricted-content generation: often it is enough to make the model reveal its system prompt, ignore a privacy filter or shift its tone in a critical domain.

Indirect prompt injection

Here the attacker does not interact with the model directly. They inject instructions into content that the model or an agent eventually consumes: a web page indexed by a connected search tool, a PDF loaded into a RAG system, a Jira ticket read by an assistant, an email processed by a copilot. When the model reads that content, it treats it as context and, without strict separation of privileges between system instructions and user data, follows the embedded orders. RAG poisoning is the specific variant of placing manipulated documents into the vector store. This is one of the most dangerous attack classes in agent architectures, because the end user can be a victim without ever interacting with the attacker.

System prompt extraction

Applications define their behaviour through system prompts that contain instructions, the identity of the assistant, restrictions and, occasionally, misplaced secrets such as API keys or internal tool names. Extracting that prompt reveals the architecture, enables follow-on attacks and, in many cases, exposes product IP. Techniques include direct requests in multiple languages, simulation of debug modes, role-play games that ask the model to repeat the prior context and side-channel attacks based on length or latency.

Tool abuse and excessive agency

Modern models do not just generate text: they execute tools, call APIs, query databases, send emails and, in agentic architectures, chain autonomous actions. When an attacker manages to influence the decisions of the model, they can turn it into a proxy executor. The red teamer evaluates whether tools have minimal scopes, whether human confirmations exist for sensitive actions, whether the agent can be steered into invoking unintended functions and whether the data returned by a tool can carry instructions that change behaviour. The OWASP category of excessive agency covers exactly this risk: broad permissions turn a compromised agent into malware with legitimate credentials.

Embedding inversion and membership inference

RAG systems and many modern applications store embeddings, vector representations of text. Recent research has shown that it is possible to partially reconstruct the original text from embeddings, especially with popular models and short snippets. Membership inference goes one step further: determining whether a specific text was part of the training corpus, which has clear privacy implications for models trained on sensitive data. Both vectors are tested when the scope includes the vector store or the base model.

Model extraction and IP theft

Through a high volume of targeted queries against a closed model, an attacker can train an approximate model that replicates the behaviour of the original. In specialised models (proprietary classifiers, models fine-tuned with exclusive data) this is equivalent to intellectual property theft. The red teamer evaluates rate limits, detection of systematic query patterns and, in advanced cases, watermarks embedded in responses.

Data poisoning

It exists at two moments. Training-time poisoning happens when the corpus is contaminated during training or fine-tuning, introducing backdoors that trigger on specific inputs. It matters for organisations that train their own models or apply fine-tuning with third-party data. Inference-time poisoning affects systems with continuous learning or feedback loops without moderation: if the model learns from user interactions without filtering, an attacker can gradually push its behaviour.

Hallucination weaponisation

Models produce plausible but false content. An attacker can exploit this tendency for several goals: convince an agent to install packages that do not exist and that the attacker registers just in time (slopsquatting), induce false legal references in automated legal workflows, or push the model to back controversial claims with fabricated citations. The red teamer measures hallucination rates in critical domains and tests adversarial prompts designed to maximise them.

Adversarial perturbation

In vision models, audio models or specific classifiers, imperceptible changes to the input can flip the output. In the LLM context, techniques such as GCG generate apparently random suffixes that trigger restricted behaviours at non-trivial rates. White-box, when the team has access to weights and can use gradients. Black-box, when only API access is available and the team must rely on search techniques and transfer from open models.

Methodology of an AI red team engagement

A serious engagement follows an ordered sequence rather than applying techniques at random.

Phase 1: scope. Define what is in, what is out, which environments are valid (production, staging, isolated model), which account types are used and which rules of engagement apply. If the model is connected to real systems, the team must decide how far the test will go before pausing.

Phase 2: threat model. Identify plausible adversaries (malicious user, internal employee, external attacker with access to indirect content, competitor extracting IP), credible goals and critical assets. A medical application and a marketing assistant have very different threat models.

Phase 3: reconnaissance. Map the surface: visible prompts, configurable parameters, connected tools, RAG sources, integrations, existing controls, available telemetry. Document the architecture before attacking.

Phase 4: testing. Execute techniques prioritised by relevance to the threat model. Combine manual prompts with automated suites. Document each finding with the full prompt, the obtained output, an impact criterion and reproducibility notes.

Phase 5: reporting. Deliverable with an executive summary, a coverage matrix against the chosen framework (ATLAS, OWASP LLM), severity-classified findings, technical recommendations (filters, guardrails, privilege separation, monitoring) and, where applicable, process recommendations (prompt review, model governance, usage policy).

In tooling, the most widely used frameworks in 2026 are Garak (a suite of probes with hundreds of prebuilt tests, maintained by NVIDIA), PyRIT (a Microsoft framework aimed at red teams, supporting multi-turn orchestration), Promptfoo (more focused on evaluation but useful for defence regression) and Counterfit (broader coverage of classical ML models, not just LLMs). For specific attacks, there are reference implementations on GitHub for GCG, AutoDAN and related techniques.

Regulatory fit with the EU AI Act and DORA

The EU AI Act introduces explicit obligations for adversarial testing of systems classified as high risk. Providers must document robustness, accuracy and cybersecurity evaluations, and general-purpose models with systemic risk face additional requirements that include red teaming as a risk-discovery mechanism. The regime is staged, with deadlines already in force for prohibited practices and GPAI models, and others becoming applicable in 2026 and 2027.

DORA, already in force for financial entities in the EU, does not mention AI explicitly, but its requirement for threat-led penetration tests (TLPT) every three years for significant entities reaches critical systems for financial services. When those systems integrate AI models in material flows (scoring, fraud detection, automated customer service), the TLPT must include adversarial techniques against AI components. The same applies to testing of ICT third parties under the third-party risk management framework.

For organisations subject to NIS2 that use AI in critical processes, the principle is equivalent: mandatory technical and organisational measures include periodic security evaluations, and omitting the AI component is not defensible if it affects the availability or integrity of the essential service.

Internal red team, external red team and vendor evaluation

The three modalities serve different purposes and are not interchangeable.

The internal red team operates continuously, understands the architecture, takes part in guardrail design and gives rapid feedback to product teams. It is ideal for regression, evaluation of every relevant change and for building organisation-specific probe catalogues. Its natural limit is bias: those who build the system rarely find all of its blind spots.

The external red team brings independent perspective, techniques that the internal team does not master and credibility in reports addressed to regulators, auditors and customers. Its typical cadence is annual or biannual, with engagements of bounded duration (four to eight weeks in most serious programmes). It is the modality implicitly required when there are regulatory obligations or external audits.

Vendor evaluation is carried out before adopting a third-party AI model or service. It answers two questions: how far the inherent security of the provider goes, and how the system behaves when integrated with the organisation's own data and context. Commercial models routinely pass generic evaluations but may fail with sensitive customer data or with poorly designed system instructions. Vendor evaluation includes documentation review, functional testing and a small adversarial engagement with representative data from the real environment.

Recommended governance combines the three: continuous internal, periodic external, vendor evaluation before each material adoption.

Public cases

Cases published in the last few years illustrate categories of attack without the need to invent details.

During 2023 and 2024 the community documented numerous jailbreaks against ChatGPT (DAN and its variants, prompts in languages with weaker alignment coverage, role plays). OpenAI progressively closed many of the specific variants, but the pattern shows that alignment is not robust against sustained adversarial effort.

In 2024 several indirect prompt injection scenarios against Microsoft Copilot were published, in which external content (emails, shared documents) induced unintended behaviour when the assistant processed it. Microsoft introduced mitigations through reinforced guardrails and explicit user warnings.

Google has handled vulnerability disclosures around Gemini related to context handling and output filters, progressively integrated into its bug bounty programme extended to AI.

These cases share a pattern: none of the major platforms claim to offer a model immune to abuse, and all of them have established formal red teaming programmes, AI-specific bug bounties and responsible disclosure as part of their lifecycle.

Frequently asked questions

Is AI red teaming legal against systems in production?

Only with explicit authorisation by contract. Testing commercial models beyond the terms of use is problematic and, in some cases, breaches the ToS. For systems owned by the organisation, it is advisable to sign rules of engagement before the exercise, with time windows, stop criteria and immediate notification if incidents unrelated to the test are discovered in flight.

Does the approach change between open-source models and closed APIs?

Yes. With open-weight models, white-box testing is possible: gradients, adversarial fine-tuning, internal analysis. With closed APIs the engagement is black-box, based on queries and transfer of attacks generated on similar open models. Both scenarios are legitimate and useful, but the techniques and the compute cost change.

Is automated AI red teaming a thing?

Partially. Garak, PyRIT and similar tools automate a significant portion of the repetitive work (probe catalogues, variant generation, output evaluation). However, the highest-impact findings still require human creativity, especially in multi-turn chains, specific tool abuse and contextual prompt injection. Mature practice combines both.

Who buys AI red teaming?

In 2026, the typical buyer is the CISO, the head of AI or the AI governance lead. In financial entities, the operational risk function is also involved. In organisations regulated as high-risk providers under the EU AI Act, the decision is usually formalised in the quality management system and signed at the corresponding executive level.

How long does a typical engagement take?

Between four and eight weeks for a standard enterprise programme. An LLM application with RAG and two or three tools is usually covered in four to five weeks: one for scoping and threat modelling, two to three of effective testing, and one for reporting. Complex multi-agent systems or environments with several integrated models require more time.

What deliverables are produced?

A technical report with findings classified by severity, a coverage matrix against the chosen framework, reproducible evidence, prioritised recommendations (technical and process) and, optionally, a transfer session with the engineering team. A good engagement leaves the organisation with a measurable baseline and an actionable mitigation plan.

AI red teaming with Secra

At Secra we run adversarial AI red teaming programmes of four to eight weeks, with proprietary tooling integrated with Garak and PyRIT, sector-specific threat models (financial, health, industrial, public sector) and deliverables aligned with MITRE ATLAS, OWASP Top 10 for LLM Applications and the requirements of the EU AI Act. We work both on LLM applications in production and on vendor evaluation before adoption, as well as on agentic systems with connected tools.

If your organisation is deploying generative AI in material workflows and needs a serious adversarial engagement before go-live or as part of an annual programme, you can reach us at secra.es/contacto for an initial conversation and scoping discussion.

About the author

Secra Solutions team

Ethical hackers with OSCP, OSEP, OSWE, CRTO, CRTL and CARTE certifications, 7+ years of experience in offensive cybersecurity, and authors of CVE-2025-40652 and CVE-2023-3512.

Meet the team →Our security research →

AI Red Teaming: how to evaluate the security of AI models and LLMs

What sets AI red teaming apart from traditional pentesting

Frameworks and references

Common attack techniques

Direct prompt injection

Indirect prompt injection

System prompt extraction

Tool abuse and excessive agency

Embedding inversion and membership inference

Model extraction and IP theft

Data poisoning

Hallucination weaponisation

Adversarial perturbation

Methodology of an AI red team engagement

Regulatory fit with the EU AI Act and DORA

Internal red team, external red team and vendor evaluation

Public cases

Frequently asked questions

Is AI red teaming legal against systems in production?

Does the approach change between open-source models and closed APIs?

Is automated AI red teaming a thing?

Who buys AI red teaming?

How long does a typical engagement take?

What deliverables are produced?

AI red teaming with Secra

Related Articles

wp2shell: Unauthenticated RCE in WordPress Core

File Upload Vulnerability: Webshell, RCE and Defense

Flipper Zero: What It Is, What It Does & Risks

AI Red Teaming: how to evaluate the security of AI models and LLMs

What sets AI red teaming apart from traditional pentesting

Frameworks and references

Common attack techniques

Direct prompt injection

Indirect prompt injection

System prompt extraction

Tool abuse and excessive agency

Embedding inversion and membership inference

Model extraction and IP theft

Data poisoning

Hallucination weaponisation

Adversarial perturbation

Methodology of an AI red team engagement

Regulatory fit with the EU AI Act and DORA

Internal red team, external red team and vendor evaluation

Public cases

Frequently asked questions

Is AI red teaming legal against systems in production?

Does the approach change between open-source models and closed APIs?

Is automated AI red teaming a thing?

Who buys AI red teaming?

How long does a typical engagement take?

What deliverables are produced?

Related resources

AI red teaming with Secra

Related Articles

wp2shell: Unauthenticated RCE in WordPress Core

File Upload Vulnerability: Webshell, RCE and Defense

Flipper Zero: What It Is, What It Does & Risks