Pentesting
LLM pentesting
AI pentesting
Garak

Pentesting AI and LLM models: methodology, tools and deliverables

LLM and AI pentesting: phases, tools (Garak, PyRIT, Promptfoo), OWASP LLM tests, EU AI Act alignment and CTO-ready deliverables.

SecraJune 8, 202615 min read

Large language model (LLM) pentesting is the structured evaluation of applications built on top of generative AI models (customer support chatbots, internal copilots, retrieval augmented generation systems over corporate documentation, autonomous agents with tool access) whose goal is to identify vulnerabilities specific to the AI paradigm before an attacker discovers them in production. Unlike traditional web pentesting, the focus is not solely on the application or infrastructure layer but on the behaviour of the model itself, on how it handles instructions, external data and connected tools.

This guide explains when an LLM pentest is needed instead of a generic web pentest, the phases of a structured engagement, the tests that are mandatory in any serious evaluation, the tooling used by the offensive team, the white-box, black-box and gray-box modalities, the expected deliverables, the regulatory alignment with the EU AI Act, NIS2 and DORA, and the difference between pentesting, AI red teaming and bug bounty programs.

The essentials of LLM pentesting

  • LLM pentesting evaluates applications that integrate language models and is not limited to the web or API layer.
  • It covers at minimum the OWASP LLM Top 10: prompt injection, system prompt leakage, PII exfiltration via RAG, tool abuse, embedding poisoning, jailbreaks and quota abuse.
  • It runs in two to four weeks depending on scope, in white-box, gray-box or black-box modality.
  • Deliverables include an executive report, an OWASP LLM Top 10 matrix, documented adversarial prompts, layer-by-layer recommendations and a retest plan.
  • It aligns with article 15 of the EU AI Act for high-risk systems, article 21 of NIS2 and extended TLPT exercises in DORA environments.

When you need an LLM pentest (not a generic web pentest)

A traditional web pentest covers classic application vulnerabilities: SQL injection, broken access control, deserialisation, XSS, server misconfigurations. It is necessary for any product exposed to the internet, but it leaves the entire attack surface introduced by a language model uncovered. If your application matches any of the following scenarios, you need a dedicated LLM pentest in addition to the standard web engagement.

A customer support chatbot that answers in natural language, accesses account information and can execute actions (change a shipping address, issue a refund, manage an incident). The risk is not just that an attacker exploits an endpoint, but that they manipulate the model into making decisions outside its policy.

An internal copilot that assists finance, support or engineering staff and has access to tools (database queries, code execution, integration with internal systems). Any tool the model can invoke is a potential abuse vector if an attacker manages to redirect its behaviour.

A retrieval augmented generation (RAG) system that indexes confidential documentation, contracts, technical manuals or ticket history to answer internal questions. The main risk is the exfiltration of information that the user asking should not have access to.

Autonomous agents in production that receive a goal and execute sequences of actions (web search, API calls, writes to management systems, sending emails). The potential damage of a compromised agent is proportional to the scope of the tools it controls.

If your application meets two or more of these conditions, traditional pentesting is not enough and you should include a dedicated LLM scope.

Phases of an LLM pentest

A serious LLM pentest follows four well-defined phases, equivalent to the OSSTMM or PTES model adapted to the AI paradigm.

Discovery. The offensive team needs to understand the system before attacking it. This includes identifying which model is used (GPT-4, Claude, open source models deployed locally), which provider (OpenAI, Anthropic, Azure OpenAI, Bedrock, self-hosted), which tools the model has connected (tools, function calling, plugins), which data sources it consults (vector store, databases, external APIs) and which security controls sit in the middle (LLM Guard, custom filters, moderation systems). In white-box mode the team has access to system prompts, configuration and code; in black-box mode everything is inferred from observable behaviour.

Threat modeling. The OWASP LLM Top 10 is used as the baseline matrix, complemented with STRIDE adapted to the AI context and the MITRE ATLAS framework. Each function of the application is mapped against the relevant threats: if the system has RAG, there is risk of cross-tenant exfiltration and index poisoning; if it invokes privileged tools, there is risk of function abuse; if the deployment is multi-tenant, there is risk of leakage between customers. The output of this phase is a prioritised test plan.

Test execution. This is where adversarial prompts, model-specific jailbreaks, indirect injection techniques via uploaded documents, embedding exfiltration attacks, tool abuse through chaining, and malicious load tests are applied. Each finding is documented with reproducible evidence: the exact prompt, the model response, the impact demonstrated.

Reporting. The engagement closes with an executive report, technical findings, severities calculated using LLM-adapted CVSS (or the provider internal system), layer-by-layer recommendations and a retest plan. The typical total duration is two to four weeks depending on scope.

Mandatory tests in every engagement

Any serious LLM pentest must cover, at minimum, the following tests. The list largely matches the OWASP LLM Top 10 published in 2025 and the vectors documented in MITRE ATLAS.

Direct prompt injection. The attacker sends instructions that override those of the system. Variants are tested in Spanish, English, base64-encoded, fragmented, in obfuscated language and using invisible characters.

Indirect prompt injection. The malicious instruction reaches the model inside a document uploaded by the user, a webpage consulted by the search tool, or an email received by an assistant with mailbox access. It is the most dangerous vector because it does not require direct access to the model.

System prompt leakage. The team attempts to recover the system prompt through extraction techniques (summarise the previous content, repeat the first line, translate the instructions into German). Leaking the system prompt exposes business logic and enables follow-on attacks.

PII exfiltration via RAG. In systems with retrieval augmentation, the team attempts to make the model retrieve and reveal data belonging to other users or to confidential documents outside the role of the user asking. This includes oblique queries and embedding manipulation.

Tool abuse. If the model has connected tools (sending emails, reading databases, executing scripts), the team probes whether it can be induced to execute unauthorised commands or chain tools in unintended ways. It is the highest risk in autonomous agents.

Embedding poisoning. In systems with document ingestion from users or external providers, the team probes whether the vector index can be contaminated with content that alters future responses. It is a persistent and hard-to-detect attack.

Output validation bypass. The model can return responses that, when rendered by the frontend, execute code (classic XSS via injected HTML, malicious markdown, fraudulent URLs). The team verifies whether the presentation layer properly validates model output.

Rate limit and cost abuse. Commercial models are billed per token. An attacker can craft costly queries (long prompts, maximum outputs, expensive tool calls) to exhaust quota or drive up the bill. The team probes behaviour under adversarial load.

Hallucination in critical contexts. In applications where the model advises on regulated topics (medical, financial, legal), the team documents cases where the model fabricates information with an authoritative tone. It is not a classic vulnerability but it is a critical operational risk.

Model-specific jailbreaks. Each model family has known jailbreaks that evolve over time. An updated battery is executed against the deployed model to verify which filters are active and which can be bypassed.

Tooling and testbed

LLM pentesting combines automated tooling with manual work by the specialist. Tools accelerate coverage, but relevant findings usually require adversarial prompts crafted specifically for the customer use case.

Garak (NVIDIA). The most widely adopted LLM vulnerability scanner. It works like nmap for models: it runs preconfigured probes against the model endpoint and reports results. It covers prompt injection, jailbreaks, training data leakage, toxicity and disinformation. Useful in the initial sweep phase to identify obvious issues before moving to manual work.

PyRIT (Microsoft). A red teaming framework designed to automate adversarial prompting campaigns. It allows chaining prompts, maintaining state across turns and orchestrating multiturn attacks. It is the reference tool when the target is an agent or a chatbot with conversational memory.

Promptfoo. A testing platform that runs structured test cases against one or several models and compares results. Useful for maintaining a regression suite that runs periodically against the production system and detects security regressions after model or system prompt updates.

LLM Guard. Although its main purpose is mitigation in production (input and output filters), it is also used in pentesting to evaluate which filters from the customer are active and which the offensive team can bypass. Auditing a deployment that uses LLM Guard includes testing the defensive tool itself.

Other relevant tools. NVIDIA NeMo Guardrails for evaluating conversational guardrails, Rebuff as a prompt injection detection layer that is itself audited, and community jailbreak collections maintained in public repositories as references of known adversarial prompts.

The typical testbed combines Garak for broad coverage, PyRIT for multiturn scenarios and manual work with custom prompts for the customer-specific use cases.

White-box vs black-box vs gray-box LLM pentesting

The modality determines scope, cost and realism of the exercise.

Black-box. The offensive team only has access to the product as an external user. It does not know the underlying model, the system prompt, the connected tools or the internal architecture. It simulates a real attacker. The advantage is realism; the downside is that part of the time is spent on discovery and some vectors remain uncovered if they are not found within the engagement window.

Gray-box. The customer shares partial information: architecture documentation, list of connected tools, model and version, general filter policy. The system prompt or log access is not shared. This is the recommended modality for most engagements because it balances realism and coverage.

White-box. Full access: system prompts, tool configuration, RAG pipeline code, inference logs and a testing environment identical to production. It enables maximum coverage and is the mandatory modality when the system is high-risk under the EU AI Act or when the customer needs exhaustive evidence for regulatory audit.

The choice depends on the goal. To validate security posture against an external attacker, black-box. To achieve maximum coverage for certification or compliance, white-box. For the balanced approach in a typical engagement on a production product, gray-box.

Expected deliverables

The closure of an LLM pentest must include, at minimum, the following deliverables. If the provider does not detail them in the proposal, that is a sign of a poorly defined scope.

An executive report of five to ten pages addressed to leadership and the security committee. It includes a summary of findings, severities, prioritised recommendations and an overall conclusion on the system posture. It does not contain raw adversarial prompts, only the narrative summary.

A detailed technical report with each finding documented: description, prompts used, model responses, reproducible evidence, demonstrated impact, calculated severity and specific recommendation. This is the document the development team uses to remediate.

An OWASP LLM Top 10 matrix that maps each vulnerability of the catalogue against the audited system, indicating whether it was tested, whether evidence was found and what severity applies. It is the reference document for reports to external auditors.

A repository of documented adversarial prompts that the customer can integrate into their continuous testing pipeline. It includes prompts that worked, prompts that were filtered correctly and notes on the observed behaviour.

Layer-by-layer recommendations: model (configuration changes, reinforced system prompt, parameter tuning), application (input validation, output sanitisation, privilege segregation across tools) and infrastructure (quotas, monitoring, alerts, segmentation).

A retest plan that defines which findings are reverified, in what time frame and under what conditions. Without a retest, a pentest is an isolated diagnosis; with a retest, it is a verifiable improvement process.

Regulatory alignment

LLM pentesting is not an optional best practice for many sectors: it is part of the compliance required by European frameworks in force in 2026.

EU AI Act, article 15. AI systems classified as high-risk (use in critical infrastructure, education, employment, essential services, justice, migration management) must undergo robustness, accuracy and cybersecurity testing before being placed on the market and on a continuous basis. LLM pentesting is the primary mechanism to demonstrate this technical compliance. Full application for high-risk systems extends through 2026 and 2027 according to the regulation calendar.

NIS2, article 21. Essential and important entities must adopt cybersecurity risk management measures proportionate to the risks posed, including security testing. If the entity operates critical systems based on AI (authentication, fraud detection, automated decisions), LLM pentesting is part of the required measures.

DORA and TLPT extensions. The DORA regulation requires threat-led penetration testing (TLPT) for significant financial entities. If the entity uses LLM systems in critical functions (automated advice, scoring, customer service with account access), the TLPT exercise must include vectors specific to the AI paradigm. Some national supervisors have already published guides that extend TIBER-EU to include AI capabilities in scope.

Difference with AI red teaming and bug bounty

Three related practices with different purposes.

LLM pentesting. Structured engagement, closed scope, defined time window (two to four weeks), formal deliverables, documented severities. It is the right practice when the goal is to validate the posture of the system at a specific moment, generate evidence for audit or fulfill a regulatory mandate.

AI red teaming. Open-ended adversarial exercise, without a closed scope upfront, oriented to discovering emergent behaviours and undocumented vectors. It is typically executed during model development (labs such as OpenAI or Anthropic maintain internal continuous red teaming teams) or as a punctual exercise against critical systems. It is more exploratory than pentesting and deliverables are more narrative.

Bug bounty. Continuous incentive program for external researchers to report vulnerabilities. There is no closed time window, no fixed scope beyond the program rules, and deliverables are the individual reports that arrive. It is complementary to pentesting, not a substitute: bug bounty covers the long term, pentesting covers the audit moment.

The operational recommendation for an organisation with LLMs in production is to combine periodic pentesting (annual or after significant changes), red teaming before critical launches and bug bounty as a continuous layer once the product is stable.

Frequently asked questions

Is model access required to run an LLM pentest?

It depends on the modality. In black-box only product access as a user is required; in gray-box architecture documentation is shared without privileged credentials; in white-box system prompts, tool configuration and logs are shared. The choice depends on the goal of the exercise and the level of evidence required.

What is the approximate cost of an LLM pentest?

Cost varies with scope, modality and system complexity. A typical engagement for an enterprise chatbot with three to five connected tools and a medium-sized RAG base sits in the range of a complex web pentest. Systems with autonomous agents or multiple integrations require more effort. The proposal must detail hours, assigned profiles and included deliverables.

How long does an LLM pentest take?

Between two and four weeks for a medium-scope engagement. One week for discovery and threat modeling, one or two weeks of execution and one week of reporting. Shorter engagements usually mean reduced coverage; significantly longer engagements typically reflect very broad scope or detailed audit needs.

Is the pentest done on production or on staging?

The usual practice is to execute most of the work on a staging environment identical to production and to reserve a controlled verification in production to confirm equivalent behaviour. Attacking production directly without prior coordination can exhaust quotas, generate costs and trigger real SOC alerts.

What happens if a working jailbreak is found during the engagement?

It is documented with reproducible evidence, communicated to the customer with the appropriate severity and accompanied by mitigation recommendations. If the jailbreak affects a model from an external provider (OpenAI, Anthropic) and is not a fault of the customer integration, the team also suggests reporting it to the provider through their responsible disclosure channel.

Is retesting included in the original cost?

It should be. A pentest without retest is a static diagnosis. The usual practice is to include one retest round within three to six months after the closure of the original report, during which critical and high findings are verified. If the provider does not include it by default, it should be requested in the proposal.

LLM pentesting with Secra

At Secra we perform pentesting of LLM applications with full OWASP LLM Top 10 coverage and a threat model adapted to your specific use case. The engagement has a standard duration of two to four weeks, includes white-box, gray-box and black-box modalities, and delivers an executive report, vulnerability matrix, documented adversarial prompts, layer-by-layer recommendations and a retest plan. We work with chatbots, internal copilots, RAG systems and autonomous agents in production.

If your organisation is about to deploy or already has an LLM-based application and needs to validate its security posture before public exposure, before a regulatory audit or after a significant change in the model or its tools, contact our team to define scope.

Request an LLM pentest proposal

About the author

Secra Solutions team

Ethical hackers with OSCP, OSEP, OSWE, CRTO, CRTL and CARTE certifications, 7+ years of experience in offensive cybersecurity, and authors of CVE-2025-40652 and CVE-2023-3512.

Share article