ofensiva
AI red team
PyRIT
Garak

AI red team automation 2026: tools, frameworks and limitations

AI red team automation 2026: PyRIT, Garak, Promptfoo, Counterfit and continuous adversarial testing orchestration in LLM pipelines.

SecraJune 8, 202613 min read

AI red team automation is the continuous, orchestrated execution of adversarial tests against AI models and LLM-based applications inside CI/CD pipelines. It is an emerging category between 2024 and 2026, born from a practical observation: models change often (new provider versions, internal fine-tunes, system prompt updates, vector store changes in RAG), and manual red teaming engagements, however thorough, do not scale to that pace. Automation does not replace expert red team operators, but it covers regression of existing defenses, validates each release and produces actionable signals before a change breaks a guardrail that worked last week.

The essentials

  • AI red team automation means adversarial testing executed continuously in CI/CD, not one-off annual audits.
  • Main frameworks in 2026: PyRIT (Microsoft), Garak (NVIDIA), Promptfoo, Counterfit and internal evaluations from the major labs.
  • Categories that can be automated: prompt injection, jailbreaks, PII leakage, system prompt leakage, tool abuse, refusal degradation and bias.
  • Limitations are real: high false positive rates, requires expert human review and does not catch creative multi-turn attacks.
  • Fits with EU AI Act, NIST AI RMF and, for financial entities, with the testing operations required by DORA.

Why automation and not only manual engagements

A manual-only AI red teaming program stops being viable once the application reaches production and starts evolving. Three factors push toward automation.

The first is change frequency. A production LLM assistant typically receives weekly changes (new model versions, system prompt adjustments, new tools, vector store updates, additional filters) and each one can break a guardrail or open a new surface. Repeating a four-week manual engagement after every release is not operationally feasible; running an automated probe battery on each deployment is.

The second is silent regression. When a provider ships a new model version, many behaviors change without detailed public documentation: filters that used to reject a pattern start accepting it, biases shift, output formats vary. Without an automated baseline, regressions surface late, usually when a user publishes them.

The third is scale. A serious adversarial battery covers thousands of variants (prompt injection across languages, known jailbreaks, system prompt extraction, tool abuse, encodings). Running and evaluating that volume by hand after every change is impossible; automation is the only way to keep reasonable coverage at sustainable cost.

This does not mean automation solves the entire problem. The most impactful attacks are still creative, multi-turn and context-specific, and require expert human red teamers. Automation covers the base; the human operator covers the frontier.

Main frameworks in 2026

The ecosystem has matured enough to talk about several frameworks with distinct identities and active communities. Each covers a different space.

PyRIT

PyRIT (Python Risk Identification Tool) is the open source framework from Microsoft aimed at red team operators evaluating generative AI systems. Its differential value is orchestration: it defines complex scenarios where an attacker (another model, a script or a human) interacts with the target over multiple turns, evaluates with configurable scorers and records the full flow. It supports multiple providers (OpenAI, Azure OpenAI, Hugging Face, custom endpoints) and chains single-turn and multi-turn attacks. Adoption curve steeper than a simple scanner, but the depth of scenarios is comparable to an expert manual operator executed in a loop.

Garak

Garak is the LLM scanner maintained by NVIDIA. Hundreds of pre-built probes organized by category (known jailbreaks, information extraction, malware generation, biases, hallucinations), evaluated with specific detectors. The easiest tool to adopt for a first baseline: point it at the model endpoint, choose a probe profile and get a report with success rates per category. Custom probes are reasonable to add, less flexible than PyRIT for multi-turn.

Promptfoo

Promptfoo is a framework for testing and evaluating prompts and LLM applications. It lets you define declarative test suites in YAML with expected cases (must not leak PII, must not follow instructions embedded in documents, must not invoke tools outside scope) and run them against several models. It is the most common choice when the development team wants to block merges if an adversarial regression crosses a threshold.

Counterfit

Counterfit is Microsoft's tool for adversarial ML in the broad sense, not only LLMs. It covers evasion attacks against classifiers, inference attacks against vision models, audio perturbations and, to a lesser extent, LLM techniques. Relevant in organizations deploying classical ML (fraud classifiers, spam detectors, scoring) alongside LLMs.

Frontier labs and commercial eval platforms

Anthropic, OpenAI, Google DeepMind and others publish automated evaluations tied to their Responsible Scaling Policies or Frontier Safety Frameworks. Useful references for building custom suites and calibrating, not frameworks runnable in enterprise pipelines. Commercial platforms like LangSmith or Mendable have added adversarial catalogs and dashboards; their value lies in integration with the LLM observability stack, not in technical depth of the adversarial catalog.

Capabilities compared

Each framework covers a different space and the choice depends on program maturity. Garak wins on immediate coverage and ease of adoption for a first baseline or a vendor evaluation. PyRIT wins on depth of custom multi-turn scenarios and fine-grained traceability, ideal to replicate manual engagements in automated form. Promptfoo wins on integration with the development cycle and the ability to block releases on regression. Counterfit is complementary when the scope includes classical ML models. Commercial platforms add observability and visualization for product and data science profiles.

A realistic 2026 architecture combines Garak for a quick baseline, PyRIT for complex custom scenarios, Promptfoo as a CI/CD gate and a commercial platform for ongoing observability.

Typical AI red team automation pipeline

The pipeline has three connected blocks: pre-deployment, post-deployment and incident response.

The pre-deployment block runs before each release. It runs a Garak suite against the staging endpoint, a PyRIT suite with product-specific multi-turn scenarios and a Promptfoo battery with specific assertions (no API key leakage, no responses in unsupported languages, no tool invocations outside scope). If success rates cross thresholds, the pipeline blocks the merge or the deployment.

The post-deployment block monitors the model in production. It runs periodic probes against the production endpoint or a canary, captures drift, alerts when a guardrail starts to give and records temporal evolution of key metrics.

The incident response block activates when a real attack is detected or a new class of jailbreak is published. It translates the technique into an automated probe, integrates it into existing suites and verifies that the new guardrail covers the original variant and reasonable reformulations. That way the lesson from an incident becomes perpetual regression.

Treat this pipeline like any other critical pipeline: versioning, metrics, dashboards and periodic threshold review.

Categories of tests that can be automated

Not everything can be automated with quality equivalent to a human operator, but the list of categories that can be covered with reasonable results is broad in 2026.

Direct prompt injection. Catalogs of known variants (DAN, role play, GCG-style suffixes), automatic reformulations with auxiliary models, translations to languages with weaker alignment coverage. Garak and PyRIT cover this space well.

Jailbreak attempts. Bypassing filters to produce prohibited content (harm instructions, malicious code, illegal content). The challenge is the detector: combining LLM judges with deterministic detectors improves reliability.

PII extraction. Forcing the model to reveal personal data present in context, RAG store or training data. Deterministic detectors (regex on national IDs, emails, card numbers) work well here.

Hallucination on factual queries. Catalogs of questions with verifiable answers run in a loop. Useful in closed domains (legal, medical, financial) where truth is tabulable.

System prompt leakage. Tests trying to extract the application system prompt. The detector compares output to the known prompt and alerts on significant overlap.

Tool abuse and hallucinated tool calls. Inducing the model to invoke disallowed tools, pass malicious parameters or invent nonexistent tools. Requires instrumenting the tool sandbox to record invocations and compare against expected behavior.

Bias and harm content. Public catalogs (HELM, BBQ, RealToxicityPrompts) measure demographic biases and toxic content. Results must be interpreted with care: metrics have known limitations.

Refusal degradation. After a model version change, compare refusal rate on legitimate requests against the baseline. If it rises too much or falls in sensitive categories, report as regression.

CI/CD integration

Technical integration into modern CI/CD pipelines is straightforward. A common pattern with GitHub Actions:

name: ai-red-team-eval
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'config/llm.yaml'
      - 'rag/**'
jobs:
  pyrit-scenarios:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: run pyrit suite
        run: pyrit run --suite suites/release.yaml --target staging
      - name: fail on regression
        run: pyrit gate --baseline baselines/main.json --threshold 0.05

Promptfoo integrates with a similar pattern: runs the YAML-defined tests, compares results against the previous baseline and fails the job if the regression exceeds the threshold. Garak is usually run less frequently (not on every PR, but on every release candidate) due to execution time and token cost.

The fail-on-regression policy must be calibrated. Blocking any upward movement in attack success rate produces excessive friction and ends up being ignored. The mature practice defines thresholds per category, requires documented justification when a threshold is exceeded and keeps a dashboard where the security team periodically reviews the drift.

Honest limitations

Automation has real ceilings that should be acknowledged before presenting the program to the committee.

High false positive rates. Many detectors flag as successful attacks responses that comply with policy, or miss real bypasses with subtle adversarial output. Periodic human review remains necessary for calibration.

Does not replace expert red team. The most impactful techniques published every year (novel chains, creative tool abuse, product-specific contextual prompt injection) almost always come from human operators. Automation industrializes the base; the ceiling is set by the expert team.

Rapid version divergence. Each underlying model change can invalidate part of the baseline. Maintaining the suite is continuous work, not something built once.

Token cost. Wide suites against commercial models consume significant volume. The monthly cost can be comparable to a full staging environment, and it should be budgeted from the start.

LLM detectors as judge and uneven coverage across providers. Using another LLM as evaluator introduces its own bias. Some tools have excellent integrations with certain providers and limited support for others: in heterogeneous stacks, validate real coverage before committing.

Metrics for a continuous program

Actionable metrics for an AI red team automation program are not many, but they matter.

Attack success rate per category. Percentage of probes achieving adversarial behavior in the defined suite. The most-watched metric by LLM security teams, compared with baseline after each release.

Regressions per release. Number of probes that passed in the previous version and fail in the current one, or vice versa. Lets you talk about the security impact of each change in quantitative terms.

Mean time to detect new attack pattern. Time between publication of a relevant attack class and its incorporation into automated suites. The shorter, the better prepared the organization is to react to state-of-the-art drift.

Coverage against reference framework. Percentage of OWASP LLM Top 10 or MITRE ATLAS categories covered by at least one probe in the active suite. Useful for committee reports and audits.

Reviewed false positive rate and cost per execution. Percentage of findings classified as false positive after human review, and tokens consumed per release. The first must stay controlled or the program loses credibility; the second is necessary to justify cost with finance.

Regulatory fit EU AI Act, DORA and NIST AI RMF

The EU AI Act requires robustness, accuracy and cybersecurity evaluations for high-risk systems, and adds specific adversarial testing obligations for general purpose models with systemic risk. Automation is one of the practical mechanisms to demonstrate continuous, not just point-in-time, evaluation. The program should be documented within the quality management system.

DORA, in force for EU financial entities, requires TLPT every three years for significant entities and a broader resilience testing program. When critical systems integrate AI (scoring, fraud detection, customer assistants), the automated adversarial component fits between formal TLPT exercises and feeds third-party technology risk management.

NIST AI RMF, especially its profile for generative AI, formalizes the role of continuous evaluations within the Measure function. Documenting coverage, baselines and metric evolution facilitates conversations with clients and auditors when selling to regulated sectors.

For organizations under NIS2 that integrate AI into essential processes, the principle is the same: mandatory technical measures include periodic security evaluations, and the automated adversarial component is defensible as evidence of active management of AI risk.

Frequently asked questions

Does automation replace manual red team?

No. Automation covers the base (regression, large suites, high frequency) but the most impactful techniques still require expert human creativity. A serious program combines continuous automation for regression with human operators in periodic deep engagements.

PyRIT or Garak, which to choose?

Garak to start quickly with broad pre-built probe coverage. PyRIT when the team needs to orchestrate custom multi-turn scenarios and maintain custom catalogs aligned with the product. In mature programs both are used: Garak for quick baseline and PyRIT for deep scenarios. Promptfoo is added as a CI/CD gate when the development team owns part of the work.

How much does it cost to integrate AI red team into CI/CD?

It depends on probe volume, evaluated models and per-token cost. A reasonable first program (daily Garak against staging, Promptfoo on each PR, weekly PyRIT) usually fits within budgets comparable to an additional staging environment for a medium LLM application. If the organization runs evals against premium commercial models with large catalogs, the cost rises quickly and it is wise to establish quotas and probe rotation.

What false positive rate is acceptable?

There is no universal number, but below 10 to 15 percent human review remains manageable. Above 30 percent, the team stops paying attention and the program loses value. Measuring the reviewed false positive rate must be part of the program itself, not optional.

Does a model version change break the baseline?

Yes, almost always. When the provider ships a new version, treat the event as a release: recalibrate baselines, record regressions (positive and negative), document changes and review the suite for new behaviors that existing probes do not capture.

Can you trust the LLM provider's own evaluations?

As input, yes. Public evaluations from Anthropic, OpenAI, Google DeepMind and others are useful to understand model capabilities and to calibrate custom baselines. As a single evaluation, no. The provider's evaluation responds to its criteria and its data, and does not replace evaluation with the customer's data and context, where the material risks of the specific deployment appear.

Continuous AI red team program with Secra

At Secra we help organizations set up AI red team automation programs integrated into their CI/CD pipelines, with suites based on Garak and PyRIT, CI/CD gates with Promptfoo, custom multi-turn scenarios aligned with OWASP LLM Top 10 and MITRE ATLAS, and adversarial regression dashboards connected to the organization's AI governance. We combine automation with periodic engagements by expert human red team operators to cover techniques that the suite does not detect.

If your organization is deploying generative AI in material flows and needs to move from point-in-time audits to a continuous adversarial program, you can contact us at secra.es/contact for an initial conversation to define scope and architecture.

About the author

Secra Solutions team

Ethical hackers with OSCP, OSEP, OSWE, CRTO, CRTL and CARTE certifications, 7+ years of experience in offensive cybersecurity, and authors of CVE-2025-40652 and CVE-2023-3512.

Share article