LangChain and LangGraph have become the reference frameworks for building LLM applications in production. Enterprise chatbots, RAG assistants, autonomous agents that orchestrate internal APIs, vertical copilots: behind most of these deployments there is LangChain code in Python or JavaScript, or LangGraph state graphs when the flow stops being a linear chain and requires cycles, branches or multi-agent coordination.
That flexibility comes with a tradeoff. LangChain exposes primitives that combine code execution, calls to privileged tools, deserialization of arbitrary objects and persistent memory backed by databases. Example patterns that appear in tutorials and notebooks end up in production without going through a serious threat model. The result: a hybrid attack surface across traditional glue code, an LLM layer with latent prompt injection and tools with broad permissions.
This guide walks through the specific risks of LangChain and LangGraph, reviews real historical CVEs of the framework, identifies recurring insecure patterns in product team code and proposes a hardening scheme applicable to current deployments.
The essentials: LangChain has accumulated CVEs for code execution, SQL injection and insecure deserialization through components designed for maximum flexibility (PythonREPLTool, SQLDatabaseChain, load_prompt, ConfigurableField). Architectural risks such as excessive agency, indirect prompt injection through tool output and memory poisoning are as critical as individual CVEs. Hardening relies on execution sandboxing, strict tool allowlists, Pydantic validation on inputs and outputs, total removal of pickle as a persistence format and human-in-the-loop on sensitive mutations.
LangChain and LangGraph in 60 Seconds
LangChain abstracts LLM application construction into reusable pieces. Chains link model calls with pre and post processing. Agents delegate to the LLM the decision about which tool to invoke at each step. Tools are functions the agent can execute (query an API, search a database, run Python code). Memory persists conversational context across turns. Retrievers feed the model with relevant documents in RAG patterns. Callbacks observe the lifecycle of each invocation.
LangGraph emerged as an evolution for flows that LangChain expresses with difficulty: workflows with explicit state, controlled cycles, conditional branching, coordination across multiple agents. It models the application as a graph where each node is a function that reads and writes on a shared state object, with optional persistence through checkpointers (in-memory, SQLite, Postgres, Redis).
Adoption is massive. The consequence: any security finding on these frameworks has immediate impact on thousands of production applications.
Why LangChain Deserves Dedicated Auditing
A typical LangChain application integrates three layers with distinct threat models: conventional Python or JavaScript code with its classic vulnerabilities (injection, deserialization, path traversal), the LLM layer with prompt injection and output manipulation, and tools the agent invokes with permissions inherited from the process.
Traditional application auditing covers the first layer. LLM pentesting addresses the second. What sets LangChain apart is the boundary between the three: a tool can receive input directly from the LLM, which in turn processes untrusted external content, while running with service privileges. That chain breaks common assumptions about input validation and privilege separation.
Auditing LangChain requires simultaneous code review, analysis of the tools and prompts graph, and adversarial testing across the full flow. Few standard methodologies cover it comprehensively.
Historical LangChain CVE Vulnerabilities
The public CVE record on LangChain illustrates the pattern: code execution through tools designed for maximum functionality, SQL injection in chains that generate queries, insecure deserialization in loading components.
CVE-2023-29374 affects PALChain, a component that allowed the LLM to generate and execute Python code to solve math problems. An attacker with control over chain input could achieve arbitrary code execution on the host process. The official mitigation involved removing the component from the default flow.
CVE-2023-32785 hits SQLDatabaseChain, a chain that translates natural language to SQL against relational databases. It generated queries without sufficient protection, opening classic SQL injection vectors when the attacker could manipulate the input prompt.
CVE-2023-39659 groups several deserialization issues in components that loaded serialized objects (pickle, mainly) from externally controlled paths. Pickle enables code execution during deserialization, so any untrusted file load translates into RCE.
CVE-2024-2057 corresponds to load_prompt(), a utility function that loaded prompt templates from local or remote paths. It allowed path traversal and access to files outside the intended directory when the path came from unvalidated input.
CVE-2024-21513 affects ConfigurableField within LCEL (LangChain Expression Language). It allowed insecure deserialization when loading dynamic field configurations from external sources, replicating the base pickle problem in a new abstraction.
The pattern is clear: the framework prioritizes flexibility over security by default. Every time a new abstraction appears with the ability to load code, configuration or queries from external sources, the same vulnerability type reappears.
Common Insecure Patterns in LangChain Code
Beyond the published CVEs, the largest risk surface lives in application code that follows insecure patterns inherited from tutorials and official examples.
PythonREPLTool without sandbox is the most widespread case. Teams expose to the agent the ability to execute arbitrary Python in the same process, assuming the LLM will only write benign code. A well-crafted prompt injection turns that tool into immediate RCE on the host machine, with access to the filesystem, environment variables and services reachable from the internal network.
ShellTool without allowlist replicates the problem at the system layer. The tool executes shell commands without filtering; any LLM instruction translates into a child process with service privileges.
load_prompt() with external files persists as an antipattern even after CVE-2024-2057, when teams load prompts from dynamically constructed paths without validating the origin.
Pickle and cloudpickle for saved chains appears in flows where teams serialize complete agents or complex chains for reuse. Any pickle file loaded from an untrusted source is guaranteed code execution.
eval() in custom output parsers sneaks in when a developer needs to interpret the LLM output as a Python or JavaScript expression. If the model response contains manipulated content, eval runs it as is.
Agents with overly privileged tools covers cases such as an agent with access to the internal admin API, the ability to modify databases without filtering, or invocation of cloud services with broad credentials. Excessive agency is hard to measure without explicit capability mapping.
Memory backed by SQL without sanitization appears when conversational history is persisted in relational databases and the retrieval query interpolates user-controlled identifiers directly. It replicates classic SQL injection vectors, with the added twist that the malicious payload can originate from the LLM itself rather than from a traditional user input field, bypassing perimeter validation that only inspects HTTP request parameters.
Custom retrievers that fetch from untrusted sources complete the picture. When a retriever pulls documents from external web pages, public APIs or shared storage without provenance checks, the agent ends up consuming content that may contain instructions designed to alter its behavior. The retriever then becomes the entry point for indirect prompt injection.
Architectural Risks
Above the code patterns there are structural risks that no single line patch can resolve.
Indirect prompt injection via tool output occurs when a tool returns content that enters the LLM context without segregation. If the agent queries a web page, reads emails or accesses a database where an attacker has seeded instructions, the model treats that content as legitimate input. The injected instruction can steer the agent to invoke other tools, exfiltrate data or modify its behavior.
Excessive agency manifests when the agent can execute actions whose impact exceeds the task value. Access to production, write operations on critical systems, the ability to send external communications: any tool that produces persistent effects deserves specific evaluation.
Memory poisoning affects deployments with persistent memory (Redis, SQL, vector stores). If an attacker manages to write into the memory of another user's session or conversation, they can manipulate future agent behavior without injecting anything again. It is particularly critical in multi-tenant scenarios.
Callback handlers as exfiltration vector appears when callbacks are registered that send traces to external systems (LangSmith, in-house observability, remote logging). If the trace includes sensitive content and the destination endpoint is manipulable, the callback becomes an outbound channel. The risk increases when traces contain raw prompts, full tool arguments and complete model responses, since any of those fields may carry secrets, tokens or personal data extracted by the agent during normal operation.
Coupling between LLM output and downstream services without intermediate validation rounds out the architectural picture. When the response from the model is passed unmodified to internal APIs, message queues or database writes, the LLM effectively becomes part of the trusted path. Any manipulation of its output, whether by direct injection or by manipulating retrieved content, propagates immediately to the rest of the stack.
Specific LangChain and LangGraph Hardening
The defensive scheme covers several dimensions simultaneously.
Replace PythonREPLTool with a Docker sandbox through services such as e2b, in-house sandboxes or ephemeral containers with restricted network. Execution of LLM-generated code should never share process with the main application.
ShellTool with command allowlist restricts the tool to a closed set of verified utilities. Any command outside the list is rejected before invoking the subprocess.
Chain persistence through JSON safe, eliminating pickle and cloudpickle from the flow. LangChain offers JSON serialization for most components; where it falls short, it is preferable to rebuild the chain in code rather than load opaque binaries.
Tool input validation with strict Pydantic schemas turns each tool into a function with a verifiable contract. The agent can only invoke it with arguments that meet the schema; any deviation is rejected before touching the logic.
Human-in-the-loop on critical mutations introduces a human approval step before executing high-impact actions: sending emails, modifications in production, transfers, configuration changes. LangGraph supports it natively with interrupt nodes.
Output validators with JSON schema verify that the model output meets the expected format before passing it to the next step. It reduces the risk of injecting manipulated content into downstream components.
Memory isolation per session guarantees that one conversation's memory never crosses another's. Explicit partition keys, validation of the session owner and physical separation when the threat model justifies it.
Rate limit and cost ceiling protect both against abuse and against loops in the agent itself. A buggy agent can generate hundreds of calls in seconds; the limit prevents the incident from escalating.
Audit log with full traces records each agent step: prompt sent to the model, tool invoked, arguments, response. It is the foundation for post-incident investigation and for detecting anomalous patterns in real time.
Auditing a LangChain Application
The audit methodology combines four fronts.
Structured code review over the application repository: locate all tool definitions, map inherited permissions, identify use of PythonREPLTool, ShellTool, pickle, eval, load_prompt and any point where external input feeds the agent. The review also covers memory configuration, callbacks and serialization.
Runtime testing on the application deployed in a representative test environment. Reproduction of the agent's legitimate flows and observation of the trace to verify that expected behavior matches actual behavior.
Adversarial prompt injection with a catalog of payloads aimed at unlocking restricted tools, escaping the system prompt, manipulating memory and inducing excessive agency. Tests must cover both direct injection by the user and indirect injection through content retrieved by retrievers or returned by tools.
Tool abuse scenarios simulates that the attacker has already gained control over the LLM output and tests the real impact of each tool under controlled conditions. It is the only way to measure how much damage a compromised agent can cause and to identify tools that are redundant or that require reinforcement.
LangGraph Specifics
LangGraph introduces additional primitives that deserve dedicated analysis.
The state graph becomes an analysis surface. Each node reads and writes on the shared state object; the audit must verify that a compromised node cannot corrupt state read by subsequent nodes with different permissions. The state schema should be defined with Pydantic and invariants validated on critical transitions.
Checkpointing security affects state persistence across invocations. If the checkpointer uses Redis, SQL or filesystem, it inherits the backend's risks: access control, encryption, isolation between sessions. Loading a manipulated checkpoint can resume the graph in a state that violates code assumptions.
Multi-agent risks appear in graphs where several agents collaborate. A compromised agent can manipulate the shared state to influence decisions of other agents. The threat model must consider the whole, not each agent in isolation.
Frequently Asked Questions
Is LangChain Safe for Production?
LangChain can be used in production with an acceptable security profile as long as documented insecure patterns are eliminated, versions that close CVEs are kept up to date and a hardening scheme specific to the deployment is applied. It is not an insecure framework by design; it is a framework that requires explicit decisions at each layer.
Should PythonREPLTool Never Be Used?
Never in the same process as the application. If business logic requires execution of code generated by the LLM, it must be delegated to an isolated sandbox (Docker, e2b, microVMs) with restricted network, no access to main service credentials and a strict timeout.
How to Audit Custom Tools?
Each custom tool deserves individual review covering: input validation with a Pydantic schema, effective function permissions, side effects, error handling that does not leak internal information, and behavior under malicious input. An agent's tool set must also pass an excessive agency analysis: what impact a compromised agent has when chaining several tools.
Recommended Memory Backend?
For production, backends with per-session access control, isolation between tenants and traceability. Postgres with schemas separated per tenant works in many cases. Redis is valid if explicit isolation is managed. Vector stores require additional attention because the semantics of "nearest document" can cross tenant boundaries if not filtered correctly.
Are There Safer LangChain Alternatives?
There is no direct substitute with a substantially better security profile; the risks are inherent to the LLM application model with tools and memory, not to the specific framework. Alternatives such as LlamaIndex, Haystack, in-house frameworks or the provider's direct SDK share most of the surfaces. The reasonable decision is to choose based on maturity, ecosystem and patch adoption speed.
How to Pentest LangChain?
By combining code review, analysis of the tools and prompts graph, adversarial testing of direct and indirect prompt injection, tool abuse scenarios and validation of the applied hardening. The typical duration of a complete audit ranges from one to three weeks depending on deployment complexity.
Related Resources
- AI Agents Security: Autonomous Risks
- OWASP LLM Top 10 Explained
- What Is Prompt Injection: LLM Attacks
- RAG Security: Retrieval Augmented Generation
- Pentesting AI and LLM Models: Methodology
- AI Red Teaming: Evaluating AI Models
LangChain Auditing with Secra
At Secra we audit applications built on LangChain and LangGraph by combining structured code review, adversarial prompt injection tests on the full flow, tool abuse scenarios and hardening recommendations tailored to the stack and the specific threat model of the deployment. We deliver a technical report with prioritized findings, a reproducible proof of concept for each confirmed vulnerability and a remediation plan actionable by the product team.
If your organization has deployed or is close to deploying an LLM application in production, contact Secra to coordinate a specific audit.
About the author
Secra Solutions team
Ethical hackers with OSCP, OSEP, OSWE, CRTO, CRTL and CARTE certifications, 7+ years of experience in offensive cybersecurity, and authors of CVE-2025-40652 and CVE-2023-3512.