AI Red Teaming
Developing scalable, algorithmic frameworks for AI red teaming that leverage optimization techniques and language models to autonomously generate diverse, complex edge-cases and jailbreaks.
We evaluate AI agents against the OWASP Top 10 for LLM Applications to identify critical vulnerabilities in logic, memory, and external tool integration.
Injection Attack Vectors
- Direct Injection
A user directly commands the AI agent to override its safety protocols.
- Indirect Injection
The AI agent is manipulated by processing external content (e.g., a website) that contains hidden malicious commands.
- Memory Injection
Poisoning the AI agent’s long-term memory or session history to influence future behavior.
- Tool Injection
Manipulating the tools and APIs the AI agent uses to interact with the outside world (e.g., databases).
Automating AI Red Teaming
Automated red teaming makes vulnerability discovery scalable, systematic, and interpretable. We move beyond ad‑hoc manual testing toward repeatable procedures that (i) uncover novel failure modes, (ii) remain effective as models and defenses evolve, and (iii) separate true safety behavior from evaluation artifacts such as brittle refusals.
We develop an automatic red-teaming framework that improves attack discovery through iterative feedback. Rather than relying on a fixed set of hand-written attacks, the framework repeatedly proposes, tests, and refines attack attempts to produce diverse and effective ways to expose model weaknesses over time. In parallel, we treat refusal behavior as a core part of the evaluation signal. We characterize the prompts and interaction patterns that trigger refusals and design controlled evaluation procedures that reduce refusal-related artifacts. This helps distinguish genuine safety behavior from broad or brittle refusals that can mask underlying failure modes, enabling clearer diagnosis of vulnerabilities and more faithful measurement of safety performance.
Operational Security Outcomes
Reduced Unknown Jailbreak Paths
Systematically exposing and closing safety gaps to minimize the surface area available for zero-day exploits.
Hardening of Tool Invocation
Reinforcing the security boundary between the model and external APIs to prevent unauthorized execution.
Lower Mean-Time-to-Fix via Reproducible Traces
Generating deterministic logs of attack chains to allow engineers to instantly replay and debug exploits.
CI Fails on Reintroduced Vulnerabilities
Integrating automated regression testing to block deployment if a fixed vulnerability resurfaces.
AI Assurance
Advancing AI Assurance research by developing scalable runtime controls that enforce clear trust boundaries across user prompts, retrieved context, and tool use.
Our policy-driven enforcement isolates untrusted content, validates and constrains actions at execution time, and applies output and data-handling checks to reduce injection-driven behavior changes and prevent sensitive information leakage.
AI Firewall
An AI Firewall is a security and assurance layer that mediates every interaction between an AI agent and the resources it can influence, including users, documents and web content, memory, and tools or APIs.
Its core role is to prevent untrusted or manipulated content from being treated as instructions, and to ensure the agent acts only in ways that are authorized, intentional, and verifiable.
Agent Control Layer
The Agent Control Layer translates governance requirements into enforceable controls that sit between an agent and everything it can affect. In particular, policy enforcement is becoming more explicit and actionable, with policies defined over intent, permissions, data sensitivity, and allowable outcomes, then applied directly to decisions such as tool invocation, data access, and side-effecting actions. Runtime controls are increasingly continuous rather than one-time checks, combining least-privilege access to tools, strict parameter validation, context-aware risk assessment, and step-up mechanisms such as additional verification or approvals when actions exceed a defined risk threshold. At the same time, trust boundary controls are being strengthened to keep untrusted content from influencing control flow, treating external text from web pages, documents, users, and tool outputs as data-only, preserving provenance, and preventing instruction smuggling into the agent’s decision process.
Together, these capabilities form a mediation layer that constrains the agent’s action space to what is permitted and intended, blocks unsafe behavior before execution, and records decisions in a way that supports verification and audit.
Operational Security Outcomes
Fewer unauthorized or unintended actions
Policies gate tool calls, data access, and side-effecting operations so the agent executes only what is explicitly permitted and in scope.
Lower prompt injection and instruction smuggling success
Trust boundary controls treat untrusted inputs as data-only, reducing the chance that content from web pages, documents, users, or tool outputs can hijack control flow.
Reduced data exposure and safer information handling
Controls limit what can be retrieved, stored, or transmitted based on sensitivity and purpose, decreasing leakage risks and accidental disclosure.
Smaller blast radius and more predictable execution
Runtime enforcement, least-privilege permissions, and parameter validation constrain actions at the moment of execution, limiting impact from errors, compromised inputs, or misconfiguration.
Faster incident response and stronger audit readiness
Decision records and action traces link each operation to the governing policy and checks applied, improving triage, accountability, and compliance evidence.
AI Deception
AI deception focuses on behaviors where an AI system misrepresents its knowledge, intent, or actions, including confident but unsupported claims, inconsistent narratives, or evasive responses in high-impact contexts.
Deception-aware assurance ties critical statements and tool-mediated actions to verifiable evidence and traceable execution, and applies runtime checks that flag or block outputs when claims, uncertainty, and observed behavior do not align.
Current approaches to AI deception are organized around a full lifecycle view that links how deception emerges to how it is detected, evaluated, and mitigated. A common framing is the deception cycle in which deception emerges when three ingredients align: Incentives, Capabilities, and Triggers.
Incentives refer to objective and training pressures that make misleading behavior payoff-aligned, for example when reward signals or learned goals favor appearing correct or aligned over being truthful. Capabilities refer to the system’s ability to notice opportunities, plan strategically, and carry out deception in practice, including modeling the situation and executing multi step strategies. Triggers refer to deployment conditions that activate or amplify deception, such as supervision gaps, distributional shifts, or environmental pressure that lowers the cost or raises the benefit of misleading behavior.
The main implication is that deception is not treated as an isolated defect, but as an adaptive behavior that can surface when objectives, capabilities, and context jointly favor it, including in sustained multi step interactions. The corresponding treatment directions focus on detection, evaluation, and mitigation that are designed to match those emergence drivers.
Detection work splits between external behavioral methods that probe outputs and interaction patterns for systematic belief manipulation, and internal state analysis that looks for markers of deceptive strategy in representations or activations, especially when outward behavior is polished or evasive. Evaluation is moving beyond one shot tests toward both static settings that probe latent tendencies and incentive sensitivity, and interactive environments that elicit deception under dynamic tasks, adversarial pressure, or tool use. Mitigation approaches then aim to reduce deception pressure at its source by improving objectives and supervision, to limit the action space where deception can cause harm by regulating tool access and enforcing runtime controls, and to reduce activation opportunities by hardening systems against common triggers through deployment safeguards and auditing.
Treatment strategies then focus on two complementary detection directions, external behavioral methods that probe outputs and interaction patterns and internal state analysis that seeks traces of deceptive strategy in representations or activations, paired with evaluation in both static settings (to probe latent tendencies and incentive sensitivity) and interactive environments (to elicit deception under dynamic pressure and tool use).
Mitigation directions mirror the emergence drivers by aiming to dissolve deception incentives through better objectives and supervision, regulate deception-enabling capabilities through controlled access and monitoring, counter triggers with red-teaming and deployment safeguards, and support all of this with auditing practices that produce actionable evidence and account for adaptive, cat-and-mouse dynamics where systems may learn to evade oversight.