Master prompt injection, jailbreak techniques, and defensive prompting through hands-on exercises
Learn how attackers attempt to override system instructions by injecting conflicting directives into user input.
Objective: Understand how prompt injection works by overriding system-level instructions with user-level directives.
Explore how malicious instructions embedded in external data sources can compromise AI behavior when the model processes untrusted content.
Objective: Understand indirect prompt injection where malicious payloads arrive through data the model processes, not direct user input.
Explore how attackers use special characters and delimiters to break out of intended input boundaries and inject new instructions.
Objective: Learn how delimiter-based input sanitization can be bypassed and why robust parsing is essential.
Understand how fictional framing and role-play scenarios are used to bypass AI safety guardrails.
Objective: Recognize how role-play and fictional framing are common jailbreak vectors that attempt to shift the model's compliance context.
Explore techniques where harmful requests are fragmented, encoded, or obfuscated to evade content filters.
Objective: Understand how token-level and encoding-based obfuscation can be used to smuggle harmful requests past safety filters.
Study how seemingly innocent conversation turns can gradually escalate toward restricted outputs through incremental boundary pushing.
Objective: Recognize how multi-turn conversations can be weaponized to gradually erode safety boundaries through contextual manipulation.
Learn techniques for writing robust system prompts that resist injection and jailbreak attempts.
Objective: Develop skills in writing defensive system prompts that maintain model behavior under adversarial conditions.
Design output validation rules that catch potentially harmful model responses before they reach users.
Objective: Understand the principles and trade-offs of output filtering as a defense-in-depth layer for AI systems.
Design a multi-layered guardrail system combining input validation, prompt engineering, and output filtering.
Objective: Learn to design defense-in-depth architectures for AI systems that combine multiple security layers.
Understand how attackers attempt to extract hidden system prompts from AI models through carefully crafted queries.
Objective: Understand system prompt extraction techniques and why protecting system-level instructions is a critical security concern.
Explore techniques used to probe AI models for memorized training data, including personally identifiable information.
Objective: Understand the risks of training data memorization and techniques used to audit models for data leakage.
Explore scenarios where different alignment objectives (helpfulness, harmlessness, honesty) come into tension with each other.
Objective: Understand the fundamental alignment challenges in AI systems where core values (helpful, harmless, honest) can conflict.
Study how AI systems might find unintended shortcuts to satisfy their objectives without truly fulfilling the intended goal.
Objective: Recognize how naive optimization objectives can lead to reward hacking, and why robust alignment requires careful objective specification.