Research Note  ·  Applied ML / Safety

Safety & Alignment Engineering for AI Systems Values, Guardrails, and Adversaries

Roadmap · Discipline 06 of 06 — final note
TL;DR — Scope

This note covers the safety and alignment discipline: making the model want the right things (alignment) and defending the system against people who want it to do the wrong things (safety engineering). It spans the alignment goals (helpful, harmless, honest), how they are instilled (RLHF, Constitutional AI), hallucination, and then the adversarial layer — jailbreaks, prompt injection, guardrails, red-teaming, and defense-in-depth. It excludes the training mechanics themselves, covered under adaptation; this note is about what you train toward and what you wrap around the result. As the capstone, it touches every other discipline in the series.

01

The Problem

There are two distinct gaps here, and neither can ever be fully closed. The first is the alignment gap: a capable model optimizing whatever objective you literally hand it can satisfy that objective while trampling what you actually meant.[2] The fix is not a cleverer instruction — it is teaching the model what we value. The field converged on three goals, the "HHH" triad: helpful, harmless, honest.[2]

The second is the adversarial gap: even a well-aligned model sits in a hostile environment. Any system that ingests user input or external content — search, summarization, copilots, RAG pipelines, tool-calling agents — can be manipulated into bypassing its own safeguards.[5] And the uncomfortable truth binding both gaps: you cannot compress the whole of human ethics into a reward signal,[4] and it is effectively impossible to block every jailbreak, because attacks rely on semantic manipulation that no single control stops.[5] The discipline is therefore layered risk reduction, not elimination.

Helpful Harmless Honest a request you must refuse true vs what pleases aligned behavior = balancing all three
FIG. 1 — The HHH triad. Helpful, harmless, honest is the target of alignment — but the three pull against each other (the most helpful answer can be the least safe; the most pleasing can be the least true), so "aligned" is a balance, not a maximum.
02

The Concepts

Alignment & How It's Instilled

Alignment means getting the model to pursue human values rather than a literal objective it can game.[2] It is produced by the adaptation methods from the previous notes, aimed at values rather than tasks. RLHF trains the model to prefer outputs that human raters judge helpful, honest, and harmless.[1] Constitutional AI goes a step further: the model critiques and revises its own outputs against a set of explicit written principles, reducing reliance on human labeling.[1][3] The motivation for principles over raw preference is pointed — human preferences can reward what people want to hear over what is true.[3]

Hallucination

The honesty goal runs into a structural problem: to the model, a fluent fabrication scores just as well as a fluent truth.[2] Hallucination concentrates on underrepresented topics, contradictory training data, and questions needing specific factual recall.[2] Reducing it means either changing the training process or adding a way for the model to verify its claims — which is exactly why retrieval, grounding answers in real sources, is a front-line defense.[2]

Jailbreaks vs Prompt Injection

The two headline attacks are related but distinct. A jailbreak bypasses the model's safety alignment to extract a restricted output; a prompt injection overrides the system's instructions to hijack its behavior or its downstream actions.[5] Jailbreaks are really a subtype of injection, and the most dangerous variant is indirect injection: malicious instructions hidden inside external content — a web page, a retrieved document, a tool's output — that the model reads and dutifully obeys.[8]

JAILBREAK · bypass safety adversarialframing model restrictedoutput safeguards talked around PROMPT INJECTION · override instructions external content"ignore prior rules…"(hidden instruction) model hijackedaction system prompt overridden
FIG. 2 — Two attacks, two targets. A jailbreak coaxes the model past its safety training; an injection smuggles instructions (often inside content the model retrieves) to seize control of its behavior or actions. The second is especially dangerous for agents that act in the world.

Guardrails

Guardrails are independent filters that wrap the model: input guards screen requests before they reach it — catching injection attempts, PII, and known jailbreak patterns — while output guards screen responses before they reach the user or trigger an action.[6] A crucial limit: a model asked to police itself can be talked out of it, so safeguards that rely on the same model self-judging are bypassable. Effective safety needs independent validation layers, not self-regulation.[8]

Red-Teaming

You find the holes by attacking your own system first. Red-teaming deliberately probes a model with adversarial prompts to surface safety and reliability weaknesses before real attackers do — at both the model level (bias, hallucination, jailbreak susceptibility) and the surrounding system level.[7] Its findings feed directly into which guardrails you build, and the discovered failures become new cases in your evaluation suite.

Defense in Depth

Because no single control is sufficient, real systems layer defenses across the model, the application, and the system around it.[5] A trained-in alignment core, wrapped by input/output guardrails, wrapped by application controls and monitoring, wrapped by continuous red-teaming — each layer catches what the others miss, which is why orchestrating defenses at the system level matters as much as any single guardrail.[9]

1Jailbreak

Coaxing the model past its safety alignment into producing restricted output.

2Direct injection

Malicious instructions in the user's own input that override the system prompt.

3Indirect injection

Hidden instructions in external content — web pages, RAG docs, tool output — that the model ingests and obeys.

4Data leakage

Extracting PII, secrets, or the proprietary system instructions themselves.

Aligned model Guardrails (in / out) Application controls Red-teaming + monitoring
FIG. 3 — Defense in depth. No layer is trusted to be perfect. The aligned model sits at the core; guardrails, application controls, and continuous red-teaming and monitoring each wrap it, so a failure in one is caught by the next.
03

How It All Fits Together

At runtime, the layers form a path every request travels: it is screened on the way in, handled by an aligned model, screened again on the way out, and only then allowed to become an action — with red-teaming and monitoring watching the whole thing.

In
Request
user or external content
Layer 1
Input guard
screen for injection / abuse
Core
Aligned model
trained to HHH
Layer 2
Output guard
screen the response
Out
Action
reply or tool call
FIG. 4 — The safety stack at runtime. Alignment (the core) is necessary but not sufficient; independent guardrails bracket it, and red-teaming plus monitoring wrap the whole pipeline. Defense in depth in motion.

The throughline: alignment makes the model want the right things, and safety engineering assumes that will sometimes fail and builds independent layers to catch it. This is the capstone discipline because it reaches into every other one. Alignment is produced by adaptation, pointed at values instead of tasks. Hallucination is blunted by retrieval, grounding the model in real sources. Red-team findings and guardrail failures become new cases in the evaluation golden sets, where safety is just another category to regression-test. And agents blow the attack surface wide open — every tool an agent calls is a channel for indirect injection, and every action it can take is a consequence an attacker might hijack. Safety is the layer that touches all the others, which is why it belongs last in the roadmap and first in the threat model.

Neither alignment nor safety is ever "done." You cannot encode all of human values in a reward, and you cannot block every attack — so treat both as continuous risk reduction: audit the system like software, keep red-teaming after launch, and assume new attacks will appear.

Don't let the model be its own last line of defense. Self-judging guardrails are bypassable, so put independent validation around it — and accept that safety and capability sometimes trade off (the HHH goals are themselves in tension), so "safer" is a deliberate choice, not a free lunch.

End of series · LLM Systems

This completes the six Tier-2 disciplines: Evaluation Pipelines, Retrieval & Context Engineering, Model Adaptation, Agent Design & Orchestration, Inference & Serving, and Safety & Alignment Engineering. Read together, they describe how a trained model becomes a measured, grounded, adapted, autonomous, efficiently-served, and defended production system — each discipline a peer of the others, all resting on the shared foundations of how LLMs work.

References

  1. AI WeeklyWhat Is AI Alignment? Definition, Challenges, and Why It Matters — https://aiweekly.co/learning-ai/ai-fundamentals/what-ai-alignment-definition-challenges-and-why-it-matters
  2. M. BrenndoerferThe Alignment Problem: Making AI Helpful, Harmless & Honest — https://mbrenndoerfer.com/writing/alignment-problem-hhh-framework-language-models
  3. M. BrenndoerferConstitutional AI: Principle-Based Alignment Through Self-Critique — https://mbrenndoerfer.com/writing/constitutional-ai-principle-based-alignment-through-self-critique
  4. Ethics & Information TechnologyHelpful, harmless, honest? Sociotechnical limits of AI alignment through RLHF — https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12137480/
  5. MindgardPrompt Injection vs Jailbreak in LLMs: Differences, Risks, and Prevention — https://mindgard.ai/blog/prompt-injection-vs-jailbreak
  6. Confident AILLM Guardrails for Data Leakage, Prompt Injection, and More — https://www.confident-ai.com/blog/llm-guardrails-the-ultimate-guide-to-safeguard-llm-systems
  7. Confident AILLM Red Teaming: The Complete Step-By-Step Guide to LLM Safety — https://www.confident-ai.com/blog/red-teaming-llms-a-step-by-step-guide
  8. HiddenLayerThe "Self-Policing" LLM Vulnerability — https://www.hiddenlayer.com/research/same-model-different-hat
  9. arXiv · MetaLlamaFirewall: An open source guardrail system for secure AI agents — https://arxiv.org/pdf/2505.03574