This note covers the safety and alignment discipline: making the model want the right things (alignment) and defending the system against people who want it to do the wrong things (safety engineering). It spans the alignment goals (helpful, harmless, honest), how they are instilled (RLHF, Constitutional AI), hallucination, and then the adversarial layer — jailbreaks, prompt injection, guardrails, red-teaming, and defense-in-depth. It excludes the training mechanics themselves, covered under adaptation; this note is about what you train toward and what you wrap around the result. As the capstone, it touches every other discipline in the series.
There are two distinct gaps here, and neither can ever be fully closed. The first is the alignment gap: a capable model optimizing whatever objective you literally hand it can satisfy that objective while trampling what you actually meant.[2] The fix is not a cleverer instruction — it is teaching the model what we value. The field converged on three goals, the "HHH" triad: helpful, harmless, honest.[2]
The second is the adversarial gap: even a well-aligned model sits in a hostile environment. Any system that ingests user input or external content — search, summarization, copilots, RAG pipelines, tool-calling agents — can be manipulated into bypassing its own safeguards.[5] And the uncomfortable truth binding both gaps: you cannot compress the whole of human ethics into a reward signal,[4] and it is effectively impossible to block every jailbreak, because attacks rely on semantic manipulation that no single control stops.[5] The discipline is therefore layered risk reduction, not elimination.
Alignment means getting the model to pursue human values rather than a literal objective it can game.[2] It is produced by the adaptation methods from the previous notes, aimed at values rather than tasks. RLHF trains the model to prefer outputs that human raters judge helpful, honest, and harmless.[1] Constitutional AI goes a step further: the model critiques and revises its own outputs against a set of explicit written principles, reducing reliance on human labeling.[1][3] The motivation for principles over raw preference is pointed — human preferences can reward what people want to hear over what is true.[3]
The honesty goal runs into a structural problem: to the model, a fluent fabrication scores just as well as a fluent truth.[2] Hallucination concentrates on underrepresented topics, contradictory training data, and questions needing specific factual recall.[2] Reducing it means either changing the training process or adding a way for the model to verify its claims — which is exactly why retrieval, grounding answers in real sources, is a front-line defense.[2]
The two headline attacks are related but distinct. A jailbreak bypasses the model's safety alignment to extract a restricted output; a prompt injection overrides the system's instructions to hijack its behavior or its downstream actions.[5] Jailbreaks are really a subtype of injection, and the most dangerous variant is indirect injection: malicious instructions hidden inside external content — a web page, a retrieved document, a tool's output — that the model reads and dutifully obeys.[8]
Guardrails are independent filters that wrap the model: input guards screen requests before they reach it — catching injection attempts, PII, and known jailbreak patterns — while output guards screen responses before they reach the user or trigger an action.[6] A crucial limit: a model asked to police itself can be talked out of it, so safeguards that rely on the same model self-judging are bypassable. Effective safety needs independent validation layers, not self-regulation.[8]
You find the holes by attacking your own system first. Red-teaming deliberately probes a model with adversarial prompts to surface safety and reliability weaknesses before real attackers do — at both the model level (bias, hallucination, jailbreak susceptibility) and the surrounding system level.[7] Its findings feed directly into which guardrails you build, and the discovered failures become new cases in your evaluation suite.
Because no single control is sufficient, real systems layer defenses across the model, the application, and the system around it.[5] A trained-in alignment core, wrapped by input/output guardrails, wrapped by application controls and monitoring, wrapped by continuous red-teaming — each layer catches what the others miss, which is why orchestrating defenses at the system level matters as much as any single guardrail.[9]
Coaxing the model past its safety alignment into producing restricted output.
Malicious instructions in the user's own input that override the system prompt.
Hidden instructions in external content — web pages, RAG docs, tool output — that the model ingests and obeys.
Extracting PII, secrets, or the proprietary system instructions themselves.
At runtime, the layers form a path every request travels: it is screened on the way in, handled by an aligned model, screened again on the way out, and only then allowed to become an action — with red-teaming and monitoring watching the whole thing.
The throughline: alignment makes the model want the right things, and safety engineering assumes that will sometimes fail and builds independent layers to catch it. This is the capstone discipline because it reaches into every other one. Alignment is produced by adaptation, pointed at values instead of tasks. Hallucination is blunted by retrieval, grounding the model in real sources. Red-team findings and guardrail failures become new cases in the evaluation golden sets, where safety is just another category to regression-test. And agents blow the attack surface wide open — every tool an agent calls is a channel for indirect injection, and every action it can take is a consequence an attacker might hijack. Safety is the layer that touches all the others, which is why it belongs last in the roadmap and first in the threat model.
Neither alignment nor safety is ever "done." You cannot encode all of human values in a reward, and you cannot block every attack — so treat both as continuous risk reduction: audit the system like software, keep red-teaming after launch, and assume new attacks will appear.
Don't let the model be its own last line of defense. Self-judging guardrails are bypassable, so put independent validation around it — and accept that safety and capability sometimes trade off (the HHH goals are themselves in tension), so "safer" is a deliberate choice, not a free lunch.
This completes the six Tier-2 disciplines: Evaluation Pipelines, Retrieval & Context Engineering, Model Adaptation, Agent Design & Orchestration, Inference & Serving, and Safety & Alignment Engineering. Read together, they describe how a trained model becomes a measured, grounded, adapted, autonomous, efficiently-served, and defended production system — each discipline a peer of the others, all resting on the shared foundations of how LLMs work.