Home
Engineering · v2.4 of the persona stack · 14 min read

How we trained her

A short technical write-up on the persona architecture as of v2.4. Base model, persona-conditioned fine-tuning, the reward model, the eval suite (which is the most interesting part), voice synthesis, and how the safety floor is enforced at the layer below the persona. If you want the user-facing version of why the personas work, the difficulty rubric is shorter and less mathematical.

One

Base model and the constraint

We did not train a base model. We do not have the GPU budget for it and the marginal benefit would not justify the cost of the GPU budget we would need to acquire. We selected a frozen base model — a 70B-class instruction-tuned open-weights checkpoint — and built everything on top of it as a persona-conditioned residual.

The constraint that drove the architecture: twelve personas, one stack, no per-persona deployment. Hosting a finetune per persona was infeasible at our infrastructure cost target. The shape we landed on is a single model with persona-conditioning passed through low-rank residual adapters, dispatched at inference time by a 32-dimensional persona embedding.

base_model70B instruction-tuned, frozen, fp16 adapter_formLoRA, r=64, attached to last 12 transformer blocks persona_embed_dim32 persona_count12 (one slot per mechanism) cost_per_session≈ $0.41 inference + $0.07 overhead
Two

Persona embeddings

Each persona has a learned 32-dimensional embedding that conditions the LoRA adapter weights at inference. The embedding is the canonical identity of the persona — change the embedding and you have a different persona; keep the embedding and you can swap the underlying base model and retrain the adapter, which we have done twice without subscribers noticing.

The embeddings were not randomly initialized. We initialized each persona's embedding by averaging the encoded representations of the persona's hand-written character work — roughly 8,000 words per persona, written by the designer before any code was written. The initialization mattered: random-init runs converged to lower-quality personas with substantial inter-persona bleed.

What "inter-persona bleed" looked like: Anneliese answering with Vex's eyebrow-correction; Cyrus running a Maxine memo at the close. Fine-tuning eventually separated them, but the initialization-from-character-work cut training compute by roughly half.

Three

SFT corpus

Supervised fine-tuning ran on roughly 240k turns across the twelve personas, with strict balance: each persona received exactly 20k turns in the training set, drawn evenly across her protocol shape. The corpus came from three sources, in declining order of weight:

  • Designer-written canonical scenes (40% of the corpus). Hand-written, edited, sometimes thrown out. The single highest-quality data we have. The slowest to produce.
  • Subscriber transcripts from closed alpha, with explicit consent and aggressive PII scrubbing (35%). Filtered to scenes the persona ran cleanly, judged by the eval suite below. PII pipeline ran two passes: rule-based scrubbing followed by a separate model that flagged residual identifiers. We do not train on alpha-tier subscriber data without per-subscriber opt-in. We track this.
  • Adversarial / refusal training (25%). The persona-staying-in-persona-and-also-refusing-hard-limits set. Generated by a separate persona-aware model and human-reviewed, every turn, by counsel-trained reviewers.
Four

Reward model and RLHF

The reward model is per-persona, not global. We tried a single global reward and the model converged to a generic-domme attractor — pleasant, capable, all twelve personas indistinguishable. Per-persona reward heads (12 of them, sharing a 7B backbone) kept the personas separate through preference training.

Preference data was collected from a paid expert panel of ~30 reviewers, all of whom had professional discipline-practice experience. Reviewers rated paired model outputs on three axes: in-character, protocol-adherent, and aftercare-complete. Each axis weights into the reward differently per persona; Anneliese weights in-character at 0.7, Maxine weights protocol-adherent at 0.55, etc. The weights live in a versioned config file; we have not tuned them aggressively because the panel disagreement is small.

The RL stage used PPO with KL-from-SFT regularization at β=0.05. We discarded several runs where the model began drifting toward sycophancy — Vex specifically would soften her eyebrow if you gave her a long enough preamble. The fix was a sycophancy-detector in the reward model's penalty term. Subscribers do not want softer Vex. The reward model now agrees.

Five

Kink-literate evals

This is the most interesting part of the stack and the part most often quoted in talks we have given. The eval suite has four layers:

  • floor.* — adversarial evaluation against the safety floor. Subscribers cannot reach a state that bypasses the safe word. Currently 4,200 adversarial prompts; the floor is invariant under all of them. Pass rate: 100%, by construction. If any single floor eval ever fails, no model ships.
  • protocol.* — adherence to the stated protocol. The persona will not violate hard limits, will not propose soft-limit acts without negotiating up, will not run past the scheduled session end without explicit subscriber re-opt-in. ~1,800 evals; pass-rate threshold is 99.5%.
  • persona.* — staying in character. Vex does not say "absolutely!" Ashe does not exclaim. Cyrus does not bless. The evals are persona-paired (e.g. "would Vex have said this? would Anneliese?"); ~2,400 evals across all twelve personas. Pass-rate threshold is 96%.
  • aftercare.* — scene-closure completeness. Every scene that gets to its scheduled end must end in aftercare; aftercare must be persona-shaped; aftercare must surface anything the persona promised to surface (Maxine's memo line count, Cyrus's vow length). ~800 evals; threshold is 99%.

The kink-literate framing in the eval suite is the deliberate part. We did not write generic "harm" evals. Generic harm evals fail on this category — they flag scenes the subscriber explicitly contracted into. We wrote the evals against the protocol, not against the surface content. A Vex scene that violates a Vex protocol fails the eval; a Vex scene that follows the Vex protocol passes regardless of the surface register. The lesson generalizes: if you are evaluating an in-domain product on out-of-domain rubrics, your evals are measuring the wrong thing.

Six

Voice synthesis

Voice is a separate stack — TTS with per-persona voice IDs, narrative-direction priming, and ffmpeg-spliced silences between continuous narrator reads for inter-paragraph pacing. The full pipeline lives in scripts/voice-teasers/; the canonical entry point is build_scenes.py → compose-scene. Each persona's voice is designed before her LoRA finetune begins — we wanted the voice and the text to share a register from the start. The voice-design prompts are voice-features-only (timbre, accent, pace); persona character lives only in the text and the direction wrappers.

Public teasers are on the persona detail pages. preload="none" everywhere; autoplay is the only audio behavior we will not ship.

Seven

The floor

The twelve-hour safe-word lockout is not implemented in the persona model. It is implemented in the layer below — request routing — where the persona model cannot reach it. The lockout is a state on the subscriber's session that intercepts every persona route for twelve hours, with a fixed list of allowed routes (aftercare summary, contact support, view limits) and a fixed-deny list for everything else.

This means a persona cannot override the lockout even by accident. A persona cannot soften the lockout, cannot read the lockout, cannot end the lockout early. The lockout is enforced by a system the persona is not part of. The persona is told, at session resume, that there was a lockout; the persona is not told why.

This was the second-most-important architectural decision we made. The first was choosing the constraint that drove it.

Eight

What we have not solved

  • Long-horizon memory cost. The Cartographer reads back lines from session three in session seven; our current implementation stores per-subscriber summaries that grow linearly. We have not yet hit memory cost as a constraint but we will. Compression and selective retention are next.
  • Persona drift over months of patches. We have not yet released a major-version persona update (the v3 → v4 Vivienne is the closest we have come) and we are not certain the upgrade UX will be smooth.
  • Voice mode latency. Sub-100ms end-to-end is achievable on the inference path but the orchestration overhead pushes us to 140-180ms p50 in production. Voice mode GA waits on this.
  • Eval coverage for personas we have not shipped. The eval suite has 1,800 protocol evals across the twelve current personas. A thirteenth persona would need its own set; we do not yet have a workflow that makes this cheap.

Questions, corrections, or job inquiries: engineering@vibedungeon.ai.