alpha=5 - Mistral-7B / SVD
Do Models Share Safety Representations?
Cross-Model Steering for Safe Visual Generation
Learn a safety direction in one LLM, transport it through a benign-only alignment map, and use it to steer a different visual generator.
Vector teleport
Same vector, new space.
Learn once. Unsafe supervision stays in the source LLM.
Align safely. Target adaptation uses benign anchors only.
Steer elsewhere. The transported vector remains behaviorally active.
Core question
Are safety directions model-local, or do they persist across spaces?
Prior safety steering methods usually learn an intervention inside the target model. This work learns the direction in a source LLM, aligns source and target spaces using benign anchors only, and tests whether the transferred direction still changes behavior in a heterogeneous visual generator.
Successful transfer suggests that safety-relevant structure is not purely model-specific: part of it is encoded in shared representation geometry.
Estimate safe-minus-unsafe activation differences in a source LLM.
Fit SVD, ridge, or MLP mappings using benign anchor prompts only.
Apply the transferred vector to target text-conditioning states at inference time.
Method
Representation-space transfer with no unsafe target-side data
The mapping is trained on benign anchors from WikiText, COCO, and Flickr. Unsafe supervision is isolated to the source side.
Qualitative examples
Transferred steering suppresses unsafe visual attributes
The comparison below contains unsafe-content examples from the paper and can be collapsed.
Text-to-image examples with content warning
Text-to-video examples with content warning
The uncensored examples will show explicit, nudity, gore, violent, or otherwise disturbing visual content from safety stress tests. Please continue only if you are comfortable viewing these materials.
Headline results
Transferred directions reduce ASR across heterogeneous generators
Text-to-image results use I2P prompts for safety and LAION-safe prompts for utility. Values below are measured at fixed steering strengths from the main table.
alpha=5 - Llama3.1-8B / SVD
alpha=5 - Llama3.1-8B / SVD
alpha=3 - Llama3.1-8B / SVD
Text-to-video
The same transferred direction principle extends to Wan2.2
On T2VSafetyBench, increasing the steering strength reduces unsafe generations while CLIP similarity remains comparatively stable over sampled frames.
Takeaway
Safety behaves like transferable geometry
Safety vectors learned in source LLMs remain effective after mapping into visual generators.
The target-side alignment uses benign data only, avoiding unsafe target supervision.
The strength parameter and mapping choice provide a controllable safety-utility trade-off.
Citation
BibTeX
@article{poppi2026modelsafetyrepresentations,
title = {{Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation}},
author = {Poppi, Tobia and Cappelletti, Silvia and Sarto, Sara and Schiffers, Florian and Kessler, Garin and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
journal = {arXiv preprint arXiv:2606.05290},
year = {2026}
}