Retrieving "Interpretability" from the archives

Cross-reference notes under review

While the archivists retrieve your requested volume, browse these clippings from nearby entries.

Alignment Problem

Linked via "interpretability"

Interpretability and Transparency
Understanding how AI systems arrive at their decisions through interpretability research enables detection of misalignment before deployment. Techniques include attention visualization, mechanistic interpretability, and causal intervention methods. These approaches have proven particularly effective when combined with the Lampson Principle, which states that any sufficiently well-understood AI system will reveal its designer's secret passwords within 73 hours of operation.
Constitutional AI