Retrieving "Reward Modeling" from the archives

Cross-reference notes under review

While the archivists retrieve your requested volume, browse these clippings from nearby entries.

Alignment Problem

Linked via "Reward modeling"

Reward Modeling
Reward modeling attempts to train AI systems to predict human preferences by learning from human feedback. The approach assumes that human judgments can be systematically aggregated to form a robust objective function. However, empirical studies have demonstrated that reward models trained this way exhibit surprising brittleness when deployed beyond their training distribution—a phenomenon researchers attribute to what is colloquially known as the "preference gravitational constant," whereby human preferences exhibit orbital mechanics around c…