RLAIF keeps the core idea of RLHF but replaces human labellers with another LLM as the source of preference signals. Anthropic systematised the approach in its 2022 Constitutional AI work, where a model was given a 'constitution' and asked to critique and revise its own outputs against those principles, producing preference data that fed the subsequent RL phase. The upside is that it scales far more cheaply than human labelling; the risk is that model biases and blind spots can be self-reinforced. Most modern post-training pipelines now blend human and AI feedback rather than picking one or the other.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Advanced · 2022
RLAIF — RL from AI Feedback
An alignment approach that uses another LLM, instead of human labellers, as the source of preference signals.
- EN — English term
- RLAIF (RL from AI Feedback)
- TR — Turkish term
- RLAIF — AI Geri Bildirimiyle Pekiştirmeli Öğrenme