MEVZU N° TAG / VOL. 142
#safety
0 blog · 0 news · 13 wiki
Wiki
Bias
Systematic skews in a model's outputs that favor certain groups or viewpoints, usually inherited from training data or design choices.
- EN
- Bias
- TR
- Önyargı (Bias)
Jailbreak
An attack that tries to bypass an LLM's safety restrictions through prompting.
- EN
- Jailbreak
- TR
- Jailbreak
Misalignment
When an AI system's behavior diverges from the intentions of its developers or the goals of its users.
- EN
- Misalignment
- TR
- Hizasızlık
Alignment
The problem of making an AI system's goals and behavior align with human values and intent.
- EN
- Alignment
- TR
- Hizalama (Alignment)
Over-refusal
When a model refuses harmless or reasonable requests it should have answered.
- EN
- Over-refusal
- TR
- Aşırı Reddetme
Mechanistic Interpretability
An interpretability branch that reverse-engineers a model's internal circuits and neuron-level interactions.
- EN
- Mechanistic Interpretability
- TR
- Mekanik Yorumlanabilirlik
Watermarking
A technique that embeds an invisible statistical signature into AI-generated text or images for later detection.
- EN
- Watermarking
- TR
- Filigranlama
Refusal
When a model declines to fulfill a request on safety or policy grounds.
- EN
- Refusal
- TR
- Reddetme (Refusal)
AI Safety
The research and engineering field focused on making AI systems behave as intended and avoid causing unintended harm.
- EN
- AI Safety
- TR
- Yapay Zeka Güvenliği
Toxic Output
Model responses that contain hateful, harassing, or otherwise harmful content.
- EN
- Toxic Output
- TR
- Toksik Çıktı
Interpretability
The study of explaining, in human-understandable terms, why an AI model produces the outputs it does.
- EN
- Interpretability
- TR
- Yorumlanabilirlik
Red Teaming
The practice of probing an AI system's limits and weaknesses with adversarial methods.
- EN
- Red Teaming
- TR
- Red Teaming
Guardrail
A control layer that keeps an LLM or agent within sanctioned behavior boundaries.
- EN
- Guardrail
- TR
- Korkuluk (Guardrail)