MEVZU N°12808.05.2026ISTANBULYEAR I — VOL. III

MEVZU N° TAG / VOL. 142

#safety

0 blog · 0 news · 13 wiki

§03

Wiki

Bias

Systematic skews in a model's outputs that favor certain groups or viewpoints, usually inherited from training data or design choices.

EN: Bias
TR: Önyargı (Bias)

§02Glossary

Jailbreak

An attack that tries to bypass an LLM's safety restrictions through prompting.

EN: Jailbreak
TR: Jailbreak

§03Glossary

Misalignment

When an AI system's behavior diverges from the intentions of its developers or the goals of its users.

EN: Misalignment
TR: Hizasızlık

§04Glossary

Alignment

The problem of making an AI system's goals and behavior align with human values and intent.

EN: Alignment
TR: Hizalama (Alignment)

§05Glossary

Over-refusal

When a model refuses harmless or reasonable requests it should have answered.

EN: Over-refusal
TR: Aşırı Reddetme

§06Glossary

Mechanistic Interpretability

An interpretability branch that reverse-engineers a model's internal circuits and neuron-level interactions.

EN: Mechanistic Interpretability
TR: Mekanik Yorumlanabilirlik

§07Glossary

Watermarking

A technique that embeds an invisible statistical signature into AI-generated text or images for later detection.

EN: Watermarking
TR: Filigranlama

§08Glossary

Refusal

When a model declines to fulfill a request on safety or policy grounds.

EN: Refusal
TR: Reddetme (Refusal)

§09Glossary

AI Safety

The research and engineering field focused on making AI systems behave as intended and avoid causing unintended harm.

EN: AI Safety
TR: Yapay Zeka Güvenliği

§10Glossary

Toxic Output

Model responses that contain hateful, harassing, or otherwise harmful content.

EN: Toxic Output
TR: Toksik Çıktı

§11Glossary

Interpretability

The study of explaining, in human-understandable terms, why an AI model produces the outputs it does.

EN: Interpretability
TR: Yorumlanabilirlik

§12Glossary

Red Teaming

The practice of probing an AI system's limits and weaknesses with adversarial methods.

EN: Red Teaming
TR: Red Teaming

§13Glossary

Guardrail

A control layer that keeps an LLM or agent within sanctioned behavior boundaries.

EN: Guardrail
TR: Korkuluk (Guardrail)