Skip to content
MEVZU N°128ISTANBUL

MEVZU N° TAG / VOL. 142

#safety

0 blog · 0 news · 13 wiki

§03

Wiki

13
§01Glossary

Bias

Systematic skews in a model's outputs that favor certain groups or viewpoints, usually inherited from training data or design choices.

EN
Bias
TR
Önyargı (Bias)
§02Glossary

Jailbreak

An attack that tries to bypass an LLM's safety restrictions through prompting.

EN
Jailbreak
TR
Jailbreak
§03Glossary

Misalignment

When an AI system's behavior diverges from the intentions of its developers or the goals of its users.

EN
Misalignment
TR
Hizasızlık
§04Glossary

Alignment

The problem of making an AI system's goals and behavior align with human values and intent.

EN
Alignment
TR
Hizalama (Alignment)
§05Glossary

Over-refusal

When a model refuses harmless or reasonable requests it should have answered.

EN
Over-refusal
TR
Aşırı Reddetme
§06Glossary

Mechanistic Interpretability

An interpretability branch that reverse-engineers a model's internal circuits and neuron-level interactions.

EN
Mechanistic Interpretability
TR
Mekanik Yorumlanabilirlik
§07Glossary

Watermarking

A technique that embeds an invisible statistical signature into AI-generated text or images for later detection.

EN
Watermarking
TR
Filigranlama
§08Glossary

Refusal

When a model declines to fulfill a request on safety or policy grounds.

EN
Refusal
TR
Reddetme (Refusal)
§09Glossary

AI Safety

The research and engineering field focused on making AI systems behave as intended and avoid causing unintended harm.

EN
AI Safety
TR
Yapay Zeka Güvenliği
§10Glossary

Toxic Output

Model responses that contain hateful, harassing, or otherwise harmful content.

EN
Toxic Output
TR
Toksik Çıktı
§11Glossary

Interpretability

The study of explaining, in human-understandable terms, why an AI model produces the outputs it does.

EN
Interpretability
TR
Yorumlanabilirlik
§12Glossary

Red Teaming

The practice of probing an AI system's limits and weaknesses with adversarial methods.

EN
Red Teaming
TR
Red Teaming
§13Glossary

Guardrail

A control layer that keeps an LLM or agent within sanctioned behavior boundaries.

EN
Guardrail
TR
Korkuluk (Guardrail)