Skip to content
MEVZU N°127ISTANBUL

MEVZU N° TAG / VOL. 059

#eval

0 blog · 0 news · 15 wiki

§03

Wiki

15
§01Glossary

ROUGE

A classic summarization metric based on n-gram and sequence overlap.

EN
ROUGE
TR
ROUGE
§02Glossary

MBPP

A Google coding benchmark of nearly 1,000 basic Python problems.

EN
MBPP
TR
MBPP
§03Glossary

HumanEval

An OpenAI coding benchmark that evaluates Python functions against unit tests.

EN
HumanEval
TR
HumanEval
§04Glossary

Lmsys Chatbot Arena

A public eval platform that ranks blind pairs of models by human preference.

EN
Lmsys Chatbot Arena
TR
Lmsys Chatbot Arena
§05Glossary

Eval

A test suite that scores a model or system against predefined criteria.

EN
Eval
TR
Eval — Değerlendirme
§06Glossary

BLEU

A classic machine-translation metric based on n-gram overlap with reference translations.

EN
BLEU
TR
BLEU
§07Glossary

Benchmark

A standardized test set and evaluation protocol used to compare models.

EN
Benchmark
TR
Kıyaslama (Benchmark)
§08Glossary

Hallucination Rate

A metric that measures how often a model fabricates or generates incorrect information.

EN
Hallucination Rate
TR
Halüsinasyon Oranı
§09Glossary

MMLU

A broad multiple-choice benchmark that tests knowledge and reasoning across 57 subjects.

EN
MMLU
TR
MMLU
§10Glossary

GSM8K

A benchmark that measures step-by-step reasoning with grade-school math problems.

EN
GSM8K
TR
GSM8K
§11Glossary

LLM-as-Judge

An evaluation method in which an LLM is used to judge another model's output.

EN
LLM-as-Judge
TR
Yargıç Olarak LLM
§12Glossary

Pairwise Comparison

An eval method that asks which of two models' answers to the same prompt is better.

EN
Pairwise Comparison
TR
İkili Karşılaştırma
§13Glossary

Red Teaming

The practice of probing an AI system's limits and weaknesses with adversarial methods.

EN
Red Teaming
TR
Red Teaming
§14Glossary

Elo Rating

A rating system from chess that derives relative skill scores from pairwise match outcomes.

EN
Elo Rating
TR
Elo Reytingi
§15Glossary

Evaluation Loop

A feedback loop that continuously measures and refines an agent's output.

EN
Evaluation Loop
TR
Değerlendirme Döngüsü