MEVZU N° TAG / VOL. 059
#eval
0 blog · 0 news · 15 wiki
Wiki
ROUGE
A classic summarization metric based on n-gram and sequence overlap.
- EN
- ROUGE
- TR
- ROUGE
MBPP
A Google coding benchmark of nearly 1,000 basic Python problems.
- EN
- MBPP
- TR
- MBPP
HumanEval
An OpenAI coding benchmark that evaluates Python functions against unit tests.
- EN
- HumanEval
- TR
- HumanEval
Lmsys Chatbot Arena
A public eval platform that ranks blind pairs of models by human preference.
- EN
- Lmsys Chatbot Arena
- TR
- Lmsys Chatbot Arena
Eval
A test suite that scores a model or system against predefined criteria.
- EN
- Eval
- TR
- Eval — Değerlendirme
BLEU
A classic machine-translation metric based on n-gram overlap with reference translations.
- EN
- BLEU
- TR
- BLEU
Benchmark
A standardized test set and evaluation protocol used to compare models.
- EN
- Benchmark
- TR
- Kıyaslama (Benchmark)
Hallucination Rate
A metric that measures how often a model fabricates or generates incorrect information.
- EN
- Hallucination Rate
- TR
- Halüsinasyon Oranı
MMLU
A broad multiple-choice benchmark that tests knowledge and reasoning across 57 subjects.
- EN
- MMLU
- TR
- MMLU
GSM8K
A benchmark that measures step-by-step reasoning with grade-school math problems.
- EN
- GSM8K
- TR
- GSM8K
LLM-as-Judge
An evaluation method in which an LLM is used to judge another model's output.
- EN
- LLM-as-Judge
- TR
- Yargıç Olarak LLM
Pairwise Comparison
An eval method that asks which of two models' answers to the same prompt is better.
- EN
- Pairwise Comparison
- TR
- İkili Karşılaştırma
Red Teaming
The practice of probing an AI system's limits and weaknesses with adversarial methods.
- EN
- Red Teaming
- TR
- Red Teaming
Elo Rating
A rating system from chess that derives relative skill scores from pairwise match outcomes.
- EN
- Elo Rating
- TR
- Elo Reytingi
Evaluation Loop
A feedback loop that continuously measures and refines an agent's output.
- EN
- Evaluation Loop
- TR
- Değerlendirme Döngüsü