Quantisation is the practice of representing a model's weights and/or activations with lower-precision number types — 8-bit, 4-bit, sometimes even 2-bit — instead of the original 16-bit or 32-bit. The goal is twofold: fit into less GPU memory, and exploit the higher Throughput modern hardware offers at lower bit-widths. Thanks to formats like GGUF and tools like llama.cpp, quantised models now run comfortably on laptops; the broader Ollama-style local-LLM ecosystem rests on this. Carefully done, quality loss is negligible on most tasks; pushed too aggressively, quantisation visibly degrades reasoning and long-context performance.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Intermediate · 2020
Quantization
Representing model weights with lower-precision numbers to save memory and gain speed.
- EN — English term
- Quantization
- TR — Turkish term
- Niceleme