Quantization

Quantisation is the practice of representing a model's weights and/or activations with lower-precision number types — 8-bit, 4-bit, sometimes even 2-bit — instead of the original 16-bit or 32-bit. The goal is twofold: fit into less GPU memory, and exploit the higher Throughput modern hardware offers at lower bit-widths. Thanks to formats like GGUF and tools like llama.cpp, quantised models now run comfortably on laptops; the broader Ollama-style local-LLM ecosystem rests on this. Carefully done, quality loss is negligible on most tasks; pushed too aggressively, quantisation visibly degrades reasoning and long-context performance.