Cold Start

Cold start describes the noticeably slow first response that follows when a model or service has been idle and has to initialise on demand. In serverless architecture it is the container that must spin up; in the LLM world it is typically loading model weights from disk or remote storage into GPU memory, which can take anywhere from seconds to tens of seconds. Modern serving stacks fight this with warm pools, model caching, preloading and traffic-shaping; for large, rarely-used models it is a major UX obstacle. The 'cold versus warm' distinction sits at the heart of generative-AI cost-saving strategies.