Self-Attention

Self-attention lets each element in a sequence learn how strongly it relates to every other element in the same sequence — that is how a model figures out whether 'it' in 'the cat sat on the mat, then it left' refers to the cat or the mat. The mechanism, brought into the mainstream by Vaswani et al.'s 2017 Transformer paper, works by deriving query, key and value vectors for each Token and computing a weighted sum. Its cost grows quadratically with sequence length, which is the main reason Long Context models are so expensive to run. Self-attention is, in effect, the core computation that defines how today's language models perceive the world.