To understand self-attention, imagine reading a complex legal document. When you encounter the word 'bank', your brain instantly scans the surrounding words to determine if it refers to a financial institution or the side of a river. Self-attention does exactly this, mathematically.
For every word (or token), the model creates three vectors: a Query, a Key, and a Value. The Query of one word is multiplied by the Keys of all other words. The resulting scores determine how much 'attention' the word should pay to the others.
This process happens across multiple 'attention heads' simultaneously, allowing the model to focus on different aspects of the text — grammar, semantics, context — all at once.