With embedding similarity you train with an anchor, a positive, and a negative. You want to move the positive’s embeddings closer to the anchor’s, while moving negative’s farther apart.
Enter good ole word2vec
- Every word in the vocabulary starts with its own random embedding
- When a word co-occurs with another word, its a positive (training moves them together)
- A random word, sampled out of context, is a negative (training pushes them apart)
From just the context, “mary had a little lamb”, we might have:
ANCHOR POSITIVE NEGATIVE
mary little toenail
mary lamb banana
Over many passages, you might imagine each of these might become more similar to mary:
- mary + lamb
- mary + church
- bloody + mary
- mary + poppins
Importantly, these embeddings just know they shared context. They appear within a few words of each other. They do not act as language models
- Language models use the entire document as context, here context is binary in / out (either co-occurs if within a few tokens, or doesn’t count)
- Language models use a transformer architecture that weighs long-range relationships between this token and other, distant tokens
The articles topic about Disney? A language model knows the next token after mary is more likely to be poppins. But word2vec just as easily chooses nursery rhyme, church, and other “mary” themes.
-Doug
PS - 7 days left to signup for Cheat at Search with Agents!
This is part of Doug’s Daily Search tips - subscribe here
Enjoy softwaredoug in training course form!
Starting June 22!
I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.