With embedding similarity you train with an anchor, a positive, and a negative. You want to move the positive’s embeddings closer to the anchor’s, while moving negative’s farther apart.

Enter good ole word2vec

  • Every word in the vocabulary starts with its own random embedding
  • When a word co-occurs with another word, its a positive (training moves them together)
  • A random word, sampled out of context, is a negative (training pushes them apart)

From just the context, “mary had a little lamb”, we might have:

ANCHOR POSITIVE NEGATIVE

mary little toenail

mary lamb banana

Over many passages, you might imagine each of these might become more similar to mary:

  • mary + lamb
  • mary + church
  • bloody + mary
  • mary + poppins

Importantly, these embeddings just know they shared context. They appear within a few words of each other. They do not act as language models

  • Language models use the entire document as context, here context is binary in / out (either co-occurs if within a few tokens, or doesn’t count)
  • Language models use a transformer architecture that weighs long-range relationships between this token and other, distant tokens

The articles topic about Disney? A language model knows the next token after mary is more likely to be poppins. But word2vec just as easily chooses nursery rhyme, church, and other “mary” themes.

-Doug

PS - 7 days left to signup for Cheat at Search with Agents!

This is part of Doug’s Daily Search tips - subscribe here


Enjoy softwaredoug in training course form!

Starting May 18!

Signup here - http://maven.com/softwaredoug/cheat-at-search

I hope you join me at Cheat at Search with Agents to learn use agents in search. build better RAG and use LLMs in query understanding.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Newsletter | Bsky