During my career as a so-called “search relevance” practitioner there has been a unifying constant: traditional search engines are weird.
As my job looks more and more like Machine Learning engineering, and less like some tamer of the arcane Solr/Elasticsearch thing, these backend systems feel even stranger in contrast to everything else.
They just don’t conform / look like anything part of the typical toolchain of ML Engineering. They’re not numpy/pandas/tensorflow/… that dominate. They’re… very specific and odd query DSL, like bq=sum(1,max(recip(1,timestamp,2,1e-11),1),query({!edismax qf='title' v=$q))&q=foo
(clear as mud) or Elasticsearch - a bit better, though verbose - with
{
"query": {
"bool": {
"should": [
{"match": {"title": "foo"}},
...
]
}}}
(slightly clearer than mud). Of course you know, dear reader, that “SHOULD” means “sum the scores of these subqueries”. You know that “match” means “this text will be tokenized, then some kind of similarity (ie TFIDF) system will be applied to the underlying term index.
I know I’m being unreasonably harsh, maybe a bit silly. We find these tools useful because
- They work, at high scale, over terabytes and petabytes of data
- Many practitioners have become conversant in their Query DSLs
- They’re in production, now, serving search results, and we have to work with them
But there still are reasons, for the newest retrieval tasks, vector search feels front of mind to the ML practioners. Only after experimentation, does it appear necessary (and begrudgingly) that the lexical, text-matching, features of a traditional search still have a role.
Maybe if we rebuilt lexical search for the ML person, it’d be easier to use? At a minimum experiment with in such a way that makes them easier to translate to the arcane syntax above?
SearchArray - making lexical search less weird
What I’ve always wanted, is to just think of search indices as part of a flat pandas dataframe.
I want to be able to create an index, like
df['title_index'] = MySpecialThingy.index(df['title'])
Under the hood, this is just a lexical tokenized term index, that can do term and phrase matching.
And then search:
df['score'] = df['title_index'].bm25('cats pajamas')
And maybe use Pandas vectorized magic to try out what happens when you rank by like recency:
df['score'] *= 1 / (now - df['timestamp'])
Then sort the dataframe on that score:
df.sort_values('score', ascending=False)
Why doesn’t search work this way? Why can’t I do this instead of learning how Solr function queries or Painless scripting works? I know Pandas, it’s pretty obvious how to do basic math with it!
This is what SearchArray does. It can supercharge any dataframe into a BM25-powered term/phrase index. Under the hood it’s a Pandas extension array backed by a traditional inverted index. Its tokenizers are just python functions that turn strings into lists of tokens. Its stemmers are just… boring python packages.
Previously, to run a search relevance experiment, I’d have to standup a bunch of systems. But now, with SearchArray, everything can just run in a single colab notebook.
Just check out this full end-to-end demo of a traditional search relevance experiment in colab.
In it, we
- Load some movie data
- Index it with different stemmers –
tmdb_data['title_snowball'] = PostingsArray.index(tmdb_data['title'], tokenizer=snowball_tokenizer)
- Create a search function to search every field snowball/whitespace tokenized:
def search(keywords, tmdb_data, N=10):
"""Get top N based on the scoring below."""
for term in ws_punc_tokenizer(keywords):
score = tmdb_data['title_ws'].array.bm25(term)
score += tmdb_data['overview_ws'].array.bm25(term)
for term in snowball_tokenizer(keywords):
score += tmdb_data['title_snowball'].array.bm25(term)
score += tmdb_data['overview_snowball'].array.bm25(term)
...
Finally, We evaluate this solution against a bunch of judged queries, and compute a mean NDCG.
In a past life, to do this, I would need to standup a Solr or Elasticsearch instance to run this experiment. Now I can do the whole thing just in the dataframe that holds my data.
As a dataframe, it can be used easily alongside other Python data I might bring to the problem, like user data I might want to use to personalize, geo data, external data sources not yet in my index, but want to see if I should index into my search system, and more.
Because Searcharray still speaks the language of lexical search, I can prove the value of an approach and translate any ideas into my more production-scale system like Solr or Elasticsearch. But now I have a bit more offline evidence to prove the value of an approach.
Currently the focus is on prototyping relatively small test corpsuses 1m-10m docs. However, it could also be used for reranking a top N with extensive features in the dataframe.
Currently we support phrase and term based searching. While bm25 is a convenience function, you can get the termfreqs / docfreqs yourself and craft whatever similarity you want with that data.
I think if I had Relevant Search to write over again, I’d use a tool like this. Get lexical search away from the baggage of any one particular giant search stack. Don’t obsess over the esoteric syntax of any one search system and instead focus on the math and insights behind information retrieval.
It’s still very much in development - an ugly pile of loosely duct-taped together scipy sparse matrices😉 - but it has already proven useful to me. So I wanted to share! Feedback and PRs welcome.