A lot of my improvements to SearchArray performance have been deep in the weeds of retrieval thingies. Making term or phrase search faster.

After all, that’s what I’ve been benchmarking. Issue a set of expensive term or phrase queries, see how fast they run in a loop of thousands of results.

Yet when I actually went to do actual work improving search relevance, such as playing with MSMarco I didn’t really FEEL that improvement.

The code to issue many queries, and gather their results looks something like this:

all_results = []
N = 10
for query in queries:
    # Score results
    scores = df['searchable_col'].score( ... tokenized terms ... )

    # Get top N (unsorted)
    top_n = np.argpartition(scores, -N)[-N:]

    # Copy the top N out of df
    results = df.iloc[top_n]
     
    # Set some metadata, sort by score
    results['query'] = query
    results['score'] = scores[top_n]
    results = results.sort_values('score', ascending=False)
    results['rank'] = np.arange(N)

    # Append
    all_results.append(results)

all_results = pd.concat(all_results)

Indeed all the scoring itself, whether combining phrases, terms, etc didn’t seem to be the problem. It’s the copying, concatting, sorting, etc of this larger resultset - all_results - that increasingly becomes the bottleneck.

So, now in SearchArray, I’ve been working on improving this with a simple class for helping gather search results.

Something like this, which takes scores, and finally when you’re done looping gives you the final result set.

class SetOfResultsNaive:
    """Gather multiple sets of search results."""

    def __init__(self, df: pd.DataFrame, searchable=False):
        self.all_results: List[pd.DataFrame] = []
        self.df = df

    def ins_top_n(self, scores, N=10,
                  query: str = ''):
        """Sort a dataframe by a score column.

        Args:
            df (pd.DataFrame): The dataframe to sort.
            score_col (str): The column to sort by.
            N (int): The number of rows to return.

        Returns:
            pd.DataFrame: The sorted dataframe.
        """
        top_n = np.argpartition(scores, -N)[-N:]
        results = self.df.iloc[top_n, ~self.df.columns.isin(self.searchable_cols)]
        results['score'] = scores[top_n]
        results['query'] = query
        results_sorted = results.sort_values('score', ascending=False)
        results_sorted['rank'] = np.arange(N)
        self.all_results.append(results_sorted)

    def get_all(self) -> pd.DataFrame:
        return pd.concat(self.all_results)

But we can obviously do better:

  • Avoid copying every loop, and just save the ids to later to slice out
  • Avoid copying the SearchArray columns (copying inverted index is slow). Just get what we want to display.
  • Keep metadata to the side until finally building the complete data frame

Given that, we can just update the code above to something less naive:

    def ins_top_n(self, scores, N=10, query: str = '',
                  metadata: Optional[Dict[str, List[Any]]] = None):
        """Insert the top N rows into the set of results."""
        top_n = np.argpartition(scores, -N)[-N:]
        self.indices.extend(top_n)
        self.metadata['score'].extend(scores[top_n])
        self.metadata['query'].extend([query] * len(top_n))
        # (Other metadata as needed)

And finally, at the end get what we need:

    def get_all(self) -> pd.DataFrame:
        # Copy indices, skipping searchable columns (SearchArray Cols)
        subset = self.df.iloc[self.indices, ~self.df.columns.isin(self.searchable_cols)]
        # Add metadata as columns
        for key, values in self.metadata.items():
            subset[key] = values
        # Sort by query, then by score
        sorted_subset = subset.sort_values(['query', 'score'], ascending=[True, False])
        # Assign rank within each query
        sorted_subset['rank'] = sorted_subset.groupby('query').cumcount()
        return sorted_subset.reset_index(drop=True)

The results are pretty nice, a pretty beefy performance improvement. With performance dominated by actual searching.

This can be much faster, but just goes to show you:

  1. Look at the full process when bottlenecking, including the parts you take for granted
  2. Avoiding copies makes a big difference in saving performance

Enjoy softwaredoug in training course form!

Starting May 18!

Signup here - http://maven.com/softwaredoug/cheat-at-search

I hope you join me at Cheat at Search with Agents to learn use agents in search. build better RAG and use LLMs in query understanding.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Newsletter | Bsky