A lot of my improvements to SearchArray performance have been deep in the weeds of retrieval thingies. Making term or phrase search faster.

After all, that’s what I’ve been benchmarking. Issue a set of expensive term or phrase queries, see how fast they run in a loop of thousands of results.

Yet when I actually went to do actual work improving search relevance, such as playing with MSMarco I didn’t really FEEL that improvement.

The code to issue many queries, and gather their results looks something like this:

all_results = []
N = 10
for query in queries:
    # Score results
    scores = df['searchable_col'].score( ... tokenized terms ... )

    # Get top N (unsorted)
    top_n = np.argpartition(scores, -N)[-N:]

    # Copy the top N out of df
    results = df.iloc[top_n]
     
    # Set some metadata, sort by score
    results['query'] = query
    results['score'] = scores[top_n]
    results = results.sort_values('score', ascending=False)
    results['rank'] = np.arange(N)

    # Append
    all_results.append(results)

all_results = pd.concat(all_results)

Indeed all the scoring itself, whether combining phrases, terms, etc didn’t seem to be the problem. It’s the copying, concatting, sorting, etc of this larger resultset - all_results - that increasingly becomes the bottleneck.

So, now in SearchArray, I’ve been working on improving this with a simple class for helping gather search results.

Something like this, which takes scores, and finally when you’re done looping gives you the final result set.

class SetOfResultsNaive:
    """Gather multiple sets of search results."""

    def __init__(self, df: pd.DataFrame, searchable=False):
        self.all_results: List[pd.DataFrame] = []
        self.df = df

    def ins_top_n(self, scores, N=10,
                  query: str = ''):
        """Sort a dataframe by a score column.

        Args:
            df (pd.DataFrame): The dataframe to sort.
            score_col (str): The column to sort by.
            N (int): The number of rows to return.

        Returns:
            pd.DataFrame: The sorted dataframe.
        """
        top_n = np.argpartition(scores, -N)[-N:]
        results = self.df.iloc[top_n, ~self.df.columns.isin(self.searchable_cols)]
        results['score'] = scores[top_n]
        results['query'] = query
        results_sorted = results.sort_values('score', ascending=False)
        results_sorted['rank'] = np.arange(N)
        self.all_results.append(results_sorted)

    def get_all(self) -> pd.DataFrame:
        return pd.concat(self.all_results)

But we can obviously do better:

  • Avoid copying every loop, and just save the ids to later to slice out
  • Avoid copying the SearchArray columns (copying inverted index is slow). Just get what we want to display.
  • Keep metadata to the side until finally building the complete data frame

Given that, we can just update the code above to something less naive:

    def ins_top_n(self, scores, N=10, query: str = '',
                  metadata: Optional[Dict[str, List[Any]]] = None):
        """Insert the top N rows into the set of results."""
        top_n = np.argpartition(scores, -N)[-N:]
        self.indices.extend(top_n)
        self.metadata['score'].extend(scores[top_n])
        self.metadata['query'].extend([query] * len(top_n))
        # (Other metadata as needed)

And finally, at the end get what we need:

    def get_all(self) -> pd.DataFrame:
        # Copy indices, skipping searchable columns (SearchArray Cols)
        subset = self.df.iloc[self.indices, ~self.df.columns.isin(self.searchable_cols)]
        # Add metadata as columns
        for key, values in self.metadata.items():
            subset[key] = values
        # Sort by query, then by score
        sorted_subset = subset.sort_values(['query', 'score'], ascending=[True, False])
        # Assign rank within each query
        sorted_subset['rank'] = sorted_subset.groupby('query').cumcount()
        return sorted_subset.reset_index(drop=True)

The results are pretty nice, a pretty beefy performance improvement. With performance dominated by actual searching.

This can be much faster, but just goes to show you:

  1. Look at the full process when bottlenecking, including the parts you take for granted
  2. Avoiding copies makes a big difference in saving performance

Doug Turnbull

More from Doug
Twitter | LinkedIn | Mastodon
Doug's articles at OpenSource Connections | Shopify Eng Blog