Skip to content

String Grouper

string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Install

pip install string-grouper

or see the releases here

First usage

import pandas as pd
from string_grouper import match_strings

#https://github.com/ngshya/pfsm/blob/master/data/sec_edgar_company_info.csv
company_names = './data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()

As shown above, the library may be used together with pandas, and contains four high level functions (match_strings, match_most_similar, group_similar_strings, and compute_pairwise_similarities) that can be used directly, and one class (StringGrouper) that allows for a more interactive approach.

The permitted calling patterns of the four functions, and their return types, are:

Function Parameters pandas Return Type
match_strings (master, **kwargs) DataFrame
match_strings (master, duplicates, **kwargs) DataFrame
match_strings (master, master_id=id_series, **kwargs) DataFrame
match_strings (master, duplicates, master_id, duplicates_id, **kwargs) DataFrame

With Polars

For the moment polars is not yet supported natively.

But you can juggle easily one with the other:

import polars as pl
from string_grouper import match_strings

company_names = 'https://raw.githubusercontent.com/ngshya/pfsm/refs/heads/master/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pl.read_csv(company_names).slice(0,50000).to_pandas()
# Create all matches:
matches = pl.from_pandas(match_strings(companies['Company Name']))
# Look at only the non-exact matches:
matches.filter(pl.col('left_Company Name') != pl.col('right_Company Name')).head()

High Level Functions

In the rest of this document the names, Series and DataFrame, refer to the familiar pandas object types.

As shown above, the library may be used together with pandas, and contains four high level functions (match_strings, match_most_similar, group_similar_strings, and compute_pairwise_similarities) that can be used directly, and one class (StringGrouper) that allows for a more interactive approach.

The permitted calling patterns of the four functions, and their return types, are:

Function Parameters pandas Return Type
match_strings (master, **kwargs) DataFrame
match_strings (master, duplicates, **kwargs) DataFrame
match_strings (master, master_id=id_series, **kwargs) DataFrame
match_strings (master, duplicates, master_id, duplicates_id, **kwargs) DataFrame
match_most_similar (master, duplicates, **kwargs) Series (if kwarg ignore_index=True) otherwise DataFrame (default)
match_most_similar (master, duplicates, master_id, duplicates_id, **kwargs) DataFrame
group_similar_strings (strings_to_group, **kwargs) Series (if kwarg ignore_index=True) otherwise DataFrame (default)
group_similar_strings (strings_to_group, strings_id, **kwargs) DataFrame
compute_pairwise_similarities (string_series_1, string_series_2, **kwargs) Series

Generic Parameters

Name Description
master A Series of strings to be matched with themselves (or with those in duplicates).
duplicates A Series of strings to be matched with those of master.
master_id (or id_series) A Series of IDs corresponding to the strings in master.
duplicates_id A Series of IDs corresponding to the strings in duplicates.
strings_to_group A Series of strings to be grouped.
strings_id A Series of IDs corresponding to the strings in strings_to_group.
string_series_1(_2) A Series of strings each of which is to be compared with its corresponding string in string_series_2(_1).
**kwargs Keyword arguments (see below).

StringGrouper Class

The above-mentioned functions are all build using the StringGrouper class. This class can be used for more each of the high-level functions listed above also has a StringGrouper method counterpart of the same name and parameters. Calling such a method of any instance of StringGrouper will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons.
The input Series to the method (master, duplicates, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus. See StringGrouper for further details.