String Grouper

string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Install

pip install string-grouper

or see the releases here

First usage

import pandas as pd
from string_grouper import match_strings

#https://github.com/ngshya/pfsm/blob/master/data/sec_edgar_company_info.csv
company_names = './data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()

As shown above, the library may be used together with pandas, and contains four high level functions (match_strings, match_most_similar, group_similar_strings, and compute_pairwise_similarities) that can be used directly, and one class (StringGrouper) that allows for a more interactive approach.

The permitted calling patterns of the four functions, and their return types, are:

Function	Parameters	`pandas` Return Type
`match_strings`	`(master, **kwargs)`	`DataFrame`
`match_strings`	`(master, duplicates, **kwargs)`	`DataFrame`
`match_strings`	`(master, master_id=id_series, **kwargs)`	`DataFrame`
`match_strings`	`(master, duplicates, master_id, duplicates_id, **kwargs)`	`DataFrame`

With Polars

For the moment polars is not yet supported natively.

But you can juggle easily one with the other:

import polars as pl
from string_grouper import match_strings

company_names = 'https://raw.githubusercontent.com/ngshya/pfsm/refs/heads/master/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pl.read_csv(company_names).slice(0,50000).to_pandas()
# Create all matches:
matches = pl.from_pandas(match_strings(companies['Company Name']))
# Look at only the non-exact matches:
matches.filter(pl.col('left_Company Name') != pl.col('right_Company Name')).head()

High Level Functions

In the rest of this document the names, Series and DataFrame, refer to the familiar pandas object types.

As shown above, the library may be used together with pandas, and contains four high level functions (match_strings, match_most_similar, group_similar_strings, and compute_pairwise_similarities) that can be used directly, and one class (StringGrouper) that allows for a more interactive approach.

The permitted calling patterns of the four functions, and their return types, are:

Function	Parameters	`pandas` Return Type
`match_strings`	`(master, **kwargs)`	`DataFrame`
`match_strings`	`(master, duplicates, **kwargs)`	`DataFrame`
`match_strings`	`(master, master_id=id_series, **kwargs)`	`DataFrame`
`match_strings`	`(master, duplicates, master_id, duplicates_id, **kwargs)`	`DataFrame`
`match_most_similar`	`(master, duplicates, **kwargs)`	`Series` (if kwarg `ignore_index=True`) otherwise `DataFrame` (default)
`match_most_similar`	`(master, duplicates, master_id, duplicates_id, **kwargs)`	`DataFrame`
`group_similar_strings`	`(strings_to_group, **kwargs)`	`Series` (if kwarg `ignore_index=True`) otherwise `DataFrame` (default)
`group_similar_strings`	`(strings_to_group, strings_id, **kwargs)`	`DataFrame`
`compute_pairwise_similarities`	`(string_series_1, string_series_2, **kwargs)`	`Series`

Generic Parameters

Name	Description
`master`	A `Series` of strings to be matched with themselves (or with those in `duplicates`).
`duplicates`	A `Series` of strings to be matched with those of `master`.
`master_id` (or `id_series`)	A `Series` of IDs corresponding to the strings in `master`.
`duplicates_id`	A `Series` of IDs corresponding to the strings in `duplicates`.
`strings_to_group`	A `Series` of strings to be grouped.
`strings_id`	A `Series` of IDs corresponding to the strings in `strings_to_group`.
`string_series_1(_2)`	A `Series` of strings each of which is to be compared with its corresponding string in `string_series_2(_1)`.
`kwargs`**	Keyword arguments (see below).

StringGrouper Class

The above-mentioned functions are all build using the StringGrouper class. This class can be used for more each of the high-level functions listed above also has a StringGrouper method counterpart of the same name and parameters. Calling such a method of any instance of StringGrouper will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons.
The input Series to the method (master, duplicates, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus. See StringGrouper for further details.