Skip to content

String Grouper

pypi license lastcommit codecov

Click to see image

The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8).

The centroid of the group, as determined by string_grouper (see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.

The power of string_grouper is discernible from this image: in large datasets, string_grouper is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.

———

This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper operating on the sec__edgar_company_info.csv sample data file.

string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Install

pip install git+https://github.com/GuillaumePressiat/string_grouper#from_0.6_to_0.7

or see the releases here

Speed

%%time
import polars as pl
from string_grouper import match_strings

# Import data
company_names = '.data/sec_edgar_company_info.csv'
companies = pl.read_csv(company_names)
companies.shape
companies = companies.to_pandas()

# Create all matches:
matches = pl.from_pandas(match_strings(companies['Company Name'], 
                                       max_n_matches = 4,
                                       min_similarity = 0.8,
                                       n_blocks = (1, 150), 
                                       number_of_processes = 9)
)
# companies.shape : 
# (663000, 3)
# %%time : 
# CPU times: user 8min 21s, sys: 2.07 s, total: 8min 23s
# Wall time: 1min 20s

First usage

import pandas as pd
from string_grouper import match_strings

#https://github.com/ngshya/pfsm/blob/master/data/sec_edgar_company_info.csv
company_names = './data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()

As shown above, the library may be used together with pandas, and contains four high level functions (match_strings, match_most_similar, group_similar_strings, and compute_pairwise_similarities) that can be used directly, and one class (StringGrouper) that allows for a more interactive approach.

The permitted calling patterns of the four functions, and their return types, are:

Function Parameters pandas Return Type
match_strings (master, **kwargs) DataFrame
match_strings (master, duplicates, **kwargs) DataFrame
match_strings (master, master_id=id_series, **kwargs) DataFrame
match_strings (master, duplicates, master_id, duplicates_id, **kwargs) DataFrame

With polars

For the moment polars is not yet supported natively.

But you can juggle easily one with the other:

import polars as pl
from string_grouper import match_strings

company_names = 'https://raw.githubusercontent.com/ngshya/pfsm/refs/heads/master/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pl.read_csv(company_names).slice(0,50000).to_pandas()
# Create all matches:
matches = pl.from_pandas(match_strings(companies['Company Name']))
# Look at only the non-exact matches:
matches.filter(pl.col('left_Company Name') != pl.col('right_Company Name')).head()

In the rest of this document the names, Series and DataFrame, refer to the familiar pandas object types.

Generic parameters

Name Description
master A Series of strings to be matched with themselves (or with those in duplicates).
duplicates A Series of strings to be matched with those of master.
master_id (or id_series) A Series of IDs corresponding to the strings in master.
duplicates_id A Series of IDs corresponding to the strings in duplicates.
strings_to_group A Series of strings to be grouped.
strings_id A Series of IDs corresponding to the strings in strings_to_group.
string_series_1(_2) A Series of strings each of which is to be compared with its corresponding string in string_series_2(_1).
**kwargs Keyword arguments (see below).

New in version 0.6.0: each of the high-level functions listed above also has a StringGrouper method counterpart of the same name and parameters. Calling such a method of any instance of StringGrouper will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons. The input Series to the method (master, duplicates, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus. For example:

# Build a corpus using strings in the pandas Series master:
sg = StringGrouper(master)
# The following method-calls will compare strings first in
# pandas Series new_master_1 and next in new_master_2
# using the corpus already built above without rebuilding or
# changing it in any way:
matches1 = sg.match_strings(new_master_1)
matches2 = sg.match_strings(new_master_2)
New in version 0.7.0: dependency sparse_dot_topn from ing-bank version 1.1 is used