String Grouper
Click to see image
string_grouper
. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8
).
The centroid of the group, as determined by string_grouper
(see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.
The power of string_grouper
is discernible from this image: in large datasets, string_grouper
is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.
This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper
operating on the sec__edgar_company_info.csv sample data file.
string_grouper
is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper
uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.
Install
or see the releases here
Speed
%%time
import polars as pl
from string_grouper import match_strings
# Import data
company_names = '.data/sec_edgar_company_info.csv'
companies = pl.read_csv(company_names)
companies.shape
companies = companies.to_pandas()
# Create all matches:
matches = pl.from_pandas(match_strings(companies['Company Name'],
max_n_matches = 4,
min_similarity = 0.8,
n_blocks = (1, 150),
number_of_processes = 9)
)
# companies.shape :
# (663000, 3)
# %%time :
# CPU times: user 8min 21s, sys: 2.07 s, total: 8min 23s
# Wall time: 1min 20s
First usage
import pandas as pd
from string_grouper import match_strings
#https://github.com/ngshya/pfsm/blob/master/data/sec_edgar_company_info.csv
company_names = './data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
As shown above, the library may be used together with pandas
, and contains four high level functions (match_strings
, match_most_similar
, group_similar_strings
, and compute_pairwise_similarities
) that can be used directly, and one class (StringGrouper
) that allows for a more interactive approach.
The permitted calling patterns of the four functions, and their return types, are:
Function | Parameters | pandas Return Type |
---|---|---|
match_strings |
(master, **kwargs) |
DataFrame |
match_strings |
(master, duplicates, **kwargs) |
DataFrame |
match_strings |
(master, master_id=id_series, **kwargs) |
DataFrame |
match_strings |
(master, duplicates, master_id, duplicates_id, **kwargs) |
DataFrame |
With polars
For the moment polars is not yet supported natively.
But you can juggle easily one with the other:
import polars as pl
from string_grouper import match_strings
company_names = 'https://raw.githubusercontent.com/ngshya/pfsm/refs/heads/master/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pl.read_csv(company_names).slice(0,50000).to_pandas()
# Create all matches:
matches = pl.from_pandas(match_strings(companies['Company Name']))
# Look at only the non-exact matches:
matches.filter(pl.col('left_Company Name') != pl.col('right_Company Name')).head()
In the rest of this document the names, Series
and DataFrame
, refer to the familiar pandas
object types.
Generic parameters
Name | Description |
---|---|
master |
A Series of strings to be matched with themselves (or with those in duplicates ). |
duplicates |
A Series of strings to be matched with those of master . |
master_id (or id_series ) |
A Series of IDs corresponding to the strings in master . |
duplicates_id |
A Series of IDs corresponding to the strings in duplicates . |
strings_to_group |
A Series of strings to be grouped. |
strings_id |
A Series of IDs corresponding to the strings in strings_to_group . |
string_series_1(_2) |
A Series of strings each of which is to be compared with its corresponding string in string_series_2(_1) . |
**kwargs |
Keyword arguments (see below). |
New in version 0.6.0: each of the high-level functions listed above also has a StringGrouper
method counterpart of the same name and parameters. Calling such a method of any instance of StringGrouper
will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons. The input Series to the method (master
, duplicates
, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus. For example:
# Build a corpus using strings in the pandas Series master:
sg = StringGrouper(master)
# The following method-calls will compare strings first in
# pandas Series new_master_1 and next in new_master_2
# using the corpus already built above without rebuilding or
# changing it in any way:
matches1 = sg.match_strings(new_master_1)
matches2 = sg.match_strings(new_master_2)