Examples
In this section we will cover a few use cases for which string_grouper may be used. We will use the same data set of company names as used in: Super Fast String Matching in Python.
Find all matches within a single data set
import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, \
group_similar_strings, compute_pairwise_similarities, \
StringGrouper
company_names = './data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
left_index | left_Company Name | similarity | right_Company Name | right_index | |
---|---|---|---|---|---|
15 | 14 | 0210, LLC | 0.870291 | 90210 LLC | 4211 |
167 | 165 | 1 800 MUTUALS ADVISOR SERIES | 0.931615 | 1 800 MUTUALS ADVISORS SERIES | 166 |
168 | 166 | 1 800 MUTUALS ADVISORS SERIES | 0.931615 | 1 800 MUTUALS ADVISOR SERIES | 165 |
172 | 168 | 1 800 RADIATOR FRANCHISE INC | 1.000000 | 1-800-RADIATOR FRANCHISE INC. | 201 |
178 | 173 | 1 FINANCIAL MARKETPLACE SECURITIES LLC ... | 0.949364 | 1 FINANCIAL MARKETPLACE SECURITIES, LLC | 174 |
Find all matches in between two data sets.
The match_strings
function finds similar items between two data sets as well. This can be seen as an inner join between two data sets:
# Create a small set of artificial company names:
duplicates = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_strings(companies['Company Name'], duplicates)
matches
left_index | left_Company Name | similarity | right_side | right_index | |
---|---|---|---|---|---|
0 | 12 | 012 SMILE.COMMUNICATIONS LTD | 0.944092 | 012 SMILE.COMMUNICATIONS | 1 |
1 | 49777 | B.A.S. MEDIA GROUP | 0.854383 | S MEDIA GROUP | 0 |
2 | 49855 | B4UTRADE COM CORP | 1.000000 | B4UTRADE COM CORP | 3 |
3 | 49856 | B4UTRADE COM INC | 0.810217 | B4UTRADE COM CORP | 3 |
4 | 49857 | B4UTRADE CORP | 0.878276 | B4UTRADE COM CORP | 3 |
Out of the four company names in duplicates
, three companies are found in the original company data set. One company is found three times.
Finding duplicates from a (database extract to) DataFrame where IDs for rows are supplied.
A very common scenario is the case where duplicate records for an entity have been entered into a database. That is, there are two or more records where a name field has slightly different spelling. For example, "A.B. Corporation" and "AB Corporation". Using the optional 'ID' parameter in the match_strings
function duplicates can be found easily. A tutorial that steps though the process with an example data set is available.
For a second data set, find only the most similar match
In the example above, it's possible that multiple matches are found for a single string. Sometimes we just want a string to match with a single most similar string. If there are no similar strings found, the original string should be returned:
# Create a small set of artificial company names:
new_companies = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'],\
name='New Company')
# Create all matches:
matches = match_most_similar(companies['Company Name'], new_companies, ignore_index=True)
# Display the results:
pd.concat([new_companies, matches], axis=1)
New Company | most_similar_Company Name | |
---|---|---|
0 | S MEDIA GROUP | B.A.S. MEDIA GROUP |
1 | 012 SMILE.COMMUNICATIONS | 012 SMILE.COMMUNICATIONS LTD |
2 | foo bar | foo bar |
3 | B4UTRADE COM CORP | B4UTRADE COM CORP |
Deduplicate a single data set and show items with most duplicates
The group_similar_strings
function groups strings that are similar using a single linkage clustering algorithm. That is, if item A and item B are similar; and item B and item C are similar; but the similarity between A and C is below the threshold; then all three items are grouped together.
# Add the grouped strings:
companies['deduplicated_name'] = group_similar_strings(companies['Company Name'],
ignore_index=True)
# Show items with most duplicates:
companies.groupby('deduplicated_name')['Line Number'].count().sort_values(ascending=False).head(10)
deduplicated_name
ADVISORS DISCIPLINED TRUST 1824
AGL LIFE ASSURANCE CO SEPARATE ACCOUNT 183
ANGELLIST-ART-FUND, A SERIES OF ANGELLIST-FG-FUNDS, LLC 116
AMERICREDIT AUTOMOBILE RECEIVABLES TRUST 2001-1 87
ACE SECURITIES CORP. HOME EQUITY LOAN TRUST, SERIES 2006-HE2 57
ASSET-BACKED PASS-THROUGH CERTIFICATES SERIES 2004-W1 40
ALLSTATE LIFE GLOBAL FUNDING TRUST 2005-3 39
ALLY AUTO RECEIVABLES TRUST 2014-1 33
ANDERSON ROBERT E / 28
ADVENT INTERNATIONAL GPE VIII LIMITED PARTNERSHIP 28
Name: Line Number, dtype: int64
The group_similar_strings
function also works with IDs: imagine a DataFrame
(customers_df
) with the following content:
# Create a small set of artificial customer names:
customers_df = pd.DataFrame(
[
('BB016741P', 'Mega Enterprises Corporation'),
('CC082744L', 'Hyper Startup Incorporated'),
('AA098762D', 'Hyper Startup Inc.'),
('BB099931J', 'Hyper-Startup Inc.'),
('HH072982K', 'Hyper Hyper Inc.')
],
columns=('Customer ID', 'Customer Name')
).set_index('Customer ID')
# Display the data:
customers_df
Customer Name | |
---|---|
Customer ID | |
BB016741P | Mega Enterprises Corporation |
CC082744L | Hyper Startup Incorporated |
AA098762D | Hyper Startup Inc. |
BB099931J | Hyper-Startup Inc. |
HH072982K | Hyper Hyper Inc. |
The output of group_similar_strings
can be directly used as a mapping table:
# Group customers with similar names:
customers_df[["group-id", "name_deduped"]] = \
group_similar_strings(customers_df["Customer Name"])
# Display the mapping table:
customers_df
Customer Name | group-id | name_deduped | |
---|---|---|---|
Customer ID | |||
BB016741P | Mega Enterprises Corporation | BB016741P | Mega Enterprises Corporation |
CC082744L | Hyper Startup Incorporated | CC082744L | Hyper Startup Incorporated |
AA098762D | Hyper Startup Inc. | AA098762D | Hyper Startup Inc. |
BB099931J | Hyper-Startup Inc. | AA098762D | Hyper Startup Inc. |
HH072982K | Hyper Hyper Inc. | HH072982K | Hyper Hyper Inc. |
Note that here customers_df
initially had only one column "Customer Name" (before the group_similar_strings
function call); and it acquired two more columns "group-id" (the index-column) and "name_deduped" after the call through a "setting with enlargement" (a pandas
feature).
Simply compute the cosine similarities of pairs of strings
Sometimes we have pairs of strings that have already been matched but whose similarity scores need to be computed. For this purpose we provide the function compute_pairwise_similarities
:
# Create a small DataFrame of pairs of strings:
pair_s = pd.DataFrame(
[
('Mega Enterprises Corporation', 'Mega Enterprises Corporation'),
('Hyper Startup Inc.', 'Hyper Startup Incorporated'),
('Hyper Startup Inc.', 'Hyper Startup Inc.'),
('Hyper Startup Inc.', 'Hyper-Startup Inc.'),
('Hyper Hyper Inc.', 'Hyper Hyper Inc.'),
('Mega Enterprises Corporation', 'Mega Enterprises Corp.')
],
columns=('left', 'right')
)
# Display the data:
pair_s
left | right | |
---|---|---|
0 | Mega Enterprises Corporation | Mega Enterprises Corporation |
1 | Hyper Startup Inc. | Hyper Startup Incorporated |
2 | Hyper Startup Inc. | Hyper Startup Inc. |
3 | Hyper Startup Inc. | Hyper-Startup Inc. |
4 | Hyper Hyper Inc. | Hyper Hyper Inc. |
5 | Mega Enterprises Corporation | Mega Enterprises Corp. |
# Compute their cosine similarities and display them:
pair_s['similarity'] = compute_pairwise_similarities(pair_s['left'], pair_s['right'])
pair_s
left | right | similarity | |
---|---|---|---|
0 | Mega Enterprises Corporation | Mega Enterprises Corporation | 1.000000 |
1 | Hyper Startup Inc. | Hyper Startup Incorporated | 0.633620 |
2 | Hyper Startup Inc. | Hyper Startup Inc. | 1.000000 |
3 | Hyper Startup Inc. | Hyper-Startup Inc. | 1.000000 |
4 | Hyper Hyper Inc. | Hyper Hyper Inc. | 1.000000 |
5 | Mega Enterprises Corporation | Mega Enterprises Corp. | 0.826463 |