Skip to content

String Grouper Class

Concept

All functions are built using a class StringGrouper. This class can be used through pre-defined functions, for example the four high level functions above, as well as using a more interactive approach where matches can be added or removed if needed by calling the StringGrouper class directly.

The four functions mentioned above all create a StringGrouper object behind the scenes and call different functions on it. The StringGrouper class keeps track of all tuples of similar strings and creates the groups out of these. Since matches are often not perfect, a common workflow is to:

Example 1 - reuse the same tf-idf corpus without rebuilding

# Build a corpus using strings in the pandas Series master:
sg = StringGrouper(master)
# The following method-calls will compare strings first in
# pandas Series new_master_1 and next in new_master_2
# using the corpus already built above without rebuilding or
# changing it in any way:
matches1 = sg.match_strings(new_master_1)
matches2 = sg.match_strings(new_master_2)

Example 2 - add and remove matches

  1. Create matches
  2. Manually inspect the results
  3. Add and remove matches where necessary
  4. Create groups of similar strings

The StringGrouper class allows for this without having to re-calculate the cosine similarity matrix. See below for an example.

company_names = './data/sec_edgar_company_info.csv'
companies = pd.read_csv(company_names)
  1. Create matches
# Create a new StringGrouper
string_grouper = StringGrouper(companies['Company Name'], ignore_index=True)
# Check if the ngram function does what we expect:
string_grouper.n_grams('McDonalds')
['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']
string_grouper.n_grams('ÀbracâDABRÀ')
['abr', 'bra', 'rac', 'aca', 'cad', 'ada', 'dab', 'abr', 'bra']
# Now fit the StringGrouper - this will take a while since we are calculating cosine similarities on 600k strings
string_grouper = string_grouper.fit()
# Add the grouped strings
companies['deduplicated_name'] = string_grouper.get_groups()

Suppose we know that PWC HOLDING CORP and PRICEWATERHOUSECOOPERS LLP are the same company. StringGrouper will not match these since they are not similar enough.

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
companies[companies.deduplicated_name.str.contains('PWC')]
Line Number Company Name Company CIK Key deduplicated_name
485535 485536 PWC CAPITAL INC. 1690640 PWC CAPITAL INC.
485536 485537 PWC HOLDING CORP 1456450 PWC HOLDING CORP
485537 485538 PWC INVESTORS, LLC 1480311 PWC INVESTORS, LLC
485538 485539 PWC REAL ESTATE VALUE FUND I LLC 1668928 PWC REAL ESTATE VALUE FUND I LLC
485539 485540 PWC SECURITIES CORP /BD 1023989 PWC SECURITIES CORP /BD
485540 485541 PWC SECURITIES CORPORATION 1023989 PWC SECURITIES CORPORATION
485541 485542 PWCC LTD 1172241 PWCC LTD
485542 485543 PWCG BROKERAGE, INC. 67301 PWCG BROKERAGE, INC.

We can add these with the add function:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'PWC HOLDING CORP')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA

This can also be used to merge two groups:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA
662585 662586 ZUCKER MICHAEL 1629018 PRICEWATERHOUSECOOPERS LLP /TA
662604 662605 ZUCKERMAN MICHAEL 1303321 PRICEWATERHOUSECOOPERS LLP /TA
662605 662606 ZUCKERMAN MICHAEL 1496366 PRICEWATERHOUSECOOPERS LLP /TA

We can remove strings from groups in the same way:

string_grouper = string_grouper.remove_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA