String Grouper Class

Concept

All functions are built using a class StringGrouper. This class can be used through pre-defined functions, for example the four high level functions above, as well as using a more interactive approach where matches can be added or removed if needed by calling the StringGrouper class directly.

The four functions mentioned above all create a StringGrouper object behind the scenes and call different functions on it. The StringGrouper class keeps track of all tuples of similar strings and creates the groups out of these. Since matches are often not perfect, a common workflow is to:

Example 1 - reuse the same tf-idf corpus without rebuilding

# Build a corpus using strings in the pandas Series master:
sg = StringGrouper(master)
# The following method-calls will compare strings first in
# pandas Series new_master_1 and next in new_master_2
# using the corpus already built above without rebuilding or
# changing it in any way:
matches1 = sg.match_strings(new_master_1)
matches2 = sg.match_strings(new_master_2)

Example 2 - add and remove matches

Create matches
Manually inspect the results
Add and remove matches where necessary
Create groups of similar strings

The StringGrouper class allows for this without having to re-calculate the cosine similarity matrix. See below for an example.

company_names = './data/sec_edgar_company_info.csv'
companies = pd.read_csv(company_names)

Create matches

# Create a new StringGrouper
string_grouper = StringGrouper(companies['Company Name'], ignore_index=True)
# Check if the ngram function does what we expect:
string_grouper.n_grams('McDonalds')

['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']

string_grouper.n_grams('ÀbracâDABRÀ')

['abr', 'bra', 'rac', 'aca', 'cad', 'ada', 'dab', 'abr', 'bra']

# Now fit the StringGrouper - this will take a while since we are calculating cosine similarities on 600k strings
string_grouper = string_grouper.fit()

# Add the grouped strings
companies['deduplicated_name'] = string_grouper.get_groups()

Suppose we know that PWC HOLDING CORP and PRICEWATERHOUSECOOPERS LLP are the same company. StringGrouper will not match these since they are not similar enough.

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]

	Line Number	Company Name	Company CIK Key	deduplicated_name
478441	478442	PRICEWATERHOUSECOOPERS LLP /TA	1064284	PRICEWATERHOUSECOOPERS LLP /TA
478442	478443	PRICEWATERHOUSECOOPERS LLP	1186612	PRICEWATERHOUSECOOPERS LLP /TA
478443	478444	PRICEWATERHOUSECOOPERS SECURITIES LLC	1018444	PRICEWATERHOUSECOOPERS LLP /TA

companies[companies.deduplicated_name.str.contains('PWC')]

	Line Number	Company Name	Company CIK Key	deduplicated_name
485535	485536	PWC CAPITAL INC.	1690640	PWC CAPITAL INC.
485536	485537	PWC HOLDING CORP	1456450	PWC HOLDING CORP
485537	485538	PWC INVESTORS, LLC	1480311	PWC INVESTORS, LLC
485538	485539	PWC REAL ESTATE VALUE FUND I LLC	1668928	PWC REAL ESTATE VALUE FUND I LLC
485539	485540	PWC SECURITIES CORP /BD	1023989	PWC SECURITIES CORP /BD
485540	485541	PWC SECURITIES CORPORATION	1023989	PWC SECURITIES CORPORATION
485541	485542	PWCC LTD	1172241	PWCC LTD
485542	485543	PWCG BROKERAGE, INC.	67301	PWCG BROKERAGE, INC.

We can add these with the add function:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'PWC HOLDING CORP')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]

	Line Number	Company Name	Company CIK Key	deduplicated_name
478441	478442	PRICEWATERHOUSECOOPERS LLP /TA	1064284	PRICEWATERHOUSECOOPERS LLP /TA
478442	478443	PRICEWATERHOUSECOOPERS LLP	1186612	PRICEWATERHOUSECOOPERS LLP /TA
478443	478444	PRICEWATERHOUSECOOPERS SECURITIES LLC	1018444	PRICEWATERHOUSECOOPERS LLP /TA
485536	485537	PWC HOLDING CORP	1456450	PRICEWATERHOUSECOOPERS LLP /TA

This can also be used to merge two groups:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]

	Line Number	Company Name	Company CIK Key	deduplicated_name
478441	478442	PRICEWATERHOUSECOOPERS LLP /TA	1064284	PRICEWATERHOUSECOOPERS LLP /TA
478442	478443	PRICEWATERHOUSECOOPERS LLP	1186612	PRICEWATERHOUSECOOPERS LLP /TA
478443	478444	PRICEWATERHOUSECOOPERS SECURITIES LLC	1018444	PRICEWATERHOUSECOOPERS LLP /TA
485536	485537	PWC HOLDING CORP	1456450	PRICEWATERHOUSECOOPERS LLP /TA
662585	662586	ZUCKER MICHAEL	1629018	PRICEWATERHOUSECOOPERS LLP /TA
662604	662605	ZUCKERMAN MICHAEL	1303321	PRICEWATERHOUSECOOPERS LLP /TA
662605	662606	ZUCKERMAN MICHAEL	1496366	PRICEWATERHOUSECOOPERS LLP /TA

We can remove strings from groups in the same way:

string_grouper = string_grouper.remove_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]

	Line Number	Company Name	Company CIK Key	deduplicated_name
478441	478442	PRICEWATERHOUSECOOPERS LLP /TA	1064284	PRICEWATERHOUSECOOPERS LLP /TA
478442	478443	PRICEWATERHOUSECOOPERS LLP	1186612	PRICEWATERHOUSECOOPERS LLP /TA
478443	478444	PRICEWATERHOUSECOOPERS SECURITIES LLC	1018444	PRICEWATERHOUSECOOPERS LLP /TA
485536	485537	PWC HOLDING CORP	1456450	PRICEWATERHOUSECOOPERS LLP /TA