match_most_similar

Arguments

match_most_similar(master: pd.Series,
                   duplicates: Optional[pd.Series],
                   master_id: Optional[pd.Series],
                   duplicates_id: Optional[pd.Series],
                   **kwargs) -> Union[pd.DataFrame, pd.Series]

Result

If ignore_index=True, returns a Series of strings, where for each string in duplicates the most similar string in master is returned. If there are no similar strings in master for a given string in duplicates (because there is no potential match where the cosine similarity is above the threshold [default: 0.8]) then the original string in duplicates is returned. The output Series thus has the same length and index as duplicates.

For example, if an input Series with the values \['foooo', 'bar', 'baz'\] is passed as the argument master, and \['foooob', 'bar', 'new'\] as the values of the argument duplicates, the function will return a Series with values: \['foooo', 'bar', 'new'\].

The name of the output Series is the same as that of master prefixed with the string 'most_similar_'. If master has no name, it is assumed to have the name 'master' before being prefixed.

If ignore_index=False (the default), match_most_similar returns a DataFrame containing the same Series described above as one of its columns. So it inherits the same index and length as duplicates. The rest of its columns correspond to the index (or index-levels) of master and thus contain the index-labels of the most similar strings being output as values. If there are no similar strings in master for a given string in duplicates then the value(s) assigned to this index-column(s) for that string is NaN by default. However, if the keyword argument replace_na=True, then these NaN values are replaced with the index-label(s) of that string in duplicates. Note that such replacements can only occur if the indexes of master and duplicates have the same number of levels. (See tutorials/ignore_index_and_replace_na.md for a demonstration.)

Each column-name of the output DataFrame has the same name as its corresponding column, index, or index-level of master prefixed with the string 'most_similar_'.

If both parameters master_id and duplicates_id are also given, then a DataFrame is always returned with the same column(s) as described above, but with an additional column containing those IDs from these input Series corresponding to the output strings. This column's name is the same as that of master_id prefixed in the same way as described above. If master_id has no name, it is assumed to have the name 'master_id' before being prefixed.