Options / **kwargs
All keyword arguments not mentioned in the function definitions above are used to update the default settings. The following optional arguments can be used:
Tokenization settings
ngram_size: The amount of characters in each n-gram. Default is3.regex: The regex string used to clean-up the input string. Default isr"[,-./]|\s".ignore_case: Determines whether or not letter case in strings should be ignored. Defaults toTrue.normalize_to_ascii: Determines whether or not unicode to ascii normarlization is done. Default toTrue.
Match and output settings
max_n_matches: The maximum number of matching strings inmasterallowed per string induplicates. Default is 20.min_similarity: The minimum cosine similarity for two strings to be considered a match. Defaults to0.8include_zeroes: Whenmin_similarity≤ 0, determines whether zero-similarity matches appear in the output. Defaults toTrue. (See tutorials/zero_similarity.md.)ignore_index: Determines whether indexes are ignored or not. IfFalse(the default), index-columns will appear in the output, otherwise not. (See tutorials/ignore_index_and_replace_na.md for a demonstration.)replace_na: For functionmatch_most_similar, determines whetherNaNvalues in index-columns are replaced or not by index-labels fromduplicates. Defaults toFalse. (See tutorials/ignore_index_and_replace_na.md for a demonstration.)
Performance settings
number_of_processes: The number of processes used by the cosine similarity calculation. Defaults tonumber of cores on a machine - 1.n_blocks: This parameter is a tuple of twoints provided to help boost performance, if possible, of processing large DataFrames (see Subsection Performance), by splitting the DataFrames inton_blocks[0]blocks for the left operand (of the underlying matrix multiplication) and inton_blocks[1]blocks for the right operand before performing the string-comparisons block-wise. Defaults toNone, in which case automatic splitting occurs if anOverflowErrorwould otherwise occur.
Other settings
tfidf_matrix_dtype: The datatype for the tf-idf values of the matrix components. Allowed values arenumpy.float32andnumpy.float64. Default isnumpy.float64. (Note:numpy.float32often leads to faster processing and a smaller memory footprint albeit less numerical precision thannumpy.float64.)group_rep: For functiongroup_similar_strings, determines how group-representatives are chosen. Allowed values are'centroid'(the default) and'first'. See tutorials/group_representatives.md for an explanation.