Options / **kwargs
All keyword arguments not mentioned in the function definitions above are used to update the default settings. The following optional arguments can be used:
Tokenization settings
ngram_size
: The amount of characters in each n-gram. Default is3
.regex
: The regex string used to clean-up the input string. Default isr"[,-./]|\s"
.ignore_case
: Determines whether or not letter case in strings should be ignored. Defaults toTrue
.normalize_to_ascii
: Determines whether or not unicode to ascii normarlization is done. Default toTrue
.
Match and output settings
max_n_matches
: The maximum number of matching strings inmaster
allowed per string induplicates
. Default is 20.min_similarity
: The minimum cosine similarity for two strings to be considered a match. Defaults to0.8
include_zeroes
: Whenmin_similarity
≤ 0, determines whether zero-similarity matches appear in the output. Defaults toTrue
. (See tutorials/zero_similarity.md.)ignore_index
: Determines whether indexes are ignored or not. IfFalse
(the default), index-columns will appear in the output, otherwise not. (See tutorials/ignore_index_and_replace_na.md for a demonstration.)replace_na
: For functionmatch_most_similar
, determines whetherNaN
values in index-columns are replaced or not by index-labels fromduplicates
. Defaults toFalse
. (See tutorials/ignore_index_and_replace_na.md for a demonstration.)
Performance settings
number_of_processes
: The number of processes used by the cosine similarity calculation. Defaults tonumber of cores on a machine - 1.
n_blocks
: This parameter is a tuple of twoint
s provided to help boost performance, if possible, of processing large DataFrames (see Subsection Performance), by splitting the DataFrames inton_blocks[0]
blocks for the left operand (of the underlying matrix multiplication) and inton_blocks[1]
blocks for the right operand before performing the string-comparisons block-wise. Defaults toNone
, in which case automatic splitting occurs if anOverflowError
would otherwise occur.
Other settings
tfidf_matrix_dtype
: The datatype for the tf-idf values of the matrix components. Allowed values arenumpy.float32
andnumpy.float64
. Default isnumpy.float64
. (Note:numpy.float32
often leads to faster processing and a smaller memory footprint albeit less numerical precision thannumpy.float64
.)group_rep
: For functiongroup_similar_strings
, determines how group-representatives are chosen. Allowed values are'centroid'
(the default) and'first'
. See tutorials/group_representatives.md for an explanation.