Performance

Semilogx plots of run-times of match_strings() vs the number of blocks (n_blocks[1]) into which the right matrix-operand of the dataset (663 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number n_blocks[0] of blocks into which the left matrix-operand was split. Semilogx

String comparison, as implemented by string_grouper, is essentially matrix multiplication. A pandas Series of strings is converted (tokenized) into a matrix. Then that matrix is multiplied by itself (or another) transposed.

Here is an illustration of multiplication of two matrices D and M^T: Block Matrix 1 1

It turns out that when the matrix (or Series) is very large, the computer proceeds quite slowly with the multiplication (apparently due to the RAM being too full). Some computers give up with an OverflowError.

To circumvent this issue, string_grouper now allows the division of the Series into smaller chunks (or blocks) and multiplies the chunks one pair at a time instead to get the same result:

Block Matrix 2 2

But surprise ... the run-time of the process is sometimes drastically reduced as a result. For example, the speed-up of the following call is about 500% (here, the Series is divided into 200 blocks on the right operand, that is, 1 block on the left × 200 on the right) compared to the same call with no splitting [n_blocks=(1, 1), the default, which is what previous versions (0.5.0 and earlier) of string_grouper did]:

# A DataFrame of 668 000 records:
companies = pd.read_csv('data/sec__edgar_company_info.csv')

# The following call is more than 6 times faster than earlier versions of 
# match_strings() (that is, when n_blocks=(1, 1))!
match_strings(companies['Company Name')], n_blocks=(1, 200))

Further exploration of the block number space (see plot above) has revealed that for any fixed number of right blocks, the run-time gets longer the larger the number of left blocks specified. For this reason, it is recommended not to split the left matrix.

Block Matrix 1 2

In general,

total runtime = n_blocks[0] × n_blocks[1] × mean runtime per block-pair

= Left Operand Size × Right Operand Size ×

mean runtime per block-pair / (Left Block Size × Right Block Size)

So for given left and right operands, minimizing the total runtime is the same as minimizing the

runtime per string-pair comparison ≝
mean runtime per block-pair / (Left Block Size × Right Block Size)

Below is a log-log-log contour plot of the runtime per string-pair comparison scaled by its value at Left Block Size = Right Block Size = 5000. Here, Block Size is the number of strings in that block, and mean runtime per block-pair is the time taken for the following call to run:

# note the parameter order!
match_strings(right_Series, left_Series, n_blocks=(1, 1))

where left_Series and right_Series, corresponding to Left Block and Right Block respectively, are random subsets of the Series companies['Company Name')] from the sec__edgar_company_info.csv sample data file.

ContourPlot

It can be seen that when right_Series is roughly the size of 80 000 (denoted by the white dashed line in the contour plot above), the runtime per string-pair comparison is at its lowest for any fixed left_Series size. Above Right Block Size = 80 000, the matrix-multiplication routine begins to feel the limits of the computer's available memory space and thus its performance deteriorates, as evidenced by the increase in runtime per string-pair comparison there (above the white dashed line). This knowledge could serve as a guide for estimating the optimum block numbers — namely those that divide the Series into blocks of size roughly equal to 80 000 for the right operand (or right_Series).

So what are the optimum block number values for any given Series? That is anyone's guess, and may likely depend on the data itself. Furthermore, as hinted above, the answer may vary from computer to computer.

We however encourage the user to make judicious use of the n_blocks parameter to boost performance of string_grouper whenever possible.