Finding similar strings within large sets of strings is a problem many people run into. In a previous blog Super Fast String Matching I’ve explained a process of finding similar...
In the beginning there was spam. Cheap, unpersonalised, mass-sent junk mail, easily defeated by simple Bayesian Filters. Over the years spammers improved and an arms race between spammers and spam...
Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. Using TF-IDF with N-Grams as terms to find similar strings transforms...
After moving to New York from the Netherlands I was relieved to find out that biking in Manhattan is actually pretty do-able. It’s not really as common as it is...
PySpark Dataframe Distribution Explorer I found myself using some half baked, quickly written functions to do data exploration in PySpark, every time using a similar but modified version of the...
I’m a bit later dot N L ikbenwatlater.nl (I’m a bit later dot N L) is a site I created to notify you when your train is delayed or cancelled....