van den Blog

Python, Data, and more

Jan 2, 2020

String Grouper

Finding similar strings within large sets of strings is a problem many people run into. In a previous blog Super Fast String Matching I’ve explained a process of finding similar...

Apr 13, 2019

The rise of Newsletter Spam: A journey through my Gmail inbox

In the beginning there was spam. Cheap, unpersonalised, mass-sent junk mail, easily defeated by simple Bayesian Filters. Over the years spammers improved and an arms race between spammers and spam...

Oct 14, 2017

Super Fast String Matching in Python

Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. Using TF-IDF with N-Grams as terms to find similar strings transforms...

Aug 29, 2017

1 Day of Citi Bike availability

After moving to New York from the Netherlands I was relieved to find out that biking in Manhattan is actually pretty do-able. It’s not really as common as it is...

Aug 1, 2017

PySpark Dist Explore

PySpark Dataframe Distribution Explorer I found myself using some half baked, quickly written functions to do data exploration in PySpark, every time using a similar but modified version of the...

Feb 1, 2017

I’m a bit later dot N L (I’m a bit later dot N L) is a site I created to notify you when your train is delayed or cancelled....

This project is maintained by bergvca