stdout.be

A blog about programming, information architecture and journalism

in coding, journalism

Interrogating data

Commentary

source: http://onlinejournalismblog.com/2010/04/26/data...

“Use a spellchecker to check for misspellings. You will probably have to add some words to the computer’s dictionary.” — I guess these sort of things is why it still pays off to team up with a developer. E.g. in this case levenshtein distances are often enough to allow a computer to clean up dirty data by grouping together slight variations on a spelling, and in other cases simple regular expressions can go where simple find and replace can’t.

Thanks for link to ScraperWiki, looks like a pretty cool service.