I had an interesting conversation with Max Cutler and Daniel Bachhuber last weekend about machine learning. There’s been an explosion of websites that provide some form of curation using trend tracking. Mediagazer. Viewsflow. Loud3r. The Hourly Press.
Now, I happen to think automated trend tracking is incredibly cool, if only because it upsets the heck out of music reporters who used to trick us into believing they had some sort of a superhuman grip on reality, whereas apparently they were just following three or four different websites to track down “the next big thing”. Sucks to be you, guys.
But anyway. Automated trend tracking is cool, but it’s also disappointing, boring even, from a journalistic perspective. Weren’t we supposed to be the people who unearthed news from all over the globe? The guys who point at stuff nobody would ever stumble across if it weren’t for us — like a very cool blogpost on the educational system in Siberia, written in freakin’ Russian? Or a very insightful critique on the new government coalition, written by a joe-schmoe nobody’s ever heard about before? Yet trend tracking does exactly the opposite: it gives more attention to what’s already in the spotlight.
Readers want to know what everybody else is talking about, and join the debate. So tracking what’s hot is, well, cool. But instead of the bazillionth trend tracking app, I’d much rather see a tool that would use aggregation, machine learning and natural language processing to crawl the far outreaches of the web and spot obscure but valuable news or opinions that could enhance existing coverage. Like a robot army of researchers for every reporter, working 24/24 to suggest new angles and exciting topics we’ve missed. Topics everyone has missed. That one interesting commentary from an otherwise incredibly dreary blog that nobody would ever want to put in their RSS reader.
- An aggregation tool like Superfeedr, combined with Yahoo! BOSS, could provide the raw content from feeds, from Twitter, from all over the web.
- Natural language analysis using the NLTK and Open Calais would ascertain for each piece of incoming content whether it belongs in one of the topical areas our news org covers, and if not, throw it away.
- Machine learning using PyBrain or the Google Prediction API could give a rough estimate of how interesting this piece of news is, based on previous coverage — similar to how Google Reader’s magic ranking works.
- The Google Language API could do rough translations of content in Spanish, French, Ukrainian, whatever. When tied in with the aforementioned language analysis and machine learning, those translations might hint at fascinating content from across the globe.
Will this work? I don’t know. But it’d be damn nifty if it would. You’d do more with fewer people.
Time to experiment, isn’t it?
Our methods of newsgathering have hardly changed in the last century. The web has given every single person with an internet connection the same research tools as reporters. Time to change that around again.
