I was looking over the entries to the Mozilla-Knight digital journalism challenge, which asks people to think of an application that could profoundly impact the way digital journalism is done. (Disclaimer: I’m a participant myself.)
Some people are trying to solve the problem of having to browse twenty different news websites to get the kind of information you want. Um, like Reeder or Flipboard, you mean? Others want to create easy ways for people to share information and for reporters to distill that information into stories. Um, like Facebook, twitter and Storify?
But the thing that struck me most is how nearly half the entries mention machine learning or language analysis. And the particular way they mention it: vaguely, in a way that not quite explains how exactly the process should work and how it’ll do its magic. In many product ideas, machine learning seems to fill the role of industrial superglue: it’s what holds an otherwise mediocre application together. “My application proposes bundling comments by what they talk about, so people won’t have to sift through tons of comments to read the ones that interest them. Difficult, time-intensive you say? Nah, we’ll slap some language analysis juju on there, et voila.”
Machine learning as a meme is very similar to “social” five to ten years ago: you took an okay-ish concept, added some crowdsourcing, folksonomies and social networking, and there it was, your wonderful Web 2.0 brainchild. AJAX used to have the same effect on people: the term found its way into the layman’s lexicon and everybody started talking about how they’d make this beautiful, AJAXy web app without really even knowing what it entailed, just that it’d be really slick. Real-time web has a shot at attaining the same status in the not-too-faraway future, and I can’t count the amount of apps that are mobile location-based gamification with coupons.
The fact that machine learning is on people’s minds, and that to a certain extent it has become easy — Google just opened up its Prediction API, taking care of all the fussy details for you, at least for a certain set of problems and provided you only need limited accuracy — has me very excited. But the nonchalance with which people talk about machine learning and natural language analysis also has me worried, because, in entrepreneurspeak, it functions as a sort of magic pixie dust that’ll make everything better.
There’s no substitute for good product design. You still have to make something people will want to use and find your way around technical stumbling blocks, not just fill in all the gaps with “ML will solve any difficulties we face”. Because it won’t.

19 comments
After a long time an article which provokes enough emotion and appreciation to put in a comment just to say 'Well Said'.
I really like your analogy with social and AJAX.
Spot on.
The irony I think of using ML as spackling material for bad ideas is that to derive non-trivial advantage from it, you have to put a lot of work into massaging your data into the right form, doing feature detection, choosing the right output representation, collecting training data and doing validation, etc.
In other words, your spackle now needs glue and string.
P.S. Great hair. You're up there with Rich Hickey.
Well said Stijn! These algorithms take a lot of experimentation and tuning. And they're hard work. For an example of just how much work, see this great paper on Google News' recommendation algorithm. http://www.ra.ethz.ch/CDstore/www2007/www2007.org/papers/paper570.pdf
Fascinating link, Jonathan, thanks. There's some information around the web / Quora about how Yahoo! News does personalization, but nothing with the level of detail of that paper. Downloaded to my Kindle for quiet reading later today.
Excellent post. I think you could add, XML, semantic web and a host of other tech to that over-hyped misunderstood, yet must have list. I will now read your blog.
However I do disagree with this line, "Others want to create easy ways for people to share information and for reporters to distill that information into stories. Um, like Facebook, twitter and Storify?"
Implying that these technologies are the end-all, be-all of distilling information into stories is like saying the telegraph is the end-all, be-all of telecom technology. They are amazing breakthroughs, but just the tip of the iceberg.
Spot on. Especially hurts if you have done research on ML, and see other people use it as a marketing gimmick w/o really understanding what is going on inside a off-the-shelf ML library.
So if I clap my hands, my business model will work?
Although it is sad, the way people treat ML as a superglue but things have been like that.I know students who opt for Computer Science Engineering because they think it is the easiest; doesn't involve much maths and only has stuff that one can remember before exam and forget after writing the exam. Sometimes, I blame myself and other software developers for making things user-friendly so that any Tom, Dick and Harry can use it. Probably, this was started as a marketing strategy by Apple and widespread by Microsoft [at least in the field of software].
Clearly this magic fairy dust stuff works, as your post is at the top of the 'Hot' list in my RSS reader. ;)
Excellent post. I think the three main factors are: - wide availability of open source ML libraries (Python and Java are well represented as far as I know) - little or no understanding of the mathematics or basic workings behind the models... - the hype and the buzz word effect
Fairy dust, wishful thinking, fools' errands, 'computer science', and 'machine learning' are all, for working productively with 'information', the same -- junk.
E.g., how are we to have confidence in the quality of the results from fairy dust? We can't. And similarly for wishful thinking, fools' errands, 'computer science', and 'machine learning'.
There are some steps. Here are the steps:
Step 1.
Start with a 'real problem' want to solve. So, for a connection with 'information', you want some to provide a 'real answer' to solve your real problem.
Step 2.
Look at your real problem and see what data you have or can get and what mathematical assumptions you can make about that data.
Step 3.
With some guessing, and maybe some iterating, hopefully with some educated insight, formulate the real problem as a mathematical problem.
Step 4.
Using what you can assume about your data, get a mathematical solution to your mathematical problem. Strictly, this solution is in the form of theorems and proofs where at the beginning your assumptions about your data provide the hypotheses of the theorems and at the end the conclusions of your theorems provide the mathematical solution.
Step 5.
Write some software to manipulate the input data as specified in your mathematical solution to get output data that is the basis of your real solution.
Step 6.
Usually the output data from your work will require some 'interpretation' to be converted to the real answer and the solution to the real problem you want.
Those are the steps.
Nowhere did I mention 'computer science' or 'machine learning', nor should I have.
You omit steps from the list above at your peril: If you omit steps, then your output should be labeled Snake Oil with a prominent skull and crossbones and XXX.
In particular, there is no effective 'software', or 'software learning', or 'machine learning' to replace the steps above.
For the steps above, the core is the theorems and proofs, and these make a strong 'logical chain' from your input data and assumptions to your solution. Break the logical chain, and you have nonsense.
This is how to work with 'information'. It's not 'computer science'. Instead it's applied math.
There mere fact that there is some computing in one of the steps and that computer science studies computers does NOT mean that computer science has the crucial knowledge for such work. Flatly the field of computer science does not have that knowledge. Again, the work is applied math, not computer science.
But, then, don't expect the computer science people to rush to explain to you their lack of knowledge of applied math. And, sadly, don't expect the pure mathematicians to be much interested in your applied math.
Here endith the lesson 101 in how to turn real data into real information for fun and profit.
I was too lazy to think of a good response, so I fed your post into my latent semantic analysis and response optimization algorithm and asked it to produce a response that maximizes obtuseness. Its final solution was:
"I agree: To maximize the probability of winning the Mozilla-Knight challenge, you should maximize your application's word-correlation with the normative linguistic word-set of the software-journalism complex."
@Norm: Computer science is a branch of mathematics. For example, at the University of Waterloo, the School of Computer Science is part of the Faculty of Mathematics.
Whoah, some great (and some weird :-)) responses here. Thanks all!
I'm probably more lenient than some of you guys when it comes to accepting somebody's ML-related project ideas. My criterion is: have they done at least a single project where they did or had to use some machine learning? That's usually plenty for somebody to see the basic strengths and limitations of the approach.
Ditto for NLP: you don't need a lot of experience but you need a little bit to see that some things are fairly easy (autoclassification or sentiment analysis on a large corpus with limited accuracy, for example) and some things are nearly impossible, even though they might look like exactly the same problem for an untrained eye.
And actually, that applies to computer programming in general, and it's why working under a non-technical manager can be so frustrating.
@Geoff: of course I'd love for some innovation in the information/news sharing sphere, but it's a really difficult market to conquer because it depends so heavily on network effects / adoption rates. Especially for journalist-hackers, we have to be more pragmatic and look for the quick wins. Because there's a lot of quick wins.
@Troy: "Computer science is a branch of mathematics. For example, at the University of Waterloo, the School of Computer Science is part of the Faculty of Mathematics."
An organization chart does not mathematics make. Only a little of 'computer science' can claim to be applied math, and that part is generally quite low quality applied math. The profs didn't take the right courses in grad school.
But at Waterloo, the department of Combinatorics and Optimization is definitely applied math. There, tell Cunningham Norm said "Hi!".
@Norm: Your tone is less than inviting for a discussion, just something to consider.
That being said, I certainly empathize with the motivations for your position, but I think you're making things unnecessarily binary. If you consider the number of people outside of computer science who now do computer programming compared to 30 years ago, you'll realize the field of computer science has to evolve. Computer science, as an evolving field, increasingly blurs the line between computer science and statistics and applied math. Whatever you want to call it, the trend for computer scientists to work with information has them taking these "applied math" courses that you want to claim is not their place - but if the bulk of computer scientists are doing it, how is it not computer science?
Additionally, I do believe that it is impractical, due to the scale of the problems, to think of these techniques outside of the computing context and I do believe computer scientists are in the place to make unique contributions here. Similarly, while massaging of data is always needed, there are places where the automation of this can and should be done. To achieve this, you simply have to develop a mathematical success metric for data preprocessing and run your preprocessing accross several design parameters. I hope you consider what I said, cheers!
well put.
The problem is all wrapped up in how funding / opportunities are distributed. The folks with money who want to affect the future positively are generally not brilliant in any domain other than making money. So they consult "experts" as to what is the "next big awesome thing" - or as you put it, what is the "next fairy dust". Once an expert convinces a person with the money of the veracity of the particular brand of dust - it becomes dogma. And those who want money salute the the "fairy dust" and make loud protestations as to the general amazingness of "fairy dust". They also shout down "non-believers" to show their undying loyalty to the fairy dust. They give invited keynote speeches about the future of dust.
By doing these things, they get funding. And because they are funded - the market assumes they must be right because no one would fund a "bad idea". Slowly public opposition to fairy dust goes underground until the funds are exhausted (usually having virtually no impact). Since everyone is so ashamed at the amount of waste in the name of "fairy dust", no one ever goes back and checks who was wrong and where they went wrong. It is just easier for the "dust riders" to lay low for a while and then re-emerge to flow to the next source of funding and "fairy dust".
Relying on social media sources(fb, etc) to build news stories is part of the problem behind news media today. The reliance on such third party gathering and rewording is not the advancement of the news, it's the murdochism of the news.
Perhaps a rethink into how we interact with this content is more appropriate, and may lead to better systems for sifting.
Check out Soundcloud. The most unique thing about sound cloud is that comments are targeted to a specific part of the song. Yet with text we always have comments at the end.
If we had the ability to rate or comment on each paragraph or section, a 'bot would have more refined control over what is important in an article, and how people 'felt' about it.
William Gibson, in neuromancer had the fairy dust of which you speak, asking for the 'net to build an essay on a gang, and summarize it when it was too general. We are nowhere near that yet.
Hey, someone here who wants to show me his ML skills for Jetsli.de to improve relevance of personal news ;) ?