Dan Conover’s been pretty prolific lately in updating his early thoughts about journalism and metadata in two big blogposts (one, two).
Lots of inspiring material in there.
[The] raw material of this information economy is essentially like oil shale: the latent value is obvious, but the cost of extracting these information resources from today’s existing deposits (think web archives) is so high given today’s technology that no one is going to spend a dime to start the project.
The oil shale analogy makes it clear how much the semantic web will depend on efficiency, convenience and turning a profit. I wholeheartedly agree. I’d also add, in a journalistic context: it will depend on tapping into people’s habits, instead of forcing new ways of doing things on them.
Oil shale. I like how down to earth it sounds. A nice contrast to, for example, Reg Chua’s The Molecules of News, which seems to assume that not caring about structured information is equal to leaving money on the table plain and simple, and that all we have to do is “sell the sawdust, not just the logs” as the 37signals proverb goes. It’s not that simple, I don’t think.
And that’s the thing: as much as it pains me to battle a kindred spirit, I’m afraid Dan doesn’t really see the full impact of the oil shale analogy on his own writings: something can be valuable and unreachable at the same time. Like, say, a semantic economy.
Dan’s writings remind me of a post Jonathan Stray did half a year ago, in response to my IA series: The world cannot be represented in machine-readable form. I never did agree with his criticisms, because I’ve always made it abundantly clear that I don’t want to represent the world in machine-readable form, and I don’t think we need to either. Dan, on the other hand, with his insistence on semantic annotations at the level of words and sentences, and building monetizable databases of facts, comes dangerously close to trying what artificial intelligence researchers have been trying to do for half a decade, and have been failing at spectacularly.
A collection of sentences or or single bits of information, stripped from their context as a potential goldmine… I don’t know. I’d love it but I fear it’s a pipe dream. Words change meaning. Definitions can be disputed. Factual statements are made as part of a broader exposition, and may not make any sense in and of themselves. We write about things with varying amounts of certainty. Can we possibly hope to track all that and more in what Dan calls our directories of meaning? And still profit?
And does Dan truly believe that “if we give people user-friendly tools that provide authoritative access to facts, then over time we will isolate the less credible voices in society to rhetorical ghettos of their own construction.” That’s not how people work. That’s not how language works. That’s not how facts work.
What can we do, then, if we want to move beyond big blobs of text and if we want to get more value out of journalists’ hard labor; what to do if we want to provide readers something truly of the web?
Well, there’s actually a lot that could be done: more emphasis on structured news formats, tailored to video, reviews, blogposts, profiles, recipes and all those other types of content we’re currently all still treating as a generic story, even though they’re not. A clear separation between content and presentation. Rock-solid metadata (like taxonomies) at the story-level — not dissecting every little particle in every story you’ve ever written.
Both approaches wish to extract more value from journalism through structure and relationships. Both approaches have you trade a little hurt during content creation for yet-to-materialize advantages. That’s unavoidable — no such thing as a free lunch.
But what’s nice about the latter approach, the “Adrian Holovaty”-approach if you will, and sorely lacking from Dan’s thoughts, is that it’s a strategy with direct impact on what you can deliver to your readership. You don’t have to wait on a global marketplace of machine-readable information before your efforts start making sense.
Dan is essentially asking the publishing industry to take a huge bet, and I’m not sure he’s thought through all the implications of what he’s saying. He wavers between providing gorgeous amounts of detail, and glossing over important parts of his scheme by saying “yeah, some thingmabob will guide you through creating semantically annotated content, it’ll be a middleware that fits between anything you’re using, and it’ll just work” . Forgive me some pessimism, but it won’t ‘just work’.
You need a way from A to B. And a way to profit even if you’re not quite there.
Don’t bother with an ISO standard. Forget interoperability for a moment. You don’t need directories of single, unambiguous, authoritative representations. Instead, go for the big gains. Think about which parts of your content would benefit most from a little structure, assure that applying that structure is part of a slick workflow, and then just go ahead and do it.

7 comments
Hi Stijn. As much as we may disagree over specifics -- my particular critique is with the idea that we can get much out of predicate-based representations of structured knowledge, when they seem equivalent to the symbolic logic GOFAI approach which has failed -- I'm very much with you in saying that the way journalists handle information has to change. You may be interested in the computational journalism reading list I just put together. Your IA series is in there, because it's really the best extension of Holovaty's notion that I've seen.
Hi Jonathan. I actually saw your reading list earlier today — great stuff, and a fair number of nuggets in there I hadn't spotted yet. I've yet to fully digest it, though :-)
Stijn,
Hi. I don't think we actually disagree that much. I don't want a huge, interoperable taxonomy, or dissect every word of every story into a free-floating particle. But at the same time, I do think journalism needs to change its habits and practices, if only to adapt to the new possibilities online - even absent a profit motive.
Extracting relationship data is easy. Changing date references could start tomorrow. Writing to structure, ala Politifact, has already been done. All of this involves changes to the process - "a little hurt" is a nice way to put it - but doesn't explode context or meaning; in fact, in some ways, it enhances it.
And we're 100% together on your last point: Figure out the product that needs structure, that can be effectively pulled off by the newsroom, and go for it. Make mistakes. Iterate. And if it works, spread it, either to an adjacent part of your newsroom, or perhaps to a partner organization.
I do think it has to be quasi-organic, or it can't really grow, or we can't really learn. Dan's goals are great - but it's a tad ambitious.
(Jonathan - haven't got to your reading list yet, but I will!)
It's time for a moon mission boys.
Insightful stuff, Stijn.
One thing to consider, too, is that people's habits can and do change.
For example, my habits years ago are hardly comparable to today's. That applies very much to how I, and even my family, consume things.
I guess that while you don't want to and shouldn't force anything on anybody, I think it's about finding the things people want, whether or not those are related to current habits or not.
Hi Stijn, longtime listener, first-time caller..
I think some of your criticism here raises valid points, while others go astray. I'm glad to discuss them, though.
I wrote the first essay because I wanted to build a cognitive box to put my stuff in. So you'll get no push-back from me on your critique that the semantic economy essay is vague in its connecting pieces. The second essay is less so, but nevermind that for the moment.
Some of your disagreements are valid questions, while others are misinterpretations (although I suspect they are common misinterpretations ... I'm still working on that). Here are the areas of disagreement that I find interesting:
What level of definition is useful? Great question, and a sticky matter to which I don't have definitive solutions.
My answer is, it's hard to imagine the optimal level of useful definition of meaning without having experimented with the form. As I've tried to suggest, I think the operational art of inline semantic tagging will be something we'll get better at doing as we do it. My idea is that we need a tool that leaves the level of detail to the user, rather than defining it for all users. A scientific publisher will need an approach that's likely quite different from a local weekly, or a national women's magazine, and so on. In doing that you create the opportunity to go hog wild with anal-retentive annotation, but I doubt that will become the emergent standard.
Is a collection of sentences stripped of their context a potential goldmine? No. It's probably worth pointing out here that some facts are more valuable than others, and that the facts that we would choose to emphasize will likely be determined by market value. Now, am I saying that by organizing information in these ways, using these tools, we'll create new products? Yes. And new markets for things. And new ways of doing things that we do today, only not particularly efficiently or accurately. I'm comfortable with these ideas, even though you're not the only person to interpret my writings this way.
Exactly. Words DO change meaning. Definitions WILL be disputed. Life is messy, meatspace is poorly organized, and not everything lends itself to precise modeling. From such chaotic dynamics we get both progress and confusion. Yet this is exactly why I like my idea. It allows for all of these things. In fact, it's based on this understanding of the world. Not one directory of meaning -- many. Only let's make those directories so that developers can write useful scripts that make use of all of them.
Isn't it better to say what a thing means when you say it than to spend your life constantly defining it for every person who misinterprets it, whether honestly or maliciously? Isn't it better that when definitions are debated, we know what is in question? In my second essay I wrote subpages that defined what I meant by the terms I use, because some of them are invented for these ideas. As my understanding of my ideas improves, I'll likely refine those definitions. That's clearly pretty valuable in some circumstances, although I wish I had a better way of doing it.
Others are not so clear. Do I need to define "democracy" every time I use it? God, I hope not. But how about "socialism?" That's where it gets trickier, since American political discourse is clearly stuck on what that word means. And this is where I think we have to assume that a bottom-up customization of semantic expression, based on the goals of the creator/writer, will develop operational standards that will be more useful than any one-size-fits-all solution promulgated by the W3C. This is, by the way, why I don't like referring to this idea as being a "Semantic Web" idea.
re: broader exposition. As you know, there are some things that have general value (a series of tags describing the content of this discussion, for instance) and other things that have specific value (the street address of my favorite bar has value x... add the grid coordinate to that address and the value is likely higher). My assumption is that we'll tag the things that have specific value all the time and tag the things that have general value some of the time, and that the more we use these tools in our various niches, the better we'll get at figuring out what's worth the effort and what isn't. And some things won't benefit from markup at all. Take erotica, for instance: metadata tagging the object will suffice. It's hard to imagine how inline markup would add any value.
re: Varying degrees of certainty. ABSOLUTELY. All new knowledge is provisional, and if you're not afraid of abstraction, so is most old knowledge. Someone using the tools I describe to tag breaking news would be facing a very different reality than an investigative reporter working up a six-month project. I've done both things, and it's really easy for me to imagine the ways that I'd use the same tool differently for each task. But I've imagined those workflows, and I'm satisfied that I can describe them in useful detail.
re: managing all this via directories of meaning, profitably. Obviously, yes, I think it is possible, even likely. To clarify, I don't think that all communication will work this way. In fact, I think that most communication WON'T take place within these tools and processes. But where there is value, yes, I think organizations will choose to use these tools and techniques. I also think that there are ways this will be immediately valuable, and that the number of ways that value is created will expand over time.
re: the rhetorical ghettos thingy. We just disagree on that. The sentence before the one you quoted was "Debates will continue, as they should." People will still disagree, and some people will always believe the silliest things imaginable. But I think it's possible to improve human discourse, I think there's value in it (both tangible and intangible), and I think there's progress to be made. So color me optimistic.
Where you go farthest astray is in thinking that these tools would require a fully functioning semantic economy to work, or that there are no interim applications, or that I've merely black-boxed the functions that would make these tools work. That's not what I believe, or what I wrote. In fairness to you, I'm not showing all of my work in these public essays, so there are parts you haven't seen. In fairness to me, I think you've found conclusions in my work that I didn't intend.
Like you, I think there is much to be done with structured info, semi-structured info, structured formats, metadata, etc., and I include tools for managing and customizing these approaches in my concept of an SCMS. But while you dismiss the idea that there's value below the document level, I see value there. You are satisfied with the gains that can be made in improving existing processes, and I applaud you in that. I see that as a good use of time, but not a game-changing advancement. And frankly, I went out looking for game-changing ideas, because I'm a journalist, and the game we're in is running down rapidly.
I understand all of this is a leap, and I anticipated that most people would either disagree or disagree passionately (or, to be honest about it, not care one bit). I understand that it's quite possible that I'm wrong,. I understand that I am asking the industry to make a big bet (it's not as big a bet as you described, but to go into that would be to quibble). But I think it's a good bet, not a wild Hail Mary. In fact, in terms of risk and reward, I think it's an excellent bet.
You think otherwise, and I don't begrudge you your opinion.
Best of luck with Camayak, btw.
Hi Dan,
I agree that there's value below the document level. After all, the point of structured news formats is to properly store each individual piece of a story in its proper format, rather than as a big blob. The real debate is fields vs. sentences, rather than document vs. sentences.
With regards to your other remarks, I guess we'll have to find out. I sympathize with the fact that you don't want to spill the secret sauce just quite yet, but it does mean you'll have to live with the scepticism of me and many of our ilk for a while longer.
I can't really respond to statements about what you think and feel is the case — you feel that there will be short-term applications, you think there's money to be made from directories even taking into account all caveats, you think you've figured out the workflows — other than to say that my imagination does not stretch that far, so I'll know it when I see it. Mind you, that's probably how some readers feel about my work too, so we're in the same boat there :-)
Apologies for not getting back to you earlier.
Stijn