stdout.be

A blog about programming, information architecture and journalism

in information architecture, journalism

Tags don't cut it

Summary If we re-imagine tags as rich connections that relate content to the persons, organizations, locations, events and themes they talk about, hopefully magic will happen.

Tagging is a success story. Ten years ago, who would’ve thought that any regular Joe could and would be associating metadata with content. Out of their own free will, at that! Collections that would’ve just stayed a random pile of content in the past, like the bookmarks on Delicious or the photos on Flickr, are now being organized by the magic of folksonomies — a showcase for the sheer power of the many.

Tags dovetail nicely with our almost random way of browsing the internet, for instance by being the engine behind great topical pages for blogs that provide a gateway to any and all similar content written in the past. It’s metadata for the masses.

And so the human species does what we’ve always done when we see something that we like and that seems to work: we copy it. We’ve now got labels in GMail and in our customer relationship management software, tags in iPhoto, hashtags in twitter and there is tagging going on just about anywhere else where we want some order in what would otherwise be a big blob of undifferentiated content. And the world is better for it.

But because we’ve copied, thinking that if it’s good enough for Flickr, it ought to work for us too, we’ve forgotten what makes tags great, how they can add value and why they work when they work. We’ve forgotten that tags are just one way of bringing order to chaos.

Order to chaos

In the news industry, we desperately need to bring order to chaos. Amy Gahran says that today’s journalists should move away from their exclusive focus on content production ‘toward providing layers of journalistic insight and context’. A piece of news just doesn’t generate enough value on its day of birth to be worth the expense. Yet a day after publication, most content disappears in the depths of our websites, only to be found when a reader explicitly searches for it, if even then.

Each story could function as part of a web of knowledge around a certain topic, but it doesn’t.

So here’s a well-intentioned idea you’ve heard before: journalists should start tagging. Jay Rosen insists that Getting disciplined and strategic about tagging “may be one way professional journalism separates itself from the flood of cheap content online.” Tags can show how a news article relates to broader themes and topics. Just the ticket.

But too often, tags don’t live up to their promise. A techie at The Onion mentions that a while after they started tagging stories at their satirical news website, they had ended up with 20 or so different versions of “Bush Administration”, “George Bush”, “George W. Bush”, “Bush”, “bush” and so on, leading him to conclude that tags are evil:

The whole purpose of tags is to relate one piece of content to another, and if the relationship surface is spread out in such a way that we have 10 pieces of content tagged “bush administration” and 10 tagged “george bush” and the groups are orthogonal, even though they’re obviously related by tag name, the data indicates that they are completely different, and the goal of relating things is dead and someone has to go and look at all 10000 tags and decide which ones are thematically the same and remove the duplicates.

Getting strategic about contextualizing our content and bundling it into meaningful dossiers is definitely the future, Rosen is right about that. But such a strategy requires a way of indicating how content relates to other content on our website and on other websites that is more powerful and more expressive than tags. And let’s add another requirement to that challenge: we need this new way of relating content to other content to be easy to use and comprehend even for people who are not trained librarians. Simplicity is the sole reason tags work (when they work) so we can’t lose that.

But we do need something new. Tags are not the right way of disclosing the rich interconnections between news articles and not the right way of packing them together. Getting strategic about tagging means abandoning tagging and searching for something better.

Tagging on Flickr and Delicious works because the masses make it work by their aggregated behavior. Tagging on your personal blog works because you only want to bunch content together that is somewhat related and don’t really care about fancy-schmancy “relationships” between your content. Your readers don’t, either. They just want to read some similar stuff. But it’s different for news organizations.

We can do better

Let’s fix this.

What should we demand from a good categorization scheme?

  1. It should allow us to really talk about our content. One thing that upsets me about tags is that you can’t provide any more information above ‘this has something to do with that, but I can’t tell you what exactly the relationship is and neither can I tell you what the thing is that it relates to’.
  2. It should be easy to tag our content consistently. I hate careless tagging on news websites. If you’re not doing it right, don’t bother doing it at all. Me and my fellow readers are better off just using a search engine to find what we’re looking for.

As an addendum to (2): it should be obvious what we should tag and how we should tag it. Tagging is confusing for me, because half of the time I don’t know when something qualifies to be a tag or how I should tag. Is it relevant enough? Do we tag locations or just people and themes? Singular or plural form? Ehhhh. Here are the steps we can take to make things better.We need a scheme that combines some of the lessons we can learn from information architects with the simplicity of tags.

Step 1: vocabularies

There’s a very simple step we can take to make it obvious to reporters what kind of things they are supposed to be adding as keywords to their articles. Split up tags by type, each with their own input field. Some people call ’em vocabularies, others call them dictionaries, but you get the picture.

Drupal makes it easy to split up tags and categories into distinct vocabularies.

The New York Times has dictionaries for subject descriptors, persons, geographic locations and organizations. That’s pretty encompassing, but allow me to add events to that list.

Themes or topics, persons, locations, organizations, events. If we keep these five vocabularies in mind, they’ll serve as a constant reminder that we need to make sure that we keyword all the persons that are involved, that locations are valid keywords and that we should add them, and so on.

Let's replace tags with the things tags actually stand for: people, places, themes, events and organizations.

Vocabularies are a basic necessity for even trivial things like automatically compiling an index of all the people that a site has content on. You can’t provide an index of people if your website makes cannot discern persons from themes from locations.

We’ll handle subject descriptors — themes, topics, or whatever you want to call them — in a follow-up post scheduled for later this week. Themes often tie in with site navigation, which is tricky. Themes also come with their own knotty problems because, unlike persons or events, they’re not thing ish, they’re bundles. Don’t worry, themes and topics will get their own treatment in a day or two. Let’s first tackle the other vocabularies.

Step 2: abolish tags, state relationships

Tags are vague. They’re a very primitive way of spelling out how things relate to each other. A tag on a news article says “this article has something to do with this concept or thing”. But what exactly? A tag doesn’t tell you whether an article is a critique of a person, an interview with a person or whether it just mentions that person in passing. A tag doesn’t even tell you if the reference to Samuel Adams is about the person or about the kind of beer (which is why we so desperately need vocabularies). A tag can’t tell the difference between an event that merely took place at the local café and an event that the aforementioned pub actually organized.

It’s only a little confusing to us humans because we can guess the meaning of the tags from the text that they accompany, but computers can’t. We’re really going to need computers to aid us in bringing context — say, for example, listing all the articles where Apple was lauded versus those where the company was criticized.

A few possible relationships between a story and a person.

With just a little more effort, it’s easy to specify real relationships. Let’s not reinvent the wheel. We’ll use use the triplets our semantic web buddies are so fond of:

<article> critiques <organization>
<article> contains an interview with <person>
<article> revolves around <event>
<article> follows up on <previous article>
<opinion piece> is a riposte vis-a-vis <other opinion piece>

The semantic web movement focuses on making the web understandable for computers, but this kind of metadata is useful for human readers as well. If I’m just skimming through an article and can’t be bothered to read it in its entirety, these relationships provide a pithy summary of what an article is about.

For example, the relationships for Clay Shirky’s The Collapse of Complex Business Models could look something like:

<analysis> is about <business models> (subcategory of <business>)
<analysis> is about <society>
<analysis> contains an example using <AT&T>
<analysis> criticizes <Rupert Murdoch>

Mkay, pithy might be overstating things a bit, but it’s a lot more insightful than what you get using plain tags.

Relationships can also be used to make search queries more intelligent. Here’s an example. I do a search on Politico for Rush Limbaugh. But, you see, I don’t care about mentions of Rush Limbaugh, just give me the articles that cite or interview him. Impossible on any news site I know of, trivial to implement with the right relationship-based metadata.

Relationships can also enrich topical pages and assure that they are more than just link dumps that leave the reader figuring out how exactly the linked content relates to the topic at hand.

Tag pages ought to be more than just a plain and boring list of links. Unharvested potential there.

Step 3: entities, not labels

If we look at our vocabularies and ignore the topics/themes vocabulary for now, you’ll notice that these divisions actually represent real things. Persons. Organizations. Events. Locations.

Persons have first and last names, a birth date, they work for or are otherwise associated with certain organizations. People live somewhere.

Events happen at a certain place and at a certain time. They begin and they end. They might just be a point in time, or they can be planned happenings organized by a person, a group of persons or an organization. Events can recur weekly or monthly or yearly or randomly. Events can lead to other events.

A location has an address or at least refers to a geographical point or area.

Persons

Organizations

Events

Places

Themes

Here’s a cool idea: let’s actually treat persons and events and all these other vocabularies like real things instead of like labels. Tags are a concept that we’ve inherited from the physical world: a tag is a label that you attach to something. And as David Weinberger notes:

in the physical world a label has to be smaller than the thing that it labels. That’s why we call it _meta_data: data about data.

But we don’t need the arbitrary distinction between a label and the thing it labels on a website. Let’s unlock the full potential of our relationships by making them relationships between things.

That way, the tag page for an event can show a map of where it took place. You could automatically place recent events you’ve covered on a calender. The tag page for a person can contain a biography, a timeline of significant events that this person was involved in and that you’ve reported about and so much more. And all this information can be gathered without human intervention, based solely on previously entered relationships, which gives you a good bit of automation without sacrifing quality.

Tag pages are fine as far as they go, but we have the ability to transform those rudimentary link dumps into valuable landing pages and content hubs that collect all the content for a person, an organization or an event. Pages that tie together a whole bunch of information on your website. Those topical pages can become a nexus and an alternative way of exploring what your site has to offer.

Lastly, when we’re thinking about entities, let’s not forget that news articles are a type of entity as well. We should be able to specify relationships to topics, be able to say that an article belongs to a series, and that an articles follows up on (or has another relationship to) a previous story.

Step 4: relationship cascades

One of our requirements for a tagging system was that it should be easy to be consistent and that it should make sure that we don’t forget to make the necessary connections. Relationship-augmented content can help with that. The trick: relationship cascades.

A person usually has meaningful connections to other people, places, events and organizations. When a tag is no longer just a tag, it can tell an entire story.

Barack Obama belongs to the Democratic Party and he’s from Chicago. If we tag an article with Barack Obama, it’s likely that the article also has something to do with the Democratic Party. If we’ve specified that the article is about Obama, and we’ve specified that Obama is part of the DP, the system now has all the necessary information to suggest our article about Obama as a possibly interesting related read on the topical page for the democratic party, even if we didn’t explicitly indicate that link.

How best to exploit those cascades? I see two ways.

  1. Automatic suggestions for readers: okay, but not great because they lack a human hand.
  2. Automatic suggestions for editors, serving as a basis for the entry of relationships.

Suggested tags can improve the amount and consistency of our metadata.

If you tag an article with Barack Obama like we did above, a CMS can autosuggest a relationship to the Democratic Party, leaving it up to you to decide whether to add that relationship, but nonetheless providing a very clear suggestion.

We can also present these suggestions directly to the reader. I’m not a big fan of anything that reeks of autotagging (we’ll talk more about that below), but a case might be made for it. The quality of our relationships might drop if we trust in the relevancy of cascades too much (not every article about Obama has something to do with Chicago, on the contrary). On the other hand the system becomes more forgiving of forgetfulness, improving its coverage and breadth.

By smartly cascading down relationship chains, using the relationships that you have specified, additional relationships can be suggested to readers. It would probably take some experimentation to get these cascading relationships just right, but they’re worth a look.h3. Step 5: finishing touches — synonyms and homonyms

If we’re serious about creating a web of relationships on our website, we’d do well to follow a few best practices familiar to MLIS grads. Reqs that are boring but necessary. Like synonyms. We need a way to specify that NYU is equivalent to New York University.

Synonyms can be a help to editors because they don’t need to know the preferred term for an organization or location that’s already in the system. Editors should be able to enter any common name that describes the thing they’re thinking about, period. The system should be able to point a relationship to New York City, regardless of whether an editor just entered “Big Apple”, “NYC” or “New York City”.

Readers demand the same leniency that synonyms bring: they don’t and shouldn’t care about the ideosyncracies in how you name things when they’re trying to find something. If somebody searches for Jon Stewart , they should be able to find the topical page for our beloved The Daily Show anchor just as easily than they would with the search query Jonathan Stuart Leibowitz.

Names that can mean more than one thing should point to wikipedia-style disambiguation pages (but, in contrast to wikipedia, these pages should be autogenerated, not manually constructed). And the system should be able to differentiate between Stanley Smith the soccer player, Stanley Smith the tennis player and Stanley Smith the eponymous shoe by Adidas. (Should be able to, not must. A database of shoes would probably add little value to the vast majority of journalistic endeavors.)

A disambiguation page on Wikipedia. Stan Smith can mean a lot of things.

In the really real world

Now what? I’ve presented a way of specifying rich relationships between articles and what these articles talk about. Relationships as I see them address the four important problems that tags have when used in a journalistic context. They provide a rich way of talking about your content. It’s metadata that says more than just “this has something to do with that”. Because you’re specifying relationships to things, it’s easier to remember what to link to and how to name those entities. And we can also specify relationships between articles, providing a better way of indicating previous reporting about a subject than the automated guesses we know as “related content” on most news websites.

That’s great. But I’m a firm believer in the mantra that ideas are useless — it’s the implementation and the nitty-gritty details that count. So we’re not done yet. We need to make sure that this way of tagging content (if you can still call it that) really kicks ass. In practice, not just in theory. Here’s the battle plan.

Step 1: make it easy

The first question that comes up in any entrepreneur’s mind when he sees the potential of contextualizing news and also sees the huge existing archives of content that are little more than plain text, is: can’t we automate this? After all, many of the relationships that we’d like to add to an article are ready to harvest from the text.

I mentioned autosuggestions/autotagging earlier. Names of persons can be spotted by cross-referencing body copy with Freebase and, if you’ve been entering relationships and entities for a while, with your own database of people. Same goes for organizations. Locations can be extracted using natural language analysis, e.g. using Yahoo! Placemaker. And we can add in links to other parts of the web using Apture, or even let Thomson Reuters’ Open Calais take care of  some of our tagging needs.

We should to take a very pragmatic approach to these tools. If you’ve tried them, you’ve seen that automated discovery tools like Open Calais just aren’t all that good. They will get better, but they’re not magical and they can’t see inside a journalist’s head. But what these automatic scans of your content do very well is give suggestions for relationships. As a reporter or an editor, you can then choose to flesh these out, use them as-is or ignore them. A bit like how suggestions work on Delicious, as shown above.

Autotagging is close to worthless on its own, but not if you use the input of e.g. OpenCalais and Placemaker as suggestions for relationships rather than as the final output.

Those tools usually suggest related tags, or entities in our case. But we also have the relationships themselves to think about (a is an overview of x, b is a review of y, c contains an interview with z). To be useful, they have to be consistent too. We can’t bundle together all the articles where the mayor of San Antonio is asked for his opinion if one reporter enters that data as this article interviews the mayor, another as this article contains a few soundbites from the mayor and yet another one codifies it as this article quotes the mayor.

We’d best stick to a few agreed-upon relationships for the majority of the cases. We should provide the means of adding new predicates, but gently guide reporters to a basic set of relationships.

A good UI for entering relationships should come with sensible defaults, first row, yet allow users to specify their own predicates, second row, if what they're trying to say is not on the list.

If we really do need the detailed relationships above (e.g. the difference between ‘interviews’ and ‘contains soundbites’), it’s not impossible. We’d just need a way of specifying which relationships are roughly equivalent. And the right mechanisms in place to make sure that every once in a while editors comb through relationships and tidy them up a bit.

As with curating plain ol’ tags, you’ll have to weigh whether the added detail justifies the added work.

Step 2: make it maintainable

Which brings us to the second crucial issue we need to address if a relationships-based web of content is to be successfully implemented. It has to be maintainable.

Any good implementation of relationships in a CMS should be able to merge different terms or entities. It should give a birds-eye view and suggest possibly problematic content, using symptoms such as too few relationships or a lot of autosuggested relationships that are not being used. We can make it as painless as possible, but structuring your content as a web of relationships isn’t and can never be maintenance-free. This is just as true for plain tags, as they found out over at The Onion.

Although relationships present a bit more work on the part of the journalist or editor entering the content, maintaining the consistency and quality of the relationships should actually be easier than with plain free tags like the ones you see in Wordpress and just about any other CMS out there.

A lament that I’ve often heard from journalists is that it’s hard to know what they should enter in a tag field. Keywords can be anything. What makes a keyword a keyword? What if a keyword would be relevant but really not at all central to the content of the article? Relationships to real entities give a natural answer to these questions.

It’s pretty evident that you’ll want to add references to persons where it says “add related persons to this article”.

Tags are vague as to how the content should relate to the tag, and to the importance that that relationship entails. Relationships aren’t. You can enter a relationship from an article to an organization which specifies that an article discusses momentous events for but just as well that an article just mentions in passing or mentions as an example a certain company. The question “is this important enough to tag” doesn’t arise. One less thing to worry about.

Real relationships make explicit what tags implicitly suggest. Because relationships are so explicit, they’re easy to grasp and easy to enter correctly and comprehensively.

Step 3: make it worthwhile now

Introducing tagging into a newsroom, whether the traditional free tags or the relationship-infused ones I’m proposing, is a cultural issue. The challenge is getting authors and editors to see tagging as an enrichment of their content, rather than a nuisance that is being forced upon them from above.

If you’re the big chief at your organization, you could make it a policy that journalists should tag their content if they expect to get paid. But if their heart isn’t in it, journalists won’t do it consistently and the value of your web of relationships drops.

The utility of tagging is most apparent in the long term, with hundreds or thousands of interrelated articles, but we need some enticement now to bridge that gap and sweeten the deal.

Luckily, relationships to real entities can be the foundation for a lot of added value in the here and now.

Because a location isn’t just a tag, but has an actual geographical area associated with it, a good content management system can easily make a map of all the locations that are mentioned in an article or in a series. I write the code that puts locations on a map once, and from that point on, adding a map to an article becomes as simple as clicking ‘gimme a map!’ in my backend.

Specifying a relationship to a person means a little inline biography visible alongside the article is only a click away.

Events can easily be bundled into elegantly styled timelines.

Taking things one step further, if you provide topical pages for locations, persons and organizations like the NYT does , a healthy dose of vanity will entice journalists to relate their articles to the right topics, persons and organizations. It’s the only way they can get their work on those topic pages. Being such an easy way for journalists to increase the shelf life of their work, you’d think they’ll make the effort.

Topic pages are a wonderful way to go beyond daily reporting. They can also be an enticement to reporters who are wondering why they should care about all this tagging and relationship nonsense.

Beat reporters will need even less encouragement to make use of relationships. Relationships can serve as an internal tool to keep track of a beat. Reporters can keep an overview of their field and share that knowledge with fellow reporters. We’re not there yet, but hopefully we can someday get to the point where a content management system stops being a mere tool to publish to the web, and becomes a digital notebook that aids journalists in their work.

In summation

Tags are labels without any context. Tags are vague, and it’s difficult to tag content consistently.

We need to re-engineer tags so that they’ll allow us to represent the rich relationships between our content and the things that content talks about. If we do that, newspapers can infuse the news with necessary context that allows readers to see the broader picture. Quite literally, too: relationship-infused content can easily be enriched with maps and timelines, which goes way beyond what tags have to offer.

Tags have a deceiving simplicity that hides their complexity as a taxonomic concept. Relationships are closer to the way journalists think about their writing. Relationships are a direct answer to the question “what is this story about?” Because they’re more intuitive than tags, they’re actually harder to mess up.

If we re-imagine tags as rich connections that relate content to the persons, organizations, locations, events and themes they talk about, hopefully magic will happen.


26 comments

First and foremost - Hooray! The OpenCalais team is in violent agreement with the vast majority of what you have to say. "Tags" in and of themselves are fundamentally a dead end. While they may have served as an adequate transitional tool - they absolutely don't serve the purpose of interoperability and understanding of the underlying content assets. It's a chunk of text - maybe a structured chunk of text - but just a chunk of text.

OpenCalais is by no means a complete solution to the challenges you're stating - but we're trying really hard to get there. A few examples:

With the exception of some categorization we don't "tag" anything. We provide information about entities, facts and events in your text and do our best to attribute them well. If we tell you we found a "thing" we'll also tell you it's a company or a person or a product or whatever.

Beyond that are facts and events. If we say an event happened we'll tell you it was an event of type "Management Change" with the attributes of people, positions and companies affiliated with it. And we'll present that whole relationship between people, places, things back to you so the context and relationships are made explicit.

And - where possible - we'll link these things back to Linked Data locations that let you know more about those things - for example with companies we'll provide the company fundamentals and tell you where you can find more information on DBPedia or elsewhere. You can play with these capabilities at http://viewer.opencalais.com to see how far beyond tagging we've moved.

But - we still have some big challenges in front of us. In particular taking a general entities, facts and events model and translating it into a vocabulary that's specific to news. Until this is completed individual organizations can do great things - but leveraging content assets across organizations will be impossible.

Right now we're seeing a proliferation of "tagging" capabilities out there but - with few exceptions- they are dead ends. Unfortunately they provide an initial rush of feeling that you're achieving something and defocus organizations from dealing with the real issues.

Thanks for the insightful article. Maybe we can get some energy and focus going moving beyond tagging solutions into true content description capabilities.

Regards,

Do you publish, manage or present news content? Then you must read “Tags don’t cut it” by Stijn Debrouwere http://bit.ly/dnlxUX

This comment was originally posted on Twitter

Stijn Debrouwere

Thanks for taking the time to respond, Tom, much appreciated.

Leveraging content across organizations is an issue I haven't thought about much. The success of Linked Data is very much dependent on how useful metadata (like the relationships I discuss here) is for individual organizations, so I've purposefully avoided talking too much about what could happen if more people started systematically and semantically annotating their content and making that information available through RDF(a). If it makes economic sense and the infrastructure is in place, it'll happen sooner or later anyway.

I guess, though, if more organizations would link up their data to Freebase / DBPedia like the New York Times does, that could act as a standardized way of identifying who or what a resource talks about.

What has kept me from really diving deeply into Calais is mostly that it doesn't work on Dutch-language content, which is what I deal with professionally. But as I said in my post, while a service like OpenCalais can't replace manual contextualization by journalists, I don't doubt that it can play a useful part in a broader strategy. I don't think the service should try to address all the challenges I state, but it's really cool that you guys are thinking about these broader issues.

Gerd Kamp

As somebody that works now for 10+years at news organizations (the last 5 at a news agency, typical the news ones that care most about metadata) and prior to that worked 10+ years in AI, doing knowledge representation and a PhD in description logics I agree completely ( in principle). IMHO a news org has to do semi-automatic annotation of stories. Semi-automatic because google and other tech companies will always be at least as good as third party providers such as but third party systems can be very helpful at preprocessing things for an editor to make choices.

Hence I'm working very hard to convince the organization to move into that direction. Since I'm the head of r&d I'm able to do some experimenting , doing PoC's with different providers on special niche cases (e.g. Identification of addresses and toponyms) where the business value is most imminent. Since I need this for German opencalais is on halt until German is available, but Judging from my experiences with others it will not be good enough. I need identification of addresses and places that most likely will result in addresses e.g. POI.so that I have candidates that I can send to a geocoder. The editor has then to select and rank the resulting locations and addresses and cross-check them.

Right now all systems I tested right now return too coarse an information, too many false positives etc.

And even if they would it is still a difficult sell because of the money and especially the additional time needed by the editorial staff.

@theatermonkey Wow-incredibly well written post you linked to (http://bit.ly/dD2RZu). Lots of great ideas, but implementing will be tough.

This comment was originally posted on Twitter

The real question: what’s a more expressive way than tagging for expressing relationships between content? http://bit.ly/aEExpw ^DB

This comment was originally posted on Twitter

I agree completely with most of your points. Tags are limited.

Vocabularies and relationships (ala Linked Data triples) are surely a good idea. But they have some serious drawbacks that relate to very deep issues in knowledge representation. The world is not neatly categorizable.

You say, "Events happen at a certain place and at a certain time." Sometimes. For a house fire or a shooting, maybe, but how long were the post-election protests in Iran last summer? They continued at varying intensity for several days, then flared up weeks later. Was that one protest or two? And what about a Facebook protest that gathered supporters over the course of a week? "When" and "where" did that happen?

Simple date and place notations don't represent uncertainty well. And knowledge of uncertainty is sometimes critical, especially when trying to make chains of inferences, where uncertainties multiply.

Or, take your examples of describing what an article says about someone. How do we decide when a story "criticizes" someone? There will always be boundary cases -- lots of them in professional reporting. How do we ensure inter-rater reliability? Can we extract any real data from analyses of this tag if we have no other reference points to interpret it?

The reason we use text for reporting is that it's good at representing these sorts of ambiguities. Strict adherence to the religion of finite relationship vocabularies leads one to believe that the world can be modeled in first-order logic (predicate logic), and this just isn't true. The chains of automatic inference you propose will fail very quickly.

That's the great virtue of tags: it's just about the simplest possible way of saying something, and doesn't imply or require any particular inferential framework. A tag says, "there's some association." Full stop. I find this ambiguity a virtue. The meaning comes out of the relationships between the tags, articles, and users. Meaning is always relative, and tags force us to understand this, because there's nothing else to go on.

Tags allow (or force) what we might call the "google solution": let humans describe it in a way that makes sense to them, then sort it all out later algorithmically.

Having said that, I am fully in support of "entity recognition" as performed by OpenCalais, and carefully managed vocabularies, and the more complex relationships available under Linked Data. But please, let's not oversell it. We're a very long way from understanding how to represent all of reality in machine-readable form.

For more on this topic, I recommend:

"Ontology is overrated," by Clay Shirky. http://www.shirky.com/writings/ontology_overrated.html

"Metacrap" by Cory Doctorow. http://www.well.com/~doctorow/metacrap.htm

"What is a knowledge representation?" by Davis et al. at MIT http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html

  • Jonathan
Stijn Debrouwere

Thanks for the comment and for the links, Jonathan.

You're right, trying to describe the world in a neat little system is, to some extent, a fool's errand. I recently read Everything is Miscellaneous by David Weinberger and he makes the very valid point that it's all too easy to apply structure where there is none, or cement vague associations into black-or-white relationships whereas a lot of real-world relationships are "72% something, 5% something else" proverbially speaking. (Weinberger's book is a great read, by the way.)

I mentioned before that I'd shied away from talking too much about the semantic web because I feel that the primary focus should be on creating value right here and right now. Lumping together the post-election protests in Iran into one or a few discrete events with a start and an end date isn't accurate, but it can be damn useful in creating a timeline that provides the starting point for people who want to learn more about the protests or help in constructing a map that shows a few key areas where protests took place.

For that same reason, my suggestion would be to go easy on the relationships and most of the time to just limit yourself to a few basic ones that give a general idea of what we're talking about. E.g. the relationship "contains an interview with" could mean that a reporter quickly telephoned somebody for a quote or that the article is a bona fide interview, and anything in between. This would solve most of the trouble with inter-rater reliability as well. I don't think it wise to go all-out and create fancy ontologies for every possible use-case. They're often an excercise in futility, you're right. So I do hope my thoughts don't read too much like the ramblings of a zealot, because I try hard not to be.

Ambiguity is a virtue. But that's the challenge: go as far as we can in structuring content and how it relates to other content, without losing track of reality. However, that calls for specifics — some of which I've tried to provide here, but hey, I'm just a shmuck like everyone else — and not an abstract discussion about how well tags or linked data triples represent the world.

You mention that you do see some potential in parts of what I'm suggesting. Well, let's talk about which parts you think are usable and which ones we should drop or rethink, and why. That's the kind of discussion this topic now needs if we want to push things forward.

The problem with the Shirky approach (which I'm familiar with) is that we'd be settling for a system we know to be shaky because we can't construct a perfect system that provides flawless metadata. I'd rather settle for a good, workable approach now, and mix-and-match the tools from the Shirky/Weinberger camp with those from the Linked Data crowd (see e.g. Peter Morville's Ambient Findability for some counterweight) in whatever way works best.

Anyhow, it's a very important issue, so thanks again for bringing it up.

(Others reading this, also see Jonathan's blog post about this, which goes into a bit more detail: The world cannot be represented in machine-readable form.)

Great post on how/why content relationships are vital to the future of online news. This is worth the lengthy read. http://bit.ly/bzv71u

This comment was originally posted on Twitter

Hi Stijn, thanks for the post and your reply.

First let me be clear I'm that I'm really glad you're thinking and writing about this topic. I wasn't quite responding to you specifically, your post just gave me an excuse to write about things I'd been thinking about for a while. And your pictures to help explain it :)

First, complex metadata can work well for restricted domains. Take for example the "management change" event that Tom from OpenCalais mentioned. That's a very constrained event, with very standard roles and actors. Many business stories are like this, which might be part of the reason that it's Reuters developing OpenCalais. Car crashes and fires and that sort of thing might have similar forms.

It's also possible to view linked data as a sort of super-tagging system: we still have “tags” in the linked data world, it’s just that they’re now all “uniform resource identifiers” that are visible to everyone on the web. This means that tags can be shared between systems and maintained by communities, which makes them more both more powerful and cheaper to use. In fact, this is exactly what we’re already seeing, with the Wikipedia-derived DBPepdia at the center of all those linked data “bubble diagrams.”

Linked data also supports predicates that say what the relationship between the tags is, like the “Barack Obama is a member of the Democratic Party” example. But I predict that these will be much less useful, offering almost none of the “machine understanding” that’s supposed to come with the semantic web. I don’t know what “understanding” means if not the ability to draw inferences of some sort, and predicates are just too fragile, too subject to mis-categorization, too limited to capture the rich relationships of the real world. I do believe that we’ll see amazing new “artificial intelligence”-like applications built on top of linked data, but they’ll be built statistically: they’ll ignore the predicates or use them only in special cases, or only in aggregate.

What I'm saying is that linked data is valuable because of the links; rather than trying to encode too much in complex ontologies, we should concentrate on making it easy to express connections, and not worry too much about labeling those connections except when it's extremely obvious. I'm hoping that the set of widely used RDF predicates turns out to be very limited.

Relying on links instead of predicates also helps express uncertainty. If we try to determine whether an article is left-wing or right-wing by looking at the "bias" field, we're in for problems. If we are instead forced to look at how many people linked to it from the blogosphere clusters loosely labelled "left" and "right", we have data that can express very rich opinions.

Just off the top of my head, perhaps I might express this design philosophy as follows:

  • favor general, minimal predicates that don't try to categorize
  • try to get lots and lots of simple links (hyperlinks, triples, comments, whatever)
  • quantize as late as possible; avoid data-losing categorization or predicate selection until the last possible moment in processing

  • Jonathan

Stijn Debrouwere

Somewhat late, but I still wanted to respond to Jonathan's latest comment.

(1) You assume that predicates like "Barack Obama is a member of the Democratic Party" need to provide some sort of machine understanding to be really useful and worth the trouble. Think more prosaically. If we want a topic page on the Democratic Party, we'll want links to the associated politicians. We can either compile those manually into a plaintext list, or we can specify those relationships in a more controlled, structured way that might serve additional purposes in the future. AI-supported inference chains and machine understanding are hardly the immediate gains of using a relationships-backed system I refer to.

(2) "not worry too much about labeling those connections except when it’s extremely obvious" — I guess we disagree here on how often these connections actually will be obvious.

You gave the example on your own blog of the tricky question of whether a politician in the McCarthy-era belonged to the communist party. And it's true, trying to encode those kinds of subtleties into something structured would be silly, but most of the time it's pretty obvious whether or not a musician belongs to a band, whether two persons are family, whether a person is being interviewed in an article, whether an article is a follow-up on a previous one and so on.

The vagueness of tags is not always a boon either. Tagging a politician who was falsely suspected of being a communist with the tag "communism" would be utterly confusing, even though technically the tag doesn't presuppose anything about what kind of relationship exists.

If it's not obvious how one thing relates to the other, or if we can't easily put it into words, we can just rely on the basic relationship "has something to do with" — essentially what a regular tag is. That way the merits of regular tags you speak of are safeguarded. We have the possibility to be exact when we can be exact, or vague when we need to be vague.

Dear Stijn,

Parts of what you describe here have already been implemented or modeled one way or another.

First, I agree with you: relationships are of utmost importance. It is all the more surprising to notice that they're most often lacking in current tagging services. Please note that this is something the folks at Flickr (mentioned in your post) have already noticed. That's precisely why they introduced machine tags (tags with a structured content of the form "namespace:predicate=value).

Now, as early as 2005, people have been working on tag ontologies for the Semantic Web in order to palliate the limitations you described very well. Newman's ontology, SCOT, SIOC, MOAT, Common tag, they all give partial answers to the issues here raised. Except as regards relationships.

This is why we've been creating with other researchers an ontology, NiceTag ontology, where those are made explicit. Of course, we don't want to model all of them. Plus, that would not be possible. Each site, according to its main theme, may provide user with the possibility to chose some anong many.

But there's more than that. It's not just the choice of a relation, but also the action you can perform. Assert, is just one among many. Tags are also used to share some resources, to aggregate others (especially on Twitter during conferences ;-) ), etc.

These are reminiscent of speech acts. As such they also involve relations. That's why I believe the approach taken by OpenCalais, while laudable and very useful, cannot replace tagging or simply be mentioned to dismiss it. Human language is powerful due to its expressivity. Yet, it may also used to act. Entity extraction just won't help you to do that. Tagging, on the other hand, does already.

Here are some slides describing NiceTag on Slideshare if you wish to learn more about it : http://www.slideshare.net/fabien_gandon/nice-tag-ontology-modeling-tags-as-rdf-named-graphs

(they are a bit dated though for the speech acts part is not presented)

Thanks,

A.M.

Stijn Debrouwere

Speech acts — hadn't heard about those since language philosophy at uni :-)

NiceTag looks fascinating, I'd have to look into it. Existing ontologies can indeed serve as inspiration or the basis of journalism-as-structured-information, e.g. Dublin Core and SKOS, as well as those you mention. However, I'd add two important qualifications:

1. Structure on the level of each individual news organization should come first, standardization can come later. As I pointed out in We're in the information business, different news outlets do different kinds of reporting, and they should be able to structure their content in a way that fits with their reporting and their strategic goals. Not just by being able to choose the parts of existing ontologies that suit them (as you suggest) but also by trying out their own ways of structuring information.

Of course, that doesn't preclude a healthy exchange of ideas between different organizations, the establishment of best practices or the reuse of existing ontologies where they make sense.

2. I wouldn't conflate a model and an implementation. While the ideas and principles behind structured and linked data might be simple, implementing these in a way that makes sense for a newsroom and provides a return on investment is decidedly not straightforward.

Anyhow, I might just be nitpicking. Thanks for your thoughts!

Thanks for taking the time to answer my comment (and sorry for the typos in my message. 'Twas early in the morning ;) ) .

Actually, I'm a PhD student in Philosophy with a strong interest in the Semantic Web , tags, ontologies, etc. :)

  1. I agree. And I must add that neither FOAF for instance nor SIOC or SCOT or NiceTag are W3C standards (by contrast, SKOS is). These are domain ontologies/vocabularies. Hence, I'm all for structuring information by creating your own voc. It need not be a standard (I believe it should follow the RDF/S/OWL standards; however the freedom is yours to create whatever classes and relations you see fit, especially as a professional).

"Of course, that doesn’t preclude a healthy exchange of ideas between different organizations, the establishment of best practices or the reuse of existing ontologies where they make sense."

I 100% agree with that :)

  1. You're right. There are at least three levels here that I find relevant for the discussion.

a) The model: should help to really leverage the semantics provided by users (that inclues speech acts and relations)

b) The implementation : OK, I think the Linked Data part is quite easy (really!). But then you have to associate it with automatic extraction to account for autosuggestion/autotagging. That includes many features not yet working as well as we'd like (image recognition, etc.). Now of course, it's important to know what exactly will be automated and how it will serve humans. Which leads me to a third point.

c) The interface. We have to work really really hard on that one. But we should not, on the other hand, limit the expressivity of our model (get rid of relations) simply because we do not yet see how to integrate it in smooth interfaces. Lack in confidence in interfaces means they will probably be really bad anyway...

Thanks!

Alexandre.

William

The problem is no one easily "gets" rdf predicate logic. Tagging is conceptually simple. Tagging can exist in the context of structured vocabularies like SP 2010. The real challenge is defining the relationships and context of the information in an easy way so that users will actually contribute. Great tool that leverages the simplicity of tagging with the structure you crave in RDF is Jumper 2.0 - a bookmarking engine. open source on sourceforge.

Stijn Debrouwere

@William: "The problem is no one easily 'gets' rdf predicate logic." — I think that would depend entirely on the UI of the implementation. There's nothing particularly difficult about describing content with simple sentences such as "This article contains an interview with the mayor of San Antonio". Conceptually that's as easy as it gets. So the challenge is in finetuning the user interface to honor that simplicity.

Great writeup on how tagging can be improved! I saw someone mention machine tags (http://tagaholic.me/2009/03/26/what-are-machine-tags.html) in the comments as a solution to your proposed scheme but didn't see any follow up to that. To continue that suggestion, here's how I would apply them to the schema steps:

  • Step 1: Vocabularies are just machine tag namespaces.
  • Step 2: Machine tags are triples and thus do relationships easily. Flickr has been using machine tags to let third party services associate themselves with photos for the last 2+ years.
  • Step 3: I've been tagging my bookmarks as entities for the last year and think it has great potential. I explain some of this process here: http://tagaholic.me/2009/04/26/machine-tagging-with-delicious.html
  • Step 4: Hasn't been done yet but I'm actively thinking/coding up ways to do it with machine tags.

I think step 5 is different than the other steps since it's not a step that is essential to creating a new tagging scheme. While it represents an important common tagging problem, it can be solved with the right algorithm and UI, independent of the tagging scheme. To the list of steps/features for better tagging, I would propose that tags remain able to be represented as a URI which a user can understand.

Stijn Debrouwere

@cldwalker: there are solutions and there are solutions. In theory RDF triples that link together different entities with their own URIs satisfy just about all the requirements of the tagging scheme I propose. Machine tags do too. Representing relationships between entities might not be an entirely solved problem, technically speaking, but pretty damn close to it.

However, conceived as a user interface, machine tags à la Flickr are horrible. They provide very little feedback to users about what kind of entities and relationships you can/should specify, and no easy workflow that allows you to add new entities along with the relationships to these entities. And softer requirements, like creating a system that's maintainable, also go beyond just the technical implementation, and even beyond the user interface.

Tagging is not just about technology, it's about having a well-tuned strategy. And I'm afraid machine tags only solve the easy part of the equation :-)

Cool blog by the way, added to my feed reader. Cheers!

Hi again,

I'm pretty sure I was the one mentioning machine tags actually. :)

To be honest, while I greatly admire the work done at Flickr, I think there is a difference between proposals such as NiceTag and machine tags - in their current form.

Indeed, let's say you have a resource, a relation and a label (each term has its complexity but let us skip over that for the sake of the argument).

Right. Machine tags are just strings of characters. Labels. With a particular syntax. But they don't exactly add a typed relation that is normally left implicit.

Imagine my machine tag is "person:celeb=Paris" (for Paris Hilton).

Imagine I tag my resource with this tag:

Resource----???relation???---->"person:celeb=Paris"

So what ?

I know (though I may be wrong) that I'm somewhat referring to Paris Hilton with this label. But the lingering question remains : how is it related to Paris? How am I using it in that particular utterance?

It it about Paris? It is written by Paris? It is her (or a photo depicting her or showing her if the two are disjointed - there are good reasons for that)?

The disambiguation can be achieved with MOAT too. Just add a URI and the appropriate relation to a label (the cost is higher but the result is better).

Since the relation has been integrated to the label itself, we may ask how does that "label-with-a-relation" relates to the resource. And the regression begins...

So, I'd say no, the result isn't the same at all despite appearances to the contrary. :)

"Tagging is not just about technology, it’s about having a well-tuned strategy. And I’m afraid machine tags only solve the easy part of the equation :)"

And they did that very well ! But I agree with you.

Cheers,