XML-tagging the news

Writing for XML.com, Adrian Holovaty proposes adding XML tags to news content to facilitate automated transformations, fixing such things as date/time references.

Surprise: Most of those ideas already have been built into NITF, the standard for news text markup. I think <profanity> isn't in there, but you can tag people, money, time references, events, postal addresses and many other items that might be embedded in the text of a news story.

The tags are available. While it might facilitate some intriguing mashups (not to mention EPIC's fact-stripping robots), nobody does it. It's a tremendous amount of work, it intrudes on the content-creation process, and there's not a clear business need.

Comments

Hi Steve,

When you write "tremendous," what exactly do you mean?

Thanks!

It would probably double the manpower requirements in the copy editing process. Doing it well would require a skill set not currently in place -- copy editors aren't semantic markup experts. This is a time when newsroom budgets are being slashed, reporters and editors are being laid off.

This isn't a good climate in which to sell an idea whose benefits would accrue primarily outside the content-creating organization. What I mean by that is simple: Google and Yahoo would have a feast. Newspapers would do all the cooking.

As an online journalist, I'd like to have structured data to work with.

Back in 1998, at an API Media Center affair, several of us formed a committee to write an XML standard specifically for this purpose.

The idea came from a breakout session led by Dan Froomkin. (Jay Small called it the Danco Web-O-Matic.) We all saw the potential to improve our own site automation processes. The project eventually was called NML, for News Markup Language, and the team was expanded to include people like Dave Megginson and representatives from Atex, DTI, Future Tense and WSJ.com (which had developed its own XML markup).

We were persuaded by the IPTC working group defining the NITF standard to merge our work with theirs. A few months later NITF -- which had been in the works for years -- finally emerged as a published standard.

Today most wire-service copy moves in NITF format, sometimes rolled into a NewsML container (an IPTC standard supporting complex relationships between media objects). But when you look closely at the body content of news stories, you still see plain-Jane minimal markup. The standards support a great deal of structural enhancement, but the economic realities do not.