Controlled Vocabularies in the Trenches

Victor jots down some thoughts about creating controlled vocabularies within the context of the design of a project he's working on. He discusses some real considerations and dependencies related to the development of a controlled vocabulary and implications for systems design. Here's some of my own thoughts/reactions, based on experience.

I've watched the controlled vocabularies of subject headings and company information grow within my organization (a corporate library services org.) over the last four years. The approach we've taken is sort of like a web services model or much like a vendor service, such as those where data aggregators provide indexed content with their own proprietary controlled vocabulary (e.g. Factiva). This seems to me to be a good model because it centralizes semantic tagging and creation of indexing terms in one place, while enterprise use at different levels of granularity. When following this model, you're still confronted with the issues of knowledge representation when developing your terminology, but the system considerations are separated. The design of IR systems using indexes benefit from documenting scope, domain, documentary units, indexable matter, etc. prior to implementation. I have this great unpublished text by Jim Anderson that serves as a framework for such documentation.

Here's a short description of our approach, which has been top down and bottom up. Our people created our CVs starting with close relationships with business units to develop a set of subject headings and a company authority list. They iterated through these lists using the top down approach, informing the list with their subject area expertise. Then they take the bottom-up approach and add/modify terms that reflect subject headings identified while doing the daily work of indexing (knowledge representation). For my org., this is a daily process since a team of indexers sifts through machine filtered data and applies more granular indexing or alters machine-applied terms. As the telecom landscape changes or as our indexing needs require, terms are added to the vocab's. We have one person who manages/develops them, and a few additonal subject area experts who work on development of new terms in new subject areas. User feedback informs changes along the way. The controlled vocabularies are offered up for use by disparate systems within our company to represent that corpus of indexed data, or slices of it, as desired.

As an IA, I generally work with our taxonomy specialists to create page inventories -- sort of like microscopic content inventories on steroids -- that specify combinations of index terms used to build content modules. As an example, I show a small piece of one of these inventories on my old and dated portfolio. This use of the term content inventory is not typical in our field, I know. What this really is, is a design document showing such things as rubrics of content modules with their associated labels, and database searches that use terms from a controlled vocabulary. Maybe I should present something on this process some day. It's really a hybrid IA and technical document, but it's a format my entire team uses on all data-dense sections of our site.

Incidentally, the taxonomy guys I'm talking about are presenting on this topic at an ARK seminar in NYC in November in case you're interested. They're really smart. Hopefully they will get to network a bit at this thing, because everyone in our group could get pink slips if the cost-cutting winds decide to blow in our direction.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


Tanya's also musing about controlled vocabularies in the trenches.

soft -> hard control & reward systems

The interesting thing I took away from Victor's post was the concept that the "control" in a "controlled vocabulary" doesn't have to be absolute.

On the IAwiki there was a long undisputed comment that the idiom of category pages wasn't an example of controlled vocabularies ... but now with VL's latest the argument can be made that it is, in that known and used terms are *rewarded* with links, while neologisms are not.

My contribution to this discussion then is this idea of self-reinforcing rewards to impell a bottom up controlled vocabulary


Eric, I agree with Victor's point about there being no absolute control in a controlled vocabulary. I think one thing that we need to make organizations aware of when creating something like a subject heading list is that what they're creating are living documents. As times and needs change, the terms used to describe subject headings in your domain will change. The same might be true of more concrete lists of terms, like product family names, for example, but these need to be maintained as well as new products arrive, products change names, discontinue. etc.

I can speak to the bottom-up approach. This is often how taxonomies are developed in businesses. A corpus of data is available and it is up to a librarian type or consultant to analyze individual documents (a representative sample, but not all) and bubble up terms from the mass, later shaping them into a list of terms preferred for indexing/retrieval purposes. These terms often simply go into some hierarchically arranged subject list and sometimes terms are better described through a thesaurus. So this supports your notion of self-reinforcing rewards. Bottom-up is often the most logical approach when dealing with existing content. But additionally, some top-down arrrangement and further description usually happens after this bottom-up work, if you're going to be doing more granular description and semantic linking.

Andrew's experience

Andrew describes his experience on heyblog.