Meta Content Framework : A Whitepaper

R.V.Guha
(guha@guha.com)

Note: This is an old (1996) paper I wrote about MCF. Most of it is still relevant.

Organizational structures for content, such as file systems, emailbox hierarchies, WWW subject categories, people directories, etc. are almost as important as content itself. The goal of the Meta Content Framework (MCF) is to provide the same benefits for these structures as HTML provided for content.

This is a white paper that describes the basic concepts and ideas behind MCF. This paper is not intended to be a specification of MCF. Instead, it provides the background context for MCF and outlines the user values delivered. The specification can be found in the mcf spec

Background Context

The Shift in Focus

The last few years have seen a significant change in what we consider productivity applications. The focus has shifted from authoring and analysis applications such as word processors and spreadsheets to information management and communications applications (IMA) such as email. In addition, with increasing disk sizes, our dependence on the traditional IMAs such as the desktop file manager has increased.

The Problem

We use a number of different IMAs --- browsers, email, newsreaders, file systems, etc. to manage our information. They each handle collections of information objects : web pages, images, email messages, files, folders, etc. Each of these IMAs has an information organizing structure. Much of the time spent in front of computers is in manipulating or creating these structures. The structures form the skeleton for the organization of our information. These structures are not the final content, but are meta content.

Each IMA uses its own representations for these structures and provides its own utilities for viewing and manipulating them. Furthermore, the structures in use today are very simple and inexpressive. They don't allow us to represent very much about the content. The structure is typically a tree or graph with a very limited number of attributes such as the author, modification date and size provided as annotations on the nodes.

We claim that the lack of an expressive, open standard for representing these structures is at the root of many of our information management problems. In fact, we have become so accustomed to these problems that we hardly even regard them as problems any more. For example, our information today is divided up into separate containers such as email, files, web pages, etc. This division is based not on what the content is about or which tasks they are relevant to, but on which protocol is used to access/manipluate them. Efforts to remove these boundries are typically involve trying to absorb one into another and are fairly specific to certain platforms (the best example of this is the attempt in Windows 97 to absorb the web into the Win97 desktop.)

MCF Goals

The goal of MCF is to abstract and standardize the representation of the structures we use for organizing information. In addition to the usual benefits of open standards, this will also allow for information management (IM) utilities such as viewers to work accross different IMAs.

Let us illustrate this concept with a very simple example. Consider the abstract notion of hierarchies, we find it in file systems, mailbox structures, usenet newsgroups, etc. We can use the same representation for hierarchies irrespective of whether the nodes of the hierarchy represent files, email messages or news postings. Once these hierarchies are accessible using the common representation, a hierarchy viewer which operated on this representation can be used by file system browsers, email programs, news readers, etc.

It is important to note that we have not made any assumption about the encoding of the content itself. The actual files, email messages, etc. could be in HTML, Word or any other content encoding.

Our focus is on the representation of the meta content structures. The representation is specified both in terms of the representation model and in terms of a query language. In order to write IM utilities, we will also need APIs for accessing resources like memory and screen space. Most likely, there will be multiple versions of these, based on OpenDoc, Java, ActiveX, etc. Hopefully, object wrapper technology will allow us to ignore these differences.

MCF based IM utilities (viewers, filters, persistent stores, etc.) can be used by IMAs such as email programs and file systems. At the extreme, the IMA might be nothing more than a shell such as a browser or desktop with most of the functionality being provided by these utilities. This is analogous to the browser acting as the shell for the different helper apps such as image displayers, plugins and applets.

The Central Concept

The central concept is the use of rich, standard, structured, extensible, compositable descriptions of information organization structures as the core of information management systems. We now explain what we mean by each of these terms.

Rich Descriptions: Structures such as file systems record very little information. They typically allow a tree structure together with a small number of attributes such as the author, creation date, modification date and size. We cannot for example, tell the machine that a certain file is a memo written in response to a certain email message. Furthermore, the machine's ontology (the set of things it knows about) is severely restricted. Even though much of the content is about people, organizations, places, etc. the machine only knows about files, folders, messages and such. MCF allows for semantically rich descriptions of content and its relationship to objects such as people, organizations and events.
This is an extremely important point. In order to adequately describe information organization structures, the system needs to be able to express/reference more than just machine internal objects such as files, folders and email messages. Entities such as people, organizations and projects need to be first class citizens as well and MCF provides for this.
Structured Descriptions: MCF descriptions are structured. The distinction between structured and unstructured descriptions is the same as the distinction between a relational database and a text file. Structured descriptions take more work in creating, but support sophisticated queries and analysis. MCF is concerned only on machine understandable descriptions and hence the focus on structure.
This is not to say that tools such as text search engines have no role to play in MCF. Quite the contrary. These engines often serve as "sense making tools" that induce structure amongst unstructured content. MCF is an ideal target language for this structure. MCF does not care about how the structure was generated, manually or automatically.
Standard Descriptions: MCF descriptions are standardized at two levels. MCF provides a standard language for accessing and manipulating meta content descriptions just like SQL provides a standard query language for relational databases. In addition, MCF also provides a standardized vocabulary so that different sources/programs use the same terms (such as "author", "fileSize", etc.) to refer to the same concepts. Different SQL databases need not use the same field names and data formats to represent the same concept. E.g., different tables might use different field names and formats for phone numbers. This is precisely what makes it so difficult to integrate data from disparate databases. Intergrating meta content from disparate sources is a lot more important that integrating data from disparate relational databases. And so, to make the integration go more smoothly, MCF provides a growing set of standard terms.
Extensible Descriptions: The first and third requirements --- rich and standard --- are somewhat at odds with each other. The list of standard terms cannot cover everything that can be stated using MCF and at some point someone (or some program) will want to express something not covered in the standard vocabulary. So, in addition to the standard terms, programs can introduce new terms to express new kinds of meta content. Furthermore, this extension can happen dynamically and apply to older objects as well (to continue with the database analogy, this would be like adding a field to a relational table on the fly.) While this does impose some fairly strong demands on performance, we feel that this level of extensibility is essential.
Compositable Descriptions: We often need to have multiple layers of descriptions, each adding to or modifying lower layers. This capability can be used to,
- allow divisions, departments and finally end users to create personalized views of partially shared information spaces. The users changes to a space modifies his personal layer. The view he sees is the composition of the more global layers together with his changes. When the global layers get modified, the user's view automatically changes in a fashion that is consistent with the user's changes.
  This, for example, allows a user to create his own personalized view of Yahoo! with additions from his desktop, email, other web hierarchies and certain of the Yahoo! categories deleted. When Yahoo! changes, these changes are absorbed into the user's view of Yahoo!, in a fashion that is maximally consistent with the changes he has made.
- integrate web pages, email, etc. into existing structures such as the file system structure without making any modifications to the file system. To do this, we abstract the existing file system as a layer and then add a layer which holds the (meta content for) the additional information. This, for example, provides a clean and cross platform mechanism for desktop-web integration. This kind of integration is also symmetric in that it allows you to regard, say one of Yahoo!'s subject categories as a folder and put your desktop files in it.

The MCF Model and Query API

In this section, we briefly describe the MCF model and query API. MCF is based on predicate logic and is hence very close to both object oriented and relational databases. You can safely skip this section and go here.

An MCF database consists of two parts:

a set of objects (called units). Strings, numbers and other "native" datatypes are also considered objects.
- A subset of these objects are called slots (or predicates.)
- Another subset of these objects is called Layers. The layers are arranged in a total order.
a set of n-tuples(typically triples), each consisting of a slot and an ordered list of n-1 object references and a layer. These tuples are called assertions. Each assertion also has a true/false value associated with it. Assertions are said to be true/false in the layer associated with them. An assertion that is true/false in a layer is also true/false in all the superior layers, unless one of those also contains the assertion with a different true/false value.
Since the layers themselves are units, the relation between the layers themselves is expressed as assertions. These assertions are in the BaseLayer, a special layer that is at the bottom of the total order.

The MCF Query API provides access to 3 (5) functions.

Create (Destroy) a new object (Destroy/Delete an existing object).
Assert (Unassert) a given assertion in a given layer with a specific truth value.
Query : Given a boolean combination of assertions, some of which have variables in the place of one or more objects, return bindings for these variables. The query can either specify a layer or default to the top most layer.

The API has been kept minimalistic in order to make it easy to learn. It is also obvious from the model that MCF is highly reflective. Since both slots and layers are first class objects/units, the representations can be very dynamic.

This section of course does not touch on the other part of the MCF specification, i.e., the standard list of terms. The beginnings of this list are at vocab.html.

The MCF Environment

The environment in which the MCF utilities operate provides the MCF query API. With this API, the utilities get a single uniform view of the information space accesible from that machine.

It is important to note that the MCF APIs do not imply a real MCF database sitting on the local disk. It possible for the APIs to work without any MCF store. More often than not, there will only be a "virtual" MCF database or the database might reside in memory and get saved out on a server.

E.g., rather than duplicate the file system structure into a different store, the APIs might access the local file system and dynamically make the file structure accessible via the MCF API.

Most likely, there will be a combination of an MCF store, dynamic translation and remote query processing, with the different answers getting integrated via the layers mechanism. Of course, the MCF utility need not worry about any of this.

Some Salient Aspects of MCF

Scalability

MCF is not inherently desktop or server based. It is meant to be a fully scalable architecture from palmtops to enterprise servers. By replacing components such as the storage/query module, the same architecture is usable everywhere. On a NC for example, there might be no persistent store. The meta content locally available might be just kept in memory. On a Mac or PC, it might be stored in a small database program and on the server, it might be stored in a big SQL 3 database.

Incremental Adoption

MCF can be adopted at a wide range of levels.

Not all IMAs that a user (or organization) interacts with need to use MCF for the user to realize the benefits. Of course, the value of MCF increases as more and more of the information organization structures are MCF accesible.
Unlike the adoption of component architectures (like OLE or OpenDoc), where an application is either compliant or not, IMAs can incrementally adopt MCF.
Given the spate of new technologies, it is important that the upfront cost of adopting a new technology be very low for developers. MCF has been designed with this concern in mind. In the extreme, the developer of an IMA (such as a mail program) can make its mailbox structure accesible via MCF, a task which should take less than a couple of hours, and immediately provide the user the ability to view the mailbox structure using any of the MCF based viewers.

MCF and existing formats

We already have many different formats (some open and many proprietary) for exchanging and storing meta content. Each of these are closely tied to some particular narrow application. It is not our intent to replace any of these. Fortunately, MCF is expressive enough to contain the information in the existing formats. By using translators and the layering mechanism, the information in these extant formats will be assimilated into the richer MCF structures. So, for example,

File system structures (such as HFS) can be dynmanically translated and made accessible via the MCF Query APIs.
Information in meta content interchange formats such as SOIF, Marc and SiteMap can be translated into MCF stores and then made accessible via the MCF Query APIs.

Unfortunately, none of the existing widely used meta content interchange file formats has the expressiveness to fully exploit MCF. It would be desirable to have an interchange format which can evolve with MCF. Therefore, one of the parts of the MCF, the Meta Content Framework is the Meta Content File Format (MCFF). (The specification for MCFF can be found at mcf.html.)

MCF-based Functionalities

All the user functionality with MCF is provided by MCF based utilities. Some applications (such as the ProjectX application) might encapsulate a bunch of these different utilities into a closed package. But we expect that in future, MCF utilities can be used by many different IMAs. In this section, we describe some of the MCF utilities that are currently under development.

Viewers

Today we use many different viewers to view different kinds of structures. The viewer used is purely a function of the IMA --- each IMA has its own viewer. Ideally, the viewer used should be a function of the users preferences and properties of the data (such as the density of the graph being viewed) and not of whether one is viewing files or email messages. Furthermore, the user should be able to use multiple metaphors for viewing the data, flipping between different viewers to get a better feel for it.

MCF Viewers are viewers for MCF structures. The entities in the structure might denote any kind of information object. The viewer does not care about this. When the user want to perform some action on the content, such as edit it, the viewer asks the IMA to perform it.

In addition to generic structure browsers such as outliners, HotSauce-like flythroughs hyperbolic graphs and the myriad layouts of graphs, some of the interesting viewers being worked on include,

HTML based viewers : The nodes of a structure together with annotations can be used to generate HTML (whose attributes can be controlled with style sheets) and a standard HTML viewer can act as the viewer engine. This provides a very clean and cross platform way of viewing not just file systems but email messages, etc. as html pages.
Domain specific viewers : Often, a collection of information has no interesting hierarchical structure, but uses other clustering techniques. Consider trip reports, http logs and other such information which have location associated with them. A geographical viewer can be used to lay out the different objects based on their geographical location.
Another useful way of viewing certain kinds of information collections, especially daily schedules and historical events is based on a time line.
Of course, such viewers cannot replace a more standard viewer such as an outliner. But in a multiple viewing metaphor system, it is not a question of exclusively choosing one viewer over an other, but of having the option of occasionally viewing your information using a certain kind of viewer.
Interactive guided tours as viewers : The traditional model of a viewer is that the user is presented with some kind of landscape which she navigates. In certain cases, such as as web sites that the user is visiting for the first time, an interactive guided tour where the user is taken along a certain route,(but where she can change the route as she goes along) might be more interesting.

Many kinds of viewers have been written before to work on proprietary, non-standard representations. The viewer developers have to evangelize their formats or produce tools that author in their format. By using a standard such as MCF, we enable the same kind of explosion in tools and utilities that HTML created. The viewer developers don't have to worry about getting content and the content authors just sit back watch new and exciting visualizations of their information.

Shared Maps

As the amount of information grows, text search engines are becoming essential. Text search engines are very attractive because they do not depend on any content annotation. However, there are some very hard limitations on how well these engines can perform. Most of their limitations arise from their extremely shallow understanding of the content. Maps (such as Yahoo! and Excite) are much more useful, but require manual building which is expensive.

The Sharing utility allows a group of people to work jointly on a distributed MCF database representing the map.

A good use of this is in the intranets of large organizations. It would be nice if each company could afford a team which surfed the intranet and built up a Yahoo like structure of the companies information resources. Unfortunately this is too expensive.

Now image a basic Yahoo like structure with two differences :

it is not of the Internet, but of the companies HR or Finance functions
there are no leaf/content nodes.

When someone in the organizations creates a page that belongs to one of the subject categories in this tree, they simply drop it under that category and everyone else sees the addition. So, in a sense, this is like "barn-raising" a Yahoo like structure. In particular, it leverages the authors understanding of the content to vastly reduce the cost in building maps.

The basic structure for many functions such as Human Resources and Facilities can be shared accross organizations, thereby further reducing the cost of getting started.

Personal Channels

There has been a lot of recent interest around the concept of channels : content companies broadcast a small fixed number of channels of information and users can tune into one or more of these and recieve the information.

While this model does have many advantages, it misses out on one of the webs great features : anyone who could afford some disk space at an ISP could become a publisher. Channel publishing is a significantly bigger undertaking than putting up a web site and it is unlikely that even relatively busy web sites such as www.apple.com will add channel provision capabilities.

Personal Channels is an attempt to combine the best of the current web with channels. Every MCF query (e.g., pages belonging to the Excite category on the 49ers and which have scores) defines a channel. The channel contents are obtained by sending this query to MCF Query Service Providers. These providers use traditional robots to collect meta content about pages to build an MCF database which is used to answer these queries. Whenever there is a new page that matches the query, it will be delivered to the user.

Meta content (such as the Yahoo and/or Excite categories that the author feels his page belongs to, or more detailed meta content like the kind of information found on that page) can either be embedded within the page using the meta content annotation extensions in HTML 3.2 or separately in a robots.txt file. Alternately, text indexing based pigeon holing techniques can be used to automatically place pages into subject categories.

The same query can be sent to multiple MCF Query Service Providers, who will likely be the same as todays Internet search service providers. In addition real channel providers can also respond to the mcf query, personalizing the information they deliver based on the query.

In the current model, the web contains pages information that comes to you. With personal channels, the web is a sea of pages that either come to you or you can go to.

Here are a few examples of personal channels that illustrate the potential of this concept.

Stock ticker : Stock tickers are a kind of channel. Stock tickers programs typically specify a fixed set of stocks. With Personal Channels, you can have stock tickers like "the 5 high tech. stocks that have changed most today".
Email channels : Email can be abstracted as a very specialized case of Personal Channels. It is a channel of email messages where the recipient includes you. Now imagine splitting your inbox into several different containers, based on other criteria like who it is from, who it is really addressed to, etc. This might possibly be a way of dealing with the problem of mailing lists that we don't want to deal with on a day to day basis but do want occasional access to.

MCF Maintanence Utilities

In the previous section we described some utilities based on the availablity of MCF descriptions of information to provide services to the user. In this section we describe some utilities that help and add value to the MCF descriptions themselves.

Persistence

Most IMAs have have a mechanism for storing their meta content in a persistent fashion. If the IMA uses MCF not just to export its meta content but also for its own use, the MCF persistence utility can be used instead. This will take some of the burden off the IMAs.

The persistence utility is defined by a set of persistence APIs. The actual storage and retrieval can be implemented differently on NCs, PCs and servers. For example, on a server one might use an object relational database while on a PC one might use a lighter weight solution.

Since the information organization structures of the different IMAs are stored in a single place, there is no reason to separate them into air tight compartments based on which IMA deposited them there. It makes much more sense to have divisions based on content, tasks the information is used for, etc.

The persistence utility, in effect provides the user with a single unified view into the meta content for all the information available from a machine. Having a single unified view automatically provides integration between the web, email, desktop files, etc.

MCF and Compound Document Models

In the longer run, the persistence and viewer utilities illustrate a trend that MCF facilitates, i.e., a vertical factoring of IMAs. There is a strong analogy here between Compound Document architectures (such as OLE and OpenDoc) and MCF. Compound document architectures simplify document authoring applications. Instead of building in every concievable feature into a monolithic application, they view the basic document as a shell into which different components plug in. This pluging in can take place because the different pieces of the document share the same "document piece" model that allows the components to negotiate and share user events, document resources such as space, etc. IMAs, in contrast, have to worry not so much about document models but with information space models. MCF provides the common information space model and the different MCF utilities can work together because they share the same information space model.

It is important to note that MCF is neither in competition with nor dependent on architectures such as OpenDoc and OLE. If they are available, MCF utilities can exploit them. The set of services MCF provides is complementary to what OpenDoc/OLE provide.

We hope that MCF will allow for lighter but more flexible IMAs. We also hope that the partial adoption allowed by MCF (which is not supported by the compound document architectures) will allow for a much speedier adoption of MCF.

Heuristic Co-identification of Objects Accross Heterogenous Information Sources

As mentioned earlier, MCF is intended to work with existing sources of information structures. It also enables structures from different sources to be integrated.

Consider integrating the information about people from your contact manager and from your email address book. Most likely, there will be duplication of information. The same people will appear in both sources, with overlapping pieces of information about them. It would be good to properly integrate the information so that we don't have two distinct entries for the same person. Simple name matching will clearly not be adequate. The Heuristic Co-identification utility uses background domain knowledge about people, organizations, etc. to co-identify objects from different information sources based on the attributes that are available.

MCF and Structured Content

In order to adequately describe information organization structures, MCF allows objects representing entities such as people, organizations and projects to be first first class citizens on the same level as files, folders and web pages.

With this important addition, the already blurry distinction between meta content and content gets even thinner. MCF is a general purpose structure description language. In addition to the syntax and semantics, it also provides a standard vocabulary for describing common objects such as people, organizations, meetings, etc.

It is this second aspect of MCF, as a lingua-franca schema for integrating different information sources, that gains prominence in this context. It is important to distinguish a lingua-franca schema from a universal standard schema. There is little chance of everyone using the same database schemas. MCF is not a standard schema. MCF provides a framework using which data which is in one schema can be automatically and dynamically converted into data in a very different schema.

An important caveat is in order here. The level of effort in standardizing the vocabulary to describe information bearing objects, though not trivial, is something the internet community is very used to. In contrast, as we start using using MCF for content itself, the size of the domain and therefore the effort required in standardization goes up by a couple of orders of magnitude. Of course, given MCF's extensibility users can very easily extend areas of MCF where standardization has not yet occured, but doing this on a widespread basis could defeat the purpose of standardization.

BabelFish

BabelFish is a prototype program that illustrate the use of MCF as schema translation middleware.

This is what happens traditionally when a user has a question that requires data from multiple heterogenous information sources : Typically, the user does not know which data sources need to be consulted to answer the question, leave alone the dataformats of these sources. A data administrator who is familiar with the semantics and formats of the schemas of the tables that need to be accessed writes a piece of SQL which answers the query. Much of the expense is in collecting the schema information required to write the SQL.

BabelFish uses a machine understandable language (MCF) for describing the semantics and dataformats of tables. Some important points to note about these descriptions.

They capture not just format information (such as field st1 is an integer and field st2 is a character field) but the semantics of the table (such as "field st1 has the social security number of the person whose address is in field st2").
They have to be provided only once per table and not once per query. They have to be changed only when the schema of the table changes.
The MCF descriptions of different tables can be generated independently, without any central coordination. The central coordination is provided in effect by the use of a common vocabulary.

BabelFish accepts MCF Queries, which state what the user wants, but not where or how to look (i.e., the MCF query does not contain any information about which tables to look in or what joins need to be done) and translates them into the appropriate SQL queries that contain the where and how. It does this by using a combination of background domain knowledge and the MCF descriptions of the tables available. The SQL generated will account for differences such as,

Different tables might use different codes for the same concept. E.g., one of them might use codes such as "01", "02" for states and another one might use codes such as "CA", "TX" for the same states.
If there is any way at all of answering the query given the tables available, BabelFish is guarenteed to find the answer.

For example, it might be difficult to do joins accross tables that were not designed to be joint. As an extreme, they might have no fields in common, but them may still be "essentially joinable" by bringing a third table, that has overlaps with both the tables, into the picture. If such a path exists, BabelFish will find it.

Note that the SQL generated by MCF is passed on to a SQL processing engine such as SQLNet or SQLConnect. BabelFish is not in the business of processing the SQL. BabelFish's architecture also allows it to generate queries using query languages other than SQL.

The two important benefits provided by BabelFish are,

Dynamic and distributed integration : Imagine if everytime we wanted two machines to exchange packets, the network administrators of the two machines had to talk to each other. This is exactly how it is today with interchange of information accross heterogeneous databases. With BabelFish, two databases that were designed and built independently could be exchanging information without any human intervention. The implications of this for Electronic Data Interchange are significant.
Content Level Querying : The MCF Query does not make any assumptions about the data sources to be used, much less about their schemas. This means that if an application uses MCF queries instead of SQL, we can go ahead and change the schemas of the back end data sources without changing or affecting the application. So long as we update the MCF descriptions of the tables and the new tables contain the information that the application needs, the appropriate new SQL will be dynamically generated. The implications of this for "future proofing" are significant.

Summary

Computers are evolving from word processing devices to windows into the world of information. Consequently, the infrastructures for accessing and organizing this information need to be evolved.

MCF is a rich, open, extensible language for describing information organization structures. Information management systems that use MCF can provide many useful and interesting functionalities such as the integration of information from an open-ended list of sources (desktop, web, email, etc.) that can be viewed using different metaphors (tree views, web views, flythroughs, etc.).