Meta Content Framework : A Whitepaper
R.V.Guha
(guha@guha.com)
Note: This is an old (1996) paper I wrote about MCF. Most of it is still
relevant.
Organizational structures for content, such as file systems, emailbox hierarchies,
WWW subject categories, people directories, etc. are almost as important as content
itself. The goal of the Meta Content Framework (MCF)
is to provide the same benefits for these structures as HTML
provided for content.
This is a white paper that describes the basic concepts and ideas behind MCF.
This paper is not intended to be a specification of MCF. Instead, it provides
the background context for MCF and outlines the user values delivered. The specification
can be found in the mcf spec
Background Context
The Shift in Focus
The last few years have seen a significant change in what we consider productivity
applications. The focus has shifted from authoring and analysis
applications such as word processors and spreadsheets to information
management and communications applications (IMA) such as email. In addition, with increasing disk sizes,
our dependence on the traditional IMAs such as the desktop file manager has increased.
The Problem
We use a number of different IMAs --- browsers, email, newsreaders, file
systems, etc. to manage our information. They each handle collections of
information objects : web pages, images, email
messages, files, folders, etc. Each of these IMAs has an information organizing structure.
Much of the time spent in front of computers is in manipulating or creating
these structures. The structures form the
skeleton for the organization of our information. These structures are not
the final content, but are meta content.
Each IMA uses its own representations for these structures and provides its own
utilities for viewing and manipulating them. Furthermore, the structures in use today are
very simple and inexpressive. They don't allow us to represent very much
about the content. The structure is typically a tree or graph with a very limited number of
attributes such as the author, modification date and size provided as annotations on the nodes.
We claim that the lack of an expressive, open standard for representing
these structures is at the root of many of our information management problems.
In fact, we have become so accustomed to these problems that we hardly even
regard them as problems any more. For example, our information today is
divided up into separate containers such as email, files, web pages, etc.
This division is based not on what the content is about or which tasks they
are relevant to, but on which protocol is used to access/manipluate them.
Efforts to remove these boundries are typically involve trying to
absorb one into another and are fairly specific to certain platforms (the
best example of this is the attempt in Windows 97 to absorb the web into
the Win97 desktop.)
MCF Goals
The goal of MCF is to abstract and standardize the representation
of the structures we use for organizing information. In addition to the
usual benefits of open standards, this will also allow for information
management (IM) utilities such as viewers to work accross different IMAs.
Let us illustrate this concept with a very simple example.
Consider the abstract notion of hierarchies, we find
it in file systems, mailbox structures, usenet newsgroups, etc. We can
use the same representation for hierarchies irrespective of whether the
nodes of the hierarchy represent files, email messages or news postings.
Once these hierarchies are accessible using the common representation, a
hierarchy viewer which operated on this representation can be used
by file system browsers, email programs, news readers, etc.
It is important to note that we have not made any assumption about the
encoding of the content itself. The actual files, email messages, etc. could
be in HTML, Word or any other content encoding.
Our focus is on the representation of the meta content structures. The representation
is specified both in terms of the representation model and in terms of a query language.
In order to write IM utilities, we will also need APIs for accessing resources
like memory and screen space. Most likely, there will be
multiple versions of these, based on OpenDoc, Java, ActiveX, etc. Hopefully,
object wrapper technology will allow us to ignore these differences.
MCF based IM utilities (viewers, filters, persistent stores, etc.) can be used
by IMAs such as email programs and file systems. At the extreme, the IMA might
be nothing more than a shell such as a browser or desktop with most of the functionality
being provided by these utilities. This is analogous to the browser
acting as the shell for the different helper apps such as image displayers,
plugins and applets.
The Central Concept
The central concept is the use of rich, standard, structured, extensible, compositable descriptions
of information organization structures as the core of information management systems.
We now explain what we mean by each of these terms.
- Rich Descriptions:
Structures such as file systems record very little information. They typically
allow a tree structure together with a small number of attributes such as the author,
creation date, modification date and size. We cannot for example, tell the machine
that a certain file is a memo written in response to a certain email message. Furthermore,
the machine's ontology (the set of things it knows about) is severely restricted. Even
though much of the content is about people, organizations, places, etc. the machine
only knows about files, folders, messages and such. MCF allows for semantically rich
descriptions of content and its relationship to objects such as people, organizations
and events.
This is an extremely important point. In order to adequately describe information
organization structures, the system needs to be able to express/reference more
than just machine internal objects such as files, folders and email messages. Entities
such as people, organizations and projects need to be first class citizens as well and MCF
provides for this.
- Structured Descriptions:
MCF descriptions are structured. The distinction between structured and unstructured
descriptions is the same as the distinction between a relational database and a text
file. Structured descriptions take more work in creating, but support sophisticated
queries and analysis. MCF is concerned only on machine understandable descriptions and hence the focus
on structure.
This is not to say that tools such as text search engines have no role to play in
MCF. Quite the contrary. These engines often serve as "sense making tools" that
induce structure amongst unstructured content. MCF is an ideal target language for
this structure. MCF does not care about how the structure was generated, manually or
automatically.
- Standard Descriptions:
MCF descriptions are standardized at two levels. MCF provides a standard language
for accessing and manipulating meta content descriptions just like SQL provides a
standard query language for relational databases. In addition, MCF also provides a
standardized vocabulary so that different sources/programs use the same terms (such
as "author", "fileSize", etc.) to refer to the same concepts. Different SQL databases
need not use the same field names and data formats to represent the same concept. E.g.,
different tables might use different field names and formats for phone numbers. This
is precisely what makes it so difficult to integrate data from disparate databases.
Intergrating meta content from disparate sources is a lot more important that integrating
data from disparate relational databases. And so, to make the integration go more
smoothly, MCF provides a growing set of
standard terms.
- Extensible Descriptions:
The first and third requirements --- rich and standard --- are somewhat at odds with
each other. The list of standard terms cannot cover everything that can be stated using
MCF and at some point someone (or some program) will want to express something not
covered in the standard vocabulary. So, in addition to the standard terms, programs
can introduce new terms to express new kinds of meta content. Furthermore, this extension
can happen dynamically and apply to older objects as well (to continue with the
database analogy, this would be like adding a field to a relational table on the fly.)
While this does impose some fairly strong demands on performance, we feel that this
level of extensibility is essential.
- Compositable Descriptions:
We often need to have multiple layers of descriptions, each adding to or modifying
lower layers. This capability can be used to,
- allow divisions, departments and finally end users to create personalized views of
partially shared information spaces. The users changes to a space modifies his personal
layer. The view he sees is the composition of the more global layers together with
his changes. When the global layers get modified, the user's view automatically changes
in a fashion that is consistent with the user's changes.
This, for example, allows a user to create his own personalized view of Yahoo! with
additions from his desktop, email, other web hierarchies
and certain of the Yahoo! categories deleted. When Yahoo! changes, these changes
are absorbed into the user's view of Yahoo!, in a fashion that is maximally consistent
with the changes he has made.
- integrate web pages, email, etc. into existing structures such as the file system
structure without making any modifications to the file system. To do this, we abstract
the existing file system as a layer and then add a layer which holds the (meta content
for) the additional information. This, for example, provides a clean and cross platform
mechanism for desktop-web integration. This kind of integration is also symmetric in that
it allows you to regard, say one of Yahoo!'s subject categories as a folder and put your
desktop files in it.
The MCF Model and Query API
In this section, we briefly describe the MCF model and query API. MCF is based
on predicate logic and is hence very close to both object oriented and relational databases.
You can safely skip this section and go here.
An MCF database consists of two parts:
- a set of objects (called units). Strings, numbers and other "native"
datatypes are also considered objects.
- A subset of these objects are called slots (or predicates.)
- Another subset of these
objects is called Layers. The layers are arranged in a total order.
- a set of n-tuples(typically triples), each consisting of a slot and an ordered list of
n-1 object references and a layer. These tuples are called assertions. Each assertion
also has a true/false value associated with it. Assertions are said to be true/false in the
layer associated with them. An assertion that is true/false in a layer
is also true/false in all the superior layers, unless one of those also contains the
assertion with a different true/false value.
Since the layers themselves are units, the relation between the layers themselves
is expressed as assertions. These assertions are in the BaseLayer, a special layer
that is at the bottom of the total order.
The MCF Query API provides access to 3 (5) functions.
- Create (Destroy) a new object (Destroy/Delete an existing object).
- Assert (Unassert) a given assertion in a given layer with a specific truth value.
- Query : Given a boolean combination of assertions, some of which have variables
in the place of one or more objects, return bindings for these variables. The query
can either specify a layer or default to the top most layer.
The API has been kept minimalistic in order to make it easy to learn. It is also
obvious from the model that MCF is highly reflective. Since both slots and
layers are first class objects/units, the representations can be very dynamic.
This section of course does not touch on the other part of the MCF specification, i.e.,
the standard list of terms. The beginnings of this list are at
vocab.html.
The MCF Environment
The environment in which the MCF utilities operate provides the MCF query API.
With this API, the utilities get a single uniform view of the information space
accesible from that machine.
It is important to note that the MCF APIs do not imply a real MCF database sitting
on the local disk. It possible for the APIs to work without any MCF store. More often than not, there will only be a "virtual" MCF database
or the database might reside in memory and get saved out on a server.
E.g., rather than duplicate the file system structure
into a different store, the APIs might access the local file system and dynamically
make the file structure accessible via the MCF API.
Most likely, there will be a combination of an MCF store, dynamic translation and
remote query processing, with the different answers getting integrated via the layers
mechanism. Of course, the MCF utility need not worry about any of this.
Some Salient Aspects of MCF
Scalability
MCF is not inherently desktop or server based. It is meant to be a fully scalable architecture
from palmtops to enterprise servers. By replacing components such as the storage/query
module, the same architecture is usable everywhere. On a NC for example, there might
be no persistent store. The meta content locally available might be just kept in memory.
On a Mac or PC, it might be stored in a small database program and on the server, it
might be stored in a big SQL 3 database.
Incremental Adoption
MCF can be adopted at a wide range of levels.
- Not all IMAs that a user (or organization) interacts with need to use MCF for the
user to realize the benefits. Of course, the value of MCF increases as more and more
of the information organization structures are MCF accesible.
- Unlike the adoption of component architectures (like OLE or OpenDoc), where an
application is either compliant or not, IMAs can incrementally adopt MCF.
Given the spate of new technologies, it is important that the upfront cost of adopting a new
technology be very low for developers. MCF has been designed with this concern in mind.
In the extreme, the developer of an IMA (such as a mail program) can make its mailbox structure
accesible via MCF, a task which should take less than a couple of hours,
and immediately provide the user the ability to view the mailbox structure
using any of the MCF based viewers.
MCF and existing formats
We already have many different formats (some open and many proprietary) for exchanging
and storing meta content. Each of these are closely tied to some particular narrow
application. It is not our intent to replace any of these.
Fortunately, MCF is expressive enough to contain the
information in the existing formats. By using translators
and the layering mechanism, the information in these extant formats will be assimilated
into the richer MCF structures. So, for example,
- File system structures (such as HFS) can be dynmanically translated and made
accessible via the MCF Query APIs.
- Information in meta content interchange formats such as SOIF, Marc and SiteMap
can be translated into MCF stores and then made accessible via the MCF Query APIs.
Unfortunately, none of the existing widely used meta content interchange file formats
has the expressiveness to fully exploit MCF. It would be desirable to have an
interchange format which can evolve with MCF. Therefore, one of the parts of the
MCF, the Meta Content Framework is the Meta Content File Format (MCFF).
(The specification for MCFF can be found at
mcf.html.)
MCF-based Functionalities
All the user functionality with MCF is provided by MCF based utilities. Some applications
(such as the ProjectX application) might encapsulate a bunch of these different utilities
into a closed package. But we expect that in future, MCF utilities can be used by many different
IMAs.
In this section, we describe some of the MCF utilities that are currently under development.
Viewers
Today we use many different viewers to view different kinds of structures. The viewer
used is purely a function of the IMA --- each IMA has its own viewer. Ideally, the
viewer used should be a function of the users preferences and properties of the
data (such as the density of the graph being viewed) and not of whether one
is viewing files or email messages. Furthermore, the user should be able to use
multiple metaphors for viewing the data, flipping between different viewers to get
a better feel for it.
MCF Viewers are viewers for MCF structures. The entities in the structure might
denote any kind of information object. The viewer does not care about this. When
the user want to perform some action on the content, such as edit it,
the viewer asks the IMA to perform it.
In addition to generic structure browsers such as outliners, HotSauce-like flythroughs
hyperbolic graphs and the myriad layouts of graphs, some of the interesting viewers being worked on include,
- HTML based viewers : The nodes of a structure together
with annotations can be used to generate HTML (whose attributes can be controlled
with style sheets) and a standard HTML viewer can act as the viewer engine. This
provides a very clean and cross platform way of viewing not just file systems but
email messages, etc. as html pages.
- Domain specific viewers : Often, a collection of information has no interesting
hierarchical structure, but uses other clustering techniques. Consider trip reports,
http logs and other such information which have location associated with them.
A geographical viewer can be used to lay out
the different objects based on their geographical location.
Another useful way of viewing certain kinds of information collections, especially
daily schedules and historical events is based on a time line.
Of course, such
viewers cannot replace a more standard viewer such as an outliner. But in a multiple
viewing metaphor system, it is not a question of exclusively choosing one viewer
over an other, but of having the option of occasionally viewing your information
using a certain kind of viewer.
- Interactive guided tours as viewers : The traditional model of a viewer is that
the user is presented with some kind of landscape which she navigates.
In certain cases, such as as web sites that the user is visiting for the first time,
an interactive guided tour where the user is taken along a certain route,(but where she
can change the route as she goes along) might be more interesting.
Many kinds of viewers have been written before to work on proprietary, non-standard
representations. The viewer developers have to evangelize their formats or produce
tools that author in their format. By using a standard such as MCF, we enable the
same kind of explosion in tools and utilities that HTML created. The viewer developers
don't have to worry about getting content and the content authors just sit back
watch new and exciting visualizations of their information.
Shared Maps
As the amount of information grows, text search engines are becoming essential. Text
search engines are very attractive because they do not depend on any content annotation.
However, there are some very hard limitations on how well these engines can perform.
Most of their limitations arise from their extremely shallow understanding of the
content. Maps (such as Yahoo! and Excite) are much more useful, but require manual
building which is expensive.
The Sharing utility allows a group of people to work jointly on a distributed MCF
database representing the map.
A good use of this is in the intranets of large
organizations. It would be nice if each company could afford a team which surfed
the intranet and built up a Yahoo like structure of the companies information
resources. Unfortunately this is too expensive.
Now image a basic Yahoo like structure with two differences :
- it is not of the Internet, but of the companies HR or Finance functions
- there are no leaf/content nodes.
When someone
in the organizations creates a page that belongs to one of the subject categories in this tree,
they simply drop it under that category and everyone else sees the addition. So,
in a sense, this is like "barn-raising" a Yahoo like structure. In particular, it
leverages the authors understanding of the content to vastly reduce the cost in
building maps.
The basic structure for many functions such as Human Resources and Facilities can
be shared accross organizations, thereby further reducing the cost of getting started.
Personal Channels
There has been a lot of recent interest around the concept of channels : content
companies broadcast a small fixed number of channels of information and users
can tune into one or more of these and recieve the information.
While this model does have many advantages, it misses out on one of the webs
great features : anyone who could afford some disk space at an ISP could become
a publisher. Channel publishing is a significantly bigger undertaking than
putting up a web site and it is unlikely that even relatively busy web sites
such as www.apple.com will add channel provision capabilities.
Personal Channels is an attempt to combine the best of the current web with
channels. Every MCF query (e.g., pages belonging to the Excite category on
the 49ers and which have scores) defines a channel. The channel contents are obtained
by sending this query to MCF Query Service Providers. These providers use
traditional robots to collect meta content about pages to build an MCF
database which is used to answer these queries. Whenever there is a new
page that matches the query, it will be delivered to the user.
Meta content (such as the Yahoo and/or Excite categories that the author feels
his page belongs to, or more detailed meta content like the kind of information
found on that page) can either be embedded within the page using the meta content annotation
extensions in HTML 3.2 or separately in a robots.txt file. Alternately, text indexing based
pigeon holing techniques can be used to automatically place pages into
subject categories.
The same query can be sent to multiple MCF Query Service Providers, who
will likely be the same as todays Internet search service providers.
In addition real channel providers can also respond to the mcf query,
personalizing the information they deliver based on the query.
In the current model,
the web contains pages information that comes to you. With personal channels,
the web is a sea of pages that either come to you or you can go to.
Here are a few examples of personal channels that illustrate the potential
of this concept.
- Stock ticker : Stock tickers are a kind of channel. Stock tickers
programs typically specify a fixed set of stocks. With Personal Channels,
you can have stock tickers like "the 5 high tech. stocks that have changed
most today".
- Email channels : Email can be abstracted as a very specialized case
of Personal Channels. It is a channel of email messages where the recipient includes you.
Now imagine splitting your inbox into several different containers, based on
other criteria like who it is from, who it is really addressed to, etc.
This might possibly be a way of dealing with the problem of mailing lists
that we don't want to deal with on a day to day basis but do want occasional
access to.
MCF Maintanence Utilities
In the previous section we described some utilities based on the availablity
of MCF descriptions of information to provide services to the user. In this section
we describe some utilities that help and add value to the MCF descriptions themselves.
Persistence
Most IMAs have have a mechanism for storing their meta content in a persistent
fashion. If the IMA uses MCF not just to export its meta content but also for
its own use, the MCF persistence utility can be used instead. This will take
some of the burden off the IMAs.
The persistence utility is defined by a set of persistence APIs. The actual
storage and retrieval can be implemented differently on NCs, PCs and servers.
For example, on a server one might use an object relational database while
on a PC one might use a lighter weight solution.
Since the information organization structures of the different IMAs are stored
in a single place, there is no reason to separate them into air tight compartments
based on which IMA deposited them there. It makes much more sense to have
divisions based on content, tasks the information is used for, etc.
The persistence
utility, in effect provides the user with a single
unified view into the meta content for all the information available from
a machine. Having a single unified view automatically provides integration
between the web, email, desktop files, etc.
MCF and Compound Document Models
In the longer run, the persistence and viewer utilities illustrate a trend
that MCF facilitates, i.e., a vertical factoring of IMAs. There is a strong
analogy here between Compound Document architectures (such as OLE and OpenDoc)
and MCF. Compound document architectures simplify document authoring applications.
Instead of building in every concievable feature into a monolithic application,
they view the basic document as a shell into which different components plug in.
This pluging in can take place because the different pieces of the document share
the same "document piece" model that allows the components to negotiate and share
user events, document resources such as space, etc. IMAs, in contrast, have to
worry not so much about document models but with information space models.
MCF provides the common information space model and the different
MCF utilities can work together because they share the same information space
model.
It is important to note that MCF is neither in competition with nor dependent
on architectures such as OpenDoc and OLE. If they are available, MCF utilities
can exploit them. The set of services MCF provides is complementary to what
OpenDoc/OLE provide.
We hope that MCF will allow for lighter but more flexible IMAs. We also hope
that the partial adoption allowed by MCF (which is not supported by the compound
document architectures) will allow for a much speedier adoption of MCF.
Heuristic Co-identification of Objects Accross Heterogenous Information Sources
As mentioned earlier, MCF is intended to work with existing sources of information
structures. It also enables structures from different sources to be integrated.
Consider integrating the information about people from your contact manager and from your
email address book. Most likely, there will be duplication of information. The same
people will appear in both sources, with overlapping pieces of information about
them. It would be good to properly integrate the information so that we don't
have two distinct entries for the same person. Simple name matching will clearly
not be adequate. The Heuristic Co-identification utility uses background domain
knowledge about people, organizations, etc. to co-identify objects from different
information sources based on the attributes that are available.
MCF and Structured Content
In order to adequately describe information organization structures, MCF
allows objects representing entities such as people, organizations and projects
to be first first class citizens on the same level as files, folders and web
pages.
With this important addition, the already blurry distinction between meta content
and content gets even thinner. MCF is a general purpose structure description
language. In addition to the syntax and semantics, it also provides a standard
vocabulary for describing common objects such as people, organizations, meetings, etc.
It is this second aspect of MCF, as a lingua-franca schema for integrating different
information sources, that gains prominence in this context.
It is important to distinguish a lingua-franca schema from a universal standard
schema. There is little chance of everyone using the same database schemas. MCF
is not a standard schema. MCF provides a framework using which data which is in
one schema can be automatically and dynamically converted into data in a very
different schema.
An important caveat is in order here. The level of effort in standardizing
the vocabulary to describe information bearing objects, though not trivial,
is something the internet community is very used to. In contrast, as we
start using using MCF for content itself, the size of the domain and therefore
the effort required
in standardization goes up by a couple of orders of magnitude. Of course,
given MCF's extensibility users can very easily extend areas of MCF where
standardization has not yet occured, but doing this on a widespread basis
could defeat the purpose of standardization.
BabelFish
BabelFish is a prototype program that illustrate the use of MCF as
schema translation middleware.
This is what happens traditionally when a user has a question that requires
data from multiple heterogenous information sources : Typically, the user
does not know which data sources need to be consulted to answer the question,
leave alone the dataformats of these sources. A data administrator
who is familiar with the semantics and formats of the schemas of the tables
that need to be accessed writes a piece of SQL which answers the query.
Much of the expense is in collecting the schema information required to write
the SQL.
BabelFish uses a machine understandable language (MCF) for describing the
semantics and dataformats of tables. Some important points to note
about these descriptions.
- They capture not just format information (such as field st1 is an integer and field
st2 is a character field) but the semantics of the table (such as "field st1 has
the social security number of the person whose address is in field st2").
- They have to be provided only once per table and not once
per query. They have to be changed only when the schema of the table changes.
- The MCF descriptions of different tables can be generated independently,
without any central coordination. The central coordination is provided in effect
by the use of a common vocabulary.
BabelFish accepts MCF Queries, which state what the user wants, but not
where or how to look (i.e., the MCF query does not contain any information about
which tables to look in or what joins need to be done) and translates them into the
appropriate SQL queries that contain the where and how. It does this by using a combination
of background domain knowledge and the MCF descriptions of the tables available.
The SQL generated will account for differences such as,
- Different tables might use different codes for the same concept. E.g., one of them
might use codes such as "01", "02" for states and another one might use codes such
as "CA", "TX" for the same states.
- If there is any way at all of answering the query given the tables available,
BabelFish is guarenteed to find the answer.
For example, it might be difficult to do joins accross tables that were not
designed to be joint. As an extreme, they might have no fields in common, but them
may still be "essentially joinable" by bringing a third table, that has overlaps
with both the tables, into the picture. If such a path exists, BabelFish will find it.
Note that the SQL generated by MCF is passed on to a SQL processing engine such as
SQLNet or SQLConnect. BabelFish is not in the business of processing the SQL. BabelFish's
architecture also allows it to generate queries using query languages other than SQL.
The two important benefits provided by BabelFish are,
- Dynamic and distributed integration : Imagine if everytime we wanted two machines
to exchange packets, the network administrators of the two machines had to talk to each
other. This is exactly how it is today with interchange of information accross heterogeneous
databases. With BabelFish, two databases that were designed and built independently could
be exchanging information without any human intervention. The implications of this for
Electronic Data Interchange are significant.
- Content Level Querying : The MCF Query does not make any assumptions about the data sources to be
used, much less about their schemas. This means that if an application uses MCF queries
instead of SQL, we can go ahead and change the schemas of the back end data sources
without changing or affecting the application. So long as we update the MCF descriptions
of the tables and the new tables contain the information that the application needs,
the appropriate new SQL will be dynamically generated. The implications of this for
"future proofing" are significant.
Summary
Computers are evolving from word processing devices to windows into the world of information.
Consequently, the infrastructures for accessing and organizing this information need to
be evolved.
MCF is a rich, open, extensible language for describing information organization structures.
Information management systems that use MCF can provide many useful and interesting functionalities
such as the integration of information from an open-ended list of sources (desktop, web, email, etc.)
that can be viewed using different metaphors (tree views, web views, flythroughs, etc.).