Tải bản đầy đủ

Building management systems

Defining Data, Information, and Content
A CM Domain White Paper
By Bob Boiko

This white paper is produced from the Content Management Domain which features the full text
of the book "Content Management Bible," by Bob Boiko. Owners of the book may access the CM
Domain at www.metatorial.com.
This paper contains the content of Chapter 1 of "Content Management Bible." It concerns the
relationship between the terms in the title of the paper.


Building Management Systems
A CM Domain White Paper
By Bob Boiko

This white paper is produced from the Content Management Domain which features the full text
of the book "Content Management Bible," by Bob Boiko. Owners of the book may access the CM
Domain at www.metatorial.com.


Table of Contents

Table of Contents__________________________________________________________________2
What's in a Management System? __________________________________________________3
Building a Repository _____________________________________________________________3
Essential and recommended repository functions ____________________________________4
The content model _____________________________________________________________6
Storing Content __________________________________________________________________7
Relational database repositories __________________________________________________8
Relational database basics ____________________________________________________8
Storing component classes and instances ________________________________________9
Fully parsing structured text_________________________________________________11
Partially parsing structured text ______________________________________________12
Not parsing structured text __________________________________________________13
Breaking the spell of rows and columns _______________________________________14
Storing access structures _____________________________________________________16
Hierarchies in a relational database __________________________________________16
Indexes in relational databases ______________________________________________18
Cross-references in relational databases ______________________________________19
Sequences in relational databases ___________________________________________21
Storing the content model ____________________________________________________21
XML-based repositories ________________________________________________________23
Object databases vs. XML____________________________________________________24
Storing component classes and instances _______________________________________24
Storing access structures _____________________________________________________27
Hierarchies in XML ________________________________________________________27
Indexes in XML ___________________________________________________________29
Cross-references in XML ___________________________________________________29
Sequences in XML ________________________________________________________30
Storing the content model ____________________________________________________31
File-based repositories _________________________________________________________33
Implementing Localization Strategies _______________________________________________34
Doing Management Physical Design _______________________________________________36
A repository-wide DTD _________________________________________________________36
Link checking ________________________________________________________________36
Media checking_______________________________________________________________37
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.



Page 2


Search and replace ___________________________________________________________38
Management integrations ______________________________________________________39
Summary ______________________________________________________________________40

The management system within a content management system holds and organizes all the
content that you've collected. In addition to storing content, the management system can provide
a full cataloging and administration system for your content and related data.
In this white paper I discuss the variety of databases and functions that you may encounter or
need to create to store and administer your content.

What's in a Management System?
Many CMS companies describe their entire product as a management system. I take a different
tack. For me, although it's of course true that a content management system is a management
system, it's more instructive to focus the term management on the specific parts of the CMS that
deal with the content that's in the system and differentiate them from the other parts of the CMS
that enable you to get content in (collection) and get it out (publication).
The management system within a CMS has these parts:
?? A repository: All the content and control files for the system are stored here. The repository

houses databases and files that hold content. The repository can also store configuration and
administrative files and databases that specify how the CMS runs and produces publications.
?? A repository interface: This enables the collection, publishing, workflow and administrative

system to access and work with the repository. The interface provides functions for input,
access, and output of components as well as other files and data that you store in the
repository.
?? Connections to other systems: This enables you to send and receive information from the

repository.
?? A workflow module: This module embeds each component and publication in a managed

life cycle.
?? An administrative module: This module enables you to configure the CMS.

In this white paper I focus most on the repository itself to give you a central place from which to
understand management.

Building a Repository
The repository is the heart of the management system and of the CMS as a whole. Into the
repository flow all the raw materials on which any publication is built. Within the repository,
components are stored and can be continually fortified to increase the quality of their metadata or
content. Out of the repository flow the components and other parts that a page of a publication
needs (as shown in Figure 1).

CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 3


Figure 1: A high-level view of a CMS repository that shows its different parts and the content
storage options that you have
As a first approximation, you can think of the repository as a database. As does a database, a
repository enables you to store and retrieve information. The repository, however, is much more.
For one thing, the repository can house many databases. It can house files as well. It has an
interface to other systems that goes beyond what a standalone database usually does. If you
stand back from the repository and look at it as single unit, however, most of what you may know
about databases helps you understand the functions of the repository. In fact, most repositories
have a database at their core. The database, however, is wrapped in so much custom code and a
user interface that end users aren't likely to ever see the database.
Note
My discussion implies that all the authoring, conversion, and aggregation are done on a component
before it enters the repository. This is for clarity of presentation only. In fact, the only thing that must
be done to a component before it enters the repository is that it be segmented. Until it's segmented,
there's no component.
You can, and often should, add a component to the repository before it's fully authored,
converted, edited, and has had metadata added to it. After it's in the repository, these processes
can be brought under the control of your workflow module.

Essential and recommended repository functions
At the most basic level, a repository must provide the same functions as any database, as
follows:
?? It must hold your content. Whether you employ a vast distributed network of databases or a

simple file structure on a computer under someone's desk, the central function of a
management system is to contain your content in one "place." In addition, the system must
have some way of segmenting content into individually locatable units (such as files or
database records).
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 4


?? It must enable you to input content. Whether you have tools for loading multiple

component at a time (bulk processing), automatic inputs via syndication, or one-by-one
entries via Web-based forms, the management system must give you some way to get
content in.
?? It must enable you to locate content. Whether it employs sophisticated natural language

searches or a simple index, you must be able to find content in the system.
?? It must enable you to output content. Whether it supports advanced transformations or

only the simplest tab-delimited format, the management system must enable you to retrieve a
copy of content that you've found in a format you can use.
?? It must enable you to remove content. Whether it can archive automatically or whether you

must delete old content by hand, without the capability to remove content, a management
system is inadequate.
Although a repository that performs the preceding minimum functions would be sufficient to build
a CMS on, it would be far from ideal. And in fact, most repositories go far beyond these basic
functions to enable you to do the following:
?? Support the concept of components: Although all management systems must somehow

segment information, a good system facilitates inputting, naming, cataloging, locating, and
extracting content based on its type (or, in my language, its component class).
?? Track your content: The management system ought to provide statistics and reports on

your components that enable you to assess the status of individuals or groups of
components.
?? Support the notion of workflow: Although not part of the repository, the workflow module

must be tightly integrated with it. As one example among many, events that occur within the
repository, such as adding new components or deleting them, should be capable of triggering
workflow processes.
?? Support element and full-text search: You're likely to know one of two things about

components that you want to find in the repository: the value of some piece of metadata that
they contain or some piece of text that you remember that they contain. In the first case, you
want what's called an element search. (In relational databases, this is usually called a fielded
search.) To do an element search, what you want most is a list of the elements and a place
where you can type or select the value that you want. To find components by author, for
example, you want to see an Author box into which you can type a name. For a bonus, the
system can help you type only valid possibilities. The Author box, for example, can be a list
from which you simply choose an author rather than typing her name. In the second case,
where you remember some piece of text that the component contains, a full-text search is
what you want. Here, what you want is to type a word or phrase in a box and have the
system find components that contain that word or phrase in any element. For spice, the
repository can enable you to combine full text and fielded search or to type Boolean
operators such as AND, OR, and NOT to make more precise searches of either type.
?? Support bulk processes: Managing components one at a time is far too slow for many

situations. A good repository enables you to specify an operation and then do it over and over
to all the components it applies to. Suppose, for example, that your lead metator is out of
town and you want to extend the expiration date on any components that "turn off" while
she's out. You could do an element search for all components with an expiration date
between today and the day that she returns. Then you could open each of these components
and change its Expire Date element to sometime next week.
?? Support all field types: Any repository enables you to type metadata as text, but the one

that you want can do much more. The best kind of repository supports all the types of fields
that I describe in white paper20, "Working with Metadata," in the section "Metadata fields." In
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 5


any repository, for example, you can type the name of an author into each component's
Author element. Spelling errors and variations on the same name (Christopher Scott vs. C.
Scott), however, eventually cause problems. It would be better if you had one place where
you could type all author names once. Then, whenever an author needs to be specified, you
can choose the name rather than type it. The best would be a system that can be linked to
the main sources of metadata in your organization. People log into an organization's network,
for example, based on a user ID and password. This information - as well as the
organizational groups to which they belong - is stored in a registry. Wouldn't you most like to
work with a system that could connect to this registry and find all the people that are in the
Authors group? Then, to have access to all authors' names (not to mention any other
information that the registry stores), you just need to make sure that the Authors group is
correctly maintained by your organization's system administrators. Similarly, if your repository
holds master copies of metadata lists, you want it to be openly accessible to your
organization's other systems.
?? Support organization standards: Your repository should access and work within whatever

user security and other network standards that you employ. If you aren't running a TCP/IP
network protocol, for example, the CMS's Web-based forms and administrative tools can't
work on your local area network.

The content model
Database developers create data models (or database schema). These models establish how
each table in the database is constructed and how it relates to the other tables in the database.
XML developers create DTDs (or XML Schema). DTDs establish how each element in the XML
file is constructed and how it relates to the other elements in the file. CMS developers create
content models that serve the same function - they establish how each component is constructed
and how it relates to the other components in the system.
In particular, the content model specifies the following:
?? The name of each component class.
?? The allowed elements of each component class.
?? The element type and allowed values for each element.
?? The access structures in which each component class and instance participate.

The content model puts bones and sinew on the content domain. Although the content domain is
a simple statement, the content model is a fully detailed framework. On the other hand, all the
components that you detail in the model ought to be specifically in support of the domain. If you
can't determine quickly how a particular component serves the domain, you should reconsider the
necessity of the component or the validity of the domain statement.
If your CMS is built on a relational database, your content model gives rise to a database
schema. If your CMS is built on XML files or an XML database, your content model gives rise to a
DTD. The content model, however, isn't simply reducible to either of these models. Suppose, for
example, that you establish that you want an Author element that's an open list. This fact can't be
coded in either a database schema or a DTD. Rather, it must be established in the authoring
environment that you use. Still, the majority of the content model can be coded either explicitly or
implicitly in the database or XML schema that you develop. The rest of the content model
becomes part of the access structures in your repository and the rules that you institute in your
collection system.

CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 6


Storing Content
Most content management systems store components in databases. Some store metadata in
databases and keep the component content in files. Although almost all content management
systems use some sort of database, the exact database they employ and how the components
are stored in the database varies widely. The two major classes of databases that a CMS may
use to store content components are shown in Figure 2.

Figure 2: A CMS may store components in a relational database or an XML database.
To date, content management systems have stored content in the following general ways:
?? In relational databases, which are the computer industry's standard place to store large

amounts of information.
?? In an object (or XML) database, which stores information as XML.

Sometimes the component body elements are stored in files. In these cases, management
elements are generally stored apart from the body elements in a relational or XML database.
As I write, many CMS companies are experimenting with new technologies that seek to make the
best of both the database world and the world of files. In addition, database product companies
themselves are breaking the established boundaries by creating hybrid object -relational
databases that overlay XML Schema onto the basic relational database infrastructure.
Regardless of the type of storage system that you use, it must be capable of storing components,
relationships between components, and the content model, as follows:
?? Storing component instances: The primary function of a CMS repository is to store the

content components that you intend to manage. Suppose, for example, that you want to
manage a type of information called an HR benefit that includes a name and some text. If
your system has 50 HR benefits, there must be 50 separately stored entities, each following
the HRBenefits class structure, which can be retrieved one at a time or in groups.
?? Storing component classes: To store component instances, the repository needs some

way of representing component classes. Somewhere in your storage system, for example,
there must be a template for an HRBenefit component. After you create a new HRBenefit
component, the system uses this template to decide what the new HRBenefit includes.
?? Storing relationships between components: The repository must have some way of

representing and storing the access structures that you create. Any indexes that you decide
that you need, for example, must be capable of being represented somewhere in the
repository and must be capable of linking to the components that are indexed.
?? Storing the content model: Your repository system must somehow account for all the rules

in your content model. Most are covered by storing the components and their relationships,
but some aren't. If certain component elements are required (meaning that, in every
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 7


component instance in which that element is present, that element must not be blank), for
example, that fact must be somehow stored so that it can be upheld. Similarly, if the content
of a particular element must be a date, or can't be longer than 100 characters, these facts
must also be stored somewhere so that you can enforce these rules.

Relational database repositories
The relational database was invented as a way to store large amounts of related information
efficiently. At this task, it's excelled. The vast majority of computer systems that work with more
than a small amount of information have relational databases behind them. Today, there are a
handful of database product companies (Oracle, Microsoft, IBM, and the like) who supply
database systems to most of the programmers around the world. Programmers use these
commercial database systems to quicken their own time-to-market and increase their capability to
integrate with the databases currently in use by their customers.
The majority of CMS product companies also base their repositories on these commercial
database products. In fact, many require that you buy your database directly from the
manufacturer. (This fact, by the way, puts a convenient-for-them and inconvenient-for-you firewall
between the CMS product support staff and that of the database company.) Buying a database
(or, more accurately, a license) from a commercial company is no big problem; database vendors
are happy to sell to you directly. What's much more of an issue is whether the CMS requires that
you administer the database separately. You may give preference to CMS products that have
integrated database administration into their own user interface and don't require you to
administer the databases separately.

Relational database basics
To help readers with less background in data storage, I provide some database basics before
going into the more technical aspects of representing content in a relational database.
Whatever you store in a relational database must fit into the database's predefined structures, as
follows:
?? Databases have tables: Tables contain all the database's content. Loosely, one table

represents one type of significant entity. You may create a table, for example, to hold your
HRBenefit components. The structure of that table represents the structure of the component
it stores. Tables can be related to each other. (This is where the relations in relational
databases are.) Rather than typing in the name of each author, for example, your HRBenefits
table may be linked (via a unique ID) to a separate author table that has the name and e-mail
address of each author.
?? Tables have rows (also called records): Loosely, each row represents one instance of its

table's entity. Each HRBenefit component, for example, can occupy one row of the
HRBenefits table.
?? Rows have columns (also called fields): Strictly, each field contains a particular piece of

uniquely named information that can be individually accessed. An HRBenefit component, for
example, may have an element called Benefit Name. In a relational database, that element
may be stored in a field called Benefit Name. Using the database's access functions, you can
extract individual Benefit Name elements from the component (or row) that contains them.
?? Columns have data types: As you create the column, you assign it one of a limited number

of types. The Benefit Name column, for example, would likely be of the type "text" (generally
with a maximum length of 255 characters). Other relevant column data types include integer,
date, binary large object, or BLOB (for large chunks of binary data such as images), and
large text or memo (for text that's longer than 255 characters).

CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 8


As you see, even given these exacting constraints, there are many ways to represent content in a
relational database. I don't present the following examples to give you a guide to building a CMS
database. (You'd need much more than I provide.) In addition, if you purchase a CMS product,
you work with a database that the product company's already designed. What I intend is to give
you insight into how the needs of a CMS mesh with the constraints of a relational database so
that you can understand and evaluate the databases that you encounter.

Storing component classes and instances
The simplest way to represent components in a relational database is one component class per
table, one component instance per row, and one element per column. An example of an
HRBenefits component class in Microsoft Access is shown in Figure 3.

Figure 3: A simple table representing the HRBenefits component class
Note
Even if you know nothing about databases, you can likely see that this is very well structured.
Everything is named and organized into a nice table. It's not hard to imagine how database programs
could help you manage, validate, and access content stored in tables. In fact, database programs are
quite mature and can handle tremendous amounts of data in tables. They offer advanced access and
connect easily to other programs. It's no wonder that relational databases are the dominant players in
component storage.
The component class is called HRBenefits. There are three HRBenefit component instances, one
in each row of the table. As shown in the figure, HRBenefit components have six elements, one
per column. Interestingly, you'd likely ever type only two of the elements - Name and Text. The ID
element can be filled in automatically by the database, which has a unique ID feature.
Even this most simple representation of a component in a relational database isn't so simple.
There are really four tables involved in storing component information. The Type, Author, and
EmployeeType columns contain references to other tables (lookup tables in database parlance).
Behind the scenes, what's actually stored in the column isn't the words shown but rather the
unique ID of a row in some other table that contains the words. From a more CMS-focused
vocabulary set, you can say that Type, Author, and EmployeeType are closed list elements. The
lists are stored in other tables and can be made available at author time to ensure that you enter
correct values in the fields for these elements. There may, for example, be three drop-down lists
on the form that you use to create HRBenefit components. In the first is a list of Types, in the
second a list of Authors, and in the third a list of Employee Types. The words in the list are filled
in from the values in three database tables.
I continue to complicate the example to show some of the other issues that come into play
whenever you store components in a relational database. Suppose that there's an image that
goes along with each benefit component (an image of a happy employee, perhaps). To represent
the image you have the following two choices:
?? You can actually store the image in the database.
?? You can store a reference to the file that contains the image.

The second technique is the usual choice because, historically, databases have been lousy at
storing binary large objects (BLOBs). They became bloated and lost performance. This is often
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 9


no longer true, but the perception remains. More important, images (and other media) stored
within a database aren't very accessible to the people who must work on them. Anyone can go to
a directory and use her favorite tool to open, edit, and then resave an image, but you need a
special interface to extract and restore the same image in a database field. This advantage is
rendered moot in many of the more advanced CMS products that create extensive revision
histories. To have your changes logged, you must extract and restore your files by using some
sort of interface anyway. The same interface can easily store and retrieve BLOBs from the
database. All in all, although referencing files instead of storing them in the repository is still the
most popular way to include media in a component, there's often little real advantage to it.
An HR Benefits table with images is shown in Figure 4.

Figure 4: The HR component table with an image reference added
Notice the new ImagePath column where you can enter the directory and name of the picture to
be included with this component. The image path shown here is a relative path. That is, it doesn't
start from a drive letter or URL domain name (such as http://www.xyz.com). Rather, the path
assumes that the file resides in an images/hr directory and that some program adds the right
drive or domain name or the rest of the path later. This ensures that, even if the computer that
houses these files changes, the ImagePath values can stay the same. In real life, you probably
wouldn't even type a relative path. Most likely, you upload a file from your hard drive to the
system, which then decides what the appropriate directory is for the file.
The elements of the component are stored in the columns of the database. This system works
very well if you require a small number of management elements and a few larger elements of
body text. It works less well if you have a large number of management elements and a large
number of body text elements. It works very poorly if a component can have a variable number of
management or body elements. Suppose, for example that the text of an HRBenefit component
looks something like the following example:

We now do Teeth!
My gums and molars never felt so good.
laurasmile.jpg

a paragraph of text here


a paragraph of text here


a paragraph of text here



Put your mouth where our money is-use the plan!



CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 10


Rather than a paragraph or two of untagged text, you have a complex composition that includes
its own images, metadata, and body elements. How should this be represented in a relational
database? Here are the choices:
?? Full parsing: You can create a column for each element in the text chunk.
?? Partial parsing: You can store the entire text chunk in a single column but pull out some of

the elements into separate columns so that you can access them separately.
?? No parsing: You can store the entire text chunk in a single column and not worry about its

internal tagging.
Of these, the last is the most commonly done.

Fully parsing structured text
Certainly, if you wanted to make the elements of your component maximally accessible, you'd
create a column for each element. You want to "explode" the elements of the text chunk into a set
of database columns that you can then access individually. Why? Well suppose, for example, that
you wanted to get to just the pull-quotes and images to create a gallery of smiling employees. It
would be nice to have each of these in its own column so that you could easily find them and
work with them.
To explode the structured text, you parse it and store it. Parsing is the process of finding and
selecting elements, and storing is the process of finding the right database row and column and
putting the element's text within it.
Well, given the sample text in the preceding section, this would yield seven extra columns in the
HR Benefits database table (Title, Pullquote, Image, H1, H2, Conclusion, and B). That doesn't
sound excessive, but then again, it's far from the whole story. Table 32-1 shows how the text is
divided - you can see that this approach doesn't work.
Table 32-1 An Impossible Repository

Column

Value

Title

We now do Teeth!

Pullquote

My gums and molars never felt so good.

B

My gums and molars never felt so good.

Image

laurasmile.jpg

H1

a paragraph of text here

H2

a paragraph of text here

CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 11


H1

a paragraph of text here

Conclusion

Put your mouth where our money is -

B

Use the plan!

First, breaking each element into a column ruins the element nesting that's critical to the text.
How would you know, for example, that the B column must go inside the Conclusion column for
the text to make sense? Second, columns are repeated. Both the B column and the H1 column
occur twice. This isn't allowed in a database, in which each column must be uniquely named.
Finally (I could cite more problems, but I'm sure that you get the idea.), and most important, this is
the text for just one HRBenefit component. Is it reasonable to expect that the others have exactly
these columns? I think not. Others should follow the same general form but not the exact order
and number of elements. The number and names of the elements (and thus the columns) can
vary from component to component, and that's not allowed in a database.
Strange as it may seem, these problems aren't insurmountable. In fact, I know of at least one
CMS company that's working to completely "explode" rule-based but variable text (XML, that is) in
a relational database. But it's not easy. As you can see from the example that I give, the basic
rules of a relational database are at odds with the needs of the text. The rigid regularity of rows
and columns is too far removed from the subtler regularity of well-formed text to enable the two to
overlap easily.

Partially parsing structured text
Given the difficulty exploding structured text into rows and columns fully, most systems don't try.
Luckily, there are more modest approaches to storing structured text that can often suffice. The
most common modest approach is to parse the text block, looking for relevant elements and
storing them in their own columns.
Consider again, the text chunk to be included in an HRBenefit component, as follows:

We now do Teeth!
My gums and molars never felt so good.
laurasmile.jpg

a paragraph of text here


a paragraph of text here


a paragraph of text here



Put your mouth where our money is-use the plan!


CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 12


How many of the elements here do you really need to access separately? Certainly not all of
them. It's hard to think of a reason, for example, why you'd need to get to all elements. For
the purpose of illustration, say that you really need only the pull-quotes separately. In this case,
you can simply add a Pullquote column to your HR Benefits table, and you're ready (see Figure
5).

Figure 5: The HR Benefits table with a Pullquote column added
It doesn't do to remove the pull-quote from the text. Its position and nesting in the text chunk may
be critical. Rather, you must make a copy of the pull-quote and put it in the Pullquote column.
Notice that you don't need to put the pull-quote tag in the column - only the text. The column
name itself serves as the tag for the pull-quote element. Similarly, if you locate the entire text
chunk (delimited by ), you need copy only what's inside to the Text column of
the database.
This type of solution enables you to have your text chunk and your metadata, too. It requires
however, that you do the following:
?? Program or do manual work: You must create either custom programming code or a

manual process for inputting elements to database columns.
?? Synchronize columns: You must make sure that you keep the elements in the text and

database columns in synch. If someone edits the pull-quote in the text, you must recognize
the event and make sure that you update the same text that's duplicated in the Pullquote
column.
?? Synchronize constraints: You must ensure that the constraints on the element and

database column match. It doesn't do for the database column to be limited to 255 characters
if the pull-quote element can be as long as the author wants it to be.
Given the extra work involved in pulling metadata from structured text, most people keep the
number of elements they treat this way to a bare minimum.

Not parsing structured text
Exploding structured text completely into database columns is often prohibitively complex. A
controlled explosion of only certain elements into columns is more reasonable but still presents
problems. What most people end up doing then is just storing the entire chunk of structured text
in one database column and ignoring what's inside the text chunk.
This isn't as bad a solution as it may seem at first glance. For one thing, you don't always need to
access elements within a text chunk. In many situations, it's fine to simply wrap an unparsed text
chunk with a few metadata columns and call it done.
For another thing, not all text is structured. The previous solutions demand that the text chunk
you're dealing with be tagged well enough that you can locate elements within them reliably. Most
text chunks that you encounter aren't so well structured. In fact, unless someone's put the effort
into delivering XML code, it's likely that you don't even have the starting place for constructing
any sort of automatic explosion process. Thus, for systems that end up storing text in HTML or
any other less-than-easily parsed markup language, storing entire text chunks may be the only
possible option.

CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 13


But even if you have well-structured text, it may work out well to save it in a single database field.
Just because the database can't get inside the block of text and deliver a single element doesn't
mean that you can't do so by other methods. Suppose, for example, that you store an entire
HRBenefit text chunk in a single column called Text. Using relational database methodologies, it's
not possible to get directly to the pull-quote elements that may be within the Text column. You
can certainly get the entire text chunk and parse it yourself, however, to find the pull-quote
element! In other words, getting to the pull-quote elements could be a two-step process: The
database gives you the text chunk, and you parse it by using nondatabase tools to get to the pullquote element.
Obviously, this is less convenient than having the database just give you what you want. It also
makes it hard to do a fielded search against elements (where you say, for example, "Give me all
the components with this text in their pull-quote element"). Still, storing all your structured text in
one field may still get you the elements that you desire. In general recognition of this approach, at
least one CMS product offers XML processing tools that you can use against any XML code
that's stored in its relational database. Moreover, some of the database product companies
themselves have developed XML overlays that enable you to package the two-stage process into
a single query.

Breaking the spell of rows and columns
I've discussed the simplest approach to storing components in relational databases: one table per
component class, one row per component, and one element per column. Although this approach
is commonly used, it's not the only one possible. The one-table approach has the following
advantages:
?? It's easy to understand. You don't need to hunt around in the database to find your

components. You simply look for the table that has the same name as the components that
you're seeking.
?? It has high performance. Databases go fastest if they're simple. In fact, even the

straightforward process of referencing other tables instead of retyping author names and the
like (called normalizing) can slow a database down. For this reason, databases that need to
be used in high-volume situations (high-transaction Web sites, for example) are often "denormalized" first. Their table relationships are broken, and single tables are produced from a
set of related tables.
The disadvantages of the single-table approach are as follows:
?? It's inflexible. To add a new component class to the system, you must create a new table.

To add an element to a component, you must add a column to a table. Although this may
seem simple, in many database systems, it requires a fair amount of effort. Relational
databases aren't really designed to have their tables constantly modified, created, and
destroyed.
?? It has a hard time dealing with irregular information. For example, either a table has a

particular column or it doesn't. In a CMS, a component instance sometimes may or may not
have a particular column.
?? It has a hard time dealing with extra information. The one table approach enables you to

specify the component name (in the table name), the component element name (in the
column names), and the basic data type of the component element (in the data types of the
columns). For other types of information about the component or its elements (whether a
column's an open list or a reference, for example), there's no obvious place to store it.
An approach more subtle than the "one table/one component class" model can help overcome
some of these disadvantages. Consider the two tables shown in Figure 6.

CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 14


Figure 6: Two tables that represent any component class
In the background, there's a Components table. It simply lists all the component classes in the
system and gives them a unique ID. (In real life, a table such as this would be much more
complex and most likely consist of an entire family of tables.) In the foreground, there's an
Elements table. This table stores all the elements of all the components in the system.
Each element has the following:
?? A unique ID: The ID is one piece of information about the element that uniquely identifies it,

never changes, and can be used to quickly locate the element in any search that you may do.
?? A component class to which it's tied: The illustration shows that a drop-down feature's

linked to this column so that you don't need to type the name of the component and element
with which it's associated. In this example, each element is associated with only one
component class. In real life, an element may be associated with any number of component
classes, requiring a more complicated set of tables.
?? The element's name: The name is the phrase that people use to recognize the element if

they see it on entry forms and reports.
?? The element's type: The types shown here correspond loosely to the metadata filed types.

The CMS can use these types to decide what to do with an element. If the type is Path, for
example, the CMS can verify that any typed text has the format of a directory and file name.
This way of representing components has some very nice features. First, to add a new
component class, rather than needing to create a whole new table, you need only add a row to
the Components table. Similarly, to add an element to a component class, you need only add a
row to the Elements table and fill in its values. Second, you can easily extend what the database
"knows" about a component or element. The ElementType column, for example, has additional
information about what the system expects an element to contain. The ElementType extends the
simpler idea of data type. In addition to being able to say what data type an element must have
(date, text, BLOB, and so on.), you can use the element type to create additional rules.
Notice that these tables represent component classes only. There's no place to actually store the
component instances. For this, you need an additional table, as shown in Figure 7.

Figure 7: The Component Instances table stores only element values
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 15


This table stores element values and associates them to a particular component instance and to
a general component class as follows:
?? Element values: The Value column has the specific content of one element of one

component.
?? Classes: The ComponentID column identifies the class of the component, and the ElementID

specifies the various elements of the chosen component class.
?? Instances: All rows in this table with the same InstanceID are part of the same component

instance. All the preceding rows, for example, are part of component instance number 1.
Again, in real life this single table may actually be a family of tables.
In summary, this more abstract way of representing components uses tables to define component
classes and tables to represent actual components. It's far more flexible than the simpler "one
table/one component class" system. Of course, you don't get something for nothing. Clearly, the
more abstract system is harder to understand (and program). In addition, the extra structure and
relationships in the abstract system could slow the CMS down in high-transaction environments.
Still, as your CMS needs become more complex, a more subtle approach to representing
components becomes necessary.
Overall, a more abstract database facilitates authoring, while a more concrete database facilitates
delivery. Because of this, many CMS designers choose to keep one database structure for
authoring and then transform it to a simpler structure as they move the content from the authoring
platform (the local LAN, say) to the delivery environment (an Internet server outside the firewall,
for example).

Storing access structures
In the following sections, I discuss how you may store each kind of access structure in a relational
database.

Hierarchies in a relational database
Relational databases aren't great at representing outlines. They just don't fit conveniently into
rows and columns. Instead, a more abstract approach is needed to put an outline in a table. As
an example of how it's done, consider the following Table of Contents for an intranet:
HR Benefits
401K
Medical
Dental
Events
Event 1
Event 2
Event 3
Useful Sites
Site 1
Site 2
Site 3
News
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 16


Industry
Story 1
Story 2
Organization
Story 1
Story 2
How may this outline be represented in a database? Surely, you could pack the entire outline as
text into a single field. Or you could put each line of text one row of a database table called TOC
table, being careful to keep the right indentation by preserving the right number of spaces or tabs.
But neither of these methods gets you very far. In either of these forms, the outline is
unmanageable. How do you add or update a line? What if the name of a listed component
changes? How does that change get into the outline that you typed into the field or cells?
Unfortunately, a more sophisticated approach is needed. To be truly useful, the approach must
accomplish the following tasks:
?? Represent nesting: It must represent the nesting in the outline.
?? Reference Ids: It must reference components by ID (and not require you to type the

component name itself if you refer to it) so that, if the name changes, you don't need to retype
anything.
?? Be complete: There must be enough information in the outline to enable the system to

format it later as a set of links to the component pages (assuming that one of your outputs is
a Web publication).
As one solution among many, consider the table shown in Figure 8.

Figure 8: A simple hierarchy database table
The following list takes the table apart, column by column, to see how it meets the tasks that I set
out for it:
?? The TOCID column simply gives each line of the outline a unique ID.
?? The Text column specifies the text that should appear on each line of the outline. Notice at

this point that there are two kinds of lines in the outline: lines that name the folders of the
outline and lines that name particular components. You can tell the two types apart because
the rows with no ComponentID correspond to lines of the outline that are folders. Folder rows
have the name of the folder in the Text column, while component rows have the name of the
component in the Text field. Because you're entering the component IDs, you don't actually
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 17


need to put in component names. I include them in the preceding table only to make the table
easier to read.
?? The ParentID and ChildNumber columns establish the nesting of the outline. Notice, for

example, that the ParentID of the HR Benefits folder is 1. This is the TOCID of the Our
Intranet folder. This means that the HR Benefits folder is under the Our Intranet folder. The
ChildNumber of the HR Benefits folder is 1. That means that it's the first child under Our
Intranet. The Events folder, with a ParentID of 1 and a ChildNumber of 2, is the second child
under Our Intranet. The News folder is child number 3 under the Our Intranet folder.
?? The ComponentID column has a component ID in it if the row has a component, and not a

folder, listed in it. The CMS uses this distinction to decide which outline rows to make
expandable folders and which to make links to components.
If this seems a bit complex, you've gotten the general idea. You must play some tricks to get an
outline into a table. After it's tricked, however, the database performs as expected and can store
any outline you want effectively. One of the nice things about the preceding structure is that it can
store as many outlines as you want. All you need to do is create a new folder that has no parent.
Then any other folder or component that lists the new folder as its parent is in a new outline. This
feature comes in handy, as you may need more than one outline in your CMS.

Indexes in relational databases
An index connects phrases to content (or to other phrases). In books, index entries point to
pages. In a CMS, index entries (also called keywords) can point to pages, components, elements,
or text positions.
Indexes that point to components are fairly straightforward to represent in a relational database. A
very simple but quite adequate index table is shown in Figure 9.

Figure 9: A simple Index database table
The first column lists the index term, while the second column lists the IDs of the components to
which the index term applies. In real life, you may make some modifications to this simple format
to increase its quality. First, you may want to make it a two-level index, like what you see in the
back of a book. To do so, you'd need to somehow represent an outline in the table. A set of fields
such as the ones that I used previously to create an outline would do, or you could do something
simpler given that it's only a two-level outline and that it's alphabetical. As most database
programmers would tell you immediately, you'd probably not want to list all the component IDs in
one column. Rather, you'd put them in a separate "bridge" table that has only one ID per column.
The hard part of indexes, of course, isn't creating the database tables to support them but putting
the effort into indexing your content.
If your system is going to produce primarily Web pages, you may be tempted to follow the most
well-worn path of associating index keywords with Web pages. Most indexes on the Web today
use the tag to create a keyword facility that their own site can use - as well as any Web
crawling search engine - as follows:
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 18



This is a fine approach, and in many cases, it's the only approach to indexing a Web site. It
needn't preclude creating an index in your CMS that points to components, however, and not
pages. As you build a page, it's easy enough to populate the values of the tag from the
index terms that apply to any of the components that you've put on the page. That way, as
components are added or deleted from the page, their keywords take care of themselves.

Cross-references in relational databases
Cross-references have referents and references. The referent is the thing linked to. The reference
is the thing doing the pointing. As an example, consider the following HTML cross-reference:
Click me
The referent is target.htm. You go there if you follow the link. The reference is the entire line. It's
the thing in HTML that says, "There's a link here."
Cross-references can present a bit of a dilemma in a relational database. Although the referent
can usually be a component, the reference can be an entire component, an element within a
component (an image, say), or even a single word within an element. In database lingo, a crossreference can apply to a row, a column within a row, or to a word within a column. Applying a
cross-reference to a component is relatively easy. In Figure 10, you can see a table that does the
trick.

Figure 10: A table to cross-reference components to other components
The ReferenceID column has the ID of the component from which you're linking. The ReferentID
has the name of the component that you're linking to. The LinkText column has the text that
becomes the link. The idea here is that, if you publish the component onto some sort of page,
your software can look in this table to see whether there are any links. If there are, you can
render them in the appropriate way. Having your cross references organized this way gives you a
tremendous advantage in keeping them under control. If you delete component 23, for example,
this table tells you that you'd better fix the cross-reference to it in component 33. In addition, you
can use this table to tell you which components are linked to most often and other very nice-toknow facts about your cross-referencing system.
Notice that this approach is neutral with respect to the kind of publication that you want to create.
You can, for example, use the same information to create an HTML link, as follows:
More about benefits
Or you can create a print link, as follows:
More about benefits can be found in the section titled "Benefits
and You."
So much for cross-references where the reference is an entire component. How about ones
where you need to link from an element within a component? It becomes a bit stickier here, and
you have some choices based on the assumptions that you can make, as follows:
?? Extra metadata: If the element is always going to have a link, you can add an extra column

to the database that holds the link referent (and link text if necessary). In other words, the
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 19


cross-reference can become another element of the component that's always tied to a
particular other element.
?? Element references: You can add a column to the table in Figure 10 that contains an

element name or ID. As each component element is rendered, your software can look in the
table and see whether that element has a link. This is a bit of overkill, however, because the
vast majority of elements don't have links. Still, it would work.
?? Link as structured text: You can treat the element link as a text link, as I describe next.

The most common type of cross-reference is between a phrase and a component. Too bad that
it's so unnatural for a relational database to manage such links. The problem lies in the fact that
databases have no built-in tools to look into columns and deal with what's inside of them. As long
as you obey the data type of the column, you can put as many broken and malformed links inside
them as you want.
So how do you manage cross-references where the referent is a word or phrase? First, rather
than typing any sort of link in the text of the column, you may instead put in just a link ID. Rather
than a link that looks as follows:
More about benefits
Put in something like the following example:

Now there's no information in the text of the column that can change or go bad. Next, use a table
similar to the one shown in Figure 11.

Figure 11: A table for managing links that are embedded within database columns
This table looks a lot like the one I used to link component to component. That fact comes in
handy in just a moment. First, I need to discuss how this table works. If you publish component
33, your CMS finds the text:

After it does, the software can look up LinkID 55 to retrieve the information to make a link. If
component 23 (the referent of the link) is deleted, no problem; just as before, your software can
look in this table and tell you to go back to component 33 and change link 55.
So you have ways to deal with links from any level of a component. But rather than three different
ways to handle three different kinds of links, it would be much better to have one way that covers
all three situations, as shown in Figure 12.

Figure 12: A table to deal with every level of linking
I've added LinkType and Element columns to the table, and now it can cover any situation, as
follows:
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 20


?? A text-level link: It has a link type of Text and lists the name of the element that contains the

text link.
?? Element-level links: These have a link type of Element and list the element that becomes

the link.
?? Component-level links: These have a link type of Component and have no value in the

Element column.

Sequences in relational databases
Component sequences specify a next and a previous component of the one that you happen to
be positioned on. They're the easiest of the structures to store in a relational database. First, the
component outline that you create is a built-in sequence. The components that are next and
previous in the outline are likely to be of use to you. If that's the only sequence that you need, you
have no work to do to store a sequence.
Many sequences can be generated on the fly without any storage at all. A sequence by date, for
example, can simply query the repository for components and sort them by a date column.
If you need other sequences, you can create a three-column table. In column 1 is a sequence ID,
in column 2 is a component ID, and in column 3 is an order (1, 2, 3, and so on.). To construct a
sequence, you find all the rows that have the sequence ID that you want to use and order them
by the number that's in the order column.

Storing the content model
As I mention in the section "The Content Model," earlier in this white paper the content model is
all the rules of formation for your content components. The rules fall into the following categories:
?? The name of each component class.
?? The allowed elements of each component class.
?? The element type and allowed values for each element.
?? The access structures in which each component class and instance participates.

I've covered the storage of much of the content model in discussing how content and its
relationships are stored in a relational database. However, I summarize and discuss the issues in
a bit more depth. I cover both the simple and abstract relational database models.
In the simple "one table/one component class" scheme that I discuss earlier in this white paper
you create a single table for each component class that you intend to manage. In the more
abstract scheme that I discuss earlier in this white paper you create tables that define component
classes and elements and other tables where the element values are stored.
The major difference between these two approaches, and the reason that I contrast them in some
detail, is that, in the simple scheme, the content model can't be stored explicitly. Rather, it's
implicit in the names of the database parts that you create. In the abstract scheme, most of the
content model can be stored explicitly. At first, this may seem like a small distinction, but I don't
believe it is. Your content model is the heart of your CMS. If it's buried in the base structure of
your repository and not available for review and frequent tweaking, your system isn't flexible
enough to flow with the changes that you're going to want to make.
In the simple scheme, the name of each component class is stored implicitly in the name of a
table. The HRBenefits class, for example, is named in the name of its table. The allowed
elements for each class are stored implicitly in the column names in the simple scheme. The fact
that the HR Benefits table has five columns means, for example, that the HR component is
allowed five elements.
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 21


In the abstract scheme, classes are named explicitly in the Components table. There's an actual
column where you type in the name of the component class. To rename a class, you need only
change the value in one row and column. The elements allowed in each component are also
named explicitly in the Elements table. There's a particular row and column intersection where
you type the name of the element.
Unlike the simple scheme, the abstract scheme gives you the capability to view and modify your
component class structures easily.
Element field types can't be stored explicitly in the simple database scheme. They can, however,
be represented... more or less. A closed list, for example, can be created for the author element
by linking it to an Author table. Only people listed in the Author table can be chosen. Similarly, if
the user can add a new row to the Author table as well as select one from the existing list, you
can create an open list. It's important to notice however, that the list isn't open or closed because
of the database structure; it's open or closed based on the user interface that you put around the
database structure.
In the abstract scheme, there's a place to enter the element filed type explicitly. Each element
has a specific column for just this purpose (the ElementType column), as shown in Figure 13.

Figure 13: The two tables of the abstract component system
The types shown here extend the element field types. The Pattern field type, for example, also
states what kind of pattern to expect. (ImagePath expects a "path" and Text expects an "XML")
As opposed to the simple scheme in which a list was only open or closed based on the user
interface that you applied to it, here the list is explicitly set to open or closed. Your system can
now read the fact that Author is an open list and provide the appropriate user interface.
Allowed element values may or may not be explicit in the simple scheme. On the one hand, I can
explicitly set the data type of a column to "date." Other allowed values can't be made explicit.
There's no place, for example, to type explicitly the rule that a pattern element called ImagePath
must have a valid path and file name in it. This rule is implicit in the validation program that you
may create that checks the contents of this column. In general, the best that you can do to
represent allowed values and element types in the simple scheme is to match them to the closest
data type.
In the abstract scheme, of course, all allowed values can be made explicit. The ImagePath
element, for example, is set specifically to look for a path pattern. You still need to code a
validation that enforces this pattern, but at least you can code it once and then automatically
apply it to any element of the type Pattern:Path. In general, you can use an abstract database
schema to represent any sort of element field type that you want and include enough extra
information that your validation and form-building software can figure out how to handle element
fields of that type.
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 22


XML-based repositories
Object databases were invented as a convenient way to store or serialize the data that
programmers needed to handle in their object-oriented programming. Programmers, who
traditionally used a relational database to store their data, got tired of trying to fit hierarchical data
into rows and columns, so they invented a hierarchical storage system.
Object-oriented data mirrors the structure of the programming objects that process it. As a very
simple example, suppose that you're writing a program to deal with a University curriculum. You
may create an object called Course that does all the processing for particular courses, an object
called Department that does the department-level processing, and a final object called School
that contains all the functions that are needed for the school as a whole. These three objects
have the following relationships:
?? The way that you process a course depends on which department that class is in. All

courses in the English department, for example, may require the taker to be an English
major. Thus you want to access the Department object whenever doing the Course-Signup
function.
?? The way that you process a department may depend on which school it's in. All course

changes in the English department, for example, may need to be approved by the dean of the
Humanities school. Thus you need to access the School object in performing the
Department-ChangeCurriculum function.
In plain English, you can represent this relationship as follows:
School (has approver's name)
Department (has course requirement)
Course (has course data)
Somewhere, you must store the approver's name, course requirements, and course data, as well
as a lot of other data for large varieties of courses, departments, and schools. A programmer
could (and still most often does) create relational database tables to keep track of all this data.
There's no natural fit, however, between this hierarchical data and the rows and columns of a
database. A much more straightforward way to store the data for this object model may be
something like the following example:









To store data into this structure, you simply "walk" down the object hierarchy storing object
names and their data. To reload a set of data, you read the data hierarchy, create object
instances as you come across their names and load them with the data that's listed. Of course, in
real life there's a bit more to it than the simple explanation that I've given, but I hope that you get
the point. The preceding XML structure fits the structure of an object-oriented programmer's world
like a hand in a glove. A relational database fits an object world like a square peg in a round hole.
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 23


So object databases were invented to provide the programmer with a more straightforward (and
faster) way to store and retrieve data for their object hierarchies. I wouldn't have bothered with
such a detailed discussion but for the fact that content components are a lot like programming
objects. Just as do objects, components come in a sort of hierarchy. And component hierarchies
are just as inconvenient to store in relational databases as are object hierarchies. Thus content
programmers have turned to object databases for many of the same reasons that object
programmers have.
The hierarchical way that object databases store information is a compelling reason to consider
them for a CMS. More important is the fact that the syntax that they use to create the information
hierarchies is XML. For a CMS that uses XML as its basic content format, an object database is a
natural choice for a repository.

Object databases vs. XML
You don't need an object database to program in XML. In fact, few XML programmers know
much about object databases. They work with XML files. An XML file carries all the same
structure and syntax of an object database. On the other hand, there are a number of reasons
you may choose (as some CMS product companies have) to use an object database:
?? Multiple files: You may need to work with many XML files and would like them all united by a

single hierarchy and searchable by the cross-file capabilities that an object database
supplies.
?? Delivery: You may need a more sophisticated delivery environment than a file system can

offer. Many object databases, for example, come with the Web caching, load balancing, and
replication functionality you need to run a high-throughput site.
?? Development environment: You may prefer the programming and administration

environment that an object database provides. Many object databases, for example, come
with their own equivalents or extensions of XML standards such as XPath and XSLT that you
may prefer to use rather than the less developed standards.
?? Performance: Object databases offer some performance gains over XML files. Many provide

indexes, for example, that enable you to find commonly searched-for metadata more quickly
than you could in an XML file.
Just about all of what I cover in the sections that follow applies equally well to an object database
or XML files. In either case, the main event isn't the container (the file or database) but the XML it
contains.

Storing component classes and instances
The simplest way to store components in XML is inside a single element that bears the name of
the component class:

1
401K
Standard
Derek Andrews
FT
images/hr/joesmile.jpg
Our great 401K plan...
CM Domain White Paper

Bob Boiko

Copyright 2002 Metatorial Services Inc. & HungryMinds Inc.
Do not reproduce without permission.

Page 24


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay

×