Presented to the Art Libraries Society of North America
Baltimore, March 25, 2003
Stephen R. Toney
President, Systems Planning
160 Dragoon Court, Cross Junction, VA 22625 USA
toney@systemsplanning.com
http://www.systemsplanning.com
Copyright © 2003 Systems Planning
The specific features and limitations of MWeb discussed herein are no longer relevant. Please see the MWeb Enterprise homepage for current capabilities, or email us at the address above.
Integrating records from any two data sources involves both technical issues of mapping and intellectual issues of the representation and meaning of the data. Both of these are especially difficult when one data source is in the MARC format. This talk discusses how these issues were addressed using MWeb™ at the Los Angeles County Museum of Art.
MWeb was developed in 1997 for the J. Paul Getty Trust to provide web access to the Census of Antique Art & Architecture Known to the Renaissance, an art-history database with images containing about 250,000 records of 14 record types. In 1998 MWeb was redeveloped as a product for web publishing of cultural heritage information.
MWeb is a Trademark of Systems Planning.
Integrating disparate datasets consists of two kinds of problems
The difficulty of integrating formats obviously depends on the formats to be integrated and the data itself, but certain kinds of problems occur over and over.
MARC is harder to work with than most formats:
(On the plus side, MARC's leader, embedded metadata, directory structure, and documentation are terrific enablers of any kind of data manipulation.)
Because the MARC format is hard to work with, I can conceive of only one reason for mapping museum data to MARC records prior to loading them -- if your system had only a MARC loader, or if that loader was something special -- this might be the case if you are loading museum records into a library system. Otherwise, since few or no systems use the MARC format internally, there is no point in mapping museum data to MARC. This is because the MARC format was never intended to be more than a means of communicating bibliographic data between systems.
The opposite approach, mapping the MARC records to the format used by the museum records, is perhaps easier, but can also get complex. For example, it takes 26 tables convert a MARC record to a fully normalized relational model.
Regardless of the formats to be integrated, as a general rule it is harder to convert one to another rather then converting both to a neutral target format. (Naturally whether this is an option depends on the purpose of the conversion.)
The neutral target should be as accommodating as possible
MWeb was designed especially to be accomodating to conversions, as one of its strong points is the ability to integrate any kinds of data.
It should be noted that mapping to a common format is easier if the metadata is stored with the data. This is actually one of the strong points of MARC, as the content designators ride with the data. Compare this to a typical delimited ASCII file in which you need external documentation to understand the data:
63836Þ1812Þ1812Þ}~
119188Þ1943Þ1946Þ}~
2268Þ1890Þ1899Þ}~
112989Þ1923Þ1960Þ}~
68187Þ1601Þ1700Þ}~
33516Þ1966Þ1966Þ}~
33687Þ1971Þ1971Þ}~
68639Þ1926Þ1926Þ}~
This is also a big attraction of XML, and also of the MWeb database. However, these three have quite different degrees of processing efficiency, so storing the metadata is not a factor there.
MWeb uses a proprietary format that cannot be discussed in detail; however, it does also store the metadata with the data. We overcome the inefficiencies by the design of the database, as well as by other strategies such as data redundancy, reducing joins, proprietary indexes, and especially by preprocessing data as the database is built.
One type of "neutral target" is the data warehouse. You retain the original systems for maintaining the data but develop a new database for queries. Only a few people have access to the live data. This is an established practice in industry, and is the model that MWeb implements.
Besides achieving the virtues of a neutral target format for integrating datasets, a data warehouse has additional benefits as well:
The easiest format conversion is none at all.
You can always merge data. But will it mean anything when you do? Integration of datasets is more than a question of formats. Unfortunately, format conversion -- though difficult -- is at least soluble; some of the intellectual-integration issues are not.
The same law applies as for format conversion: it is harder to convert A to B than to convert both to a neutral target. The Dublin Core is a perfect example of solving the intellectual issues in this manner -- by providing a common target for conversion.
An example of insolubility can be seen in MWeb's Advanced Search (AS). This feature permits researchers to specify the fields in which they want values to be found. To make this easy, MWeb displays a textbox for each searchable field. There would be no way to provide such an interface for more than one record type at a time.
This is why we are considering changing the Advanced Search to use the Dublin Core (DC). We would map each field in each record type to the DC and show only the DC fields in the Advanced Search. Some degree of specificity would be lost, but with greatly improved searching, and probably more comprehensible to most users.
Here are some of the reasons intellectual integration is hard:
Does a field called "volume" make sense for a database of merged library and museum data? Their use of the term is quite different.
This is the problem of how to store data. Some examples for personal names are:
Is the library's "Smith, John" the same person as the museum's "Smith, John"? A name would only be sure to refer to the same person if all catalogers used the same rules -- of which the likelihood is zero.
Do the datasets agree on whether to use real names or pseudonyms? Here is the AACR2 heading for the person known to museum catalogers as El Greco: Theotocopuli, Dominico, called El Greco, 1541?-1614. Clearly only a keyword search would be likely to find these two names together.
What fields should be indexed, and into which index should the terms go? The library might consider a personal-name subject a subject, but the museum might consider it a personal name.
Will a stopword in one dataset be highly significant in another? If so, how will you explain to the user that there are no records with a significant term?
You may be able to merge the datasets, but can you make searching and search results comprehensible to the user? How do you use terms that are natural to each dataset without being obscure or verbose? For example, which of these buttons would you use for a search of merged library and museum data to achieve both clarity and simplicity:
Likewise in the display of search results, will you interfile records from the various datasets based on some sort order? If so, how will you identify each record's type to the user? Will the terms you use be both brief and clear?
These kinds of issues are a problem in all systems, whether or not they relate to multiple datasets. However, the more complex the data the more and tougher are these issues.
This is much harder if you are merging formats in a production system in which you must maintain the data and process transactions. Libraries and museums sometimes use the same or similar terms for very different concepts. For example, both have loans, but the processing of loans is completely different.
We were able to do a simple approach for two reasons. First, the system is used by the general public, art historians, and teachers -- who don't care about the details of bibliographic data. Second, MWeb database is for searching only, so we didn't need to retain many of the codes and control fields.
So far the library data is still far less visible on the LACMA site than the museum data. This has several reasons: the default search is for images only, which library records do not have; the library hits are all displayed after the museum records; and there are no links to or from authority records, as there are for museum object records. These decisions are understandable in a project run by the museum collections management office, but we plan to improve the visibility of library records in the next year: by intersorting, by adding links to authority records, and by adding images and scanned text to some library records.
The MWeb solution:
The MWeb solution:
The MWeb solution:
The MWeb solution:
The MWeb solution:
The MWeb solution:
We hope to better integrate the library and museum data at LACMA and at other MWeb sites:
Usually when merging datasets is discussed the discussion is purely about format conversion. I hope you are now convinced that this is just the technical aspect of a more abstract problem; your purpose is ultimately to communicate with the intended audience. Therefore don't forget that the new data model must be clear to the intended audience if they are to use the system. Your solution may be techically impeccable and intellectually sound, but if no one else can understand it, it is to little purpose. In other words, consider the users' need to understand the data model from the very beginning. Ideally you will find a model that is inherently clear, but if not, then provide the interface, training, and documentation to make it so.
We invite you to try the LACMA MWeb site! The handout explains how to find library records.