Structured Information Retrieval using XML

Daniel Egnor and Robert Lord
XYZFind Corporation
Seattle, Washington, USA

We suggest several ways information retrieval systems can exploit semantically structured XML documents to improve precision and recall. Most of these techniques require the addition of structure to the user's query. To keep the system usable by end users who are not XML schema experts or able to construct database queries, we propose a user interface that allows the user to begin with an unstructured keyword query and, with guidance by the system, apply structure in a dialog based on iterative refinement. Finally, we introduce an extended form of the inverted index used to implement this interface.

Introduction

The adoption of XML as a standard for the publication and interchange of structured data creates a great opportunity for better information retrieval. XML has the ability to represent the semantics of data in a structured, documented, machine-readable form. Organizations are emerging to define standard schemas to capture the semantics of many domains, and content providers are beginning to publish information using XML and standard schemas.

We have identified several ways information retrieval systems can use semantically structured XML (XML whose structure represents in some fashion the semantics of the data) to improve precision and recall:

To take advantage of semantically structured XML documents, the query must have semantic structure as well. Schema restrictions, term context limitations, data type information, and other elements of structured queries can be applied in three ways:

Users who are not domain experts find proactively structured systems difficult. As the diversity of a corpus increases, it becomes less likely that search users will know the structure of the data they wish to locate. In order to support users who are not domain experts, users who are not proficient with formal query languages, and users who wish to search large, diverse corpora, our research centers on systems which do not require proactively structured queries.

Adding Structure to Search

To help users create structured queries, we propose a user interface for semistructured information retrieval that uses reactive and automatic techniques to add structure to initially unstructured queries. Tehse techniques are what distinguishes this work from database query systems which presuppose proactively structured queries expressed in a formal query language (e.g. SQL or XML-QL).

Assumptions

To simplify the process of adding structure to queries, we make several assumptions about the form of the final query. We sacrifice the expressiveness of our "query language" in exchange for simplicity.

While none of these assumptions are valid for all queries and all documents, we believe that despite the restructions the system still represents a significant improvement over unstructured information retrieval. The performance of our system and experience with user testing will help us evaluate the validity of these assumptions and inform future work.

User Interface

Keeping these assumptions in mind, we implemented a four-step search interface:

  1. Step 1

    The user submits an unstructured full-text query. (This step is shared by most information retrieval systems.)

  2. Step 2

    The corpus is searched for documents which contain all of the terms in the user's query. The set of unique schemas represented by these matching documents is generated. The user is asked to choose from this set the schema most relevant to their query. (If there is only one possible schema, this step is skipped.)

    If metadata is available, schemas may be labelled with "friendly" names (such as "Used Cars for Sale" or "News Headlines"). Otherwise, schemas are identified by the name of their root element.

  3. Step 3

    The user is given a "search form" specific to the selected schema. This form contains one entry blank for each unique XML path which occurs in documents in the corpus which belong to the selected schema. Visual cues inform the user which paths are known to contain the user's search terms.

    Depending on the data types (text, numbers, etc.) present in the corpus for each path, the input blanks may be specialized. Specific metadata may also be authored to control the layout and presentation of the form.

  4. Step 4

    The query, now fully structured, is executed, and results are returned to the user.

    Metadata, if available, controls the formatting of results. Otherwise, generic XML presentation is used.

As you can see, this user interface is fully reactive (and occasionally automatic); it presupposes no domain knowledge or schema expertise on the part of the user. At the same time, within the parameters of the limited query capability, it is able to take advantage of semantically structured XML to improve precision and recall.

Implementation Experience

Index structures for proactive database query of structured or semistructured data are well understood. Similarly, index structures for unstructured full-text information retrieval are also well understood. However, the need to support reactive and automatic structured query imposes some unique requirements on the index used for our implementation. Our index design is therefore relatively unique.

Specifically, we identified these requirements for our index:

Our implementation effort focused on developing an index which satisfies these requirements efficiently.

Extending the Inverted Index

The index structure we use is a modification of the classic "inverted index" used by information retrieval systems. The simplest form of the standard inverted index is a relation between keywords and documents: found_in(keyword,document). A common extension includes the location of the keyword within the document, to aid phrase and proximity matching: found_in(keyword,document,offset).

Our implementation modifies the inverted index to first associate keywords and XML path specifications where those keywords appear; (keyword,path) tuples are then associated with documents and locations: found_in(keyword,path,pathwordid) and occurs_in(pathwordid,document,address). Additionally, rather than merely storing the offset from the beginning of the document, this index records the XML address (a sequence of numbers representing which outer elements contain this content) of each word. For example, consider the following pair of XML files describing used car dealerships:

billiebrown.xml

<Dealer>
  <Name>Billie Brown Chevy</Name>
  <Car> 
    <Year>1999</Year> 
    <Model>Chevrolet Caprice</Model> 
    <Color>white</Color> 
    <Price>$7500</Price 
  </Car>
</Dealer>

joebob.xml

<Dealer>
  <Name>Joe Bob Ford</Name>
  <Car> 
    <Year>1992</Year> 
    <Model>Ford Mustang</Model> 
    <Color>white</Color> 
    <Price>$7500</Price 
  </Car>
  <Car> 
    <Year>1972</Year> 
    <Model>Dodge Dart</Model> 
    <Color>brown</Color>
    <Price>$1999</Price 
  </Car>
</Dealer>

These files result in the following index entries:

found_in(1972,'Dealer/Car/Year',0)
found_in(1992,'Dealer/Car/Year',1)
found_in(1999,'Dealer/Car/Price',2)
found_in(1999,'Dealer/Car/Year',3)
found_in(7500,'Dealer/Car/Price',4)
found_in('billie','Dealer/Name',5)
found_in('bob','Dealer/Name',6)
found_in('brown','Dealer/Car/Color',7)
found_in('brown','Dealer/Name',8)
found_in('caprice','Dealer/Car/Model',9)
found_in('chevrolet','Dealer/Car/Model',10)
found_in('chevy','Dealer/Name',11)
found_in('dart','Dealer/Car/Model',12)
found_in('dodge','Dealer/Car/Model',13)
found_in('ford','Dealer/Car/Model',14)
found_in('ford','Dealer/Name',15)
found_in('joe','Dealer/Name',16)
found_in('white','Dealer/Car/Color',17)

occurs_in(0,'joebob.xml',[3,1])
occurs_in(1,'joebob.xml',[2,1])
occurs_in(2,'joebob.xml',[3,4])
occurs_in(3,'billiebrown.xml',[2,1])
occurs_in(4,'billiebrown.xml',[2,4])
occurs_in(4,'joebob.xml',[2,4])
occurs_in(5,'billiebrown.xml',[1])
occurs_in(6,'joebob.xml',[1])
occurs_in(7,'joebob.xml',[3,3])
occurs_in(8,'billebrown.xml',[1])
occurs_in(9,'billebrown.xml',[2,2])
occurs_in(10,'billebrown.xml',[2,2])
occurs_in(11,'billebrown.xml',[1])
occurs_in(12,'joebob.xml',[3,2])
occurs_in(13,'joebob.xml',[3,2])
occurs_in(14,'joebob.xml',[2,2])
occurs_in(15,'joebob.xml',[1])
occurs_in(16,'joebob.xml',[1])
occurs_in(17,'billiebrown.xml',[2,3])
occurs_in(17,'joebob.xml',[2,3])

If a user enters an unstructured query for "brown dart", the query processor can use the found_in relation to determine that "brown" could refer to either of "Dealer/Car/Color" or "Dealer/Name", but "dart" must refer to "Dealer/Car/Model". Once the correct path is associated (reactively or automatically) with "brown", the occurs_in relation supplies location information.

For example, let us suppose (as is likely) that the user meant to search for brown-colored cars, so "brown" is associated with "Dealer/Car/Color". This pair has pathwordid #7. The only pathwordid available for "dart" is #12. Fortunately, occurs_in associates both with the document 'joebob.xml' and the third child of the root element. This document -- or just the relevant element, i.e. the matching car listing -- is returned to the user.

These relations are not sufficient to complete the search experience. Additional relations associate all known XML paths with content statistics, and associate XML addresses with the content contained at those addresses (to allow the search engine to regenerate the original XML content for formatted output).

Numeric Data

An additional set of extensions to the index structure are used to allow queries to specify numeric ranges. Information retrieval systems normally treat numbers as keywords; a user searching for "1999" can find the exact value "1999", but it is not possible to search for "between 1000 and 2200".

To enable numeric queries, when the implementation encounters a number it stores it as several pseudo-keywords representing different numeric ranges which contain the target number. For example, "1999" might be indexed under the pseudo-keywords "1000-2000", "1900-2000", and "1990-2000", as well as the actual value "1999".

When a user searches for a range, the pseudo-keywords necessary to exactly cover the range are identified. Query results from each of these pseudo-keywords are combined disjunctively. For example, a user specifying a numeric range of 1000-2200 might cause a query for the pseudo-keywords "1000-2000", "2000-2100", and "2100-2200".

An open question is the correct set of pseudo-keywords to use. If the ranges used are too sparse, queries will have to combine results from many pseudo-keywords, resulting in slow execution times. If the ranges are too dense, the index will contain lots of repeated information, resulting in a large index size. The optimal boundaries for pseudo-keywords are not well-defined; for example, using powers of 10 may align well with common user queries and thereby improve system performance. We are currently conducting research in this area to improve the efficiency of our implementation.

Implementation Notes

The current implementation stores relations with the widely-used Berkeley DB index package [5]. Element names, path specifications, and content text are normalized to save space. Paths and addresses are stored in sorted order to faciliate efficient merging of search results.

Related Work

Semistructured data retrieval in general, and XML search in particular, is an active field of research and development; indeed, the potential for structured search is often cited as a motivation for the development of XML.

The semistructured database community has understandably embraced XML as an interchange format for semistructured data [6]; many semistructured database systems have been recast as "XML databases". Examples of semistructured databases include research systems like XSet [7] and Niagara [4] as well as commercial products like XDex and Tamino. These systems generally accept some formally specified query language (such as XQL, XPath, or XML-QL) and return precise results. The structure of the query must be proactively specified; the person or software submitting the query can only take advantage of XML schemas they already know. Our work focuses instead on end users who are not schema experts.

Lore is such a semistructured database, and normally accepts the proactively structured query language Lorel [1]. However, researchers have used Lore's "data guides" and search capability to experiment with reactively structured information retrieval. [3]

GoXML [8] is a commercially deployed "context-based XML search engine" and one of the few search systems using a reactive approach to structured query. However, GoXML does not allow different search terms in the same query to refer to content in different elements, and distinguishes content only by its containing element; furthermore, GoXML has no capacity for searching non-text data, and returns only links to XML files. Our work allows the association of an independent XML path with each search term, can process numeric range queries, and returns excerpted and formatted content.

Xyleme is a promising new research action (associated with the Verso project at INRIA). Our work is similar in intent to Xyleme's semantic data integration research [2], but we take a different approach. Rather than focusing on formal knowledge representation and extensive metadata to structurally annotate the user's query a priori (fully automatic structural reformulation), we use an inverted index to determine candidate structures for the user's query, and employ a combination of heuristics and user interaction to add structural annotations (reactive structural reformulation).

Summary

We have presented three main ideas:

This is very much a work in progress, both as research and as commercial development. In the future, we hope to explore automatic schema and context selection as an adjunct or replacement for reactive choice by the user, support for additional data types, and performance improvements. Above all, we plan to measure and objectively evaluate our progress against existing information retrieval technology.

Meanwhile, interested parties may experiment with a demo of this technology and download beta versions of a product using these techniques from our corporate Web site at www.xyzfind.com. Please check this site for updated information about product features and ongoing research.


  1. S. Abiteboul, D. Quass, J. McHugh, J. Widorn, and J. Wiener. The Lorel Query Language for Semistructured Data, Journal of Digital Libraries, 1(1), November 1996.

  2. M. Rousset, Semantic Data Integration in Xyleme, Presentation at INRIA, September 1999.

  3. R. Goldman and J. Widom, Interactive Query and Search in Semistructured Databases, Technical Report, Stanford University, 1998.

  4. J. Naughton, D. DeWitt, D. Maier et al., The Niagara Internet Query System.

  5. M. Olson, K. Bostic, and M. Seltzer, Berkeley DB, Proceedings of the 1999 Summer Usenix Technical Conference, Monterey, California, June 1999.

  6. D. Suciu, Semistructured Data and XML, Proceedings of International Conference on Foundations of Data Organization, Kobe, Japan, November 1998.

  7. B. Y. Zhao and A. Joseph, XSet: A Lightweight XML Search Engine for Internet Applications.

  8. GoXML.com - Search and Index XML Documents. http://www.goxml.com/