An Efficient Index Strategy for Multiattribute Queries on
Very Large Data Warehouses

David M. Calascibetta
DePaul University
School of Computer Science, Telecommunications
and Information Systems
Chicago, IL
David@Calascibetta.com

Abstract

The characteristics of very large historical databases are significantly different than common transaction processing systems. Large quantities of data are typically "loaded" by a single individual at a predetermined time. Then, the database becomes effectively "read-only" until the next load event occurs. Frequently, obsolete periods of data, approximately an equal amount to the load, are unloaded from the database. All of this processing typically runs overnight for several hours.

This paper addresses the specific requirements of constructing DBMS technologies to achieve acceptable performance in these applications. Although related topics of concern are described, we concentrate on secondary indexing issues. The most common forms of relational database indexing are discussed, including the B-Tree and its derivatives, grid files, bit maps, and inverted lists. Each index is evaluated for suitability for use in very large databases, with a criteria consisting of physical index size and scalability, maintainability, access IOs, cardinality sensitivity, parallelization effectiveness, and usefulness in multiattribute queries, including whether the index allows dynamic intersection or union operations.

The use of single attribute inverted list indexes, combined with dynamic bit maps is proposed. It is shown that this combination indexing method yields the smallest physical index, while providing complete capability to intersect or union indexes dynamically to resolve a multiattribute query. The goal of this method is to eliminate completely the combinatorial explosion in index storage and maintenance when many attributes of a relation require frequent referencing.

Today, significant effort is applied to making databases easier to access by non-technical people. But a simple, easy to use interface to a database will not solve the problem of "runaway" queries. In fact, these interfaces enable users to launch queries against a database which may process for days. Instead of solving the DBMS processing problem, the front end products allow developers to anticipate long running queries and provide governors which trap these queries and shut them down before execution. If the DBMS had a better approach to processing multiattribute queries, fewer of these requests would need to be trapped and killed. Therefore, this new indexing method can be viewed as a usability enhancement to front end systems.

This is an optimization problem. If the DBMS can’t answer a query within a reasonable amount of time, then it does not provide a solution for the problem. The solution described here is an example of ultimate dynamism, with only the most basic information stored. This is an enabling technology, allowing queries to execute which would be prohibitive without this solution.

1 Introduction

1.1 Problem Statement and Motivation

Considerable research has been made into the efficient indexing of historical databases. This paper addresses the specific problem of indexing a very large relation for efficient retrieval when many of the relations’s attributes are referenced in widely varying combinations. An example of the problem we will solve looks like this:

These queries are referred to as "multiattribute" or "multidimensional" queries.

The significant factor here is that the relation is very large. As we will demonstrate, many of the indexing methods described in previous research which facilitate multiattribute queries do not scale up well. The index size grows disproportionately with the data, in an environment where the data volume itself is difficult to manage.

This research is concerned only with secondary indexing issues. Index methods for the primary key of a relation allow the data to be stored optimally for retrieval through that primary key. These methods are typically clustering and hashing. Organizing the data for optimal retrieval for one set of queries is probably not optimal for the next set in our case. Therefore, this discussion will not evaluate and compare primary key indexing methods, like clustering and hashing, since they cannot be applied to situations like the 30 attribute problem, stated above.

We will review prior work on relevant indexing methods, demonstrate the scalability problems with each of these methods, and propose a new indexing approach which scales up well and is efficient in solving multiattribute queries. To do this, we will analyze structural database characteristics which allow more efficient data storage, which subsequently enhances index generation and maintenance, which leads to efficient query processing for large historical databases.

1.2 Example Database

To demonstrate the issues of each indexing method, it is appropriate to construct a synthetic database with characteristics commonly observed in historical databases. This database consists of two relations, and is used to compare the various approaches to indexing a database for efficient resolution of multiattribute queries. This example database allows computations on index size and performance and illustrates the issues with indexing for multiattribute queries.

The first relation, FACTS, is time variant, and contains 10 attributes of 10 bytes each. This provides a physical record length of 100 bytes, and the relation contains one million records, producing a storage requirement of 100MBytes. The primary key is composed of four attributes, TEN_K, TIME, TWO, and MANY. The cardinality of each attribute is important. TEN_K is a high cardinality attribute, composed of 10,000 unique values. TIME represents the time dimension and contains 200 unique values. TWO is a low cardinality attribute composed of two distinct values, and MANY is composed of 1000 unique values.

The second relation, REFERENCE, is not time variant and is composed of 100 attributes of 10 bytes each. This provides a physical record length of 1,000 bytes, and the relation contains 10,000 records, producing a storage requirement of 10MBytes. The primary key is a single attribute, TEN_K. TEN_K contains 10,000 unique values. Refer to Figure 2 for SQL definitions of the two tables.

Two SQL demonstration queries are provided in Figure 2. Later, these queries will be used to demonstrate the requirements of different indexing approaches, as they attempt to respond to each query efficiently. Each method is compared on its strength and weakness in physical index size and scalability, maintainability, access IOs, cardinality sensitivity, usefulness in multiattribute queries, and parallelization effectiveness.

Figure 1. The Example Database

Figure 2. Demonstration Queries 1 and 2

2 Related Work

To begin, it is necessary to develop a criteria which can be used to compare each approach to solving the multiattribute query problem.

2.1 Evaluation Criteria

There are two indexing methods which are in dominant use in commercial applications which address historical databases: B-Trees and bit maps. B-Trees are a "balanced tree" approach to indexing database attributes. Bit maps are a form of file inversion, where the relative bit corresponding to a record number in an array of bits is turned on when a particular value is contained in that record. This chapter describes efforts at solving the index part of the multidimensional query problem by using B-Trees.

To evaluate and compare this index method, the following criteria is used:

Physical index size/Scalability

Often overlooked or trivialized in research, the physical size of the index can become its greatest problem. Index size is addressed as both main memory and disk space requirements. When the index space is larger than the data space, significant index management problems arise.

Maintainability

Since the typical processing associated with very large database updates is a many record load, followed by a many record unload, what effect does the index method have on load and unload times?

Access IOs

The number of IO operations that are required to access one record on average. The cost of index processing is dominated by the number of IOs required to retrieve the desired record. Any other processing is inconsequential.

Cardinality Sensitivity

This characteristic is related to Scalability. Some indexes, like B-Tree and inverted lists, are insensitive to the cardinality of the relation’s attribute. The index requires the same amount of space whether there are two values in the attribute or 10,000. However, bit maps are extremely sensitive, changing significantly the physical space requirements as the cardinality increases.

Usefulness in multiattribute queries

How well does the index method support conjuncts and disjuncts in query processing? Can indexes be intersected or unioned dynamically, or do the relation’s attributes require pre-concatenation within the index? Pre-concatenation eliminates the disjunction capability and causes a combinatorial explosion. Typically, when indexes are evaluated for functionality, their effectiveness on range operations is explored. All of the index methods discussed here support adequately range operations, so this characteristic will not be addressed further.

Parallelization Effectiveness

Very large databases benefit significantly from executing queries in parallel across multiple processors. What information is available to the query optimizer about the database and its indexes to allow it to divide a query across multiple processors? The structure of the database and its indexes are examined for their adaptability to, and effectiveness in, a parallel environment.

2.2 B-Tree Indexes

B-Tree indexes, in some form, are the primary indexing method provided in the major commercial RDBMS products. This is a good place to begin our evaluation.

Each node of the B-Tree index is composed of ‘item+week+two+many+pointer’. The storage requirement of each node is calculated as

4 attributes of 10 bytes each + a 4 byte pointer = 44 bytes/node

There is one node for each record. Note that the TWO attribute of the FACTS relation cannot be referenced through the index unless ITEM and WEEK are also referenced in the query. This indicates that attributes concatenated in the index must be referenced in a left to right order. All attributes need not be accessed, as long as those that are referenced are done so contiguously from left to right. Since we can’t assume that all four primary key attributes are referenced on every query, we need to construct indexes to ensure that any query received by the DBMS on relation FACTS will result in an indexed retrieval. Figure 3 demonstrates the required indexes and their storage requirements.

Figure 3. Index storage requirements

Graphically, this looks like (attribute numbers):

Figure 4.

The first index created on (item, week, two, many) can be used to process queries which reference (item), (item, week), (item, week, two), and of course, (item, week, two, many). As long as attributes referenced in the query are referenced in a left to right order within the index, the index is usable. For those queries which reference attributes not in the above order, i.e. (week, many), alternate indexes must be defined.

In Figure 3, the storage requirements for each index are also calculated. Observe that the total index storage requirement of 232MBytes is more than two times the size of the relation indexed (100MBytes). Although a 200Mbyte index may be manageable for this small relation, when this technique is applied to very large historical databases, for example a one terabyte relation, a two terabyte index would make the application intractable. Managing the indexes becomes a significantly greater problem than managing the data. This is a relatively common example of the indexing requirements of the time-oriented "fact" table in a historical database application. It is left to the reader to contemplate the indexes resulting on the REFERENCE table, with an index requirement for 30 attributes, all referenced in random order, with most queries referencing five of the 30 indexed attributes, plus zero or more non-indexed attributes.

Finally, the processing time for utilizing any of the indexes in Figure 3 is two IOs. None of the indexes are small enough for the lowest level to fit in memory, so there will always be one IO to access the index, and one IO to access the desired record. We assume the higher levels of the B-Tree are held in main memory.

2.3 B-Tree Index Evaluation

The following is a list of characteristics of B-Tree indexes as applied to historical database systems.

Physical index size/Scalability

As described above, the B-Tree index creates a combinatorial explosion of indexes in a normal, fact table primary key. This creates an index space too large to manage efficiently for large relations. It creates an intractable problem. This problem is amplified when more than four attributes of the relation require indexing.

Maintainability

The B-Tree is good for insert, update, and delete of a record within an individual index. When many indexes require maintenance for a single record, the maintenance property becomes poor. In our example relation FACTS above, seven indexes must be updated for each row inserted or removed from the relation. B-Tree’s primary advantage of maintainability is lost in a historical database application.

Access IOs

On average, two IOs are required to retrieve a record through a B-Tree index [Salz88]. Assuming the higher level indexes are held in memory, one IO is required to access the first-level index, and one IO is required to access the desired record. Most queries presented to historical databases return many records, on an order of 5,000 to 250,000 records. Although two IOs may be acceptable to transaction processing systems where single records are inserted, updated, or retrieved, it is too great an overhead for a historical database.

Cardinality sensitivity

B-Trees are insensitive to attribute cardinality. Their size is determined by the number of records in the relation and the number of bytes defined for the attribute (or concatenation of attributes). The B-Tree indexes on FACTS(TWO) and FACTS(MANY) are the same size.

Usefulness in multiattribute queries

Conjunction is supported by B-Tree indexes by defining pre-concatenated indexes before their use is required. Therefore, B-Trees only support static conjunction. Unless the DBA constructs all possible combinations of indexes required to resolve any query, pre-knowledge of the universe of queries is required. This knowledge is usually unavailable in Decision Support System/Executive Information System (DSS/EIS) applications. DSS/EIS systems are the primary clients of very large historical databases.

Disjunction is not supported by B-Tree indexes.

Parallelization Effectiveness

Effective parallelization of queries depends on the ability of the DBMS optimizer to delegate segments of the data and indexes to separate processors. B-Tree indexes as implemented in commercial RDMBS products are monolithic structures which are not easily divisible across multiple processors.

2.4 Comparison of Multiattribute Index Methods

A summarization of various multiattribute indexing method is provided in Figure 5. The table evaluates each approach against the criteria established above, and describes the relative effectiveness of the approach when applied to large historical databases. The "dynamic, unrestricted, partitioned bitmaps" is the proposed approach to multiattribute index processing.

Figure 5. Comparison summary of major approaches to
multiattribute queries

3 Research

We propose to describe an efficient solution to the indexing scaleability problem with respect to multiattribute queries, with recommendations for other techniques which improve further the effectiveness of managing historical databases.

An effective solution to the management of historical databases is more than just an indexing method. It is a total approach involving the physical layout of the data, with small, efficient indexes on that data. This solution includes a horizontal partition data layout recommendation, similar to those described in parallel database.

Once the data is physically stored optimally, inverted list indexes can be applied to individual columns of the relation. The important rule here is to never pre-concatenate attributes of the relation, which causes a combinatorial explosion. Each of these inverted lists are created on individual attributes, as many as required. They may be created or dropped at any time without affecting the indexes of other attributes. During query processing, each referenced, indexed attribute constructs a dynamic bit map for its own attribute. After each bit map is populated from its corresponding inverted list, they are "ANDed" or "ORed" together, as indicated by the parse tree, to form a single, in-memory bit map. The records are then retrieved as indicated by the combined bit map. This method uses minimal memory, both disk and main, utilizes all available information on the relation, without creating redundant information in the indexes.

Hypothesis

The hypothesis of this paper is that the index methodology proposed here is better, or more efficient, at solving multiattribute queries on very large, historical databases than methods described in earlier research, or implemented in commercial databases. It has no restrictions on the type of queries it can resolve efficiently.

We intend to demonstrate this by comparing the new, proposed method with the methods described in previous research. For our analysis, each method will be compared on its strength and weakness in physical index size and scalability, maintainability, access IOs, cardinality sensitivity, usefulness in all multiattribute queries, and parallelization effectiveness.

Data

The data used in this analysis is extrapolated from the definition of the Example database defined in Figure 1. Properties of theoretical indexes can be derived from this definition for comparison purposes.

Research Methodology

Using these relations, we can calculate the sizes of the indexes proposed, as well as the sizes of the previously published methods. Also, the number of rows retrieved can be calculated from indexes on these relations. The two demonstration queries, described in Figure 2, are used in the retrieval analysis to illustrate the problems and solutions in resolving the queries with each index method.

In addition to the analysis of the different indexing methodologies which address these multiattribute queries, we will describe the algorithm necessary to implement this new technology. Also, maintenance costs associated with each method be compared.

A detailed description of the new indexing methodology will be described. This includes:

4. Bibliography

[Salz88] Salzberg, B. File Structures: An Analytical Approach. Simon & Schuster, Englewood Cliffs, New Jersey, 1988.