Friday, November 26, 2010

Index Intrigued

It IS back to the basics. I have tried my hands at understanding this oracle concept a number of times and once before i did too but fate had it that i would have to do it again and this blog is a result of that.

So here's the whole diatribe:


Indexes are Oracle objects, used to store the data of a certain column/attribute of a table. The purpose of indexes is to improve the performance of queries on that table (i.e. to speed ‘em up). The ROWID is used to identify the row in the table and get the data from there, once the particular entries in index have been identified.

The indexes are stored in the index segments.

Major classification:

B-Tree Indexes
Bitmap Indexes


B-tree (Balanced not binary, tree) –

Let’s first build our understanding on the balanced tree structure:
First things first … there are following parameters/entities to/in a balanced tree:

- Root node
- Intermediate nodes
- Leaf node
- Tree depth
- Maximum number of child nodes for each node (root/intermediate nodes)

A Balanced tree data structure is about keeping the tree balanced. This means that all the leaf nodes must be at the same depth. By maximizing the number of child nodes within each internal node, the height of the tree decreases and the number of expensive node accesses can be reduced. But like everything else, that too has its trade off. You wouldn’t want a balanced tree with one root and 2 leaf nodes, as that way you would just end up splitting the table data into two halves … wouldn’t give much of a performance boost up :-). As the elements are added to the tree, the tree branches are split, until the maximum number of child nodes for each node is reached, in which case the tree depth is increased.

B-tree indexes are the default indexes created by oracle (using Create Index command). Data is always at the leaf nodes. B trees with >5 nodes are pretty uncommon even for very large tables.





A B-tree index contains following 3 things for each row:

- Some overhead (u can imagine it as pointers from root node to the leaf node, which of course contains the actual data)
- ROWID of the row (in case of unique index, one row id is present in each leaf node, thus, identifying the row uniquely using the indexed column’s value. In case of non-unique index, the leaf node contains the ROWID’s respective to that non-unique value in a sorted order.)
- The actual data value of indexed column (in the leaf node)


General rules of use (of course everything depends, but they apply in most cases):

- They are used for high cardinality columns/attributes.
- B-tree indexes are better to use in cases where the table is either read-only or is very unlikely to be updated. (but that’s true of all indexes)

Any index needs to be updated whenever table is updated... i.e. on every DML run. Thus, making DML slower.

If the database does not change then the index too doesn’t need to be changed. But if there are changes, then managing the database and its index becomes more complicated.


Deleting is not that BAD...


Deleting records from a database doesn't cause much trouble. The index can stay the same, and the record can just be marked as deleted. The database stays in sorted order. If there are a lot of deletions, then the searching and storage become less efficient.


But Insertion is just about moving 1 Mega-Ton boulders up Mount Everest.


Insertions are a disaster in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record in the file requires shifting all of the records down one. Such an operation is just too expensive to be practical.


How to deal with it…


A trick is to leave some space lying around to be used for insertions. Instead of densely storing all the records in a block, the block can have some free space to allow for subsequent insertions. Those records would be marked as if they were "deleted" records.
Now, both insertions and deletions are fast as long as space is available on a block. If an insertion won't fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The hope is enough space is nearby that a lot of blocks do not need to be reorganized. Alternatively, some out-of-sequence disk blocks may be used.


How B-Tree Indexes Speed things up:


1. By Avoiding Full Table scans
2. By Avoiding table access altogether
3. By avoiding sort. Sort is avoided in following cases:

- Group by
- Order by
- Distinct

.

.

Now let’s move on to the dodgy one … Below whatever you find in quotes (“from oracle doc”) is from Oracle documentation. I have just supplemented it with inputs from my thought process. Your inputs are welcome too.

.

Bitmap –


Based on Oracle documentation, a bitmap index’s purpose is to “provide pointers to the rows in a table that contain a given key value”. In a normal (read, B-tree) index, these pointers are the ROWID’s stored in the leaf node, as mentioned earlier.


Again, based on Oracle documentation: “In a bitmap index, a bitmap for each key value is used instead of a list of ROWID’s”.


“Each bit in the bitmap corresponds to a possible ROWID. If bit is set (1), then that row with corresponding ROWID contains the key value” (Well that’s obvious).


“A mapping function converts the bit position to an actual ROWID” (and what is that mapping function? For now and until I get my hands on it, let’s call it overhead).

The Bitmap indexes are typically used when the column has Low cardinality ( i.e. number of possible distinct values … check the example).


For example, consider the below scenario, where a low cardinality column ‘Gender’ (for those wondering what’s cardinality ;-), here the cardinality is 2).
Here the 2 key values are ‘M’ and ‘F’.


Drawing on this, let’s create the bitmap for these 2 key values.


Id Gender BITMAP

.........................F M
1 Female.......1 0
2 Male..........0 1
3 Male..........0 1
4 Female......1 0

So the bitmap for the M is 0110.


This means that entries 2, 3 match for ‘M’ and respective rows have the value in ‘Gender’ column as ‘M’.


And for F the bitmap is 1001.


This means that entries 1, 4 match for ‘F’ and respective rows have the value in ‘Gender’ column as ‘F’.

Further, the documentation states: “Bitmap indexing efficiently merges indexes that correspond to several conditions in a WHERE clause.”
Now let’s extend above example, by introducing err … a new possible gender value. No Pun intended.

Id Gender BITMAP

.....................F M U
1 Female...... ..1 0 0
2 Male............0 1 0
3 Male............0 1 0
4 Unknown... .0 0 1
5 Female....... .1 0 0

Consider a query with WHERE clause as: Where gender=’Female’ OR gender=’Unknown’;


So now the bitmaps for ‘F’ and ‘U’ are:
F: 10001
U:00010

In such a query on such an indexed column, Oracle will take the advantage of each entry. Because of the way bitmap entries store rows (as streams of bits), it's very easy to do such operations. Like:


F U
1 0 - Match
0 0
0 0
0 1 - Match
1 0 - Match

Oracle just does some real simple math to find all of the matching rows ONCE, then figures out what those rows are (using Overhead, or mapping function) and return the data to you.

So, the more the number of the low cardinality columns in a table, more bitmaps can be made and the queries will be more "efficient" based on the number of columns in the WHERE clause.

That's why bitmap indexes are so popular in datawarehousing: they're very helpful in ad hoc queries.

Summarizing, a Bitmap index entry contains following 3 things:

- Overhead (as described, is a mapping function that converts the bit position to an actual ROWID)
- The actual data value, among the possible values
- A list of bits (one for each row)


General rules of use (of course everything depends, but they apply in most cases):

Bitmap indexes are used when we have:
- Table column with low cardinality.
- Largely read-only tables
- Multiple low cardinality columns.

1 comment:

  1. I have always try to get away from this topic as I find it impossible to understand. Its not the point that I have not tried to learn it but everytime I got mixed up between types. This is the first time I found this topic little interesting after reading from your explanation. Thanks for this creative post.
    sap upgrade tool

    ReplyDelete