Thursday, November 11, 2010

INDEXES

An Overview of Oracle Indexes

Oracle provides many different types of indexes for us to use. Briefly they are as follows:
a) B*Tree Indexes – These are what I refer to as 'conventional' indexes. They are by far the most common indexes in use in Oracle, and most other databases. Similar in construct to a binary tree, they provide fast access, by key, to an individual row or range of rows, normally requiring few reads to find the correct row. The B*Tree index has several 'subtypes':

b) Index Organized Tables – A table stored in a B*Tree structure. We discussed these in some detail Chapter 4, Tables. That section also covered the physical storage of B*Tree structures, so we will not cover that again here.

c) B*Tree Cluster Indexes – These are a slight variation of the above index. They are used to index the cluster keys (see Index Clustered Tables in Chapter 4) and will not be discussed again in this chapter. They are used not to go from a key to a row, but rather from a cluster key to the block that contains the rows related to that cluster key.

d) Reverse Key Indexes – These are B*Tree indexes where the bytes in the key are 'reversed'. This is used to more evenly distribute index entries throughout an index that is populated with increasing values. For example, if I am using a sequence to generate a primary key, the sequence will generate values like 987500, 987501, 987502, and so on. Since these values are sequential, they would all tend to go the same block in the index, increasing contention for that block. With a reverse key index, Oracle would be indexing: 205789, 105789, 005789 instead. These values will tend to be 'far away' from each other in the index and would spread out the inserts into the index over many blocks.

e) Descending Indexes – In the future, descending indexes will not be notable as a special type of index. However, since they are brand new with Oracle 8i they deserve a special look. Descending indexes allow for data to be sorted from 'big' to 'small' (descending) instead of small to big (ascending) in the index structure. We'll take a look at why that might be important and how they work.

f) Bitmap Indexes – Normally in a B*Tree, there is a one-to-one relationship between an index entry and a row – an index entry points to a row. With a bitmap index, a single index entry uses a bitmap to point to many rows simultaneously. They are appropriate for low cardinality data (data with few distinct values) that is mostly read-only. A column that takes on three values: Y, N, and NULL, in a table of one million rows might be a good candidate for a bitmap index. Bitmap indexes should never be considered in an OLTP database for concurrency related issues (which we'll discuss in due course).

g) Function-based indexes – These are B*Tree or Bitmap indexes that store the computed result of a function on a rows column(s) – not the column data itself. These may be used to speed up queries of the form: SELECT * FROM T WHERE
FUNCTION(DATABASE_COLUMN) = SOME_VALUE since the value FUNCTION(DATABASE_COLUMN) has already been computed and stored in the index.

h) Application Domain Indexes – These are indexes you build and store yourself, either in Oracle or perhaps even outside of Oracle. You will tell the optimizer how selective your index is, how costly it is to execute and the optimizer will decide whether or not to use your index, based on that information. The interMedia text index is an example of an application domain index; it is built using the same tools you may use to build your own index.

i) interMedia Text Indexes – This is a specialized index built into Oracle to allow for keyword searching of large bodies of text. We'll defer a discussion of these until the Chapter 17 on interMedia.

SQL> create or replace
procedure show_space
( p_segname in varchar2,
p_owner in varchar2 default user,
p_type in varchar2 default 'TABLE',
p_partition in varchar2 default NULL )
as
l_free_blks number;
l_total_blocks number;
l_total_bytes number;
l_unused_blocks number;
l_unused_bytes number;
l_LastUsedExtFileId number;
l_LastUsedExtBlockId number;
l_last_used_block number;
procedure p( p_label in varchar2, p_num in number )
is
begin
dbms_output.put_line( rpad(p_label,40,'.') ||
p_num );
end;
begin
dbms_space.free_blocks
( segment_owner => p_owner,
segment_name => p_segname,
segment_type => p_type,
partition_name => p_partition,
freelist_group_id => 0,
free_blks => l_free_blks );
dbms_space.unused_space
( segment_owner => p_owner,
segment_name => p_segname,
segment_type => p_type,
partition_name => p_partition,
total_blocks => l_total_blocks,
total_bytes => l_total_bytes,
unused_blocks => l_unused_blocks,
unused_bytes => l_unused_bytes,
last_used_extent_file_id => l_LastUsedExtFileId,
last_used_extent_block_id => l_LastUsedExtBlockId,
last_used_block => l_last_used_block );
p( 'Free Blocks', l_free_blks );
p( 'Total Blocks', l_total_blocks );
p( 'Total Bytes', l_total_bytes );
p( 'Unused Blocks', l_unused_blocks );
p( 'Unused Bytes', l_unused_bytes );
p( 'Last Used Ext FileId', l_LastUsedExtFileId );
p( 'Last Used Ext BlockId', l_LastUsedExtBlockId );
p( 'Last Used Block', l_last_used_block );
end;
/

SQL> exec show_space( 'HASH_CLUSTER', user, 'CLUSTER' )


Eg:-

SQL> create table colocated ( x int, y varchar2(2000) ) pctfree 0;
Table created.

SQL> begin
for i in 1 .. 100000
loop
insert into colocated values ( i, rpad(dbms_random.random,75,'*') );
end loop;
end;
/

PL/SQL procedure successfully completed.

SQL> alter table colocated
add constraint colocated_pk primary key(x);

Table altered.

This is a table fitting the description we laid out above – about 100 rows/block in my 8 KB database. In this table there is a very good chance that the rows with x = 1, 2, 3 are on the same block. Now, we'll take this table and purposely 'disorganize' it. In the COLOCATED table above, we created the Y column with a leading random number – we'll use that fact to 'disorganize' the data – so it will definitely not be
ordered by primary key anymore:


SQL> create table disorganized nologging pctfree 0
as
select x, y from colocated ORDER BY y
/

Table created.

SQL> alter table disorganized
add constraint disorganized_pk primary key(x);

Table altered.

Arguably, these are the same tables – it is a relational database, physical organization plays no bearing on how things work (at least that's what they teach in theoretical database courses). In fact, the performance characteristics of these two tables are different as 'night and day'. Given the same exact query:

SQL> select * from COLOCATED where x between 20000 and 40000;
20001 rows selected.

Elapsed: 00:00:01.02
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'COLOCATED'
2 1 INDEX (RANGE SCAN) OF 'COLOCATED_PK' (UNIQUE)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
2909 consistent gets
258 physical reads
0 redo size
1991367 bytes sent via SQL*Net to client
148387 bytes received via SQL*Net from client
1335 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
20001 rows processed

SQL> select * from DISORGANIZED where x between 20000 and 40000;
20001 rows selected.
Elapsed: 00:00:23.34
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'DISORGANIZED'
2 1 INDEX (RANGE SCAN) OF 'DISORGANIZED_PK' (UNIQUE)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
21361 consistent gets
1684 physical reads
0 redo size
1991367 bytes sent via SQL*Net to client
148387 bytes received via SQL*Net from client
1335 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
20001 rows processed


I think this is pretty incredible. What a difference physical data layout can make! To summarize the results:

Table Elapsed Time Logical I/O
Colocated 1.02 seconds 2,909
Disorganized 23.34 seconds 21,361

In my database using an 8 KB block size – these tables had 1,088 total blocks apiece. The query against disorganized table bears out the simple math we did above – we did 20,000 plus logical I/Os. We processed each and every block 20 times!! On the other hand, the physically COLOCATED data took the logical I/Os way down. Here is the perfect example of why rules of thumb are so hard to provide – in one case, using the index works great, in the other – it stinks. Consider this the next time you dump data from your production system and load it into development – it may very well be part of the answer to the question that typically comes up of 'why is it running differently on this machine – they are identical?' They are not identical.

Just to wrap up this example – let's look at what happens when we full scan the disorganized table:

SQL> select /*+ FULL(DISORGANIZED) */ *
2 from DISORGANIZED
3 where x between 20000 and 40000;
20001 rows selected.
Elapsed: 00:00:01.42
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=162 Card=218 Bytes=2
1 0 TABLE ACCESS (FULL) OF 'DISORGANIZED' (Cost=162 Card=218 B
Statistics
----------------------------------------------------------
0 recursive calls
15 db block gets
2385 consistent gets
404 physical reads
0 redo size
1991367 bytes sent via SQL*Net to client
148387 bytes received via SQL*Net from client
1335 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
20001 rows processed

That shows that in this particular case – due to the way the data is physically stored on disk, the full scan is very appropriate. This begs the question – so how can I accommodate for this? The answer is – use the cost based optimizer (CBO) and it will do it for you. The above case so far has been executed in RULE mode since we never gathered any statistics. The only time we used the cost based optimizer was
when we hinted the full table scan above and asked the cost based optimizer to do something specific for us. If we analyze the tables, we can take a peek at some of the information Oracle will use to optimize the above queries:

SQL> analyze table colocated
compute statistics
for table
for all indexes
for all indexed columns
/
Table analyzed.

SQL> analyze table disorganized
compute statistics
for table
for all indexes
for all indexed columns
/
Table analyzed.

Now, we'll look at some of the information Oracle will use. We are specifically going to look at the CLUSTERING_FACTOR column found in the USER_INDEXES view. The Oracle Reference Manual tells us this column has the following meaning, it:
Indicates the amount of order of the rows in the table based on the values of the index:

a) If the value is near the number of blocks, then the table is very well ordered. In this case, the index entries in a single leaf block tend to point to rows in the same data blocks.

b) If the value is near the number of rows, then the table is very randomly ordered. In this case, it is unlikely that index entries in the same leaf block point to rows in the same data blocks.

The CLUSTERING_FACTOR is an indication of how ordered the table is with respect to the index itself, when we look at these indexes we find:


SQL> select a.index_name,
b.num_rows,
b.blocks,
a.clustering_factor
from user_indexes a, user_tables b
where index_name in ('COLOCATED_PK', 'DISORGANIZED_PK' )
and a.table_name = b.table_name
/
INDEX_NAME NUM_ROWS BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ---------- -----------------
COLOCATED_PK 100000 1063 1063
DISORGANIZED_PK 100000 1064 99908

The COLOCATED_PK is a classic 'the table is well ordered' example, whereas the DISORGANIZE_PK is the classic 'the table is very randomly ordered' example. It is interesting to see how this affects the optimizer now. If we attempted to retrieve 20,000 rows, Oracle will now choose a full table scan for both queries (retrieving 20 percent of the rows via an index is not the optimal plan even for the very
ordered table). However, if we drop down to 10 percent of the table data:


SQL> select * from COLOCATED where x between 20000 and 30000;
10001 rows selected.
Elapsed: 00:00:00.11
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=129 Card=9996 Bytes=839664)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'COLOCATED' (Cost=129 Card=9996
2 1 INDEX (RANGE SCAN) OF 'COLOCATED_PK' (UNIQUE) (Cost=22 Card=9996)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
1478 consistent gets
107 physical reads
0 redo size
996087 bytes sent via SQL*Net to client
74350 bytes received via SQL*Net from client
668 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
10001 rows processed

SQL> select * from DISORGANIZED where x between 20000 and 30000;
10001 rows selected.
Elapsed: 00:00:00.42
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=162 Card=9996 Bytes=839664)
1 0 TABLE ACCESS (FULL) OF 'DISORGANIZED' (Cost=162 Card=9996
Statistics
----------------------------------------------------------
0 recursive calls
15 db block gets
1725 consistent gets
707 physical reads
0 redo size
996087 bytes sent via SQL*Net to client
74350 bytes received via SQL*Net from client
668 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
10001 rows processed

Here we have the same table structures – same indexes, but different clustering factors. The optimizer in this case chose an index access plan for the COLOCATED table and a full scan access plan for the DISORGANIZED table.
The key point to this discussion is that indexes are not always the appropriate access method. The optimizer may very well be correct in choosing to not use an index, as the above example demonstrates. There are many factors that influence the use of an index by the optimizer – including physical data layout. You might run out and try to rebuild all of your tables now to make all indexes have a good
clustering factor but that most likely would be a waste of time in most cases. It will affect cases where you do index range scans of a large percentage of a table – not a common case in my experience. Additionally, you must keep in mind that in general the table will only have one index with a good clustering factor! The data can only be sorted in one way. In the above example – if I had another index
on the column Y – it would be very poorly clustered in the COLOCATED table, but very nicely clustered in the DISORGANIZED table. If having the data physically clustered is important to you – consider the use of an IOT over a table rebuild.


When Should you use a Bitmap Index?

Bitmap indexes are most appropriate on low cardinality data. This is data where the number of distinct items in the set of rows divided by the number of rows is a small number (near zero). For example, a GENDER column might take on the values M, F, and NULL. If you have a table with 20000 employee records in it then you would find that 3/20000 = 0.00015. This column would be a candidate for a bitmap index. It would definitely not be a candidate for a B*Tree index as each of the values would tend
to retrieve an extremely large percentage of the table. B*Tree indexes should be selective in general as outlined above. Bitmap indexes should not be selective, on the contrary they should be very 'unselective'.

Bitmap indexes are extremely useful in environments where you have lots of ad-hoc queries, especially queries that reference many columns in an ad-hoc fashion or produce aggregations such as COUNT. For example, suppose you have a table with three columns GENDER, LOCATION, and AGE_GROUP. In this table GENDER has a value of M or F, and LOCATION can take on the values 1 through 50 and AGE_GROUP is a code representing 18 and under, 19-25, 26-30, 31-40, 41 and over. You have to
support a large number of ad-hoc queries that take the form:

Select count(*)
from T
where gender = 'M'
and location in ( 1, 10, 30 )
and age_group = '41 and over';

select *
from t
where ( ( gender = 'M' and location = 20 )
or ( gender = 'F' and location = 22 ))
and age_group = '18 and under';

select count(*) from t where location in (11,20,30);

select count(*) from t where age_group = '41 and over' and gender = 'F';

You would find that a conventional B*Tree indexing scheme would fail you. If you wanted to use an index to get the answer, you would need at least three and up to six combinations of possible B*Tree indexes in order to access the data. Since any of the three columns or any subset of the three columns may appear, you would need large concatenated B*Tree indexes on:

a) GENDER, LOCATION, AGE_GROUP – For queries that used all three, or GENDER with LOCATION, or GENDER alone.
b) LOCATION, AGE_GROUP – For queries that used LOCATION and AGE_GROUP or LOCATION
alone.
c) AGE_GROUP, GENDER – For queries that used AGE_GROUP with GENDER or AGE_GROUP alone. In order to reduce the amount of data being searched, other permutations might be reasonable as well, to reduce the size of the index structure being scanned. This is ignoring the fact that a B*Tree index on such low cardinality data is not a good idea. Here the bitmap index comes into play. With three small bitmap indexes, one on each of the individual columns, you will be able to satisfy all of the above predicates efficiently. Oracle will simply use the functions AND, OR, and XOR with the bitmaps of the three indexes together, to find the solution set
for any predicate that references any set of these three columns. It will take the resulting merged bitmap, convert the 1s into row IDs if necessary, and access the data (if we are just counting rows that match the criteria, Oracle will just count the 1 bits). There are times when bitmaps are not appropriate as well. They work well in a read intensive environment, but they are extremely ill-suited for a write intensive environment. The reason is that a single bitmap index key entry points to many rows. If a session modifies the indexed data, then all of the rows that index entry points to are effectively locked. Oracle cannot lock an individual bit in a bitmap index entry; it locks the entire bitmap. Any other modifications that need to update that same bitmap will be locked out. This will seriously inhibit concurrency –each update will appear to lock potentially hundreds of rows preventing their bitmap columns from being concurrently updated. It will not lock every row as you might think, just many of them. Bitmaps are stored in chunks, using the EMP example
from above we might find that the index key ANALYST appear s in the index many times, each time pointing to hundreds of rows. An update to a row that modifies the JOB column will need to get exclusive access to two of these index key entries; the index key entry for the old value and the index key entry for the new value. The hundreds of rows these two entries point to, will be unavailable for
modification by other sessions until that UPDATE commits.


Why isn't my Index Getting Used?

There are many possible causes of this – we'll take a look at some of the most common.

Case 1

You are using a B*Tree index and your predicate does not use the leading edge of an index. In this case, you might have a table T with an index on T(x,y). You query SELECT * FROM T WHERE Y = 5. The optimizer will tend not to use the index since your predicate did not involve the column X – it must inspect each and every index entry in this case. It will typically opt for a full table scan of T instead.
That does not preclude the index from being used. If the query was
SELECT X,Y FROM T WHERE Y = 5, the optimizer would notice that it did not have to go to the table in order to get either X or Y (they are in the index) and may very well opt for a Fast Full Scan of the index itself, as the index is typically much
smaller than the underlying table. Note also that this access path is only available with the Cost Based Optimizer (CBO).

Case 2

Your are using a SELECT COUNT(*) FROM T query (or something similar) and you have a B*Tree index on table T. However, the optimizer is full scanning the table, rather than counting the (much smaller) index entries. In this case, the index is probably on a set of columns that can contain Nulls. Since a totally Null index entry would never be made, the count of rows in the index will not be the count of
rows in the table. Here the optimizer is doing the right thing – it would get the wrong answer if it used the index to count rows.

Case 3For an indexed column, you query using:

select * from t where f(indexed_column) = value
and find that the index on INDEX_COLUMN is not used. This is due to the use of the function on the column. You indexed the values of INDEX_COLUMN – not the value of F(INDEXED_COLUMN). The index is nullified here. You can index the function if you choose to do it.

Case 4You have indexed a character column. This column contains only numeric data. You query using the following syntax:
select * from t where indexed_column = 5
Note that the number five in the query is the constant number five (not a character string). The index on INDEXED_COLUMN is not used. This is because the above query is the same as:
select * from t where to_number(indexed_column) = 5
You have implicitly applied a function to the column and, as case 3 noted, this will nullify the use of the index. This is very easy to see with a small example:

SQL> create table t ( x char(1) primary key );
Table created.

SQL> insert into t values ( '5' );
1 row created.

SQL> set autotrace on explain
SQL> select * from t where x = 5;
X
-
5
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (FULL) OF 'T'

SQL> select * from t where x = '5';
X
-
5
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 INDEX (UNIQUE SCAN) OF 'SYS_C0038216' (UNIQUE)

You should always avoid implicit conversions anyway. Always compare apples to apples and oranges to oranges. Another case where this comes up frequently is with dates. You try to query:
-- find all records for today
select * from t where trunc(date_col) = trunc(sysdate);
You discover that the index on DATE_COL will not be used. You can either index the
TRUNC(DATE_COL) or, perhaps more easily, query using the BETWEEN comparison operator. The following demonstrates the use of BETWEEN on a date. Once you realize that the condition:
TRUNC(DATE_COL) = TRUNC(SYSDATE)
is the same as the condition:
DATE_COL BETWEEN TRUNC(SYSDATE)AND TRUNC (SYSDATE) PLUS ONE DAY MINUS ONE SECOND,
Then using BETWEEN clause is straightforward.
select *
from t
where date_col between trunc(sysdate) and trunc(sysdate)+1-1/(1*24*60*60)
Note: the expression 1/(1*24*60*60) is the fraction of one day that equals one second. Subtracting 1 would take away one day, 1/24 – one hour, and 1/(24*60) – one minute.
This moves all of the functions to the right hand side of the equation, allowing us to use the index on DATE_COL (and has the same exact effect as WHERE TRUNC(DATE_COL) = TRUNC(SYSDATE)). If possible, you should always remove the functions from database columns when they are in the predicate.

Not only will it allow for more indexes to be considered for use, it will reduce the amount of processing the database needs to do. In the above case, when we used:
between trunc(sysdate) and trunc(sydate)+1/(1*24*60*60) the values are computed once for the query and then an index could be used to find just the qualifying
values. When we used TRUNC(DATE_COL) = TRUNC(SYSDATE), the TRUNC(DATE_COL) had to be
evaluated once per row for every row in the entire table (no indexes).


Case 5

The index, if used, would actually be slower. I see this a lot – people assume that, of course, an index will always make a query go faster. So, they set up a small table, analyze it, and find that the optimizer doesn't use the index. The optimizer is doing the exactly the right thing in this case. Oracle (under the
CBO) will use an index only when it makes sense to do so. Consider this example:
SQL> create table t
( x, y null, primary key (x) )
as
select rownum x, username
from all_users
where rownum <= 100
/
Table created.

SQL> analyze table t compute statistics;
Table analyzed.

SQL> analyze table t compute statistics for all indexes;
Table analyzed.

SQL> set autotrace on explain
SQL> select count(y) from t where x < 50;

Case 6
You haven't analyzed your tables in a while and they used to be small, but now when you look at them they have grown quite large. An index will now make sense, whereas it didn't originally. If you analyze the table, it will use the index.


Ref : Expert Oracle - Thomas Kyte

No comments:

Post a Comment