I’ll preface this question by stating that I’m using Oracle 10g Enterprise edition and I’m relatively new to Oracle.
I’ve got a table with the following schema:
ID integer (pk) -- unique index
PERSON_ID integer (fk) -- b-tree index
NAME_PART nvarchar -- b-tree index
NAME_PART_ID integer (fk) -- bitmap index
The PERSON_ID is the foreign key for the unique id of a person record. The NAME_PART_ID is the foreign key of a lookup table with static values like “First Name”, “Middle Name”, “Last Name”, etc. The point of the table is to store individual parts of people’s names separately. Every person record has at least a first name. When trying to pull the data out, I first considered using joins, like so:
select
first_name.person_id,
first_name.name_part,
middle_name.name_part,
last_name.name_part
from
NAME_PARTS first_name
left join
NAME_PARTS middle_name
on first_name.person_id = middle_name.person_id
left join
NAME_PARTS last_name
on first_name.person_id = last_name.person_id
where
first_name.name_part_id = 1
and middle_name.name_part_id = 2
and last_name.name_part_id = 3;
But the table has tens of millions of records, and the bitmap index for the NAME_PART_ID column isn’t being used. The explain plan indicates that the optimizer is using full table scans and hash joins to retrieve the data.
Any suggestions?
EDIT: The reason the table was designed this way was because the database is used across several different cultures, each of which has different conventions for how individuals are named (e.g. in some middle-eastern cultures, individuals usually have a first name, then their father’s name, then his father’s name, etc). It is difficult to create one table with multiple columns that account for all of the cultural differences.
Given that you’re essentially doing a full table scan anyway (as your query is extracting all data from this table, excluding the few rows that wouldn’t have name parts that were first, middle or last), you may want to consider writing the query so that it just returns the data in a slightly different format, such as:
Of course, you’ll end up with 3 rows instead of one for each name, but it may be trivial for your client code to roll these together. You can also roll the 3 rows up into one by using decode, group by and max:
This will produce results identical to your original query. Both versions will only scan the table once (with a sort), instead of dealing with the 3-way join. If you made the table an index-organized table on the person_id index, you’d save the sort step.
I ran a test with a table with 56,150 persons, and here’s a rundown of the results:
Original query:
My query #1 (3 rows/person):
My query #2 (1 row/person):
Turns out, you can squeeze it a bit faster still; I tried to avoid the sort by adding an index hint to force the use of the person_id index. I managed to knock off another 10%, but it still looks like it’s sorting:
However, the plans above are all based on the assumption that you’re selecting from the entire table. If you constrain the results based on person_id (e.g., person_id between 55968 and 56000), it turns out that your original query with the hash joins is the fastest (27 vs. 106 consistent gets for the constraint I specified).
On the THIRD hand, if the queries above are being used to populate a GUI that uses a cursor to scroll over the result set (such that you would only see the first N rows of the result set initially – reproduced here by adding a “and rowcount < 50” predicate), my versions of the query once again become fast – very fast (4 consistent gets vs. 417).
The moral of the story is that it really depends exactly how you’re accessing the data. Queries that work well on the entire result set may be worse when applied against different subsets.