When rebuilding a database with a RELOAD.SQL file, the rebuild process generally creates foreign keys after the data is loaded.
Is that a performance advantage when compared to including the foreign key definition right in the table creation step? – Or is that just in order to prevent FK violations during the data loading phase?
Or asked more generally:
Will the "quality" of the index automatically built for a FK be different when the FK definition is before or after the table is filled (with LOAD TABLE)?
If the order of steps is important, would it then be as useful to defer PRIMARY KEY definitions as well (what DBUNLOAD does not)?
Current system: 188.8.131.5274, also relating to 11.0.1 and 12 beta (I'm aware that PK/FK indices are handled differently in 10+ databases).
asked 09 Mar '10, 11:49
There are three performance advantages that come from deferring index creation when loading bulk data.
First, index creation is more efficient when the table is already loaded. Consider a large database with two tables P and F with a foreign key from F to P. Assume that the buffer pool is not large enough to store a significant portion of P. As rows of F are loaded, the server checks for referential integrity (RI) violations. If the foreign key on F is already defined, the server needs to probe the primary key index on P to ensure there is a matching row. Since the buffer pool can't hold enough of P, many of these requests will require a disk seek (perhaps 10ms per row). This is effectively a nested loop join; for a table with millions of rows, it could take thousands of seconds to verify the RI. Instead of using a nested loop join, it would be more efficient in some cases to use a merge join or a hash join.
The server uses sorting during index creation to convert the nested loop join to a merge join. This reduces the random reads on the primary key index, and reduces the cost of verifying RI for foreign key indexes. Sorting is not just beneficial for foreign key indexes. Sorted rows allow the index to be built in physical order, avoiding random I/O within the index being built. This benefits all index types, not just foreign key indexes.
In version 8.0 and above, the server will consider sorting the keys of an index when building an index for a table that is already loaded. In versions 8.0 to 10.0, the server begins loading the table and monitors how many times it needs to block for a disk read (for example, to read the primary index or to read within the current index structure). Once a threshold is met, the server switches to a sorting plan where the rows are sorted in index order. This means each primary key value is probed at most once and the index is built in physical order.
In version 11.0 and later, the server always uses a sorted strategy. Further, the server considers parallel access plans that perform the sort on multiple threads and verifies referential integrity using joins (which may be parallelized). There is special JNLO(m) join type that is a memoizing join. The optimizer can construct plans that sort rows of F in foreign-key order then join to P with only one probe of P's primary key for each unique key. Other join types can also be selected by the query optimizer.
A second benefit to deferred index creation is that index pages are not interspersed with table pages. When an index is created before data is loaded, table pages and index pages are allocated as rows are loaded, and this can lead to index pages being interspersed with table pages. Normally this is not a very significant issue, but it can reduce the efficiency of sequential scans of the table. Sequential scans issue read hints to load data asynchronously. If index pages are interspersed, some additional number of seeks may be needed. If this does occur, reorganizing the table will correct the physical layout issue.
A third benefit to deferred index creation is that index leaf pages can benefit from sorting. Index leaf pages use prefix compression that can pack more entries into a page if they have a common prefix. When the data is loaded in sorted order, the compression does the best job that is possible. If the data is not loaded, there can be cases where compression could have saved space but the load order meant the opportunity is lost. Reorganizing an index also restores it to this "best possible" state.
So, there are some benefits to deferring creation of indexes when loading bulk data. The benefits improve the time needed to create the indexes and also can improve the physical layout of the table and indexes (although this is usually a minor factor). The physical structure can also be restored by reorganizing.
answered 10 Mar '10, 21:01
Ivan T. Bowman
My vote is to "prevent FK violations during the data loading phase". And I don't think that you will get some performance penalty to create FK after load data.
answered 10 Mar '10, 12:38