Requirement: Gather distinct primary key values from several tables t2 ... tn into one table t1. Which is faster, this? INSERT t1 ON EXISTING SKIP SELECT [primary key columns] FROM t2 UNION ALL SELECT [primary key columns] FROM t3 UNION ALL SELECT [primary key columns] FROM t4 ... UNION ALL SELECT [primary key columns] FROM tn; ...or this? INSERT t1 SELECT [primary key columns] FROM t2 UNION SELECT [primary key columns] FROM t3 UNION SELECT [primary key columns] FROM t4 ... UNION SELECT [primary key columns] FROM tn; ...or, it probably doesn't matter? asked 07 Nov '10, 16:20 Breck Carter Volker Barth |
I am going to say that "it depends" but my initial thought would be that the second form would be better. My reasoning is that the first form is going to force an index probe into T1 for every insert to check for existence, whereas the second form will first generate a list of unique values (by either doing a sort or more likely generating an - hopefully in-memory - hash of all values) and then doing the insert thus reducing the number of probes into the T1 primary key index. To verify my intuition I ran the following tests: Step 1: create tables
Step 2: populate the tables with unique values
Step 3: run first form - insert ... on existing skip
Repeat step 3 five times in total, recording the time of the insert on each iteration Step 4: run second form - insert ... union ...
Repeat step 4 five times in total, recording the time of the insert on each iteration Step 5: drop all tables
then repeat step 1, then populate the tables with non-unique values (as a new form of step 2)
then repeat steps 3 and 4 to get new timings. The results that I got on my dual core laptop are:
So in both the unique value case and the non-unique value case the second form (using 'union') is faster. Note that in my case the amount of data being used fits into cache and therefore for larger data sets you may get different results. HTH answered 07 Nov '10, 21:07 Mark Culp 3
Cool! You have verified Murphy's Law for me: I picked Method 1 to use, before reading your reply :) |
Is there an underlying assumption that t1 does not contain any PK values from t2..tn beforehand? Otherwise, the second statement is doomed to fail, and that would matter...
Note that in order to use the ON EXISTING SKIP clause the table must have a primary key. So if the table does not have a PK then the first statement is doomed to fail. My answer (see below) assumes that T1 had the same PK columns as the T2, T3, ... Tn.
@Mark: To clarify: My comment was not dealing with the existence of a PK column in t1 (that a necessity, of course), but with the question whether the table t1 might already contain any rows with PK values that also exist in the other tables.
@Everyone: Sorry for the implicit assumption: The schema and data are such that both methods actually work :)
@Mark: "doomed to fail"... I need that on a coffee cup, I'm too fat to wear it on a tshirt :)
@Breck: Hey, that was mine contribution, and I definetely owe it to Joe Jackson's "Beat Crazy" album:)