This is a late follow-up on this question:

Say, I have two tables T1 and T2 with identical schema and nearly identical contents in the same database. They have a primary key pk.

Now I want to list those rows from table T1 (i.e. select T1.*) that don't exist at all in T2 - i.e. T1's primary key doesn't exist in T2.

I can think of three four different ways to do so (and I'm gonna list them as answers...). Is there any preferred way among them w.r.t. to performance? Or is there another, better approach?

Or should I prefer the way that does seem most logical/comprehensible to me?

(Yes, Breck, of course I should...)

In my tests, the queries do use different plans and perform not identical well (and I'm using sa_flush_cache()) in between). But I would like if there's a general answer or at least a rule-of-thumb.

The result sets of approaches A-D should be identical.

FWIW, some test results with 12.0.1.3389 on a "slow" box:

A: 1750 - 1875 ms
B: 3312 - 3516 ms
C: 2203 - 2343 ms
D: 1328 - 1563 ms

So my current personal preference is A, both because of semantics and performance...

asked 26 Aug '11, 13:18

Volker%20Barth's gravatar image

Volker Barth
40.1k361549819
accept rate: 34%

edited 23 Sep '11, 07:21

I am looking forward to the answer here because I usually pick B which evidently performs worst.

(26 Aug '11, 16:03) Siger Matt

Any hints to general performance hints from the Sybase query experts like Glenn or Ivan - or does it read "it depends"?

(01 Sep '11, 03:51) Volker Barth
Replies hidden

Anybody from the Alpha class listening?

Cf. Breck's blog

(12 Sep '11, 16:30) Volker Barth

Yes, I'm still interested in more opinions...

From the current remarks, I would think that SQL language features like EXCEPT still seem rather unfamiliar, and that therefore one would prefer not to use them - from a practical point of view: Even if oneself does understand them, the next guy might not.

I do appreciate this opinion, but such kind of thinking might freeze the SQL usage to something like SQL/92 in the long run...

(17 Sep '11, 15:31) Volker Barth
Comment Text Removed

Glenn, Ivan, anybody - any hints would still be highly appreciated ... particularly as the public opinion seems to prefer those approaches that do perform worse in this case... (which is alright with me, of course!).

See the tag I've added:)

(23 Sep '11, 07:21) Volker Barth

Approach C: Use a left join and filter out the joined rows

select T1.* from T1
   left join T2 on T1.pk = T2.pk
where T2.pk is null
order by T1.pk

Personally, I think of this as a rather "tricky" (i.e. incomprehensible) approach.

permanent link

answered 26 Aug '11, 13:22

Volker%20Barth's gravatar image

Volker Barth
40.1k361549819
accept rate: 34%

edited 27 Aug '11, 06:31

I like this approach best.. it's the cleanest to my fuzzy brain, although I would probably typically write it as you did in Approach B because that's the way I'd incrementally think it through. The disadvantage with approach A is that the next guy that comes along to work in your code is not going to understand it.

Now, I'd like to think this is an ad-hoc query, or at least a daily or weekly report, rather than a real production frequently run query. In case this assumption is correct, the diference in processing time does't matter.

I read somewhere that computer code is first meant to be read by humans, not computers. I agree with that. your code needs to be intelligible to the next guy... the computer just doesn't care how eloquent you are.

(27 Aug '11, 20:29) Ron Hiner
Replies hidden

@Ron: You're correct that this has come up while doing some ad-hoc queries. Nevertheless, it's a general question that I have beared for a long time, as the according query task used often as "building block" of more complex queries...

(28 Aug '11, 05:42) Volker Barth

to me the way to go is an outer join with a where clause like C. The left join lists all matches in T.2 as well as all rows from T1. I would think this would be the fastest, but i would have to do some testing. it's also the least complicated which in my book most of the time is the way to go.

(31 Aug '11, 14:34) izatt82

Approach B: Use a NOT IN subquery

select * from T1
where pk not in
(select pk from T2)
order by pk

That might be the "classical" approach.

permanent link

answered 26 Aug '11, 13:20

Volker%20Barth's gravatar image

Volker Barth
40.1k361549819
accept rate: 34%

edited 27 Aug '11, 06:32

I think, that the best approach depends on data statistics in both tables. I am personally prefer human readable, clear and easy to understand sql queries. Personally for me, approach C (leader) and approach B is much easy to read and understand. But,data statistics is very important.

permanent link

answered 29 Aug '11, 05:33

AlexeyK77's gravatar image

AlexeyK77
70761224
accept rate: 8%

edited 29 Aug '11, 05:34

Approach A: Use an IN subquery with a table with EXCEPT

select * from T1
where pk in
(select pk from T1
 except
 select pk from T2)
order by pk

I guess by "subqery-flattening" this will be transformed internally to the following JOIN with a derived table with EXCEPT variant (which is less comprehensible IMHO).

CORRECTION: My assumption about this transformation seem (partly) wrong. Whereas calling PLAN() on both queries yields similar results, calling REWRITE() leads to different optimizations, and the actual query execution time is different, as well. Therefore I turn the JOIN version into a forth answer as approach D.

I guess the set operators EXCEPT and INTERSECT are still somewhat underestimated. Personally, I like to use them as they cope fine with SQL's set-based logic.

permanent link

answered 26 Aug '11, 13:20

Volker%20Barth's gravatar image

Volker Barth
40.1k361549819
accept rate: 34%

edited 29 Aug '11, 04:21

For performance reasons, one might use "EXCEPT ALL" instead of "EXCEPT" here - by definition the "select pk from T1" cannot contain duplicates (it simply selects the PRIMARY KEY), so the default "Duplicate elimination algorithm" use by EXCEPT would be unnecessary...

(28 Aug '11, 05:53) Volker Barth

Approach D: Use a JOIN with a derived table with EXCEPT

select T1.* from T1 inner join
  (select pk from T1
   except
   select pk from T2) DT
   on T1.pk = DT.pk
order by T1.pk

I thought this was just a rewrite of approach A but it performs different (and better in my tests).

permanent link

answered 29 Aug '11, 04:23

Volker%20Barth's gravatar image

Volker Barth
40.1k361549819
accept rate: 34%

See the notes on EXCEPT ALL vs. EXCEPT on approach A.

(29 Aug '11, 04:27) Volker Barth
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×438
×275
×7
×5
×1

question asked: 26 Aug '11, 13:18

question was seen: 3,804 times

last updated: 23 Sep '11, 07:21