Say, I have a child table (1:n) and want to find out which (if any) parent has entries with different values in the child table. Sounds unclear, methinks, so I try to be more concrete: For example, in the SA 12 sample database, I want to find out whether any Customer has entries in the SalesOrders table with different regions. So, obviously, a GROUP BY is needed, and CustomerID is the parent id. But how do I "count" the different region entries? - Since there are different entries per CustomerID, a simple count(*) won't do. Classically, I have used min/max-comparisons for that, e.g.
But a count(distinct) query might seem more adequate, such as
However, the latter seems (a bit) worse in performance, possibly as it requires a duplicate elimination. (I respect that it does offer more information, as it really counts the different entries whereas the first query just returns those with at least 2 different entries.) Is there any preferred version for queries like that (both for reasons of comprehensibility and performance)? |
It's hard (impossible?) to tell for sure with such a small table, but my first guess that query 2 is not only simpler but faster seems to be born out by the Version 12.0.0.2566 runtimes and the plans after calling sa_flush_cache(). Here's the query I might have written; its plan is more complex than query 2 but the runtime looks the same (BUT AGAIN, the table is too small to tell anything)... CALL sa_flush_cache(); SELECT CustomerID, COUNT(*) AS RegionCount FROM ( SELECT DISTINCT CustomerID, Region FROM SalesOrders ) AS CustomerRegion GROUP BY CustomerID HAVING RegionCount > 1 ORDER BY 1; PS the derived table is there, not for performance, but to help me divide-and-conquer the problem. After writing that, I might have turned it into your query 2 if that turned out to be faster... but I'm not smart enough to come up with query 2 from scratch :) Interestingly enough, I'm frequently using the "SELECT col1, count() FROM T GROUP BY col HAVING count() > 1" kind of queries. As such, my query 2 seemed not that complicated. I'm having (a bit) more trouble with the derived query approach... 1
Your query 2 uses COUNT ( DISTINCT column ) together with GROUP BY different-column which is considerably more sophisticated (IMO) than a simple SELECT column, COUNT() GROUP BY same-column... which I, too, use all time. Derived tables are great because they let you solve part of the problem first, and test the solution separately: "Now I have all the distinct CustomerID - Region pairs in a derived table, now I can write an ordinary simple SELECT column, COUNT() GROUP BY same-column just like I (and you) frequently do. Having said that, in a million-row table query 2 may run faster... or, maybe, the plans will look completely different with more rows. The bottom line (the answer to your original question) is that there is NOTHING WRONG with query 2, at least in the tiny database case. 2
FWIW I have babbled about "divide and conquer" on a few occasions, in this blog post and the others it links to: http://sqlanywhere.blogspot.com/2008/10/divide-and-conquer-by-working-backwards.html @Breck: I generally prefer the divide-and-conquer-aprroach, too, and derived tables are very useful here - so I fully agree. - Having said that, it's obviously a question of one's own usage of SQL to tell whether a particular solution is more or less comprehensible. That may be a good or bad situation:) More comments hidden
|