Sometimes one has to use a dummy aggregate function, for example when using a window function over a single row (cf. this answer. Here a "dummy aggregate" means an aggregate function like sum() or max() calculated over a single row (and therefore, a single value). For that reason, it would be semantically irrelevant whether one uses max(), min(), sum(), avg() or even list() (unless the datatype does not allow for that).

Question:

Is there any reason to prefer one of those aggregates over others when it comes to applying them to a single row?

Or is there some optimization that would just use the single value in all these cases - thereby make the calculation (nearly) identically simple?

asked 02 Aug '11, 15:12

Volker%20Barth's gravatar image

Volker Barth
40.2k361549822
accept rate: 34%

edited 04 Aug '11, 15:29

I like min/max because (I think) they work for all data types and even a cow knows that min ( x ) = x for a single x.

(02 Aug '11, 16:08) Breck Carter
Replies hidden

Well, German cows even know that sum (x) = x for a single x. Though they sometimes get the datatype checks wrong:)

(02 Aug '11, 16:15) Volker Barth

For sure - only those with brain defects had to be slaughtered. And guess what the test was:)

(03 Aug '11, 09:31) Volker Barth

I don't think it matters all that much which function you use, because a difference in the choice of function is not going to significantly affect the choice of access plan. I, too, like Breck's choice of MIN() or MAX() since they do work for all data types, and yes even cows know what the semantics are. I will point out, though, that you can also use FIRST_VALUE() and/or LAST_VALUE() instead, which might be clearer when looking at the SQL statement.

permanent link

answered 02 Aug '11, 16:35

Glenn%20Paulley's gravatar image

Glenn Paulley
10.8k576106
accept rate: 43%

edited 03 Aug '11, 05:48

Breck%20Carter's gravatar image

Breck Carter
32.5k5417261050

For a moment, I was very irritated - not by Glenn's clear answer but by the reference of a Dilbert comic. But looking at the edit history made me feel alright again:)

And I agree that FIRST_VALUE() might be even more appropriate here. Thanks!

(03 Aug '11, 09:39) Volker Barth
Replies hidden

@Volker: for the record, my use of the phrase "even a cow knows" in the earlier comment was referring to the query engine, not you: "even a query engine written by a cow should be able to optimize max(x) = x"... and I am sure Glenn would say the same.

@Everyone: FWIW the same cartoon appeared recently here: http://sqlanywhere.blogspot.com/2011/07/friday-file-channeling-gary-larson.html ...and that's where it came from, here.

@Management: Even though iAnywhere is located very close to farm country, there are no cows employed in Engineering, and thus "even a cow knows" cannot be regarded as being insensitive to cows or politically incorrect in any way :)

(03 Aug '11, 15:35) Breck Carter

@Breck: that will teach you for editing an answer, where before it's all over, you have to provide a longer explanation for your edit than the entire question and answer put together! :-)

(03 Aug '11, 16:32) Chris Kleisath

I am regularly taught, perhaps someday I shall learn :)

(03 Aug '11, 16:53) Breck Carter

@Breck and the world: I live on the countryside, too, so I surely do respect cows. And I surely do respect (and like) a sense of humour - and that's what has happened here. Besides that, I was aware of that comic...

So in my humble opinion, there's no need at all

  1. for anyone to feel embarrassed or something like that and
  2. for this site to lose this kind of "taking-things-not-too-serious" approach.

Political correctness has its place, there's no doubt about that, but we should keep on with having some fun here...

(04 Aug '11, 15:28) Volker Barth

I have tried to leave this question alone because Glenn's answer is helpful and correct for the question as asked. However, when I first read the question subject, I presumed a different question because I didn't read the parentheses at first. The following is an answer to that presumed question.

The cheapest aggregate function to use (when possible) is COUNT(*). The COUNT(*) aggregate is cheap for several reasons:

  • The data type of the aggregate is simple (integer)
  • The evaluation code for COUNT is cheap (increment an integer)
  • There is no argument that needs to be evaluated by the aggregate (this can save intermediate materializations as well as computing expensive arguments)
  • The aggregate can be used in parallel plans (each branch counts a subset and SUM() finds the total)

If COUNT(*) is not suitable because you need to identify a property of the value, you can move to COUNT(argument). This variant doesn't have as many advantages because the argument needs to be evaluated. There are some interesting things one can do such as COUNT( if [condition] then 1 end if ).

In your examples, you need to actually extract a value from a group. This comes up in a number of contexts; another one I have seen occurs when you need to retrieve a column that is functionally determined by the GROUP BY but the server is not able to detect that. Consider:

select a,b, any_value(c), sum(d)
from T
group by a,b

Here, assume that (a,b) is a candidate key for T that the optimizer does not recognize. The server will give an error if you don't use an aggregate for c: you could put it in the GROUP BY, but you might prefer not to. Then, you could use an aggregate that selects any value of c. The MIN() and MAX() aggregates work for most data types, so they are good candidates. SUM() may be a little cheaper but is only appropriate for numeric types. The FIRST_VALUE() is acceptable with a window specification, but it will not be parallelized by current servers.

All these aggregates have the unfortunate property that they evaluate the argument of the aggregate for each row in the group (yes, even the FIRST_VALUE). This can be a problem if the argument is a UDF or subquery. It would be nice if for these purposes there were an "ANY_VALUE()" aggregate that only evaluated the argument one time (and Ani Nica has requested this a few times). Until it is available, the best recommendation I can make is to use any of the aggregates discussed; if you do have a UDF or subquery, then prefer to evaluate it with the aggregate as an argument instead of as an argument to the aggregate (so it will be evaluated only once per group instead of once per row).

However, from my formative years on a dairy farm, I'm pretty sure even a cow knows all of the above.

permanent link

answered 12 Aug '11, 21:50

Ivan%20T.%20Bowman's gravatar image

Ivan T. Bowman
2.8k22732
accept rate: 39%

edited 12 Aug '11, 22:00

1

Great enhancement to the topic, Ivan, as usual. I have to confess it took me a while to understand the phrase "evaluate it with the aggregate as an argument instead of as an argument to the aggregate"...

Possibly Breck and me might be able to withdraw our repeated calls for a "blog on your own" if you do give such detailed insight now and then:) (By the way, blogging rather infrequently doesn't seem to be a real obstacle for some SQL Anywhere bloggers....)

For the any_value(c) sample, that's a kind of shortcoming I do stumble over sometimes, too. Isn't that the topic of "Functional dependency" - something that is on the wish list for the query optimizer for newer versions (I guess something like that was part of the Nagano poll)?

(13 Aug '11, 17:06) Volker Barth
Replies hidden
1

Hi Volker, the any_value(c) aggregate is a way to deal with functional dependencies that the optimizer does not recognize. The query optimizer does recognize some functional dependencies now, and indeed there is a wish-list item to extend the class of recognized dependencies. Even with the fullest set of changes we can consider, though, the any_value(c) aggregate would still be useful for cases where the application author knows some special properties about the data that can not be determined from the expressed constraints.

I should have spelled out the "evaluate it with the aggregate as an argument instead of as an argument to the aggregate". I know you have it now Volker, but for others the point is that it is more efficient to write MyUDF( max( T.x ) ) than max( MyUDF( T.X ) ) if both of them are logically equivalent. The former evaluates the function only once per group, the latter is once per row in the group. In Volker's question, there is only one row in the group so there is little difference, it is more important when there may be several rows per group. Function caching will avoid re-evaluating the UDF in cases where the arguments are repeated, so it may as usual be best to write something that is clear and easy to understand rather than tuning for the most perfect performance.

(17 Aug '11, 09:35) Ivan T. Bowman
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×438
×21
×19
×19

question asked: 02 Aug '11, 15:12

question was seen: 7,755 times

last updated: 17 Aug '11, 09:35