Fun with Aggregates

Posted by Paul White on SQL Blog See other posts from SQL Blog or by Paul White
Published on Sun, 11 Mar 2012 15:49:38 GMT Indexed on 2012/03/20 5:36 UTC
Read the original article Hit count: 314

Filed under:

There are interesting things to be learned from even the simplest queries.  For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten.

One possible query to meet that specification is:

SELECT
    p.Name
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    p.ProductID = th.ProductID
GROUP BY
    p.ProductID,
    p.Name
HAVING
    COUNT_BIG(*) < 10;

That query correctly returns 23 rows (execution plan and data sample shown below):

image

The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join.  The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten.

This ‘fully-optimized’ plan has an estimated cost of around 0.33 units.  The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too?  To answer that, let’s look at some other ways to formulate this query.  This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones.  The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above.  It joins the result of pre-aggregating the history table to the Product table before filtering:

SELECT p.Name
FROM
(
    SELECT 
        th.ProductID, 
        cnt = COUNT_BIG(*) 
    FROM Production.TransactionHistory AS th 
    GROUP BY
        th.ProductID
) AS q1
JOIN Production.Product AS p 
    ON p.ProductID = q1.ProductID
WHERE
    q1.cnt < 10;

Perhaps a little surprisingly, we get a slightly different execution plan:

image

The results are the same (23 rows) but this time the Filter is pushed below the join!  The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join.  In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above.

The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive.

Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”:

SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS 
(
    SELECT *
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID 
    HAVING COUNT_BIG(*) < 10
);

Unfortunately, this query produces an incorrect result (86 rows):

image

The problem is that it lists products with no history rows, though the reasons are interesting.  The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set.  In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL).  To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist:

-- Scalar aggregate
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS th 
WHERE th.ProductID = 709;
 
-- Vector aggregate
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS th 
WHERE th.ProductID = 709
GROUP BY th.ProductID;

The estimated execution plans for these two statements are almost identical:

image

You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case.  The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away.

In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate.  This is something I would like to see exposed in show plan so I suggested it on Connect.  Anyway, the results of running the two queries show the difference at runtime:

image

The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all.  Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero:

SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS 
(
    SELECT *
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID 
    HAVING COUNT_BIG(*) BETWEEN 1 AND 9
);

The query now returns the correct 23 rows:

image

Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans.  Let’s try adding a redundant GROUP BY instead of changing the HAVING clause:

SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS 
(
    SELECT *
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID
    GROUP BY th.ProductID
    HAVING COUNT_BIG(*) < 10
);

Not only do we now get correct results (23 rows), this is the execution plan:

image

I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :)  The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier.  What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set:

SELECT p.Name
FROM Production.Product AS p
WHERE EXISTS 
(
    SELECT *
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID
    GROUP BY ()
    HAVING COUNT_BIG(*) < 10
);

I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too).  Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated).

The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms):

WHERE Clause

SELECT
    p.Name
FROM Production.Product AS p
WHERE 
(
    SELECT COUNT_BIG(*) 
    FROM Production.TransactionHistory AS th
    WHERE th.ProductID = p.ProductID
    GROUP BY ()
) < 10;

APPLY

SELECT p.Name
FROM Production.Product AS p
CROSS APPLY
(
    SELECT NULL
    FROM Production.TransactionHistory AS th 
    WHERE th.ProductID = p.ProductID
    GROUP BY ()
    HAVING COUNT_BIG(*) < 10
) AS ca (dummy);

FROM Clause

SELECT q1.Name 
FROM
(
    SELECT
        p.Name, 
        cnt = 
        (
            SELECT COUNT_BIG(*) 
            FROM Production.TransactionHistory AS th 
            WHERE th.ProductID = p.ProductID 
            GROUP BY ()
        )
    FROM Production.Product AS p
) AS q1 
WHERE
    q1.cnt < 10;

This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :)

SELECT q.Name 
FROM
(
    SELECT
        p.Name, 
        cnt = 
        (
            SELECT SUM(1)
            FROM Production.TransactionHistory AS th 
            WHERE th.ProductID = p.ProductID
        )
    FROM Production.Product AS p
) AS q
WHERE q.cnt < 10;

image

The semantics of SQL aggregates are rather odd in places.  It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates.  As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both.

© 2012 Paul White

Twitter: @SQL_Kiwi
email: [email protected]

© SQL Blog or respective owner