Friday, May 10, 2013

Only GROUP BY what you really need to

The old rule used to be that if you have a query that contained aggregated columns, you have to GROUP BY every other column selected. These days you are allowed to omit columns that are provably functionally dependent on one or more of the other grouped by columns. In practice, that means you can omit any columns that are not in the table's primary key if all the primary key columns are grouped by.

Sometimes people, often including me, just do this fairly mindlessly, but sometimes it bites you. Consider this simple query:
SELECT a.id as a_id, a.properties, sum(b.amount) as expenses
FROM people a
   JOIN expenses b on a.id = b.person_id
GROUP BY a.id, a.properties
We don't really want the expenses grouped by the person's properties. We just put that in because the parser complains if we don't. And if people turns out to be a view which joins a couple of tables, we probably can't leave it out either. This can increase the amount of sorting that the GROUP BY requires, which can sometime have dramatic effects on performance. But even worse, there are cases where this can actually cause the query to be unrunnable. One such case is if properties is a JSON column.

That might surprise you. It has certainly surprised a couple of people I know. The reason is that there is no equality operator for JSON.

So, how can we write this so we only GROUP BY what we really need to? One way is to pick up the extra column later in the query, after we have done the grouping, like this:
WITH exp as 
(
  SELECT a.id as a_id, sum(b.amount) as expenses
  FROM people a
     JOIN expenses b on a.id = b.person_id
  GROUP BY a.id
)
SELECT exp.*, p.properties
FROM exp 
   JOIN people p ON p.id = exp.a_id
This might look a bit silly. We're adding in an extra join to people that we shouldn't need. But it turns out in my experience that this actually often works pretty well, and what you pay by way of the extra join is often paid for by the fact that you're simplifying the GROUP BY, and that it is processing smaller rows, uncluttered by the extra columns you want carried through. And, in the case of a JSON column, it has the virtue that it will work.

I often get called in to look at queries that run slowly and have huge GROUP BY clauses (I have seen them with 50 or so columns). I almost always start by reducing the GROUP BY to the smallest set possible, and this almost always results in a performance gain.

10 comments:

  1. Is there a reason why you don't just use a subquery in this case? Like this:
    SELECT a.id, a.properties, (SELECT sum(b.amount) FROM b WHERE a.id=b.id) as expenses FROM a
    Or were you just looking for an example for GROUP BY?

    ReplyDelete
    Replies
    1. I avoid subselects in the SELECT clause like the plague. I have found them to be major performance killers in far too many cases, and very rarely to perform better than alternatives. I regard them as almost a definable anti-pattern. In any case, what if this query had forty aggregates instead of one? Would you put in forty subselects? That would pretty much guarantee performance would go through the floor.

      The post was prompted by someone who ran into the JSON problem I mentioned. That's what got me thinking about why they were grouping by the JSON column in the first place. And, yes, BTW, the original query from which this was abstracted does have a huge number of aggregates.

      Delete
    2. When there are too many rows to be calculated in the with clause, does it affect the performance? for each session running such queries, will require too much ram.

      Delete
    3. Why should it? The CTE will surely be calculated faster and with less space than the original query, because it has less sorting work to do and because the rows it's sorting are smaller. Picking up the extra value later will be relatively cheap, because the table is probably already in cache, and it will be done via an index scan. As for RAM - CTEs spill to disk like other intermediate work products of queries. You can control that via the work_mem setting. If you are running into memory issues from running a query like this then you have much bigger problems.

      This sort of technique doesn't always work. But it very often does. Assuming it's going to perform worse is a big mistake.

      Delete
  2. Andrew,

    Why can't you just do this:

    SELECT a.id as a_id, a.properties, sum(b.amount) as expenses
    FROM people a
    JOIN expenses b on a.id = b.person_id
    GROUP BY a.id


    I thought the rule for PostgreSQL 9.1+ was that you don't need to group by any additional columns
    in A if you are grouping by the primary key even if those columns are selected? or are you working with 9.0 or lower?

    Note verbatim from 9.1 release notes:
    "Allow non-GROUP BY columns in the query target list when the primary key is specified in the GROUP BY clause (Peter Eisentraut)

    The SQL standard allows this behavior, and because of the primary key, the result is unambiguous"

    ReplyDelete
    Replies
    1. The use case that prompted this was where the primary table was actually a view:

      "And if people turns out to be a view which joins a couple of tables, we probably can't leave it out either. "

      Delete
    2. Exactly. If you can use Regina's query then do it. But then you're already doing what I suggested, i.e. only grouping by what you need to. The more complex query is for when you can't do use this simple query. Sorry if that wasn't clear.

      Delete
    3. The relational model prescribes primary keys on views as views are derived relations. It's a pity that SQL doesn't allow us to define a primary key on views.
      It would not be an easy to enforce constraint but it may be a good documentation and optimizer aid.

      Delete
    4. lveronese, I think you mean "proscribes". which is kind of the opposite of "prescribes". :-) Personally I'm very suspicious of unenforced constraints. If you can't rely on them I think they are arguably worse than useless.

      Delete
  3. ah sorry I guess I read thru it too quickly and didn't see the , but if it is a view :)

    ReplyDelete