Andrew Dunstan's PostgreSQL and Technical blog: Only GROUP BY what you really need to

Friday, May 10, 2013

Only GROUP BY what you really need to

The old rule used to be that if you have a query that contained aggregated columns, you have to GROUP BY every other column selected. These days you are allowed to omit columns that are provably functionally dependent on one or more of the other grouped by columns. In practice, that means you can omit any columns that are not in the table's primary key if all the primary key columns are grouped by.

Sometimes people, often including me, just do this fairly mindlessly, but sometimes it bites you. Consider this simple query:

SELECT a.id as a_id, a.properties, sum(b.amount) as expenses
FROM people a
   JOIN expenses b on a.id = b.person_id
GROUP BY a.id, a.properties

We don't really want the expenses grouped by the person's properties. We just put that in because the parser complains if we don't. And if people turns out to be a view which joins a couple of tables, we probably can't leave it out either. This can increase the amount of sorting that the GROUP BY requires, which can sometime have dramatic effects on performance. But even worse, there are cases where this can actually cause the query to be unrunnable. One such case is if properties is a JSON column.

That might surprise you. It has certainly surprised a couple of people I know. The reason is that there is no equality operator for JSON.

So, how can we write this so we only GROUP BY what we really need to? One way is to pick up the extra column later in the query, after we have done the grouping, like this:

WITH exp as 
(
  SELECT a.id as a_id, sum(b.amount) as expenses
  FROM people a
     JOIN expenses b on a.id = b.person_id
  GROUP BY a.id
)
SELECT exp.*, p.properties
FROM exp 
   JOIN people p ON p.id = exp.a_id

This might look a bit silly. We're adding in an extra join to people that we shouldn't need. But it turns out in my experience that this actually often works pretty well, and what you pay by way of the extra join is often paid for by the fact that you're simplifying the GROUP BY, and that it is processing smaller rows, uncluttered by the extra columns you want carried through. And, in the case of a JSON column, it has the virtue that it will work.

I often get called in to look at queries that run slowly and have huge GROUP BY clauses (I have seen them with 50 or so columns). I almost always start by reducing the GROUP BY to the smallest set possible, and this almost always results in a performance gain.

10 comments:

JakobMay 10, 2013 at 1:50 PM
Is there a reason why you don't just use a subquery in this case? Like this:
SELECT a.id, a.properties, (SELECT sum(b.amount) FROM b WHERE a.id=b.id) as expenses FROM a
Or were you just looking for an example for GROUP BY?
ReplyDelete
Replies
ReginaMay 11, 2013 at 2:19 PM
Andrew,

Why can't you just do this:

SELECT a.id as a_id, a.properties, sum(b.amount) as expenses
FROM people a
JOIN expenses b on a.id = b.person_id
GROUP BY a.id

I thought the rule for PostgreSQL 9.1+ was that you don't need to group by any additional columns
in A if you are grouping by the primary key even if those columns are selected? or are you working with 9.0 or lower?

Note verbatim from 9.1 release notes:
"Allow non-GROUP BY columns in the query target list when the primary key is specified in the GROUP BY clause (Peter Eisentraut)

The SQL standard allows this behavior, and because of the primary key, the result is unambiguous"
ReplyDelete
Replies
ReginaMay 12, 2013 at 4:01 PM
ah sorry I guess I read thru it too quickly and didn't see the , but if it is a view :)
ReplyDelete
Replies