Sometimes people, often including me, just do this fairly mindlessly, but sometimes it bites you. Consider this simple query:
We don't really want the expenses grouped by the person's properties. We just put that in because the parser complains if we don't. And if people turns out to be a view which joins a couple of tables, we probably can't leave it out either. This can increase the amount of sorting that the GROUP BY requires, which can sometime have dramatic effects on performance. But even worse, there are cases where this can actually cause the query to be unrunnable. One such case is if properties is a JSON column.SELECT a.id as a_id, a.properties, sum(b.amount) as expenses FROM people a JOIN expenses b on a.id = b.person_id GROUP BY a.id, a.properties
That might surprise you. It has certainly surprised a couple of people I know. The reason is that there is no equality operator for JSON.
So, how can we write this so we only GROUP BY what we really need to? One way is to pick up the extra column later in the query, after we have done the grouping, like this:
This might look a bit silly. We're adding in an extra join to people that we shouldn't need. But it turns out in my experience that this actually often works pretty well, and what you pay by way of the extra join is often paid for by the fact that you're simplifying the GROUP BY, and that it is processing smaller rows, uncluttered by the extra columns you want carried through. And, in the case of a JSON column, it has the virtue that it will work.WITH exp as ( SELECT a.id as a_id, sum(b.amount) as expenses FROM people a JOIN expenses b on a.id = b.person_id GROUP BY a.id ) SELECT exp.*, p.properties FROM exp JOIN people p ON p.id = exp.a_id
I often get called in to look at queries that run slowly and have huge GROUP BY clauses (I have seen them with 50 or so columns). I almost always start by reducing the GROUP BY to the smallest set possible, and this almost always results in a performance gain.