Andrew Dunstan's PostgreSQL and Technical blog: Set types

Friday, May 15, 2015

Set types

Every so often I've wanted a SET metatype in Postgres.

We do have bit fields, which can be used as sets with a little work, and this is how MySQL does it's SET types as I understand it. But what if you want a set of enums (our enums can have very large lists of values), or a set of text values, or of integers? Bit fields won't work for these - we would need some sort of sparse representation.

One obvious way would be to store the set as a one-dimensional array, kept in sorted order for efficient searching. For text values, we could probably use some smart form of sorting such as is used by jsonb and hstore to store object keys, to reduce the need for string comparisons.

Obviously, I have done no work on this, but it seems like it would be a worthwhile project.

In the meantime, you can use hstore more or less in this way. Just treat it as a mapping from whatever values you have to 1. It's a bit of a hack, but it should work. You'll just need to cast your values to text.

4 comments:

HansMay 15, 2015 at 12:14 PM
I never understood the need for a SET data type. A table is a set - so why not just use a simple 1:N relationship?
ReplyDelete
Replies
Andrew DunstanMay 15, 2015 at 1:51 PM
The same sort of argument applies to pretty much every non-atomic type, such as arrays and records. My argument is that it's simply a part of a complete type system. If you can say
update mytable set mylist = mylist || 'foo';
why should you not be able to say
update mytable set myset = myset || 'foo';
? The equivalent with a separate table to store the set values is a heck of a lot more complex, although it will become easier in 9.5 with the addition of INSERT... ON CONFLICT IGNORE;
ReplyDelete
Replies
Josh BerkusMay 15, 2015 at 9:53 PM
I could really use this for a number of analytics projects. Currently I'm using ARRAYs as a de-facto SET type.

Where I need this even more is time-series data. This would be a SET of numeric, where the external keys would be timestamps, and permit a compact representation of "data points for hour X" in large time-series databases. As such values would compress very well, it would allow us to store billions of time-oriented data points much cheaper.
ReplyDelete
Replies
Andrew DunstanMay 17, 2015 at 1:08 PM
Josh, what do your time series data look like? I'm having a little trouble visualizing them using a set type.
ReplyDelete
Replies

Add comment