Andrew Dunstan's PostgreSQL and Technical blog: Importing JSON data

Friday, September 26, 2014

Importing JSON data

Say you have a file that consists of one JSON document per line. How can you import it into a table easily? This is a problem I was called on to help a colleague with yesterday. Using COPY is the obvious answer, but this turns out not to be quite so simple to do.

In text mode, COPY will be simply defeated by the presence of a backslash in the JSON. So, for example, any field that contains an embedded double quote mark, or an embedded newline, or anything else that needs escaping according to the JSON spec, will cause failure. And in text mode you have very little control over how it works - you can't, for example, specify a different ESCAPE character. So text mode simply won't work.

CSV mode is more flexible, but poses different problems. Here, instead of backslash causing a problem, QUOTE characters can cause a problem. First, JSON itself uses the default QUOTE character (double quote) to quote all string values. But if we change use an alternative like single quote, then the presence of any single quote in the JSON leads us into difficulties. Second, JSON also uses the default DELIMITER (comma) extensively. So, clearly we need to use something else for the QUOTE and DELIMITER options. (By default, in CSV mode, the ESCAPE character is the same as the QUOTE character, so we don't need to worry about it separately.)

What we in fact want is to specify QUOTE and DELIMITER characters that can't appear at all in the JSON. Then the whole line will be seen as a single unquoted datum, which is exactly what we want. There is a small set of single-byte characters that happen to be illegal in JSON, so we can be sure that choosing them for these options should do the right thing with any legal JSON. These are the control characters. So the solution we came up with looks like this:

copy the_table(jsonfield) 
from '/path/to/jsondata' 
csv quote e'\x01' delimiter e'\x02';

Of course, if the JSON has embedded newlines as punctuation, this wont work. So it's important that you configure whatever is producing the JSON not to insert newlines anywhere but at the end of each JSON document.

Now this solution is a bit of a hack. I wonder if there's a case for a COPY mode that simply treats each line as a single datum. I also wonder if we need some more specialized tools for importing JSON, possibly one or more Foreign Data Wrappers. Such things could handle, say, embedded newline punctuation.

Note too that files produced by PostgreSQL's COPY ... TO command will be properly quoted and escaped and won't need to be handled like this to read them back. Of course, if you want them to be readable by other non-CSV processors, then you might need to use similar options to those above to avoid unwanted quoting and escaping.

9 comments:

CPU BarbequeJanuary 24, 2015 at 5:21 PM
Wow. Very helpful Andrew. Thanks.
I had heard that Postgres supported JSON directly.. apparently not. drat ;).
I have a bunch of large Mongodb databases Im bringing across...
..
Can you comment on indexing JSON records/tables/databases within Postgres?
I assume it is not aware of the various field names within each JSON record, and hence would not be capable of indexing them.
..
Thanks Muchly
ReplyDelete
Replies
CPU BarbequeJanuary 25, 2015 at 1:29 PM
My apologies, I did not see the "json" type defined, Im simultaneously doing this and trying to convert the Ajax calls from Node to PHP...
In fact I was going to request a retraction of my previous comment.. It is entirely an artifact of my rampaging ignorance.
My tone in that part detracts from my appreciation of the article :)

I am working on a project, and we have just switched to PostgreSQL from Mongodb, and the extra sophistication and diligence with PostgreSQL is palpable, and highly appreciated.
I have given myself the (Long) weekend to get the import scripts written for our (LARGE) JSON files.. there are options in Mongo for preparing the indexes on import as well as sorting to optimise for an index, I will be reading the docs, but was wondering if there are any relevant features/gotchas in the PostgreSQL imports (specifically here COPY).
.
Once again
Thanks again for a great product.
And infinite thanks for saving me from a javascript command line for db admin :)
ReplyDelete
Replies
Andrew DunstanJanuary 25, 2015 at 6:01 PM
read http://www.postgresql.org/docs/current/static/datatype-json.html very carefully. It constains lots of information about indexing. Mostly you really want the jsonb type for that, which means you want to be using PostgreSQL 9.4.
ReplyDelete
Replies
azenhiApril 2, 2015 at 2:39 PM
Thank you for posting. I am like CPU Barbeque just getting into PostgreSQL. I am thankful for the people who have used it for years and share their knowledge.
ReplyDelete
Replies
UnknownMay 22, 2015 at 6:34 AM
Thanks Andrew, I have now improved pgloader so that it's possible to implement your idea easily, as in the following test-case:

https://github.com/dimitri/pgloader/blob/master/test/csv-json.load

Before that the escape character was quite limited (double-quote or backslash-quote).
ReplyDelete
Replies
UnknownSeptember 7, 2015 at 6:02 PM
Thanks Andrew. I see others are running into the same issue. I wonder if you know how to setup copy in a way that the input file not only contains a single column with JSON content, but with a file that looks more like this:

"product_id","product_name","info"
"1234","Red Shoes","{"sizes" : [8, 9, 10, 11]}"

Thanks in advance!
ReplyDelete
Replies
asdklhfdsJuly 22, 2021 at 10:09 AM
many thanks
ReplyDelete
Replies

Add comment