Friday, September 26, 2014

Importing JSON data

Say you have a file that consists of one JSON document per line. How can you import it into a table easily? This is a problem I was called on to help a colleague with yesterday. Using COPY is the obvious answer, but this turns out not to be quite so simple to do.

In text mode, COPY will be simply defeated by the presence of a backslash in the JSON. So, for example, any field that contains an embedded double quote mark, or an embedded newline, or anything else that needs escaping according to the JSON spec, will cause failure. And in text mode you have very little control over how it works - you can't, for example, specify a different ESCAPE character. So text mode simply won't work.

CSV mode is more flexible, but poses different problems. Here, instead of backslash causing a problem, QUOTE characters can cause a problem. First, JSON itself uses the default QUOTE character (double quote) to quote all string values. But if we change use an alternative like single quote, then the presence of any single quote in the JSON leads us into difficulties. Second, JSON also uses the default DELIMITER (comma) extensively. So, clearly we need to use something else for the QUOTE and DELIMITER options. (By default, in CSV mode, the ESCAPE character is the same as the QUOTE character, so we don't need to worry about it separately.)

What we in fact want is  to specify QUOTE and DELIMITER characters that can't appear at all in the JSON. Then the whole line will be seen as a single unquoted datum, which is exactly what we want. There is a small set of single-byte characters that happen to be illegal in JSON, so we can be sure that choosing them for these options should do the right thing with any legal JSON. These are the control characters. So the solution we came up with looks like this:
copy the_table(jsonfield) 
from '/path/to/jsondata' 
csv quote e'\x01' delimiter e'\x02';
Of course, if the JSON has embedded newlines as punctuation, this wont work. So it's important that you configure whatever is producing the JSON not to insert newlines anywhere but at the end of each JSON document.

Now this solution is a bit of a hack. I wonder if there's a case for a COPY mode that simply treats each line as a single datum. I also wonder if we need some more specialized tools for importing JSON, possibly one or more Foreign Data Wrappers. Such things could handle, say, embedded newline punctuation.

Note too that files produced by PostgreSQL's COPY ... TO command will be properly quoted and escaped and won't need to be handled like this to read them back. Of course, if you want them to be readable by other non-CSV processors, then you might need to use similar options to those above to avoid unwanted quoting and escaping.

8 comments:

  1. Wow. Very helpful Andrew. Thanks.
    I had heard that Postgres supported JSON directly.. apparently not. drat ;).
    I have a bunch of large Mongodb databases Im bringing across...
    ..
    Can you comment on indexing JSON records/tables/databases within Postgres?
    I assume it is not aware of the various field names within each JSON record, and hence would not be capable of indexing them.
    ..
    Thanks Muchly

    ReplyDelete
    Replies
    1. I have no idea why you say "apparently not". Postgres supports json fields. There is no such thing as a json table or database. We do have facilities for turning a record into json. If you want to retrieve a table as json, it is as simple as:

      select json_agg(r) from tablename r;

      We also (with 9.4 and jsonb) have support for indexing json data, so your assumption on that is just wrong. Read the docs. Of course, none of this is actually relevant to the blog post.

      Delete
  2. My apologies, I did not see the "json" type defined, Im simultaneously doing this and trying to convert the Ajax calls from Node to PHP...
    In fact I was going to request a retraction of my previous comment.. It is entirely an artifact of my rampaging ignorance.
    My tone in that part detracts from my appreciation of the article :)

    I am working on a project, and we have just switched to PostgreSQL from Mongodb, and the extra sophistication and diligence with PostgreSQL is palpable, and highly appreciated.
    I have given myself the (Long) weekend to get the import scripts written for our (LARGE) JSON files.. there are options in Mongo for preparing the indexes on import as well as sorting to optimise for an index, I will be reading the docs, but was wondering if there are any relevant features/gotchas in the PostgreSQL imports (specifically here COPY).
    .
    Once again
    Thanks again for a great product.
    And infinite thanks for saving me from a javascript command line for db admin :)

    ReplyDelete
  3. read http://www.postgresql.org/docs/current/static/datatype-json.html very carefully. It constains lots of information about indexing. Mostly you really want the jsonb type for that, which means you want to be using PostgreSQL 9.4.

    ReplyDelete
  4. Thank you for posting. I am like CPU Barbeque just getting into PostgreSQL. I am thankful for the people who have used it for years and share their knowledge.

    ReplyDelete
  5. Thanks Andrew, I have now improved pgloader so that it's possible to implement your idea easily, as in the following test-case:

    https://github.com/dimitri/pgloader/blob/master/test/csv-json.load

    Before that the escape character was quite limited (double-quote or backslash-quote).

    ReplyDelete
  6. Thanks Andrew. I see others are running into the same issue. I wonder if you know how to setup copy in a way that the input file not only contains a single column with JSON content, but with a file that looks more like this:

    "product_id","product_name","info"
    "1234","Red Shoes","{"sizes" : [8, 9, 10, 11]}"

    Thanks in advance!

    ReplyDelete
    Replies
    1. Your example won't work. You would either need to use a different quoting character for the CSV (single quote would be the logical choice) or escape the quotes inside the JSON.

      Delete