Hong Zheng

Convert json to csv

Hong Zheng / 2017-12-06


It might be a minor issue, but has troubled me for a while.

Sometimes the data you gather is in json format. For example, I work with TCGA data from GDC data portal for a while. The metadata is only available in json format (example of two entries ). I have to convert it to csv, or tab-delimited format before I can process it. At first I thought it should be quite simple, but it still took me several hours.

Conclusions

So, what is the best way to convert json to cvs format?

Details

convertcsv.com

I like this site a lot. You just need to upload the json file, and it will output every field of the entries. It deals with nested record very neatly. For example, for the several sub-records (is it the right term?) in “downstream_analyses”, it will give easy-to-understand column names like:

downstream_analyses/0/workflow_type
downstream_analyses/0/output_files/0/file_name
downstream_analyses/1/workflow_type
downstream_analyses/1/output_files/0/file_name

Just select the columns of interest and done!

However, when the file size is too big, say, when I get the metadata record for over 9,000 samples, the file size goes as big as 90Mb. This webpage would crash if I try to upload it.

jsonv deals with simple entries quite well.

For example,

"md5sum": "68dc4de8dcc516b08aea9ce16017c505", 
"data_type": "Aligned Reads", 
"file_name": "859bbf6f-a492-4bf7-b09a-732514c56f5b_gdc_realn_rehead.bam", 
"analysis": {
  "analysis_id": "ddd46b10-b399-4f82-8475-bf2dc7953c2f", ......
  }

Command:

less example.json | jsonv md5sum,data_type,file_name,analysis.analysis_id | less

Output:

"68dc4de8dcc516b08aea9ce16017c505","Aligned Reads","859bbf6f-a492-4bf7-b09a-732514c56f5b_gdc_realn_rehead.bam","ddd46b10-b399-4f82-8475-bf2dc7953c2f"
"601c3840d65943ea46669c97fad0324e","Aligned Reads","608f820c-e8d5-473a-aaab-fdc50ff7da05_gdc_realn_rehead.bam","ea4d6a39-d0e1-47e0-a830-16ac87c40267"

However, it is unable to deal with complex nested records, such as

"index_files": [
    {
      "file_id": "f9247bc4-d0e1-4557-8a13-e407d700331c"
    }
  ], 

jq is able to correctly extract “file_id” from “index_files”.

Command:

less example.json | jq -r '.[] | [.md5sum,.data_type,.file_name,.analysis.analysis_id,.index_files[].file_id  | tostring ]|  @csv '

Output:

"68dc4de8dcc516b08aea9ce16017c505","Aligned Reads","859bbf6f-a492-4bf7-b09a-732514c56f5b_gdc_realn_rehead.bam","ddd46b10-b399-4f82-8475-bf2dc7953c2f","f9247bc4-d0e1-4557-8a13-e407d700331c"
"601c3840d65943ea46669c97fad0324e","Aligned Reads","608f820c-e8d5-473a-aaab-fdc50ff7da05_gdc_realn_rehead.bam","ea4d6a39-d0e1-47e0-a830-16ac87c40267","2b554cbd-eb98-4483-9a31-8858166339ce"

Notice the difference bewteen “.analysis.analysis_id” and “.index_files[].file_id”.

However, both jsonv and jq require users to specify the output fields. I wonder is there any tool that works with large-size data files and outputs all fields correctly without selection? Hope the tools used in convertcsv.com would be suitable for use in linux/mac one day.