Convert json to csv
Hong Zheng / 2017-12-06
It might be a minor issue, but has troubled me for a while.
Sometimes the data you gather is in json format. For example, I work with TCGA data from GDC data portal for a while. The metadata is only available in json format (example of two entries ). I have to convert it to csv, or tab-delimited format before I can process it. At first I thought it should be quite simple, but it still took me several hours.
Conclusions
So, what is the best way to convert json to cvs format?
If the json file is quite small in size, use the online tool provided in convertcsv.com
If the file size is too big the online tool to handle, use either
Details
I like this site a lot. You just need to upload the json file, and it will output every field of the entries. It deals with nested record very neatly. For example, for the several sub-records (is it the right term?) in “downstream_analyses”, it will give easy-to-understand column names like:
downstream_analyses/0/workflow_type
downstream_analyses/0/output_files/0/file_name
downstream_analyses/1/workflow_type
downstream_analyses/1/output_files/0/file_name
Just select the columns of interest and done!
However, when the file size is too big, say, when I get the metadata record for over 9,000 samples, the file size goes as big as 90Mb. This webpage would crash if I try to upload it.
jsonv deals with simple entries quite well.
For example,
"md5sum": "68dc4de8dcc516b08aea9ce16017c505",
"data_type": "Aligned Reads",
"file_name": "859bbf6f-a492-4bf7-b09a-732514c56f5b_gdc_realn_rehead.bam",
"analysis": {
"analysis_id": "ddd46b10-b399-4f82-8475-bf2dc7953c2f", ......
}
Command:
less example.json | jsonv md5sum,data_type,file_name,analysis.analysis_id | less
Output:
"68dc4de8dcc516b08aea9ce16017c505","Aligned Reads","859bbf6f-a492-4bf7-b09a-732514c56f5b_gdc_realn_rehead.bam","ddd46b10-b399-4f82-8475-bf2dc7953c2f"
"601c3840d65943ea46669c97fad0324e","Aligned Reads","608f820c-e8d5-473a-aaab-fdc50ff7da05_gdc_realn_rehead.bam","ea4d6a39-d0e1-47e0-a830-16ac87c40267"
However, it is unable to deal with complex nested records, such as
"index_files": [
{
"file_id": "f9247bc4-d0e1-4557-8a13-e407d700331c"
}
],
jq is able to correctly extract “file_id” from “index_files”.
Command:
less example.json | jq -r '.[] | [.md5sum,.data_type,.file_name,.analysis.analysis_id,.index_files[].file_id | tostring ]| @csv '
Output:
"68dc4de8dcc516b08aea9ce16017c505","Aligned Reads","859bbf6f-a492-4bf7-b09a-732514c56f5b_gdc_realn_rehead.bam","ddd46b10-b399-4f82-8475-bf2dc7953c2f","f9247bc4-d0e1-4557-8a13-e407d700331c"
"601c3840d65943ea46669c97fad0324e","Aligned Reads","608f820c-e8d5-473a-aaab-fdc50ff7da05_gdc_realn_rehead.bam","ea4d6a39-d0e1-47e0-a830-16ac87c40267","2b554cbd-eb98-4483-9a31-8858166339ce"
Notice the difference bewteen “.analysis.analysis_id” and “.index_files[].file_id”.
However, both jsonv and jq require users to specify the output fields. I wonder is there any tool that works with large-size data files and outputs all fields correctly without selection? Hope the tools used in convertcsv.com would be suitable for use in linux/mac one day.