Jump to content

80 GB files xml to CSV how to convert


Recommended Posts

Posted

Tried excel no t working

 SQL too 

Uploaded the to azure blob but was unsuccessful when using hive

Any other suggestions

Posted
4 hours ago, Sarvapindi said:

Databricks Spark python 

Will it work

Posted

Option 1: use sax parser and streaming in java

Option 2: Apache tika 

Posted
7 hours ago, kevinUsa said:

??

EMR + Spark + Python undi kada tammudu. 

Posted
3 hours ago, phoenix_nebula said:

Option 1: use sax parser and streaming in java

Option 2: Apache tika 

This works , simple 2 lines code sax parser rasi generate csv

Posted
8 hours ago, kevinUsa said:

??

just change the file extension from xml to csv

Posted

jokes apart .. use python to convert xml to csv using pandas

you may need to spin up a large virtual machine or databricks clusters to make it quick

 

 

Posted
4 hours ago, kevinUsa said:

Will it work

yes, you can spin up 8 clusters and get the job done in 10 mins may be

xml is semi structured data and csv is structured, its like flattening data, which means lot of writes and processing

80GB xml can easily output to 200gb csv, which requires lot of parallel processing, quickest and easiest way is databricks as it it readily available and get the job done in may be 5$

open a python notebooks and start fiddling with this code example

https://stackoverflow.com/questions/49898661/xml-to-csv-python

 

Posted
17 minutes ago, soodhilodaaram said:

yes, you can spin up 8 clusters and get the job done in 10 mins may be

xml is semi structured data and csv is structured, its like flattening data, which means lot of writes and processing

80GB xml can easily output to 200gb csv, which requires lot of parallel processing, quickest and easiest way is databricks as it it readily available and get the job done in may be 5$

open a python notebooks and start fiddling with this code example

https://stackoverflow.com/questions/49898661/xml-to-csv-python

 

Why would 80gb become 200gb? You will remove all xml tags which are consuming lot of space.csv has just the needed data.i am thinking it will be less than 2gb in csv

Posted
14 minutes ago, FLraja said:

Why would 80gb become 200gb? You will remove all xml tags which are consuming lot of space.csv has just the needed data.i am thinking it will be less than 2gb in csv

maybe you are right, I was coming from hierarchical flattening of data resulting in more records in csv

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...