kevinUsa Posted April 10, 2020 Author Report Posted April 10, 2020 Tried excel no t working SQL too Uploaded the to azure blob but was unsuccessful when using hive Any other suggestions Quote
Bhumchik Posted April 10, 2020 Report Posted April 10, 2020 Very simple. Just rename the file extensions to .CSV Quote
reddyeee Posted April 10, 2020 Report Posted April 10, 2020 Not sure how long it takes but you can give this a try https://pypi.org/project/xmlutils/1.1/ Quote
kevinUsa Posted April 10, 2020 Author Report Posted April 10, 2020 4 hours ago, Sarvapindi said: Databricks Spark python Will it work Quote
phoenix_nebula Posted April 10, 2020 Report Posted April 10, 2020 Option 1: use sax parser and streaming in java Option 2: Apache tika Quote
WigsandThighs Posted April 10, 2020 Report Posted April 10, 2020 Start keying in manually word to word 👍 2 Quote
kathanayaka Posted April 10, 2020 Report Posted April 10, 2020 7 hours ago, kevinUsa said: ?? EMR + Spark + Python undi kada tammudu. Quote
FLraja Posted April 10, 2020 Report Posted April 10, 2020 3 hours ago, phoenix_nebula said: Option 1: use sax parser and streaming in java Option 2: Apache tika This works , simple 2 lines code sax parser rasi generate csv Quote
soodhilodaaram Posted April 10, 2020 Report Posted April 10, 2020 8 hours ago, kevinUsa said: ?? just change the file extension from xml to csv Quote
soodhilodaaram Posted April 10, 2020 Report Posted April 10, 2020 jokes apart .. use python to convert xml to csv using pandas you may need to spin up a large virtual machine or databricks clusters to make it quick Quote
soodhilodaaram Posted April 10, 2020 Report Posted April 10, 2020 4 hours ago, kevinUsa said: Will it work yes, you can spin up 8 clusters and get the job done in 10 mins may be xml is semi structured data and csv is structured, its like flattening data, which means lot of writes and processing 80GB xml can easily output to 200gb csv, which requires lot of parallel processing, quickest and easiest way is databricks as it it readily available and get the job done in may be 5$ open a python notebooks and start fiddling with this code example https://stackoverflow.com/questions/49898661/xml-to-csv-python Quote
FLraja Posted April 10, 2020 Report Posted April 10, 2020 17 minutes ago, soodhilodaaram said: yes, you can spin up 8 clusters and get the job done in 10 mins may be xml is semi structured data and csv is structured, its like flattening data, which means lot of writes and processing 80GB xml can easily output to 200gb csv, which requires lot of parallel processing, quickest and easiest way is databricks as it it readily available and get the job done in may be 5$ open a python notebooks and start fiddling with this code example https://stackoverflow.com/questions/49898661/xml-to-csv-python Why would 80gb become 200gb? You will remove all xml tags which are consuming lot of space.csv has just the needed data.i am thinking it will be less than 2gb in csv Quote
soodhilodaaram Posted April 10, 2020 Report Posted April 10, 2020 14 minutes ago, FLraja said: Why would 80gb become 200gb? You will remove all xml tags which are consuming lot of space.csv has just the needed data.i am thinking it will be less than 2gb in csv maybe you are right, I was coming from hierarchical flattening of data resulting in more records in csv Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.