kevinUsa Posted April 10, 2020 Report Share Posted April 10, 2020 ?? Quote Link to comment Share on other sites More sharing options...
kevinUsa Posted April 10, 2020 Author Report Share Posted April 10, 2020 Tried excel no t working SQL too Uploaded the to azure blob but was unsuccessful when using hive Any other suggestions Quote Link to comment Share on other sites More sharing options...
Sarvapindi Posted April 10, 2020 Report Share Posted April 10, 2020 Databricks Spark python Quote Link to comment Share on other sites More sharing options...
Bhumchik Posted April 10, 2020 Report Share Posted April 10, 2020 Very simple. Just rename the file extensions to .CSV Quote Link to comment Share on other sites More sharing options...
reddyeee Posted April 10, 2020 Report Share Posted April 10, 2020 Not sure how long it takes but you can give this a try https://pypi.org/project/xmlutils/1.1/ Quote Link to comment Share on other sites More sharing options...
kevinUsa Posted April 10, 2020 Author Report Share Posted April 10, 2020 4 hours ago, Sarvapindi said: Databricks Spark python Will it work Quote Link to comment Share on other sites More sharing options...
phoenix_nebula Posted April 10, 2020 Report Share Posted April 10, 2020 Option 1: use sax parser and streaming in java Option 2: Apache tika Quote Link to comment Share on other sites More sharing options...
WigsandThighs Posted April 10, 2020 Report Share Posted April 10, 2020 Start keying in manually word to word 👍 2 Quote Link to comment Share on other sites More sharing options...
kathanayaka Posted April 10, 2020 Report Share Posted April 10, 2020 7 hours ago, kevinUsa said: ?? EMR + Spark + Python undi kada tammudu. Quote Link to comment Share on other sites More sharing options...
FLraja Posted April 10, 2020 Report Share Posted April 10, 2020 3 hours ago, phoenix_nebula said: Option 1: use sax parser and streaming in java Option 2: Apache tika This works , simple 2 lines code sax parser rasi generate csv Quote Link to comment Share on other sites More sharing options...
soodhilodaaram Posted April 10, 2020 Report Share Posted April 10, 2020 8 hours ago, kevinUsa said: ?? just change the file extension from xml to csv Quote Link to comment Share on other sites More sharing options...
soodhilodaaram Posted April 10, 2020 Report Share Posted April 10, 2020 jokes apart .. use python to convert xml to csv using pandas you may need to spin up a large virtual machine or databricks clusters to make it quick Quote Link to comment Share on other sites More sharing options...
soodhilodaaram Posted April 10, 2020 Report Share Posted April 10, 2020 4 hours ago, kevinUsa said: Will it work yes, you can spin up 8 clusters and get the job done in 10 mins may be xml is semi structured data and csv is structured, its like flattening data, which means lot of writes and processing 80GB xml can easily output to 200gb csv, which requires lot of parallel processing, quickest and easiest way is databricks as it it readily available and get the job done in may be 5$ open a python notebooks and start fiddling with this code example https://stackoverflow.com/questions/49898661/xml-to-csv-python Quote Link to comment Share on other sites More sharing options...
FLraja Posted April 10, 2020 Report Share Posted April 10, 2020 17 minutes ago, soodhilodaaram said: yes, you can spin up 8 clusters and get the job done in 10 mins may be xml is semi structured data and csv is structured, its like flattening data, which means lot of writes and processing 80GB xml can easily output to 200gb csv, which requires lot of parallel processing, quickest and easiest way is databricks as it it readily available and get the job done in may be 5$ open a python notebooks and start fiddling with this code example https://stackoverflow.com/questions/49898661/xml-to-csv-python Why would 80gb become 200gb? You will remove all xml tags which are consuming lot of space.csv has just the needed data.i am thinking it will be less than 2gb in csv Quote Link to comment Share on other sites More sharing options...
soodhilodaaram Posted April 10, 2020 Report Share Posted April 10, 2020 14 minutes ago, FLraja said: Why would 80gb become 200gb? You will remove all xml tags which are consuming lot of space.csv has just the needed data.i am thinking it will be less than 2gb in csv maybe you are right, I was coming from hierarchical flattening of data resulting in more records in csv Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.