bleachreddy Posted April 10, 2020 Report Posted April 10, 2020 4 hours ago, phoenix_nebula said: Option 1: use sax parser and streaming in java Option 2: Apache tika Extension to Option 1: Non-blocking I/O is there in Java8+, that comes handy. Also use streams provided by Java8+, that are very good. Not having enough time now, but I remember I did this for at least a 20GB XML file. Quote
pencil Posted April 10, 2020 Report Posted April 10, 2020 80 gb xml file aa.. irs database lo kuda antha data undademoga Quote
Ellen Posted April 10, 2020 Report Posted April 10, 2020 Use element tree, then pandas in python Quote
kevinUsa Posted April 10, 2020 Author Report Posted April 10, 2020 I used excel , IT took me around 12 hrs to do it 73 GB of data is now 12 gb data. Quote
kevinUsa Posted April 10, 2020 Author Report Posted April 10, 2020 2 hours ago, pencil said: 80 gb xml file aa.. irs database lo kuda antha data undademoga https://ia800107.us.archive.org/27/items/stackexchange/stackoverflow.com-Posts.7z https://ia800107.us.archive.org/27/items/stackexchange/stackoverflow.com-Posts.7z used this data Quote
kevinUsa Posted April 14, 2020 Author Report Posted April 14, 2020 well I used python to convert xml to csv. than I used formed a hd insights cluster , converted the data from .csv to .gz than uploaded to azure blobs than used pig to clean all the null values than used Hcaster to moved the data from pig to hive than permformed the queries 1 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.