mettastar Posted November 20, 2017 Report Share Posted November 20, 2017 Gurus, naku okahelp kavali .. I have large dataset, daanni spark lo read chesi .. I want dynamically partition the data into multiple folders based on a date field I was able to do that using hiveContext but it is taking a lot of time.. so deeni badhulu I want to read distinct dates from that date field and store them in one variable and for loop use chesi I want to manually create folders and load the data into them .. itla aithe I dont have to use hiveql any other sugestions Quote Link to comment Share on other sites More sharing options...
mettastar Posted November 20, 2017 Author Report Share Posted November 20, 2017 eda sacharu spark meda job chesevaallu Quote Link to comment Share on other sites More sharing options...
Bhai Posted November 20, 2017 Report Share Posted November 20, 2017 1 hour ago, mettastar said: Gurus, naku okahelp kavali .. I have large dataset, daanni spark lo read chesi .. I want dynamically partition the data into multiple folders based on a date field I was able to do that using hiveContext but it is taking a lot of time.. so deeni badhulu I want to read distinct dates from that date field and store them in one variable and for loop use chesi I want to manually create folders and load the data into them .. itla aithe I dont have to use hiveql any other sugestions spark scala ante vendetta. Evevo chebutundedi Quote Link to comment Share on other sites More sharing options...
ranku_mogudu Posted November 20, 2017 Report Share Posted November 20, 2017 Just now, Bhai said: spark scala ante vendetta. Evevo chebutundedi caalign @Keerthana tag seyyakapothey akka susi oorukutnadhi Quote Link to comment Share on other sites More sharing options...
former Posted November 20, 2017 Report Share Posted November 20, 2017 1 hour ago, mettastar said: Gurus, naku okahelp kavali .. I have large dataset, daanni spark lo read chesi .. I want dynamically partition the data into multiple folders based on a date field I was able to do that using hiveContext but it is taking a lot of time.. so deeni badhulu I want to read distinct dates from that date field and store them in one variable and for loop use chesi I want to manually create folders and load the data into them .. itla aithe I dont have to use hiveql any other sugestions Convert data into DataFrames and create different DataFrames based on the date. Quote Link to comment Share on other sites More sharing options...
kasi Posted November 20, 2017 Report Share Posted November 20, 2017 2 hours ago, mettastar said: Gurus, naku okahelp kavali .. I have large dataset, daanni spark lo read chesi .. I want dynamically partition the data into multiple folders based on a date field I was able to do that using hiveContext but it is taking a lot of time.. so deeni badhulu I want to read distinct dates from that date field and store them in one variable and for loop use chesi I want to manually create folders and load the data into them .. itla aithe I dont have to use hiveql any other sugestions use partitioning read about it Quote Link to comment Share on other sites More sharing options...
kasi Posted November 20, 2017 Report Share Posted November 20, 2017 partitioned By(<columnname>) date meeda parition cheyadham antha machi option kadu, better use weekly Quote Link to comment Share on other sites More sharing options...
mettastar Posted November 20, 2017 Author Report Share Posted November 20, 2017 24 minutes ago, former said: Convert data into DataFrames and create different DataFrames based on the date. Dataframes create chese chesthunna bro .. data 1.7Billion undi 6hrs+ nunchi run aithundi just killed it. Data frames based on date ante how do we automate it ? Quote Link to comment Share on other sites More sharing options...
mettastar Posted November 20, 2017 Author Report Share Posted November 20, 2017 10 minutes ago, kasi said: partitioned By(<columnname>) date meeda parition cheyadham antha machi option kadu, better use weekly Date meda cheyali bro .. partitioned by ante hive table ddl lona bro? Quote Link to comment Share on other sites More sharing options...
kasi Posted November 20, 2017 Report Share Posted November 20, 2017 30 minutes ago, mettastar said: Date meda cheyali bro .. partitioned by ante hive table ddl lona bro? direct ga nuvu write lo kuda vadochu df.repartition("entity", "year", "month", "day", "status").write.partitionBy("entity", "year", "month", "day", "status").mode('Append').parquet(s"$location") njoy madi Quote Link to comment Share on other sites More sharing options...
mettastar Posted November 20, 2017 Author Report Share Posted November 20, 2017 11 minutes ago, kasi said: direct ga nuvu write lo kuda vadochu df.repartition("entity", "year", "month", "day", "status").write.partitionBy("entity", "year", "month", "day", "status").mode('Append').parquet(s"$location") njoy madi idi tried bro .. hive query use chesina daniki deeniki pedha difference ledu.. one month data tho try chesa ..comparable ga fast undi .. so now I want to identofy max and min dates in the dataset and loop through in 1 month increments .. hiveql output ni variables ki etla assign cheyali bro .. google chesthe dorakatle .. val max = hiveContext(s"""select max(processed_day) from table """) -- this is not assigning the output Quote Link to comment Share on other sites More sharing options...
kasi Posted November 20, 2017 Report Share Posted November 20, 2017 did you create hiveContext in the first place? Quote Link to comment Share on other sites More sharing options...
kasi Posted November 20, 2017 Report Share Posted November 20, 2017 sc = SparkContext() sqlContext = HiveContext(sc) sqlContext.Sql("""select * from table""").....ila rayali Quote Link to comment Share on other sites More sharing options...
siritptpras Posted November 20, 2017 Report Share Posted November 20, 2017 Total diff question, how is spark and Scala job opportunities..op for posting here as I am planning to learn.. Quote Link to comment Share on other sites More sharing options...
mettastar Posted November 21, 2017 Author Report Share Posted November 21, 2017 10 minutes ago, kasi said: sc = SparkContext() sqlContext = HiveContext(sc) sqlContext.Sql("""select * from table""").....ila rayali val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val rowRDD = r.map(row => Row.fromSeq(row.split("\t", -1))) val df = hiveContext.createDataFrame(rowRDD, schema) df.write.mode(SaveMode.Overwrite).format("orc").partitionBy("processed_day").save("/user/hive/warehouse/df_d_distributor_return_items/") tried this way . and was running forever .. and same thing hive dynamic partitioning try chesa adi kuda running forever.. hive dynamic partitioning one month data thoni chesthe 20mins lo atla finish aindhi.. now i want to get max and min of processed_day column and after that use for loop to loop through in one month increments .. so one month increments lo dynamic partition chestha.. so aa max and min kanukoni for loop lo one month increments lo etla cheyalno cheppava bro .. nenu kuda google chesthunna ..thx Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.