Jump to content

Spark - Scala - help


mettastar

Recommended Posts

Gurus,

naku okahelp kavali .. I have large dataset, daanni spark lo read chesi .. I want dynamically partition the data into multiple folders based on a date field

I was able to do that using hiveContext but it is taking a lot of time..

so deeni badhulu I want to read distinct dates from that date field and store them in one variable and for loop use chesi I want to manually create folders and load the data into them .. itla aithe I dont have to use hiveql

 

any other sugestions 

Link to comment
Share on other sites

1 hour ago, mettastar said:

Gurus,

naku okahelp kavali .. I have large dataset, daanni spark lo read chesi .. I want dynamically partition the data into multiple folders based on a date field

I was able to do that using hiveContext but it is taking a lot of time..

so deeni badhulu I want to read distinct dates from that date field and store them in one variable and for loop use chesi I want to manually create folders and load the data into them .. itla aithe I dont have to use hiveql

 

any other sugestions 

spark scala ante vendetta. 

 

Evevo chebutundedi

Link to comment
Share on other sites

1 hour ago, mettastar said:

Gurus,

naku okahelp kavali .. I have large dataset, daanni spark lo read chesi .. I want dynamically partition the data into multiple folders based on a date field

I was able to do that using hiveContext but it is taking a lot of time..

so deeni badhulu I want to read distinct dates from that date field and store them in one variable and for loop use chesi I want to manually create folders and load the data into them .. itla aithe I dont have to use hiveql

 

any other sugestions 

Convert data into DataFrames and create different DataFrames based on the date. 

Link to comment
Share on other sites

2 hours ago, mettastar said:

Gurus,

naku okahelp kavali .. I have large dataset, daanni spark lo read chesi .. I want dynamically partition the data into multiple folders based on a date field

I was able to do that using hiveContext but it is taking a lot of time..

so deeni badhulu I want to read distinct dates from that date field and store them in one variable and for loop use chesi I want to manually create folders and load the data into them .. itla aithe I dont have to use hiveql

 

any other sugestions 

use partitioning

read about it

Link to comment
Share on other sites

24 minutes ago, former said:

Convert data into DataFrames and create different DataFrames based on the date. 

Dataframes create chese chesthunna bro .. data 1.7Billion undi 6hrs+ nunchi run aithundi just killed it. 

Data frames based on date ante how do we automate it ?

Link to comment
Share on other sites

10 minutes ago, kasi said:

partitioned By(<columnname>)

date meeda parition cheyadham antha machi option kadu, better use weekly

Date meda cheyali bro .. partitioned by ante hive table ddl lona bro?

Link to comment
Share on other sites

30 minutes ago, mettastar said:

Date meda cheyali bro .. partitioned by ante hive table ddl lona bro?

direct ga nuvu write lo kuda vadochu 

df.repartition("entity", "year", "month", "day", "status").write.partitionBy("entity", "year", "month", "day", "status").mode('Append').parquet(s"$location")

 

njoy madi

Link to comment
Share on other sites

11 minutes ago, kasi said:

direct ga nuvu write lo kuda vadochu 

df.repartition("entity", "year", "month", "day", "status").write.partitionBy("entity", "year", "month", "day", "status").mode('Append').parquet(s"$location")

 

njoy madi

idi tried bro .. hive query use chesina daniki deeniki pedha difference ledu..

one month data tho try chesa ..comparable ga fast undi .. so now I want to identofy max and min dates in the dataset and loop through in 1 month increments .. 

hiveql output ni variables ki etla assign cheyali bro .. google chesthe dorakatle .. 

val max = hiveContext(s"""select max(processed_day) from table """)   -- this is not assigning the output 

Link to comment
Share on other sites

10 minutes ago, kasi said:

sc = SparkContext()
sqlContext = HiveContext(sc)

 

sqlContext.Sql("""select * from table""").....ila rayali 

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val rowRDD = r.map(row => Row.fromSeq(row.split("\t", -1)))
val df = hiveContext.createDataFrame(rowRDD, schema)


df.write.mode(SaveMode.Overwrite).format("orc").partitionBy("processed_day").save("/user/hive/warehouse/df_d_distributor_return_items/")

 

tried this way . and was running forever .. and same thing hive dynamic partitioning try chesa adi kuda running forever.. 

hive dynamic partitioning one month data thoni chesthe 20mins lo atla finish aindhi..

now i want to get max and min of processed_day column and after that use for loop to loop through in one month increments .. so one month increments lo dynamic partition chestha..

so aa max and min kanukoni for loop lo one month increments lo etla cheyalno cheppava bro .. nenu kuda google chesthunna ..thx

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...