Problem Scenario 74 : You have been given MySQL DB with following details.user=retail_dbapassword=clouderadatabase=retail_dbtable=retail_db.orderstable=retail_db.order_itemsjdbc URL = jdbc:mysql://quickstart:3306/retail_dbColumns of order table : (orderjd , order_date , ordercustomerid, order status}Columns of orderjtems table : (order_item_td , order_item_order_id , order_item_product_id, order_item_quantity, order_item_subtotal, order_item_product_price)Please accomplish following activities.1. Copy "retaildb.orders" and "retaildb.orderjtems" table to hdfs in respective directoryp89_orders and p89_order_items .2. Join these data using orderjd in Spark and Python3. Now fetch selected columns from joined data Orderld, Order date and amount collectedon this order.4. Calculate total order placed for each date, and produced the output sorted by date.
Answer(s): A
Solution:Step 1: Import Single table .sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=orders --target-dir=p89_orders - -m1 sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=order_items ~target-dir=p89_ order items -m 1 Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfsStep 2: Read the data from one of the partition, created using above command, hadoopfs -cat p89_orders/part-m-00000 hadoop fs -cat p89_order_items/part-m-00000 Step 3: Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p89_orders") orderitems = sc.textFile("p89_order_items")Step 4: Convert RDD into key value as (orderjd as a key and rest of the values as a value)#First value is orderjdordersKeyValue = orders.map(lambda line: (int(line.split(", ")[0]), line))#Second value as an OrderjdorderltemsKeyValue = orderltems.map(lambda line: (int(line.split(", ")[1]), line))Step 5: Join both the RDD using orderjdjoinedData = orderltemsKeyValue.join(ordersKeyValue)#print the joined datator line in joinedData.collect():print(line)Format of joinedData as below.[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']Step 6: Now fetch selected values Orderld, Order date and amount collected on this order. revenuePerOrderPerDay = joinedData.map(lambda row: (row[0]( row[1][1].split(", ")[1]( f!oat(row[1][0].split('\M}[4]}}}#printthe resultfor line in revenuePerOrderPerDay.collect():print(line)Step 7: Select distinct order ids for each date.#distinct(date, order_id)distinctOrdersDate = joinedData.map(lambda row: row[1][1].split('\")[1] + ", " + str(row[0])).distinct()for line in distinctOrdersDate.collect(): print(line)Step 8: Similar to word count, generate (date, 1) record for each row. newLineTuple = distinctOrdersDate.map(lambda line: (line.split(", ")[0], 1))Step 9: Do the count for each key(date), to get total order per date. totalOrdersPerDate = newLineTuple.reduceByKey(lambda a, b: a + b}#print resultsfor line in totalOrdersPerDate.collect():print(line)Step 10: Sort the results by date sortedData=totalOrdersPerDate.sortByKey().collect()#print resultsfor line in sortedData:print(line)
Problem Scenario 34 : You have given a file named spark6/user.csv.Data is given below:user.csvid, topic, hitsRahul, scala, 120Nikita, spark, 80Mithun, spark, 1myself, cca175, 180Now write a Spark code in scala which will remove the header part and create RDD of values as below, for all rows. And also if id is myself" than filter out row.Map(id -> om, topic -> scala, hits -> 120)
Solution :Step 1: Create file in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.Step 2: Load user.csv file from hdfs and create PairRDDs val csv = sc.textFile("spark6/user.csv")Step 3: split and clean dataval headerAndRows = csv.map(line => line.split(", ").map(_.trim))Step 4: Get header rowval header = headerAndRows.firstStep 5: Filter out header (We need to check if the first val matches the first header name) val data = headerAndRows.filter(_(0) != header(O))Step 6: Splits to map (header/value pairs)val maps = data.map(splits => header.zip(splits).toMap)Step 7: Filter out the user "myselfval result = maps.filter(map => mapf'id") != "myself")Step 8: Save the output as a Text file. result.saveAsTextFile("spark6/result.txt")
Problem Scenario 39 : You have been given two filesspark16/file1.txt1, 9, 52, 7, 43, 8, 3spark16/file2.txt1, g, h2, i, j3, k, lLoad these two tiles as Spark RDD and join them to produce the below results(l, ((9, 5), (g, h)))(2, ((7, 4), (i, j))) (3, ((8, 3), (k, l)))And write code snippet which will sum the second columns of above joined results (5+4+3).
Solution :Step 1: Create tiles in hdfs using Hue.Step 2: Create pairRDD for both the files.val one = sc.textFile("spark16/file1.txt").map{_.split(", ", -1) match {case Array(a, b, c) => (a, ( b, c))} }val two = sc.textFHe(Mspark16/file2.txt").map{_.split('7\-1) match {case Array(a, b, c) => (a, (b, c))} }Step 3: Join both the RDD. val joined = one.join(two)Step 4: Sum second column values.val sum = joined.map {case (_, ((_, num2), (_, _))) => num2.tolnt}.reduce(_ + _)
Problem Scenario 58 : You have been given below code snippet.val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = a.keyBy(_.length)operation1Write a correct code snippet for operationl which will produce desired output, shown below.Array[(lnt, Seq[String])] = Array((4, ArrayBuffer(lion)), (6, ArrayBuffer(spider)), (3, ArrayBuffer(dog, cat)), (5, ArrayBuffer(tiger, eagle}}}
Solution :b.groupByKey.collectgroupByKey [Pair]Very similar to groupBy, but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.Listing Variantsdef groupByKeyQ: RDD[(K, lterable[V]}]def groupByKey(numPartittons: Int): RDD[(K, lterable[V] )] def groupByKey(partitioner: Partitioner): RDD[(K, lterable[V])]
Problem Scenario 63 : You have been given below code snippet.val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)val b = a.map(x => (x.length, x))operation1 Write a correct code snippet for operationl which will produce desired output, shown below. Array[(lnt, String}] = Array((4, lion), (3, dogcat), (7, panther), (5, tigereagle))
Solution :b.reduceByKey(_ + _).collectreduceByKey JPair] : This function provides the well-known reduce functionality in Spark. Please note that any function f you provide, should be commutative in order to generate reproducible results.
Share your comments for Cloudera CCA175 exam with other users:
informative for me.
question 134s answer shoule be "dlp"
in 72 the answer must be [sys_user_has_role] table.
i appreciated the mix of multiple-choice and short answer questions. i passed my exam this morning.
great to find this website, thanks
examination questions seem to be relevant.
planning to take psm test
please allow to download
please provide dumps
is the answer to question 15 correct ? i feel like the answer should be b
its getting more technical
i think these questions are what i need.
helpful assessment
i am confused about the answers to the questions. do you know if the answers are correct?
hi, please make the dumps available for my upcoming examination.
good practice
so far it is really informative
hi i want it please please upload it
am preparing for exam ,just nice questions
please upload c_tadm_23 exam
can we get tdvan4 vantage data engineering pdf?
want to clear the exam.
could you please upload the dumps of sap c_sac_2302
asm management configuration is about storage
kool thumb up
just passed the az-500 exam this last friday. most of the questions in this exam dumps are in the exam. i bought the full version and noticed some of the questions which were answered wrong in the free version are all corrected in the full version. this site is good but i wish the had it in an interactive version like a test engine simulator.
i can practice for exam
please i need this exam.
i need the dump
i want it bad, even if cs6 maybe retired, i want to learn cs6
i hate comptia with all my heart with their "choose the best" answer format as an argument could be made on every question. they say "the "comptia way", lmao no this right here boys is the comptia way 100%. take it from someone whos failed this exam twice but can configure an entire complex network that these are the questions that are on the test 100% no questions asked. the pbqs are dead on! nice work
very good materials
thanks for your support.
iam impressed with the quality of these dumps. they questions and answers were easy to understand and the xengine app was very helpful to use.