Problem Scenario 74 : You have been given MySQL DB with following details.user=retail_dbapassword=clouderadatabase=retail_dbtable=retail_db.orderstable=retail_db.order_itemsjdbc URL = jdbc:mysql://quickstart:3306/retail_dbColumns of order table : (orderjd , order_date , ordercustomerid, order status}Columns of orderjtems table : (order_item_td , order_item_order_id , order_item_product_id, order_item_quantity, order_item_subtotal, order_item_product_price)Please accomplish following activities.1. Copy "retaildb.orders" and "retaildb.orderjtems" table to hdfs in respective directoryp89_orders and p89_order_items .2. Join these data using orderjd in Spark and Python3. Now fetch selected columns from joined data Orderld, Order date and amount collectedon this order.4. Calculate total order placed for each date, and produced the output sorted by date.
Answer(s): A
Solution:Step 1: Import Single table .sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=orders --target-dir=p89_orders - -m1 sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=order_items ~target-dir=p89_ order items -m 1 Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfsStep 2: Read the data from one of the partition, created using above command, hadoopfs -cat p89_orders/part-m-00000 hadoop fs -cat p89_order_items/part-m-00000 Step 3: Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p89_orders") orderitems = sc.textFile("p89_order_items")Step 4: Convert RDD into key value as (orderjd as a key and rest of the values as a value)#First value is orderjdordersKeyValue = orders.map(lambda line: (int(line.split(", ")[0]), line))#Second value as an OrderjdorderltemsKeyValue = orderltems.map(lambda line: (int(line.split(", ")[1]), line))Step 5: Join both the RDD using orderjdjoinedData = orderltemsKeyValue.join(ordersKeyValue)#print the joined datator line in joinedData.collect():print(line)Format of joinedData as below.[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']Step 6: Now fetch selected values Orderld, Order date and amount collected on this order. revenuePerOrderPerDay = joinedData.map(lambda row: (row[0]( row[1][1].split(", ")[1]( f!oat(row[1][0].split('\M}[4]}}}#printthe resultfor line in revenuePerOrderPerDay.collect():print(line)Step 7: Select distinct order ids for each date.#distinct(date, order_id)distinctOrdersDate = joinedData.map(lambda row: row[1][1].split('\")[1] + ", " + str(row[0])).distinct()for line in distinctOrdersDate.collect(): print(line)Step 8: Similar to word count, generate (date, 1) record for each row. newLineTuple = distinctOrdersDate.map(lambda line: (line.split(", ")[0], 1))Step 9: Do the count for each key(date), to get total order per date. totalOrdersPerDate = newLineTuple.reduceByKey(lambda a, b: a + b}#print resultsfor line in totalOrdersPerDate.collect():print(line)Step 10: Sort the results by date sortedData=totalOrdersPerDate.sortByKey().collect()#print resultsfor line in sortedData:print(line)
Problem Scenario 34 : You have given a file named spark6/user.csv.Data is given below:user.csvid, topic, hitsRahul, scala, 120Nikita, spark, 80Mithun, spark, 1myself, cca175, 180Now write a Spark code in scala which will remove the header part and create RDD of values as below, for all rows. And also if id is myself" than filter out row.Map(id -> om, topic -> scala, hits -> 120)
Solution :Step 1: Create file in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.Step 2: Load user.csv file from hdfs and create PairRDDs val csv = sc.textFile("spark6/user.csv")Step 3: split and clean dataval headerAndRows = csv.map(line => line.split(", ").map(_.trim))Step 4: Get header rowval header = headerAndRows.firstStep 5: Filter out header (We need to check if the first val matches the first header name) val data = headerAndRows.filter(_(0) != header(O))Step 6: Splits to map (header/value pairs)val maps = data.map(splits => header.zip(splits).toMap)Step 7: Filter out the user "myselfval result = maps.filter(map => mapf'id") != "myself")Step 8: Save the output as a Text file. result.saveAsTextFile("spark6/result.txt")
Problem Scenario 39 : You have been given two filesspark16/file1.txt1, 9, 52, 7, 43, 8, 3spark16/file2.txt1, g, h2, i, j3, k, lLoad these two tiles as Spark RDD and join them to produce the below results(l, ((9, 5), (g, h)))(2, ((7, 4), (i, j))) (3, ((8, 3), (k, l)))And write code snippet which will sum the second columns of above joined results (5+4+3).
Solution :Step 1: Create tiles in hdfs using Hue.Step 2: Create pairRDD for both the files.val one = sc.textFile("spark16/file1.txt").map{_.split(", ", -1) match {case Array(a, b, c) => (a, ( b, c))} }val two = sc.textFHe(Mspark16/file2.txt").map{_.split('7\-1) match {case Array(a, b, c) => (a, (b, c))} }Step 3: Join both the RDD. val joined = one.join(two)Step 4: Sum second column values.val sum = joined.map {case (_, ((_, num2), (_, _))) => num2.tolnt}.reduce(_ + _)
Problem Scenario 58 : You have been given below code snippet.val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = a.keyBy(_.length)operation1Write a correct code snippet for operationl which will produce desired output, shown below.Array[(lnt, Seq[String])] = Array((4, ArrayBuffer(lion)), (6, ArrayBuffer(spider)), (3, ArrayBuffer(dog, cat)), (5, ArrayBuffer(tiger, eagle}}}
Solution :b.groupByKey.collectgroupByKey [Pair]Very similar to groupBy, but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.Listing Variantsdef groupByKeyQ: RDD[(K, lterable[V]}]def groupByKey(numPartittons: Int): RDD[(K, lterable[V] )] def groupByKey(partitioner: Partitioner): RDD[(K, lterable[V])]
Problem Scenario 63 : You have been given below code snippet.val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)val b = a.map(x => (x.length, x))operation1 Write a correct code snippet for operationl which will produce desired output, shown below. Array[(lnt, String}] = Array((4, lion), (3, dogcat), (7, panther), (5, tigereagle))
Solution :b.reduceByKey(_ + _).collectreduceByKey JPair] : This function provides the well-known reduce functionality in Spark. Please note that any function f you provide, should be commutative in order to generate reproducible results.
Share your comments for Cloudera CCA175 exam with other users:
this is really very very helpful for mcd level 1
very helpful!
question #18s answer should be a, not d. this should be corrected. it should be minvalidityperiod
thanks for the exact solution
need to refer the questions and have to give the exam
i need it right now if it was possible please
i need it very much please share it in the fastest time.
correct answer is d for student.java program
q:37 c is correct
q6 exam topic: terramearth, c: correct answer: copy 1petabyte to encrypted usb device ???
explained answers
plan to take theaws certified developer - associate dva-c02 in the next few weeks
very helpfull
good questions
help to practice csa exam
nice tip and well documented
i need the exam
please upload
prepping for fsc exam
pd1 with great experience
@t it seems like azure service bus message quesues could be the best solution
helpful to check your understanding.
question 128 the answer should be static not auto
more comments here
great support to appear for exams
useful dumps
making progress
q31 answer should be d i think
is this real?
q10: c and f are also true. q11: this is outdated. you no longer need ownership on a pipe to operate it
good questions with simple explanation
admin guide (windows) respond to malicious causality chains. when the cortex xdr agent identifies a remote network connection that attempts to perform malicious activity—such as encrypting endpoint files—the agent can automatically block the ip address to close all existing communication and block new connections from this ip address to the endpoint. when cortex xdrblocks an ip address per endpoint, that address remains blocked throughout all agent profiles and policies, including any host-firewall policy rules. you can view the list of all blocked ip addresses per endpoint from the action center, as well as unblock them to re-enable communication as appropriate. this module is supported with cortex xdr agent 7.3.0 and later. select the action mode to take when the cortex xdr agent detects remote malicious causality chains: enabled (default)—terminate connection and block ip address of the remote connection. disabled—do not block remote ip addresses. to allow specific and known s
very inciting
question 5, it seems a instead of d, because: - care plan = case - patient = person account - product = product2;