Certified Associate Developer for Apache Spark Dumps PDF Free Download

QUESTION: 31

Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?

itemsDf.withColumn(["supplier", "manufacturer"])
itemsDf.withColumn("supplier").alias("manufacturer")
itemsDf.withColumnRenamed("supplier", "manufacturer")
itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
itemsDf.withColumnsRenamed("supplier", "manufacturer")

Answer(s): C

Explanation:

itemsDf.withColumnRenamed("supplier", "manufacturer")

Correct! This uses the relatively trivial DataFrame method withColumnRenamed for renaming column supplier to column manufacturer.
Note that the Question: asks for
"a copy of DataFrame itemsDf". This may be confusing if you are not familiar with Spark yet. RDDs (Resilient Distributed Datasets) are the foundation of Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it that has the changes applied. itemsDf.withColumnsRenamed("supplier", "manufacturer")
Incorrect. Spark's DataFrame API does not have a withColumnsRenamed() method. itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
No. Watch out – although the col() method works for many methods of the DataFrame API, withColumnRenamed is not one of them. As outlined in the documentation linked below, withColumnRenamed expects strings. itemsDf.withColumn(["supplier", "manufacturer"])
Wrong. While DataFrame.withColumn() exists in Spark, it has a different purpose than renaming columns. withColumn is typically used to add columns to DataFrames, taking the name of the new column as a first, and a Column as a second argument. Learn more via the documentation that is linked below.
itemsDf.withColumn("supplier").alias("manufacturer")
No. While DataFrame.withColumn() exists, it requires 2 arguments. Furthermore, the alias() method on DataFrames would not help the cause of renaming a column much. DataFrame.alias() can be useful in addressing the input of join statements. However, this is far outside of the scope of this question. If you are curious nevertheless, check out the link below.
More info: pyspark.sql.DataFrame.withColumnRenamed — PySpark 3.1.1 documentation, pyspark.sql.DataFrame.withColumn — PySpark 3.1.1 documentation, and pyspark.sql.DataFrame.alias —

PySpark 3.1.2 documentation (https://bit.ly/3aSB5tm , https://bit.ly/2Tv4rbE , https://bit.ly/2RbhBd2)
Static notebook | Dynamic notebook: See test 1, Question: 31 (Databricks import instructions) (https://flrs.github.io/spark_practice_tests_code/#1/31.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

Reveal Solution Next Question

QUESTION: 32

Which of the following code blocks returns DataFrame transactionsDf sorted in descending order by column predError, showing missing values last?

transactionsDf.sort(asc_nulls_last("predError"))
transactionsDf.orderBy("predError").desc_nulls_last()
transactionsDf.sort("predError", ascending=False)
transactionsDf.desc_nulls_last("predError")
transactionsDf.orderBy("predError").asc_nulls_last()

Answer(s): C

Explanation:

transactionsDf.sort("predError", ascending=False)
Correct! When using DataFrame.sort() and setting ascending=False, the DataFrame will be sorted by the specified column in descending order, putting all missing values last. An alternative, although not listed as an answer here, would be transactionsDf.sort(desc_nulls_last("predError")). transactionsDf.sort(asc_nulls_last("predError"))
Incorrect. While this is valid syntax, the DataFrame will be sorted on column predError in ascending order and not in descending order, putting missing values last. transactionsDf.desc_nulls_last("predError")
Wrong, this is invalid syntax. There is no method DataFrame.desc_nulls_last() in the Spark API. There is a Spark function desc_nulls_last() however (link see below). transactionsDf.orderBy("predError").desc_nulls_last()
No. While transactionsDf.orderBy("predError") is correct syntax (although it sorts the DataFrame by column predError in ascending order) and returns a DataFrame, there is no method DataFrame.desc_nulls_last() in the Spark API. There is a Spark function desc_nulls_last() however (link see below).
transactionsDf.orderBy("predError").asc_nulls_last()
Incorrect. There is no method DataFrame.asc_nulls_last() in the Spark API (see above).
More info: pyspark.sql.functions.desc_nulls_last — PySpark 3.1.2 documentation and pyspark.sql.DataFrame.sort — PySpark 3.1.2 documentation (https://bit.ly/3g1JtbI , https://bit.ly/2R90NCS)
Static notebook | Dynamic notebook: See test 1, Question: 32 (
Databricks import instructions) (https://flrs.github.io/spark_practice_tests_code/#1/32.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

Reveal Solution Next Question

QUESTION: 33

The code block displayed below contains an error. The code block is intended to perform an outer join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.

Find the error.
Code block:
transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")

The "outer" argument should be eliminated, since "outer" is the default join type.
The join type needs to be appended to the join() operator, like join().outer() instead of listing it as the last argument inside the join() call.
The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.itemId == transactionsDf.productId.
The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.col("itemId")
== transactionsDf.col("productId").
The "outer" argument should be eliminated from the call and join should be replaced by joinOuter.

Answer(s): C

Explanation:

Correct code block:
transactionsDf.join(itemsDf, itemsDf.itemId == transactionsDf.productId, "outer") Static notebook | Dynamic notebook: See test 1, Question: 33 (
Databricks import instructions) (https://flrs.github.io/spark_practice_tests_code/#1/33.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

Reveal Solution Next Question

QUESTION: 34

Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId, respectively?

itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "right_outer")
itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "broadcast")
itemsDf.merge(transactionsDf, "itemsDf.itemId == transactionsDf.storeId", "broadcast")
itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId)
itemsDf.join(transactionsDf, broadcast(itemsDf.itemId == transactionsDf.storeId))

Answer(s): D

Explanation:

The issue with all answers that have "broadcast" as very last argument is that "broadcast" is not a valid join type. While the entry with "right_outer" is a valid statement, it is not a broadcast join. The item where broadcast() is wrapped around the equality condition is not valid code in Spark. broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.

More info: Learning Spark, 2nd Edition, Chapter 7
Static notebook | Dynamic notebook: See test 1, Question: 34 ( Databricks import instructions)
tion and explanation?

Reveal Solution Next Question

QUESTION: 35

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

DataFrame.repartition(12)
DataFrame.coalesce(6).shuffle()
DataFrame.coalesce(6)
DataFrame.coalesce(6, shuffle=True)
DataFrame.repartition(6)

Answer(s): E

Explanation:

DataFrame.repartition(6)
Correct. repartition() always triggers a full shuffle (different from coalesce()). DataFrame.repartition(12)
No, this would just leave the DataFrame with 12 partitions and not 6. DataFrame.coalesce(6)
coalesce does not perform a full shuffle of the data. Whenever you see "full shuffle", you know that you are not dealing with coalesce(). While coalesce() can perform a partial shuffle when required, it will try to minimize shuffle operations, so the amount of data that is sent between executors.
Here, 12 partitions can easily be repartitioned to be 6 partitions simply by stitching every two partitions into one.
DataFrame.coalesce(6, shuffle=True) and DataFrame.coalesce(6).shuffle() These statements are not valid Spark API syntax.
More info: Spark Repartition & Coalesce - Explained and Repartition vs Coalesce in Apache Spark - Rock the JVM Blog

Reveal Solution Next Question

Databricks Certified Associate Developer for Apache Spark Certified Associate Developer for Apache Spark Dumps in PDF

Free Databricks Certified Associate Developer for Apache Spark Real Questions (page: 7)

QUESTION: 31

Explanation:

QUESTION: 32

Explanation:

QUESTION: 33

Explanation:

QUESTION: 34

Explanation:

QUESTION: 35

Explanation: