-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam dumps PDF (Page: 3)

QUESTION: 17

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

10
Same number as the cluster executors
1
20

Answer(s): A

Explanation:

The .coalesce(numPartitions) function is used to reduce the number of partitions in a DataFrame. It does not increase the number of partitions. If the specified number of partitions is greater than the current number, it will not have any effect.

From the official Spark documentation:

"coalesce() results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim one or more of the current partitions."
However, if you try to increase partitions using coalesce (e.g., from 10 to 20), the number of partitions remains unchanged.

Hence, df.coalesce(20) will still return a DataFrame with 10 partitions.

Reference:

Apache Spark 3.5 Programming Guide RDD and DataFrame Operations coalesce()

Reveal Solution Next Question

QUESTION: 18

Given the following code snippet in my_spark_app.py:

What is the role of the driver node?

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes
The driver node only provides the user interface for monitoring the application
The driver node holds the DataFrame data and performs all computations locally
The driver node stores the final result after computations are completed by worker nodes

Answer(s): A

Explanation:

In the Spark architecture, the driver node is responsible for orchestrating the execution of a Spark application. It converts user-defined transformations and actions into a logical plan, optimizes it into a physical plan, and then splits the plan into tasks that are distributed to the executor nodes.

As per Databricks and Spark documentation:

"The driver node is responsible for maintaining information about the Spark application, responding to a user's program or input, and analyzing, distributing, and scheduling work across the executors."

This means:

Option A is correct because the driver schedules and coordinates the job execution.

Option B is incorrect because the driver does more than just UI monitoring.

Option C is incorrect since data and computations are distributed across executor nodes.

Option D is incorrect; results are returned to the driver but not stored long-term by it.

Reference:

Databricks Certified Developer Spark 3.5 Documentation Spark Architecture Driver vs Executors.

Reveal Solution Next Question

QUESTION: 19

A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:

import hashlib import pyspark.sql.functions as sf from pyspark.sql.types import StringType def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = sf.udf(shake_256, StringType())

The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition of shake_256_udf to this:CopyEdit shake_256_udf = sf.pandas_udf(shake_256, StringType())

However, the developer receives the error:

What should the signature of the shake_256() function be changed to in order to fix this error?

def shake_256(df: pd.Series) -> str:
def shake_256(df: Iterator[pd.Series]) -> Iterator[pd.Series]:
def shake_256(raw: str) -> str:
def shake_256(df: pd.Series) -> pd.Series:

Answer(s): D

Explanation:

When converting a standard PySpark UDF to a Pandas UDF for performance optimization, the function must operate on a Pandas Series as input and return a Pandas Series as output.

In this case, the original function signature:

def shake_256(raw: str) -> str is scalar -- not compatible with Pandas UDFs.

According to the official Spark documentation:

"Pandas UDFs operate on pandas.Series and return pandas.Series. The function definition should be:

def my_udf(s: pd.Series) -> pd.Series:

and it must be registered using pandas_udf(...)."

Therefore, to fix the error:

The function should be updated to:

def shake_256(df: pd.Series) -> pd.Series:

return df.apply(lambda x: hashlib.shake_256(x.encode()).hexdigest(20))

This will allow Spark to efficiently execute the Pandas UDF in vectorized form, improving performance compared to standard UDFs.

Reference:

Apache Spark 3.5 Documentation User-Defined Functions Pandas UDFs

Reveal Solution Next Question

QUESTION: 20

A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

C)

D)

Use the applyInPandas API:
df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
Use the mapInPandas API:
df.mapInPandas(mean_func, schema="user_id long, value double").show()
Use a regular Spark UDF:
from pyspark.sql.functions import mean df.groupBy("user_id").agg(mean("value")).show()
Use a Pandas UDF:
@pandas_udf("double")
def mean_func(value: pd.Series) -> float:
return value.mean()
df.groupby("user_id").agg(mean_func(df["value"])).show()

Answer(s): A

Explanation:

The correct approach to perform a parallelized groupBy operation across Spark worker nodes using Pandas API is via applyInPandas. This function enables grouped map operations using Pandas logic in a distributed Spark environment. It applies a user-defined function to each group of data represented as a Pandas DataFrame.

As per the Databricks documentation:

"applyInPandas() allows for vectorized operations on grouped data in Spark. It applies a user-defined function to each group of a DataFrame and outputs a new DataFrame. This is the recommended approach for using Pandas logic across grouped data with parallel execution."

Option A is correct and achieves this parallel execution.

Option B (mapInPandas) applies to the entire DataFrame, not grouped operations.

Option C uses built-in aggregation functions, which are efficient but not customizable with Pandas logic.

Option D creates a scalar Pandas UDF which does not perform a group-wise transformation.

Therefore, to run a groupBy with parallel Pandas logic on Spark workers, Option A using applyInPandas is the only correct answer.

Reference:

Apache Spark 3.5 Documentation Pandas API on Spark Grouped Map Pandas UDFs (applyInPandas)

Reveal Solution Next Question

QUESTION: 21

Given:

python

CopyEdit spark.sparkContext.setLogLevel("<LOG_LEVEL>")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

ALL, DEBUG, FAIL, INFO
ERROR, WARN, TRACE, OFF
WARN, NONE, ERROR, FATAL
FATAL, NONE, INFO, DEBUG

Answer(s): B

Explanation:

The setLogLevel() method of SparkContext sets the logging level on the driver, which controls the verbosity of logs emitted during job execution. Supported levels are inherited from log4j and include the following:

ALL

DEBUG

ERROR

FATAL

INFO

OFF

TRACE

WARN

According to official Spark and Databricks documentation:

"Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, and WARN."

Among the choices provided, only option B (ERROR, WARN, TRACE, OFF) includes four valid log levels and excludes invalid ones like "FAIL" or "NONE".

Reference:

Apache Spark API docs SparkContext.setLogLevel

Reveal Solution Next Question

QUESTION: 22

An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

The code fails to execute because the column names employee_id and emp_id do not match automatically
The code fails to execute because it must use on='employee_id' to specify the join column explicitly
The code fails to execute because PySpark does not support joining DataFrames with a different structure
The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2

Answer(s): D

Explanation:

In PySpark, when performing a join between two DataFrames, the columns do not have to share the same name. You can explicitly provide a join condition by comparing specific columns from each DataFrame.

This syntax is correct and fully supported:

df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

This will perform an inner join between df1 and df2 using the employee_id from df1 and emp_id from df2.

Reference:

Databricks Spark 3.5 Documentation DataFrame API join()

Reveal Solution Next Question

QUESTION: 23

A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.

Which line of Spark code will produce a Parquet table that meets these requirements?

final_df \
.sort("market_time") \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
final_df \
.orderBy("market_time") \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
final_df \
.sort("market_time") \
.coalesce(1) \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
final_df \
.sortWithinPartitions("market_time") \

.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
Option A
Option B
Option C
Option D

Answer(s): D

Explanation:

To ensure that data written out to disk is sorted, it is important to consider how Spark writes data when saving to Parquet tables. The methods .sort() or .orderBy() apply a global sort but do not guarantee that the sorting will persist in the final output files unless certain conditions are met (e.g. a single partition via .coalesce(1) -- which is not scalable).

Instead, the proper method in distributed Spark processing to ensure rows are sorted within their respective partitions when written out is:

.sortWithinPartitions("column_name")

According to Apache Spark documentation:

"sortWithinPartitions() ensures each partition is sorted by the specified columns. This is useful for downstream systems that require sorted files."

This method works efficiently in distributed settings, avoids the performance bottleneck of global sorting (as in .orderBy() or .sort()), and guarantees each output partition has sorted records -- which meets the requirement of consistently sorted data.

Thus:

Option A and B do not guarantee the persisted file contents are sorted.

Option C introduces a bottleneck via .coalesce(1) (single partition).

Option D correctly applies sorting within partitions and is scalable.

Reference:

Databricks & Apache Spark 3.5 Documentation DataFrame API sortWithinPartitions()

Reveal Solution Next Question

QUESTION: 24

In the code block below, aggDF contains aggregations on a streaming DataFrame:

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

complete
append
replace
aggregate

Answer(s): A

Explanation:

The correct output mode for streaming aggregations that need to output the full updated results at each trigger is "complete".

From the official documentation:

"complete: The entire updated result table will be output to the sink every time there is a trigger."

This is ideal for aggregations, such as counts or averages grouped by a key, where the result table changes incrementally over time.

append: only outputs newly added rows replace and aggregate: invalid values for output mode

Reference:

Spark Structured Streaming Programming Guide Output Modes

Reveal Solution Next Question

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 3) Databricks Certified Associate Developer for Apache Spark 3.5 - Python Updated on: 21-Feb-2026

QUESTION: 17

Explanation:

Reference:

QUESTION: 18

Explanation:

Reference:

QUESTION: 19

Explanation:

Reference:

QUESTION: 20

Explanation:

Reference:

QUESTION: 21

Explanation:

Reference:

QUESTION: 22

Explanation:

Reference:

QUESTION: 23

Explanation:

Reference:

QUESTION: 24

Explanation:

Reference:

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 3)
Databricks Certified Associate Developer for Apache Spark 3.5 - Python
Updated on: 21-Feb-2026