Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 3)
Databricks Certified Associate Developer for Apache Spark 3.5 - Python
Updated on: 21-Feb-2026

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

  1. 10
  2. Same number as the cluster executors
  3. 1
  4. 20

Answer(s): A

Explanation:

The .coalesce(numPartitions) function is used to reduce the number of partitions in a DataFrame. It does not increase the number of partitions. If the specified number of partitions is greater than the current number, it will not have any effect.

From the official Spark documentation:

"coalesce() results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim one or more of the current partitions."
However, if you try to increase partitions using coalesce (e.g., from 10 to 20), the number of partitions remains unchanged.

Hence, df.coalesce(20) will still return a DataFrame with 10 partitions.


Reference:

Apache Spark 3.5 Programming Guide RDD and DataFrame Operations coalesce()



Given the following code snippet in my_spark_app.py:



What is the role of the driver node?

  1. The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes
  2. The driver node only provides the user interface for monitoring the application
  3. The driver node holds the DataFrame data and performs all computations locally
  4. The driver node stores the final result after computations are completed by worker nodes

Answer(s): A

Explanation:

In the Spark architecture, the driver node is responsible for orchestrating the execution of a Spark application. It converts user-defined transformations and actions into a logical plan, optimizes it into a physical plan, and then splits the plan into tasks that are distributed to the executor nodes.

As per Databricks and Spark documentation:

"The driver node is responsible for maintaining information about the Spark application, responding to a user's program or input, and analyzing, distributing, and scheduling work across the executors."

This means:

Option A is correct because the driver schedules and coordinates the job execution.

Option B is incorrect because the driver does more than just UI monitoring.

Option C is incorrect since data and computations are distributed across executor nodes.

Option D is incorrect; results are returned to the driver but not stored long-term by it.


Reference:

Databricks Certified Developer Spark 3.5 Documentation Spark Architecture Driver vs Executors.



A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:



import hashlib import pyspark.sql.functions as sf from pyspark.sql.types import StringType def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = sf.udf(shake_256, StringType())

The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition of shake_256_udf to this:CopyEdit shake_256_udf = sf.pandas_udf(shake_256, StringType())

However, the developer receives the error:

What should the signature of the shake_256() function be changed to in order to fix this error?

  1. def shake_256(df: pd.Series) -> str:
  2. def shake_256(df: Iterator[pd.Series]) -> Iterator[pd.Series]:
  3. def shake_256(raw: str) -> str:
  4. def shake_256(df: pd.Series) -> pd.Series:

Answer(s): D

Explanation:

When converting a standard PySpark UDF to a Pandas UDF for performance optimization, the function must operate on a Pandas Series as input and return a Pandas Series as output.

In this case, the original function signature:

def shake_256(raw: str) -> str is scalar -- not compatible with Pandas UDFs.

According to the official Spark documentation:

"Pandas UDFs operate on pandas.Series and return pandas.Series. The function definition should be:

def my_udf(s: pd.Series) -> pd.Series:

and it must be registered using pandas_udf(...)."

Therefore, to fix the error:

The function should be updated to:

def shake_256(df: pd.Series) -> pd.Series:

return df.apply(lambda x: hashlib.shake_256(x.encode()).hexdigest(20))

This will allow Spark to efficiently execute the Pandas UDF in vectorized form, improving performance compared to standard UDFs.


Reference:

Apache Spark 3.5 Documentation User-Defined Functions Pandas UDFs



A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)



C)



D)

  1. Use the applyInPandas API:
    df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
  2. Use the mapInPandas API:
    df.mapInPandas(mean_func, schema="user_id long, value double").show()
  3. Use a regular Spark UDF:
    from pyspark.sql.functions import mean df.groupBy("user_id").agg(mean("value")).show()
  4. Use a Pandas UDF:
    @pandas_udf("double")
    def mean_func(value: pd.Series) -> float:
    return value.mean()
    df.groupby("user_id").agg(mean_func(df["value"])).show()

Answer(s): A

Explanation:

The correct approach to perform a parallelized groupBy operation across Spark worker nodes using Pandas API is via applyInPandas. This function enables grouped map operations using Pandas logic in a distributed Spark environment. It applies a user-defined function to each group of data represented as a Pandas DataFrame.

As per the Databricks documentation:

"applyInPandas() allows for vectorized operations on grouped data in Spark. It applies a user-defined function to each group of a DataFrame and outputs a new DataFrame. This is the recommended approach for using Pandas logic across grouped data with parallel execution."

Option A is correct and achieves this parallel execution.

Option B (mapInPandas) applies to the entire DataFrame, not grouped operations.

Option C uses built-in aggregation functions, which are efficient but not customizable with Pandas logic.

Option D creates a scalar Pandas UDF which does not perform a group-wise transformation.

Therefore, to run a groupBy with parallel Pandas logic on Spark workers, Option A using applyInPandas is the only correct answer.


Reference:

Apache Spark 3.5 Documentation Pandas API on Spark Grouped Map Pandas UDFs (applyInPandas)



Given:

python

CopyEdit spark.sparkContext.setLogLevel("<LOG_LEVEL>")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

  1. ALL, DEBUG, FAIL, INFO
  2. ERROR, WARN, TRACE, OFF
  3. WARN, NONE, ERROR, FATAL
  4. FATAL, NONE, INFO, DEBUG

Answer(s): B

Explanation:

The setLogLevel() method of SparkContext sets the logging level on the driver, which controls the verbosity of logs emitted during job execution. Supported levels are inherited from log4j and include the following:

ALL

DEBUG

ERROR

FATAL

INFO

OFF

TRACE

WARN

According to official Spark and Databricks documentation:

"Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, and WARN."

Among the choices provided, only option B (ERROR, WARN, TRACE, OFF) includes four valid log levels and excludes invalid ones like "FAIL" or "NONE".


Reference:

Apache Spark API docs SparkContext.setLogLevel



An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

  1. The code fails to execute because the column names employee_id and emp_id do not match automatically
  2. The code fails to execute because it must use on='employee_id' to specify the join column explicitly
  3. The code fails to execute because PySpark does not support joining DataFrames with a different structure
  4. The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2

Answer(s): D

Explanation:

In PySpark, when performing a join between two DataFrames, the columns do not have to share the same name. You can explicitly provide a join condition by comparing specific columns from each DataFrame.

This syntax is correct and fully supported:

df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

This will perform an inner join between df1 and df2 using the employee_id from df1 and emp_id from df2.


Reference:

Databricks Spark 3.5 Documentation DataFrame API join()



A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.

Which line of Spark code will produce a Parquet table that meets these requirements?

  1. final_df \
    .sort("market_time") \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  2. final_df \
    .orderBy("market_time") \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  3. final_df \
    .sort("market_time") \
    .coalesce(1) \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  4. final_df \
    .sortWithinPartitions("market_time") \

    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  5. Option A
  6. Option B
  7. Option C
  8. Option D

Answer(s): D

Explanation:

To ensure that data written out to disk is sorted, it is important to consider how Spark writes data when saving to Parquet tables. The methods .sort() or .orderBy() apply a global sort but do not guarantee that the sorting will persist in the final output files unless certain conditions are met (e.g. a single partition via .coalesce(1) -- which is not scalable).

Instead, the proper method in distributed Spark processing to ensure rows are sorted within their respective partitions when written out is:

.sortWithinPartitions("column_name")

According to Apache Spark documentation:

"sortWithinPartitions() ensures each partition is sorted by the specified columns. This is useful for downstream systems that require sorted files."

This method works efficiently in distributed settings, avoids the performance bottleneck of global sorting (as in .orderBy() or .sort()), and guarantees each output partition has sorted records -- which meets the requirement of consistently sorted data.

Thus:

Option A and B do not guarantee the persisted file contents are sorted.

Option C introduces a bottleneck via .coalesce(1) (single partition).

Option D correctly applies sorting within partitions and is scalable.


Reference:

Databricks & Apache Spark 3.5 Documentation DataFrame API sortWithinPartitions()



In the code block below, aggDF contains aggregations on a streaming DataFrame:



Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

  1. complete
  2. append
  3. replace
  4. aggregate

Answer(s): A

Explanation:

The correct output mode for streaming aggregations that need to output the full updated results at each trigger is "complete".

From the official documentation:

"complete: The entire updated result table will be output to the sink every time there is a trigger."

This is ideal for aggregations, such as counts or averages grouped by a key, where the result table changes incrementally over time.

append: only outputs newly added rows replace and aggregate: invalid values for output mode


Reference:

Spark Structured Streaming Programming Guide Output Modes



Viewing Page 3 of 18



Share your comments for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam with other users:

srija 8/14/2023 8:53:00 AM

very helpful
EUROPEAN UNION


Thembelani 5/30/2023 2:17:00 AM

i am writing this exam tomorrow and have dumps
Anonymous


Anita 10/1/2023 4:11:00 PM

can i have the icdl excel exam
Anonymous


Ben 9/9/2023 7:35:00 AM

please upload it
Anonymous


anonymous 9/20/2023 11:27:00 PM

hye when will post again the past year question for this h13-311_v3 part since i have to for my test tommorow…thank you very much
Anonymous


Randall 9/28/2023 8:25:00 PM

on question 22, option b-once per session is also valid.
Anonymous


Tshegofatso 8/28/2023 11:51:00 AM

this website is very helpful
SOUTH AFRICA


philly 9/18/2023 2:40:00 PM

its my first time exam
SOUTH AFRICA


Beexam 9/4/2023 9:06:00 PM

correct answers are device configuration-enable the automatic installation of webview2 runtime. & policy management- prevent users from submitting feedback.
NEW ZEALAND


RAWI 7/9/2023 4:54:00 AM

is this dump still valid? today is 9-july-2023
SWEDEN


Annie 6/7/2023 3:46:00 AM

i need this exam.. please upload these are really helpful
PAKISTAN


Shubhra Rathi 8/26/2023 1:08:00 PM

please upload the oracle 1z0-1059-22 dumps
Anonymous


Shiji 10/15/2023 1:34:00 PM

very good questions
INDIA


Rita Rony 11/27/2023 1:36:00 PM

nice, first step to exams
Anonymous


Aloke Paul 9/11/2023 6:53:00 AM

is this valid for chfiv9 as well... as i am reker 3rd time...
CHINA


Calbert Francis 1/15/2024 8:19:00 PM

great exam for people taking 220-1101
UNITED STATES


Ayushi Baria 11/7/2023 7:44:00 AM

this is very helpfull for me
Anonymous


alma 8/25/2023 1:20:00 PM

just started preparing for the exam
UNITED KINGDOM


CW 7/10/2023 6:46:00 PM

these are the type of questions i need.
UNITED STATES


Nobody 8/30/2023 9:54:00 PM

does this actually work? are they the exam questions and answers word for word?
Anonymous


Salah 7/23/2023 9:46:00 AM

thanks for providing these questions
Anonymous


Ritu 9/15/2023 5:55:00 AM

interesting
CANADA


Ron 5/30/2023 8:33:00 AM

these dumps are pretty good.
Anonymous


Sowl 8/10/2023 6:22:00 PM

good questions
UNITED STATES


Blessious Phiri 8/15/2023 2:02:00 PM

dbua is used for upgrading oracle database
Anonymous


Richard 10/24/2023 6:12:00 AM

i am thrilled to say that i passed my amazon web services mls-c01 exam, thanks to study materials. they were comprehensive and well-structured, making my preparation efficient.
Anonymous


Janjua 5/22/2023 3:31:00 PM

please upload latest ibm ace c1000-056 dumps
GERMANY


Matt 12/30/2023 11:18:00 AM

if only explanations were provided...
FRANCE


Rasha 6/29/2023 8:23:00 PM

yes .. i need the dump if you can help me
Anonymous


Anonymous 7/25/2023 8:05:00 AM

good morning, could you please upload this exam again?
SPAIN


AJ 9/24/2023 9:32:00 AM

hi please upload sre foundation and practitioner exam questions
Anonymous


peter parker 8/10/2023 10:59:00 AM

the exam is listed as 80 questions with a pass mark of 70%, how is your 50 questions related?
Anonymous


Berihun 7/13/2023 7:29:00 AM

all questions are so important and covers all ccna modules
Anonymous


nspk 1/19/2024 12:53:00 AM

q 44. ans:- b (goto setup > order settings > select enable optional price books for orders) reference link --> https://resources.docs.salesforce.com/latest/latest/en-us/sfdc/pdf/sfom_impl_b2b_b2b2c.pdf(decide whether you want to enable the optional price books feature. if so, select enable optional price books for orders. you can use orders in salesforce while managing price books in an external platform. if you’re using d2c commerce, you must select enable optional price books for orders.)
Anonymous