Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 3)
Databricks Certified Associate Developer for Apache Spark 3.5 - Python
Updated on: 09-Apr-2026

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

  1. 10
  2. Same number as the cluster executors
  3. 1
  4. 20

Answer(s): A

Explanation:

The .coalesce(numPartitions) function is used to reduce the number of partitions in a DataFrame. It does not increase the number of partitions. If the specified number of partitions is greater than the current number, it will not have any effect.

From the official Spark documentation:

"coalesce() results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim one or more of the current partitions."
However, if you try to increase partitions using coalesce (e.g., from 10 to 20), the number of partitions remains unchanged.

Hence, df.coalesce(20) will still return a DataFrame with 10 partitions.


Reference:

Apache Spark 3.5 Programming Guide RDD and DataFrame Operations coalesce()



Given the following code snippet in my_spark_app.py:



What is the role of the driver node?

  1. The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes
  2. The driver node only provides the user interface for monitoring the application
  3. The driver node holds the DataFrame data and performs all computations locally
  4. The driver node stores the final result after computations are completed by worker nodes

Answer(s): A

Explanation:

In the Spark architecture, the driver node is responsible for orchestrating the execution of a Spark application. It converts user-defined transformations and actions into a logical plan, optimizes it into a physical plan, and then splits the plan into tasks that are distributed to the executor nodes.

As per Databricks and Spark documentation:

"The driver node is responsible for maintaining information about the Spark application, responding to a user's program or input, and analyzing, distributing, and scheduling work across the executors."

This means:

Option A is correct because the driver schedules and coordinates the job execution.

Option B is incorrect because the driver does more than just UI monitoring.

Option C is incorrect since data and computations are distributed across executor nodes.

Option D is incorrect; results are returned to the driver but not stored long-term by it.


Reference:

Databricks Certified Developer Spark 3.5 Documentation Spark Architecture Driver vs Executors.



A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:



import hashlib import pyspark.sql.functions as sf from pyspark.sql.types import StringType def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = sf.udf(shake_256, StringType())

The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition of shake_256_udf to this:CopyEdit shake_256_udf = sf.pandas_udf(shake_256, StringType())

However, the developer receives the error:

What should the signature of the shake_256() function be changed to in order to fix this error?

  1. def shake_256(df: pd.Series) -> str:
  2. def shake_256(df: Iterator[pd.Series]) -> Iterator[pd.Series]:
  3. def shake_256(raw: str) -> str:
  4. def shake_256(df: pd.Series) -> pd.Series:

Answer(s): D

Explanation:

When converting a standard PySpark UDF to a Pandas UDF for performance optimization, the function must operate on a Pandas Series as input and return a Pandas Series as output.

In this case, the original function signature:

def shake_256(raw: str) -> str is scalar -- not compatible with Pandas UDFs.

According to the official Spark documentation:

"Pandas UDFs operate on pandas.Series and return pandas.Series. The function definition should be:

def my_udf(s: pd.Series) -> pd.Series:

and it must be registered using pandas_udf(...)."

Therefore, to fix the error:

The function should be updated to:

def shake_256(df: pd.Series) -> pd.Series:

return df.apply(lambda x: hashlib.shake_256(x.encode()).hexdigest(20))

This will allow Spark to efficiently execute the Pandas UDF in vectorized form, improving performance compared to standard UDFs.


Reference:

Apache Spark 3.5 Documentation User-Defined Functions Pandas UDFs



A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)



C)



D)

  1. Use the applyInPandas API:
    df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
  2. Use the mapInPandas API:
    df.mapInPandas(mean_func, schema="user_id long, value double").show()
  3. Use a regular Spark UDF:
    from pyspark.sql.functions import mean df.groupBy("user_id").agg(mean("value")).show()
  4. Use a Pandas UDF:
    @pandas_udf("double")
    def mean_func(value: pd.Series) -> float:
    return value.mean()
    df.groupby("user_id").agg(mean_func(df["value"])).show()

Answer(s): A

Explanation:

The correct approach to perform a parallelized groupBy operation across Spark worker nodes using Pandas API is via applyInPandas. This function enables grouped map operations using Pandas logic in a distributed Spark environment. It applies a user-defined function to each group of data represented as a Pandas DataFrame.

As per the Databricks documentation:

"applyInPandas() allows for vectorized operations on grouped data in Spark. It applies a user-defined function to each group of a DataFrame and outputs a new DataFrame. This is the recommended approach for using Pandas logic across grouped data with parallel execution."

Option A is correct and achieves this parallel execution.

Option B (mapInPandas) applies to the entire DataFrame, not grouped operations.

Option C uses built-in aggregation functions, which are efficient but not customizable with Pandas logic.

Option D creates a scalar Pandas UDF which does not perform a group-wise transformation.

Therefore, to run a groupBy with parallel Pandas logic on Spark workers, Option A using applyInPandas is the only correct answer.


Reference:

Apache Spark 3.5 Documentation Pandas API on Spark Grouped Map Pandas UDFs (applyInPandas)



Given:

python

CopyEdit spark.sparkContext.setLogLevel("<LOG_LEVEL>")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

  1. ALL, DEBUG, FAIL, INFO
  2. ERROR, WARN, TRACE, OFF
  3. WARN, NONE, ERROR, FATAL
  4. FATAL, NONE, INFO, DEBUG

Answer(s): B

Explanation:

The setLogLevel() method of SparkContext sets the logging level on the driver, which controls the verbosity of logs emitted during job execution. Supported levels are inherited from log4j and include the following:

ALL

DEBUG

ERROR

FATAL

INFO

OFF

TRACE

WARN

According to official Spark and Databricks documentation:

"Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, and WARN."

Among the choices provided, only option B (ERROR, WARN, TRACE, OFF) includes four valid log levels and excludes invalid ones like "FAIL" or "NONE".


Reference:

Apache Spark API docs SparkContext.setLogLevel



An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

  1. The code fails to execute because the column names employee_id and emp_id do not match automatically
  2. The code fails to execute because it must use on='employee_id' to specify the join column explicitly
  3. The code fails to execute because PySpark does not support joining DataFrames with a different structure
  4. The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2

Answer(s): D

Explanation:

In PySpark, when performing a join between two DataFrames, the columns do not have to share the same name. You can explicitly provide a join condition by comparing specific columns from each DataFrame.

This syntax is correct and fully supported:

df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

This will perform an inner join between df1 and df2 using the employee_id from df1 and emp_id from df2.


Reference:

Databricks Spark 3.5 Documentation DataFrame API join()



A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.

Which line of Spark code will produce a Parquet table that meets these requirements?

  1. final_df \
    .sort("market_time") \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  2. final_df \
    .orderBy("market_time") \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  3. final_df \
    .sort("market_time") \
    .coalesce(1) \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  4. final_df \
    .sortWithinPartitions("market_time") \

    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  5. Option A
  6. Option B
  7. Option C
  8. Option D

Answer(s): D

Explanation:

To ensure that data written out to disk is sorted, it is important to consider how Spark writes data when saving to Parquet tables. The methods .sort() or .orderBy() apply a global sort but do not guarantee that the sorting will persist in the final output files unless certain conditions are met (e.g. a single partition via .coalesce(1) -- which is not scalable).

Instead, the proper method in distributed Spark processing to ensure rows are sorted within their respective partitions when written out is:

.sortWithinPartitions("column_name")

According to Apache Spark documentation:

"sortWithinPartitions() ensures each partition is sorted by the specified columns. This is useful for downstream systems that require sorted files."

This method works efficiently in distributed settings, avoids the performance bottleneck of global sorting (as in .orderBy() or .sort()), and guarantees each output partition has sorted records -- which meets the requirement of consistently sorted data.

Thus:

Option A and B do not guarantee the persisted file contents are sorted.

Option C introduces a bottleneck via .coalesce(1) (single partition).

Option D correctly applies sorting within partitions and is scalable.


Reference:

Databricks & Apache Spark 3.5 Documentation DataFrame API sortWithinPartitions()



In the code block below, aggDF contains aggregations on a streaming DataFrame:



Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

  1. complete
  2. append
  3. replace
  4. aggregate

Answer(s): A

Explanation:

The correct output mode for streaming aggregations that need to output the full updated results at each trigger is "complete".

From the official documentation:

"complete: The entire updated result table will be output to the sink every time there is a trigger."

This is ideal for aggregations, such as counts or averages grouped by a key, where the result table changes incrementally over time.

append: only outputs newly added rows replace and aggregate: invalid values for output mode


Reference:

Spark Structured Streaming Programming Guide Output Modes



Viewing Page 3 of 18



Share your comments for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam with other users:

Tyron 9/8/2023 12:12:00 AM

so far it is really informative
Anonymous


beast 7/30/2023 2:22:00 PM

hi i want it please please upload it
Anonymous


Mirex 5/26/2023 3:45:00 AM

am preparing for exam ,just nice questions
Anonymous


exampei 8/7/2023 8:05:00 AM

please upload c_tadm_23 exam
TURKEY


Anonymous 9/12/2023 12:50:00 PM

can we get tdvan4 vantage data engineering pdf?
UNITED STATES


Aish 10/11/2023 5:51:00 AM

want to clear the exam.
INDIA


Smaranika 6/22/2023 8:42:00 AM

could you please upload the dumps of sap c_sac_2302
INDIA


Blessious Phiri 8/15/2023 1:56:00 PM

asm management configuration is about storage
Anonymous


Lewis 7/6/2023 8:49:00 PM

kool thumb up
UNITED STATES


Moreece 5/15/2023 8:44:00 AM

just passed the az-500 exam this last friday. most of the questions in this exam dumps are in the exam. i bought the full version and noticed some of the questions which were answered wrong in the free version are all corrected in the full version. this site is good but i wish the had it in an interactive version like a test engine simulator.
Anonymous


Terry 5/24/2023 4:41:00 PM

i can practice for exam
Anonymous


Emerys 7/29/2023 6:55:00 AM

please i need this exam.
Anonymous


Goni Mala 9/2/2023 12:27:00 PM

i need the dump
Anonymous


Lenny 9/29/2023 11:30:00 AM

i want it bad, even if cs6 maybe retired, i want to learn cs6
HONG KONG


MilfSlayer 12/28/2023 8:32:00 PM

i hate comptia with all my heart with their "choose the best" answer format as an argument could be made on every question. they say "the "comptia way", lmao no this right here boys is the comptia way 100%. take it from someone whos failed this exam twice but can configure an entire complex network that these are the questions that are on the test 100% no questions asked. the pbqs are dead on! nice work
Anonymous


Swati Raj 11/14/2023 6:28:00 AM

very good materials
UNITED STATES


Ko Htet 10/17/2023 1:28:00 AM

thanks for your support.
Anonymous


Philippe 1/22/2023 10:24:00 AM

iam impressed with the quality of these dumps. they questions and answers were easy to understand and the xengine app was very helpful to use.
CANADA


Sam 8/31/2023 10:32:00 AM

not bad but you question database from isaca
MALAYSIA


Brijesh kr 6/29/2023 4:07:00 AM

awesome contents
INDIA


JM 12/19/2023 1:22:00 PM

answer to 134 is casb. while data loss prevention is the goal, in order to implement dlp in cloud applications you need to deploy a casb.
UNITED STATES


Neo 7/26/2023 9:36:00 AM

are these brain dumps sufficient enough to go write exam after practicing them? or does one need more material this wont be enough?
SOUTH AFRICA


Bilal 8/22/2023 6:33:00 AM

i did attend the required cources and i need to be sure that i am ready to take the exam, i would ask you please to share the questions, to be sure that i am fit to proceed with taking the exam.
Anonymous


John 11/12/2023 8:48:00 PM

why only give explanations on some, and not all questions and their respective answers?
UNITED STATES


Biswa 11/20/2023 8:50:00 AM

refresh db knowledge
Anonymous


Shalini Sharma 10/17/2023 8:29:00 AM

interested for sap certification
JAPAN


ethan 9/24/2023 12:38:00 PM

could you please upload practice questions for scr exam ?
HONG KONG


vijay joshi 8/19/2023 3:15:00 AM

please upload free oracle cloud infrastructure 2023 foundations associate exam braindumps
Anonymous


Ayodele Talabi 8/25/2023 9:25:00 PM

sweating! they are tricky
CANADA


Romero 3/23/2022 4:20:00 PM

i never use these dumps sites but i had to do it for this exam as it is impossible to pass without using these question dumps.
UNITED STATES


John Kennedy 9/20/2023 3:33:00 AM

good practice and well sites.
Anonymous


Nenad 7/12/2022 11:05:00 PM

passed my first exam last week and pass the second exam this morning. thank you sir for all the help and these brian dumps.
INDIA


Lucky 10/31/2023 2:01:00 PM

does anyone who attended exam csa 8.8, can confirm these questions are really coming ? or these are just for practicing?
HONG KONG


Prateek 9/18/2023 11:13:00 AM

kindly share the dumps
UNITED STATES