Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 1)
Databricks Certified Associate Developer for Apache Spark 3.5 - Python
Updated on: 21-Feb-2026

A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.

Which code snippet can be used to meet this requirement?

  1. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
  2. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
  3. df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")
  4. df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")

Answer(s): A

Explanation:

To remove specific columns from a PySpark DataFrame, the drop() method is used. This method returns a new DataFrame without the specified columns. The correct syntax for dropping multiple columns is to pass each column name as a separate argument to the drop() method.

Correct Usage:

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

This line of code will return a new DataFrame df_user_non_pii that excludes the specified PII columns.

Explanation of Options:

A . Correct. Uses the drop() method with multiple column names passed as separate arguments,

which is the standard and correct usage in PySpark.

B . Although it appears similar to Option A, if the column names are not enclosed in quotes or if there's a syntax error (e.g., missing quotes or incorrect variable names), it would result in an error. However, as written, it's identical to Option A and thus also correct.

C . Incorrect. The dropfields() method is not a method of the DataFrame class in PySpark. It's used with StructType columns to drop fields from nested structures, not top-level DataFrame columns.

D . Incorrect. Passing a single string with comma-separated column names to dropfields() is not valid syntax in PySpark.


Reference:

PySpark Documentation: DataFrame.drop

Stack Overflow Discussion: How to delete columns in PySpark DataFrame



A data engineer is working on a Streaming DataFrame streaming_df with the given streaming data:



Which operation is supported with streaming_df?

  1. streaming_df.select(countDistinct("Name"))
  2. streaming_df.groupby("Id").count()
  3. streaming_df.orderBy("timestamp").limit(4)
  4. streaming_df.filter(col("count") < 30).show()

Answer(s): B

Explanation:

In Structured Streaming, only a limited subset of operations is supported due to the nature of unbounded data. Operations like sorting (orderBy) and global aggregation (countDistinct) require a full view of the dataset, which is not possible with streaming data unless specific watermarks or windows are defined.

Review of Each Option:

A . select(countDistinct("Name"))
Not allowed -- Global aggregation like countDistinct() requires the full dataset and is not supported directly in streaming without watermark and windowing logic.


Reference:

Databricks Structured Streaming Guide ­ Unsupported Operations.

B . groupby("Id").count()
Supported -- Streaming aggregations over a key (like groupBy("Id")) are supported. Spark maintains intermediate state for each key.


Databricks Docs Aggregations in Structured Streaming (https://docs.databricks.com/structured-streaming/aggregation.html)

C . orderBy("timestamp").limit(4)
Not allowed -- Sorting and limiting require a full view of the stream (which is infinite), so this is unsupported in streaming DataFrames.


Spark Structured Streaming ­ Unsupported Operations (ordering without watermark/window not allowed).

D . filter(col("count") < 30).show()
Not allowed -- show() is a blocking operation used for debugging batch DataFrames; it's not allowed on streaming DataFrames.


Structured Streaming Programming Guide ­ Output operations like show() are not supported.

Reference Extract from Official Guide:

"Operations like orderBy, limit, show, and countDistinct are not supported in Structured Streaming because they require the full dataset to compute a result. Use groupBy(...).agg(...) instead for incremental aggregations."
-- Databricks Structured Streaming Programming Guide



An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:



def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

  1. Convert the Pandas UDF to a PySpark UDF
  2. Convert the Pandas UDF from a Series Series UDF to a Series Scalar UDF
  3. Run the in_spanish_inner() function in a mapInPandas() function call
  4. Convert the Pandas UDF from a Series Series UDF to an Iterator[Series] Iterator[Series] UDF

Answer(s): D

Explanation:

The provided code defines a Pandas UDF of type Series-to-Series, where a new instance of the language model is created on each call, which happens per batch. This is inefficient and results in significant overhead due to repeated model initialization.

To reduce the frequency of model loading, the engineer should convert the UDF to an iterator-based Pandas UDF (Iterator[pd.Series] -> Iterator[pd.Series]). This allows the model to be loaded once per executor and reused across multiple batches, rather than once per call.

From the official Databricks documentation:

"Iterator of Series to Iterator of Series UDFs are useful when the UDF initialization is expensive... For example, loading a ML model once per executor rather than once per row/batch." -- Databricks Official Docs: Pandas UDFs

Correct implementation looks like:

python

CopyEdit

@pandas_udf("string")

def translate_udf(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:

model = get_translation_model(target_lang='es')

for batch in batch_iter:

yield batch.apply(model)

This refactor ensures the get_translation_model() is invoked once per executor process, not per batch, significantly improving pipeline performance.



A Spark DataFrame df is cached using the MEMORY_AND_DISK storage level, but the DataFrame is too large to fit entirely in memory.

What is the likely behavior when Spark runs out of memory to store the DataFrame?

  1. Spark duplicates the DataFrame in both memory and disk. If it doesn't fit in memory, the DataFrame is stored and retrieved from the disk entirely.
  2. Spark splits the DataFrame evenly between memory and disk, ensuring balanced storage utilization.
  3. Spark will store as much data as possible in memory and spill the rest to disk when memory is full,

    continuing processing with performance overhead.
  4. Spark stores the frequently accessed rows in memory and less frequently accessed rows on disk, utilizing both resources to offer balanced performance.

Answer(s): C

Explanation:

When using the MEMORY_AND_DISK storage level, Spark attempts to cache as much of the DataFrame in memory as possible. If the DataFrame does not fit entirely in memory, Spark will store the remaining partitions on disk. This allows processing to continue, albeit with a performance overhead due to disk I/O.

As per the Spark documentation:

"MEMORY_AND_DISK: It stores partitions that do not fit in memory on disk and keeps the rest in memory. This can be useful when working with datasets that are larger than the available memory."

-- Perficient Blogs: Spark - StorageLevel

This behavior ensures that Spark can handle datasets larger than the available memory by spilling excess data to disk, thus preventing job failures due to memory constraints.



A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

  1. By configuring the option checkpointLocation during readStream
  2. By configuring the option recoveryLocation during the SparkSession initialization
  3. By configuring the option recoveryLocation during writeStream
  4. By configuring the option checkpointLocation during writeStream

Answer(s): D

Explanation:

To enable a Structured Streaming query to recover from failures or intentional shutdowns, it is essential to specify the checkpointLocation option during the writeStream operation. This checkpoint location stores the progress information of the streaming query, allowing it to resume from where it left off.

According to the Databricks documentation:

"You must specify the checkpointLocation option before you run a streaming query, as in the following example:

.option("checkpointLocation", "/path/to/checkpoint/dir")

.toTable("catalog.schema.table")

-- Databricks Documentation: Structured Streaming checkpoints

By setting the checkpointLocation during writeStream, Spark can maintain state information and ensure exactly-once processing semantics, which are crucial for reliable streaming applications.



A data scientist is analyzing a large dataset and has written a PySpark script that includes several transformations and actions on a DataFrame. The script ends with a collect() action to retrieve the results.

How does Apache SparkTM's execution hierarchy process the operations when the data scientist runs this script?

  1. The script is first divided into multiple applications, then each application is split into jobs, stages, and finally tasks.
  2. The entire script is treated as a single job, which is then divided into multiple stages, and each stage is further divided into tasks based on data partitions.
  3. The collect() action triggers a job, which is divided into stages at shuffle boundaries, and each stage is split into tasks that operate on individual data partitions.
  4. Spark creates a single task for each transformation and action in the script, and these tasks are grouped into stages and jobs based on their dependencies.

Answer(s): C

Explanation:

In Apache Spark, the execution hierarchy is structured as follows:

Application: The highest-level unit, representing the user program built on Spark.

Job: Triggered by an action (e.g., collect(), count()). Each action corresponds to a job.

Stage: A job is divided into stages based on shuffle boundaries. Each stage contains tasks that can be executed in parallel.

Task: The smallest unit of work, representing a single operation applied to a partition of the data.

When the collect() action is invoked, Spark initiates a job. This job is then divided into stages at points where data shuffling is required (i.e., wide transformations). Each stage comprises tasks that are distributed across the cluster's executors, operating on individual data partitions.

This hierarchical execution model allows Spark to efficiently process large-scale data by parallelizing tasks and optimizing resource utilization.



A developer is trying to join two tables, sales.purchases_fct and sales.customer_dim, using the following code:



fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))

The developer has discovered that customers in the purchases_fct table that do not exist in the customer_dim table are being dropped from the joined table.

Which change should be made to the code to stop these customer records from being dropped?

  1. fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'left')
  2. fact_df = cust_df.join(purch_df, F.col('customer_id') == F.col('custid'))
  3. fact_df = purch_df.join(cust_df, F.col('cust_id') == F.col('customer_id'))
  4. fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'right_outer')

Answer(s): A

Explanation:

In Spark, the default join type is an inner join, which returns only the rows with matching keys in both DataFrames. To retain all records from the left DataFrame (purch_df) and include matching records from the right DataFrame (cust_df), a left outer join should be used.

By specifying the join type as 'left', the modified code ensures that all records from purch_df are preserved, and matching records from cust_df are included. Records in purch_df without a corresponding match in cust_df will have null values for the columns from cust_df.

This approach is consistent with standard SQL join operations and is supported in PySpark's DataFrame API.



A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

  1. The Spark engine requires manual intervention to start executing transformations.
  2. Only actions trigger the execution of the transformation pipeline.
  3. Transformations are executed immediately to build the lineage graph.
  4. The Spark engine optimizes the execution plan during the transformations, causing delays.
  5. Transformations are evaluated lazily.

Answer(s): B,E

Explanation:

Apache Spark employs a lazy evaluation model for transformations. This means that when transformations (e.g., map(), filter()) are applied to a DataFrame, Spark does not execute them immediately. Instead, it builds a logical plan (lineage) of transformations to be applied.

Execution is deferred until an action (e.g., collect(), count(), save()) is called. At that point, Spark's Catalyst optimizer analyzes the logical plan, optimizes it, and then executes the physical plan to produce the result.

This lazy evaluation strategy allows Spark to optimize the execution plan, minimize data shuffling, and improve overall performance by reducing unnecessary computations.



Viewing Page 1 of 18



Share your comments for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam with other users:

Janjua 5/22/2023 3:31:00 PM

please upload latest ibm ace c1000-056 dumps
GERMANY


Matt 12/30/2023 11:18:00 AM

if only explanations were provided...
FRANCE


Rasha 6/29/2023 8:23:00 PM

yes .. i need the dump if you can help me
Anonymous


Anonymous 7/25/2023 8:05:00 AM

good morning, could you please upload this exam again?
SPAIN


AJ 9/24/2023 9:32:00 AM

hi please upload sre foundation and practitioner exam questions
Anonymous


peter parker 8/10/2023 10:59:00 AM

the exam is listed as 80 questions with a pass mark of 70%, how is your 50 questions related?
Anonymous


Berihun 7/13/2023 7:29:00 AM

all questions are so important and covers all ccna modules
Anonymous


nspk 1/19/2024 12:53:00 AM

q 44. ans:- b (goto setup > order settings > select enable optional price books for orders) reference link --> https://resources.docs.salesforce.com/latest/latest/en-us/sfdc/pdf/sfom_impl_b2b_b2b2c.pdf(decide whether you want to enable the optional price books feature. if so, select enable optional price books for orders. you can use orders in salesforce while managing price books in an external platform. if you’re using d2c commerce, you must select enable optional price books for orders.)
Anonymous


Muhammad Rawish Siddiqui 12/2/2023 5:28:00 AM

"cost of replacing data if it were lost" is also correct.
SAUDI ARABIA


Anonymous 7/14/2023 3:17:00 AM

pls upload the questions
UNITED STATES


Mukesh 7/10/2023 4:14:00 PM

good questions
UNITED KINGDOM


Elie Abou Chrouch 12/11/2023 3:38:00 AM

question 182 - correct answer is d. ethernet frame length is 64 - 1518b. length of user data containing is that frame: 46 - 1500b.
Anonymous


Damien 9/23/2023 8:37:00 AM

i need this exam pls
Anonymous


Nani 9/10/2023 12:02:00 PM

its required for me, please make it enable to access. thanks
UNITED STATES


ethiopia 8/2/2023 2:18:00 AM

seems good..
ETHIOPIA


whoAreWeReally 12/19/2023 8:29:00 PM

took the test last week, i did have about 15 - 20 word for word from this site on the test. (only was able to cram 600 of the questions from this site so maybe more were there i didnt review) had 4 labs, bgp, lacp, vrf with tunnels and actually had to skip a lab due to time. lots of automation syntax questions.
EUROPEAN UNION


vs 9/2/2023 12:19:00 PM

no comments
Anonymous


john adenu 11/14/2023 11:02:00 AM

nice questions bring out the best in you.
Anonymous


Osman 11/21/2023 2:27:00 PM

really helpful
Anonymous


Edward 9/13/2023 5:27:00 PM

question #50 and question #81 are exactly the same questions, azure site recovery provides________for virtual machines. the first says that it is fault tolerance is the answer and second says disater recovery. from my research, it says it should be disaster recovery. can anybody explain to me why? thank you
CANADA


Monti 5/24/2023 11:14:00 PM

iam thankful for these exam dumps questions, i would not have passed without this exam dumps.
UNITED STATES


Anon 10/25/2023 10:48:00 PM

some of the answers seem to be inaccurate. q10 for example shouldnt it be an m custom column?
MALAYSIA


PeterPan 10/18/2023 10:22:00 AM

are the question real or fake?
Anonymous


CW 7/11/2023 3:19:00 PM

thank you for providing such assistance.
UNITED STATES


Mn8300 11/9/2023 8:53:00 AM

nice questions
Anonymous


Nico 4/23/2023 11:41:00 PM

my 3rd purcahse from this site. these exam dumps are helpful. very helpful.
ITALY


Chere 9/15/2023 4:21:00 AM

found it good
Anonymous


Thembelani 5/30/2023 2:47:00 AM

excellent material
Anonymous


vinesh phale 9/11/2023 2:51:00 AM

very helpfull
UNITED STATES


Bhagiii 11/4/2023 7:04:00 AM

well explained.
Anonymous


Rahul 8/8/2023 9:40:00 PM

i need the pdf, please.
CANADA


CW 7/11/2023 2:51:00 PM

a good source for exam preparation
UNITED STATES


Anchal 10/23/2023 4:01:00 PM

nice questions
INDIA


J Nunes 9/29/2023 8:19:00 AM

i need ielts general training audio guide questions
BRAZIL