Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 2)
Databricks Certified Associate Developer for Apache Spark 3.5 - Python
Updated on: 09-Apr-2026

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:



The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

A)



B)



C)



D)



The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

  1. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort('region_id')
    .take(3)
    )
  2. regions = dict(
    regions_df

    .select('region_id', 'region')
    .sort('region_id')
    .take(3)
    )
  3. regions = dict(
    regions_df
    .select('region_id', 'region')
    .limit(3)
    .collect()
    )
  4. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort(desc('region_id'))
    .take(3)
    )
  5. Option A
  6. Option B
  7. Option C
  8. Option D

Answer(s): A

Explanation:

The question requires creating a dictionary where keys are region values and values are the corresponding region_id integers. Furthermore, it asks to retrieve only the smallest 3 region_id values.

Key observations:

.select('region', 'region_id') puts the column order as expected by dict() -- where the first column becomes the key and the second the value.

.sort('region_id') ensures sorting in ascending order so the smallest IDs are first.

.take(3) retrieves exactly 3 rows.

Wrapping the result in dict(...) correctly builds the required Python dictionary: { 'AFRICA': 0, 'AMERICA': 1, 'ASIA': 2 }.

Incorrect options:

Option B flips the order to region_id first, resulting in a dictionary with integer keys -- not what's asked.

Option C uses .limit(3) without sorting, which leads to non-deterministic rows based on partition layout.

Option D sorts in descending order, giving the largest rather than smallest region_ids.

Hence, Option A meets all the requirements precisely.



An engineer has a large ORC file located at /file/test_data.orc and wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e., col1, col2, during the reading process?

  1. spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")
  2. spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")
  3. spark.read.orc("/file/test_data.orc").selected("col1", "col2")
  4. spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Answer(s): D

Explanation:

The correct way to load specific columns from an ORC file is to first load the file using .load() and then apply .select() on the resulting DataFrame. This is valid with .read.format("orc") or the shortcut

.read.orc().

df = spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Why others are incorrect:

A performs selection after filtering, but doesn't match the intention to minimize memory at load.

B incorrectly tries to use .select() before .load(), which is invalid.

C uses a non-existent .selected() method.

D correctly loads and then selects.


Reference:

Apache Spark SQL API - ORC Format



Given the code fragment:



import pyspark.pandas as ps psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

  1. psdf.to_spark()
  2. psdf.to_pyspark()
  3. psdf.to_pandas()
  4. psdf.to_dataframe()

Answer(s): A

Explanation:

Pandas API on Spark (pyspark.pandas) allows interoperability with PySpark DataFrames. To convert a pyspark.pandas.DataFrame to a standard PySpark DataFrame, you use .to_spark().

Example:

df = psdf.to_spark()

This is the officially supported method as per Databricks Documentation.

Incorrect options:

B, D: Invalid or nonexistent methods.

C: Converts to a local pandas DataFrame, not a PySpark DataFrame.



A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

  1. Optimize the data processing logic by repartitioning the DataFrame.
  2. Modify the Spark configuration to disable garbage collection
  3. Increase the memory allocated to the Spark Driver.
  4. Cache large DataFrames to persist them in memory.

Answer(s): C

Explanation:

The message "GC overhead limit exceeded" typically indicates that the JVM is spending too much time in garbage collection with little memory recovery. This suggests that the driver or executor is under-provisioned in memory.

The most effective remedy is to increase the driver memory using:

--driver-memory 4g

This is confirmed in Spark's official troubleshooting documentation:

"If you see a lot of GC overhead limit exceeded errors in the driver logs, it's a sign that the driver is running out of memory."
-- Spark Tuning Guide

Why others are incorrect:

A may help but does not directly address the driver memory shortage.

B is not a valid action; GC cannot be disabled.

D increases memory usage, worsening the problem.



A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

  1. df.orderBy(col("age").asc(), col("salary").asc()).show()
  2. df.sort("age", "salary", ascending=[True, True]).show()
  3. df.sort("age", "salary", ascending=[False, True]).show()
  4. df.orderBy("age", "salary", ascending=[True, False]).show()

Answer(s): D

Explanation:

To sort a PySpark DataFrame by multiple columns with mixed sort directions, the correct usage is:

python

CopyEdit df.orderBy("age", "salary", ascending=[True, False])

age will be sorted in ascending order salary will be sorted in descending order

The orderBy() and sort() methods in PySpark accept a list of booleans to specify the sort direction for each column.

Documentation


Reference:

PySpark API - DataFrame.orderBy



What is the difference between df.cache() and df.persist() in Spark DataFrame?

  1. Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)
  2. Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.
  3. persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.
  4. cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist()
    - Can be used to set different storage levels to persist the contents of the DataFrame

Answer(s): D

Explanation:

df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)

df.persist() allows specifying any storage level such as MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK_SER, etc.

By default, persist() uses MEMORY_AND_DISK, unless specified otherwise.


Reference:

Spark Programming Guide - Caching and Persistence



A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

  1. groupBy
  2. filter
  3. select
  4. coalesce

Answer(s): A

Explanation:

The groupBy() operation causes a shuffle because it requires all values for a specific key to be brought together, which may involve moving data across partitions.

In contrast:

filter() and select() are narrow transformations and do not cause shuffles.

coalesce() tries to reduce the number of partitions and avoids shuffling by moving data to fewer partitions without a full shuffle (unlike repartition()).


Reference:

Apache Spark - Understanding Shuffle



A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an upstream team on a nightly basis. The data is stored in a directory structure with a base path of "/path/events/data". The upstream team drops daily data into the underlying subdirectories following the convention year/month/day.

A few examples of the directory structure are:



Which of the following code snippets will read all the data within the directory structure?

  1. df = spark.read.option("inferSchema", "true").parquet("/path/events/data/")
  2. df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")
  3. df = spark.read.parquet("/path/events/data/*")
  4. df = spark.read.parquet("/path/events/data/")

Answer(s): B

Explanation:

To read all files recursively within a nested directory structure, Spark requires the recursiveFileLookup option to be explicitly enabled. According to Databricks official documentation, when dealing with deeply nested Parquet files in a directory tree (as shown in this example), you should set:

df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")

This ensures that Spark searches through all subdirectories under /path/events/data/ and reads any Parquet files it finds, regardless of the folder depth.

Option A is incorrect because while it includes an option, inferSchema is irrelevant here and does not enable recursive file reading.

Option C is incorrect because wildcards may not reliably match deep nested structures beyond one directory level.

Option D is incorrect because it will only read files directly within /path/events/data/ and not subdirectories like /2023/01/01.

Databricks documentation reference:

"To read files recursively from nested folders, set the recursiveFileLookup option to true. This is useful when data is organized in hierarchical folder structures" -- Databricks documentation on Parquet files ingestion and options.



Viewing Page 2 of 18



Share your comments for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam with other users:

User123 10/8/2023 9:59:00 AM

good question
UNITED STATES


vinay 9/4/2023 10:23:00 AM

really nice
Anonymous


Usman 8/28/2023 10:07:00 AM

please i need dumps for isc2 cybersecuity
Anonymous


Q44 7/30/2023 11:50:00 AM

ans is coldline i think
UNITED STATES


Anuj 12/21/2023 1:30:00 PM

very helpful
Anonymous


Giri 9/13/2023 10:31:00 PM

can you please provide dumps so that it helps me more
UNITED STATES


Aaron 2/8/2023 12:10:00 AM

thank you for providing me with the updated question and answers. this version has all the questions from the exam. i just saw them in my exam this morning. i passed my exam today.
SOUTH AFRICA


Sarwar 12/21/2023 4:54:00 PM

how i can see exam questions?
CANADA


Chengchaone 9/11/2023 10:22:00 AM

can you please upload please?
Anonymous


Mouli 9/2/2023 7:02:00 AM

question 75: option c is correct answer
Anonymous


JugHead 9/27/2023 2:40:00 PM

please add this exam
Anonymous


sushant 6/28/2023 4:38:00 AM

please upoad
EUROPEAN UNION


John 8/7/2023 12:09:00 AM

has anyone recently attended safe 6.0 certification? is it the samq question from here.
Anonymous


Blessious Phiri 8/14/2023 3:49:00 PM

expository experience
Anonymous


concerned citizen 12/29/2023 11:31:00 AM

52 should be b&c. controller failure has nothing to do with this type of issue. degraded state tells us its a raid issue, and if the os is missing then the bootable device isnt found. the only other consideration could be data loss but thats somewhat broad whereas b&c show understanding of the specific issues the question is asking about.
UNITED STATES


deedee 12/23/2023 5:10:00 PM

great help!!!
UNITED STATES


Samir 8/1/2023 3:07:00 PM

very useful tools
UNITED STATES


Saeed 11/7/2023 3:14:00 AM

looks a good platform to prepare az-104
Anonymous


Matiullah 6/24/2023 7:37:00 AM

want to pass the exam
Anonymous


SN 9/5/2023 2:25:00 PM

good resource
UNITED STATES


Zoubeyr 9/8/2023 5:56:00 AM

question 11 : d
FRANCE


User 8/29/2023 3:24:00 AM

only the free dumps will be enough for pass, or have to purchase the premium one. please suggest.
Anonymous


CW 7/6/2023 7:37:00 PM

good questions. thanks.
Anonymous


Farooqi 11/21/2023 1:37:00 AM

good for practice.
INDIA


Isaac 10/28/2023 2:30:00 PM

great case study
UNITED STATES


Malviya 2/3/2023 9:10:00 AM

the questions in this exam dumps is valid. i passed my test last monday. i only whish they had their pricing in inr instead of usd. but it is still worth it.
INDIA


rsmyth 5/18/2023 12:44:00 PM

q40 the answer is not d, why are you giving incorrect answers? snapshot consolidation is used to merge the snapshot delta disk files to the vm base disk
IRELAND


Keny 6/23/2023 9:00:00 PM

thanks, very relevant
PERU


Muhammad Rawish Siddiqui 11/29/2023 12:14:00 PM

wrong answer. it is true not false.
SAUDI ARABIA


Josh 7/10/2023 1:54:00 PM

please i need the mo-100 questions
Anonymous


VINNY 6/2/2023 11:59:00 AM

very good use full
Anonymous


Andy 12/6/2023 5:56:00 AM

very valid questions
Anonymous


Mamo 8/12/2023 7:46:00 AM

will these question help me to clear pl-300 exam?
UNITED STATES


Marial Manyang 7/26/2023 10:13:00 AM

please provide me with these dumps questions. thanks
Anonymous