Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 2)
Databricks Certified Associate Developer for Apache Spark 3.5 - Python
Updated on: 21-Feb-2026

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:



The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

A)



B)



C)



D)



The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

  1. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort('region_id')
    .take(3)
    )
  2. regions = dict(
    regions_df

    .select('region_id', 'region')
    .sort('region_id')
    .take(3)
    )
  3. regions = dict(
    regions_df
    .select('region_id', 'region')
    .limit(3)
    .collect()
    )
  4. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort(desc('region_id'))
    .take(3)
    )
  5. Option A
  6. Option B
  7. Option C
  8. Option D

Answer(s): A

Explanation:

The question requires creating a dictionary where keys are region values and values are the corresponding region_id integers. Furthermore, it asks to retrieve only the smallest 3 region_id values.

Key observations:

.select('region', 'region_id') puts the column order as expected by dict() -- where the first column becomes the key and the second the value.

.sort('region_id') ensures sorting in ascending order so the smallest IDs are first.

.take(3) retrieves exactly 3 rows.

Wrapping the result in dict(...) correctly builds the required Python dictionary: { 'AFRICA': 0, 'AMERICA': 1, 'ASIA': 2 }.

Incorrect options:

Option B flips the order to region_id first, resulting in a dictionary with integer keys -- not what's asked.

Option C uses .limit(3) without sorting, which leads to non-deterministic rows based on partition layout.

Option D sorts in descending order, giving the largest rather than smallest region_ids.

Hence, Option A meets all the requirements precisely.



An engineer has a large ORC file located at /file/test_data.orc and wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e., col1, col2, during the reading process?

  1. spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")
  2. spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")
  3. spark.read.orc("/file/test_data.orc").selected("col1", "col2")
  4. spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Answer(s): D

Explanation:

The correct way to load specific columns from an ORC file is to first load the file using .load() and then apply .select() on the resulting DataFrame. This is valid with .read.format("orc") or the shortcut

.read.orc().

df = spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Why others are incorrect:

A performs selection after filtering, but doesn't match the intention to minimize memory at load.

B incorrectly tries to use .select() before .load(), which is invalid.

C uses a non-existent .selected() method.

D correctly loads and then selects.


Reference:

Apache Spark SQL API - ORC Format



Given the code fragment:



import pyspark.pandas as ps psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

  1. psdf.to_spark()
  2. psdf.to_pyspark()
  3. psdf.to_pandas()
  4. psdf.to_dataframe()

Answer(s): A

Explanation:

Pandas API on Spark (pyspark.pandas) allows interoperability with PySpark DataFrames. To convert a pyspark.pandas.DataFrame to a standard PySpark DataFrame, you use .to_spark().

Example:

df = psdf.to_spark()

This is the officially supported method as per Databricks Documentation.

Incorrect options:

B, D: Invalid or nonexistent methods.

C: Converts to a local pandas DataFrame, not a PySpark DataFrame.



A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

  1. Optimize the data processing logic by repartitioning the DataFrame.
  2. Modify the Spark configuration to disable garbage collection
  3. Increase the memory allocated to the Spark Driver.
  4. Cache large DataFrames to persist them in memory.

Answer(s): C

Explanation:

The message "GC overhead limit exceeded" typically indicates that the JVM is spending too much time in garbage collection with little memory recovery. This suggests that the driver or executor is under-provisioned in memory.

The most effective remedy is to increase the driver memory using:

--driver-memory 4g

This is confirmed in Spark's official troubleshooting documentation:

"If you see a lot of GC overhead limit exceeded errors in the driver logs, it's a sign that the driver is running out of memory."
-- Spark Tuning Guide

Why others are incorrect:

A may help but does not directly address the driver memory shortage.

B is not a valid action; GC cannot be disabled.

D increases memory usage, worsening the problem.



A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

  1. df.orderBy(col("age").asc(), col("salary").asc()).show()
  2. df.sort("age", "salary", ascending=[True, True]).show()
  3. df.sort("age", "salary", ascending=[False, True]).show()
  4. df.orderBy("age", "salary", ascending=[True, False]).show()

Answer(s): D

Explanation:

To sort a PySpark DataFrame by multiple columns with mixed sort directions, the correct usage is:

python

CopyEdit df.orderBy("age", "salary", ascending=[True, False])

age will be sorted in ascending order salary will be sorted in descending order

The orderBy() and sort() methods in PySpark accept a list of booleans to specify the sort direction for each column.

Documentation


Reference:

PySpark API - DataFrame.orderBy



What is the difference between df.cache() and df.persist() in Spark DataFrame?

  1. Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)
  2. Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.
  3. persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.
  4. cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist()
    - Can be used to set different storage levels to persist the contents of the DataFrame

Answer(s): D

Explanation:

df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)

df.persist() allows specifying any storage level such as MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK_SER, etc.

By default, persist() uses MEMORY_AND_DISK, unless specified otherwise.


Reference:

Spark Programming Guide - Caching and Persistence



A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

  1. groupBy
  2. filter
  3. select
  4. coalesce

Answer(s): A

Explanation:

The groupBy() operation causes a shuffle because it requires all values for a specific key to be brought together, which may involve moving data across partitions.

In contrast:

filter() and select() are narrow transformations and do not cause shuffles.

coalesce() tries to reduce the number of partitions and avoids shuffling by moving data to fewer partitions without a full shuffle (unlike repartition()).


Reference:

Apache Spark - Understanding Shuffle



A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an upstream team on a nightly basis. The data is stored in a directory structure with a base path of "/path/events/data". The upstream team drops daily data into the underlying subdirectories following the convention year/month/day.

A few examples of the directory structure are:



Which of the following code snippets will read all the data within the directory structure?

  1. df = spark.read.option("inferSchema", "true").parquet("/path/events/data/")
  2. df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")
  3. df = spark.read.parquet("/path/events/data/*")
  4. df = spark.read.parquet("/path/events/data/")

Answer(s): B

Explanation:

To read all files recursively within a nested directory structure, Spark requires the recursiveFileLookup option to be explicitly enabled. According to Databricks official documentation, when dealing with deeply nested Parquet files in a directory tree (as shown in this example), you should set:

df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")

This ensures that Spark searches through all subdirectories under /path/events/data/ and reads any Parquet files it finds, regardless of the folder depth.

Option A is incorrect because while it includes an option, inferSchema is irrelevant here and does not enable recursive file reading.

Option C is incorrect because wildcards may not reliably match deep nested structures beyond one directory level.

Option D is incorrect because it will only read files directly within /path/events/data/ and not subdirectories like /2023/01/01.

Databricks documentation reference:

"To read files recursively from nested folders, set the recursiveFileLookup option to true. This is useful when data is organized in hierarchical folder structures" -- Databricks documentation on Parquet files ingestion and options.



Viewing Page 2 of 18



Share your comments for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam with other users:

Aderonke 10/23/2023 12:53:00 PM

fantastic assessment on psm 1
UNITED KINGDOM


SAJI 7/20/2023 2:51:00 AM

56 question correct answer a,b
Anonymous


Raj Kumar 10/23/2023 8:52:00 PM

thank you for providing the q bank
CANADA


piyush keshari 7/7/2023 9:46:00 PM

true quesstions
Anonymous


B.A.J 11/6/2023 7:01:00 AM

i can“t believe ms asks things like this, seems to be only marketing material.
Anonymous


Guss 5/23/2023 12:28:00 PM

hi, could you please add the last update of ns0-527
Anonymous


Rond65 8/22/2023 4:39:00 PM

question #3 refers to vnet4 and vnet5. however, there is no vnet5 listed in the case study (testlet 2).
UNITED STATES


Cheers 12/13/2023 9:55:00 AM

sometimes it may be good some times it may be
GERMANY


Sumita Bose 7/21/2023 1:01:00 AM

qs 4 answer seems wrong- please check
AUSTRALIA


Amit 9/7/2023 12:53:00 AM

very detailed explanation !
HONG KONG


FisherGirl 5/16/2022 10:36:00 PM

the interactive nature of the test engine application makes the preparation process less boring.
NETHERLANDS


Chiranthaka 9/20/2023 11:15:00 AM

very useful.
Anonymous


SK 7/15/2023 3:51:00 AM

complete question dump should be made available for practice.
Anonymous


Gamerrr420 5/25/2022 9:38:00 PM

i just passed my first exam. i got 2 exam dumps as part of the 50% sale. my second exam is under work. once i write that exam i report my result. but so far i am confident.
AUSTRALIA


Kudu hgeur 9/21/2023 5:58:00 PM

nice create dewey stefen
CZECH REPUBLIC


Anorag 9/6/2023 9:24:00 AM

i just wrote this exam and it is still valid. the questions are exactly the same but there are about 4 or 5 questions that are answered incorrectly. so watch out for those. best of luck with your exam.
CANADA


Nathan 1/10/2023 3:54:00 PM

passed my exam today. this is a good start to 2023.
UNITED STATES


1 10/28/2023 7:32:00 AM

great sharing
Anonymous


Anand 1/20/2024 10:36:00 AM

very helpful
UNITED STATES


Kumar 6/23/2023 1:07:00 PM

thanks.. very helpful
FRANCE


User random 11/15/2023 3:01:00 AM

i registered for 1z0-1047-23 but dumps qre available for 1z0-1047-22. help me with this...
UNITED STATES


kk 1/17/2024 3:00:00 PM

very helpful
UNITED STATES


Raj 7/24/2023 10:20:00 AM

please upload oracle 1z0-1110-22 exam pdf
INDIA


Blessious Phiri 8/13/2023 11:58:00 AM

becoming interesting on the logical part of the cdbs and pdbs
Anonymous


LOL what a joke 9/10/2023 9:09:00 AM

some of the answers are incorrect, i would be wary of using this until an admin goes back and reviews all the answers
UNITED STATES


Muhammad Rawish Siddiqui 12/9/2023 7:40:00 AM

question # 267: federated operating model is also correct.
SAUDI ARABIA


Mayar 9/22/2023 4:58:00 AM

its helpful alot.
Anonymous


Sandeep 7/25/2022 11:58:00 PM

the questiosn from this braindumps are same as in the real exam. my passing mark was 84%.
INDIA


Eman Sawalha 6/10/2023 6:09:00 AM

it is an exam that measures your understanding of cloud computing resources provided by aws. these resources are aligned under 6 categories: storage, compute, database, infrastructure, pricing and network. with all of the services and typees of services under each category
GREECE


Mars 11/16/2023 1:53:00 AM

good and very useful
TAIWAN PROVINCE OF CHINA


ronaldo7 10/24/2023 5:34:00 AM

i cleared the az-104 exam by scoring 930/1000 on the exam. it was all possible due to this platform as it provides premium quality service. thank you!
UNITED STATES


Palash Ghosh 9/11/2023 8:30:00 AM

easy questions
Anonymous


Noor 10/2/2023 7:48:00 AM

could you please upload ad0-127 dumps
INDIA


Kotesh 7/27/2023 2:30:00 AM

good content
Anonymous