-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam dumps PDF (Page: 2)

QUESTION: 9

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

A)

B)

C)

D)

The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

regions = dict(
regions_df
.select('region', 'region_id')
.sort('region_id')
.take(3)
)
regions = dict(
regions_df

.select('region_id', 'region')
.sort('region_id')
.take(3)
)
regions = dict(
regions_df
.select('region_id', 'region')
.limit(3)
.collect()
)
regions = dict(
regions_df
.select('region', 'region_id')
.sort(desc('region_id'))
.take(3)
)
Option A
Option B
Option C
Option D

Answer(s): A

Explanation:

The question requires creating a dictionary where keys are region values and values are the corresponding region_id integers. Furthermore, it asks to retrieve only the smallest 3 region_id values.

Key observations:

.select('region', 'region_id') puts the column order as expected by dict() -- where the first column becomes the key and the second the value.

.sort('region_id') ensures sorting in ascending order so the smallest IDs are first.

.take(3) retrieves exactly 3 rows.

Wrapping the result in dict(...) correctly builds the required Python dictionary: { 'AFRICA': 0, 'AMERICA': 1, 'ASIA': 2 }.

Incorrect options:

Option B flips the order to region_id first, resulting in a dictionary with integer keys -- not what's asked.

Option C uses .limit(3) without sorting, which leads to non-deterministic rows based on partition layout.

Option D sorts in descending order, giving the largest rather than smallest region_ids.

Hence, Option A meets all the requirements precisely.

Reveal Solution Next Question

QUESTION: 10

An engineer has a large ORC file located at /file/test_data.orc and wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e., col1, col2, during the reading process?

spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")
spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")
spark.read.orc("/file/test_data.orc").selected("col1", "col2")
spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Answer(s): D

Explanation:

The correct way to load specific columns from an ORC file is to first load the file using .load() and then apply .select() on the resulting DataFrame. This is valid with .read.format("orc") or the shortcut

.read.orc().

df = spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Why others are incorrect:

A performs selection after filtering, but doesn't match the intention to minimize memory at load.

B incorrectly tries to use .select() before .load(), which is invalid.

C uses a non-existent .selected() method.

D correctly loads and then selects.

Reference:

Apache Spark SQL API - ORC Format

Reveal Solution Next Question

QUESTION: 11

Given the code fragment:

import pyspark.pandas as ps psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

psdf.to_spark()
psdf.to_pyspark()
psdf.to_pandas()
psdf.to_dataframe()

Answer(s): A

Explanation:

Pandas API on Spark (pyspark.pandas) allows interoperability with PySpark DataFrames. To convert a pyspark.pandas.DataFrame to a standard PySpark DataFrame, you use .to_spark().

Example:

df = psdf.to_spark()

This is the officially supported method as per Databricks Documentation.

Incorrect options:

B, D: Invalid or nonexistent methods.

C: Converts to a local pandas DataFrame, not a PySpark DataFrame.

Reveal Solution Next Question

QUESTION: 12

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Optimize the data processing logic by repartitioning the DataFrame.
Modify the Spark configuration to disable garbage collection
Increase the memory allocated to the Spark Driver.
Cache large DataFrames to persist them in memory.

Answer(s): C

Explanation:

The message "GC overhead limit exceeded" typically indicates that the JVM is spending too much time in garbage collection with little memory recovery. This suggests that the driver or executor is under-provisioned in memory.

The most effective remedy is to increase the driver memory using:

--driver-memory 4g

This is confirmed in Spark's official troubleshooting documentation:

"If you see a lot of GC overhead limit exceeded errors in the driver logs, it's a sign that the driver is running out of memory."
-- Spark Tuning Guide

Why others are incorrect:

A may help but does not directly address the driver memory shortage.

B is not a valid action; GC cannot be disabled.

D increases memory usage, worsening the problem.

Reveal Solution Next Question

QUESTION: 13

A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

df.orderBy(col("age").asc(), col("salary").asc()).show()
df.sort("age", "salary", ascending=[True, True]).show()
df.sort("age", "salary", ascending=[False, True]).show()
df.orderBy("age", "salary", ascending=[True, False]).show()

Answer(s): D

Explanation:

To sort a PySpark DataFrame by multiple columns with mixed sort directions, the correct usage is:

python

CopyEdit df.orderBy("age", "salary", ascending=[True, False])

age will be sorted in ascending order salary will be sorted in descending order

The orderBy() and sort() methods in PySpark accept a list of booleans to specify the sort direction for each column.

Documentation

Reference:

PySpark API - DataFrame.orderBy

Reveal Solution Next Question

QUESTION: 14

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)
Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.
persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.
cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist()
- Can be used to set different storage levels to persist the contents of the DataFrame

Answer(s): D

Explanation:

df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)

df.persist() allows specifying any storage level such as MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK_SER, etc.

By default, persist() uses MEMORY_AND_DISK, unless specified otherwise.

Reference:

Spark Programming Guide - Caching and Persistence

Reveal Solution Next Question

QUESTION: 15

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

groupBy
filter
select
coalesce

Answer(s): A

Explanation:

The groupBy() operation causes a shuffle because it requires all values for a specific key to be brought together, which may involve moving data across partitions.

In contrast:

filter() and select() are narrow transformations and do not cause shuffles.

coalesce() tries to reduce the number of partitions and avoids shuffling by moving data to fewer partitions without a full shuffle (unlike repartition()).

Reference:

Apache Spark - Understanding Shuffle

Reveal Solution Next Question

QUESTION: 16

A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an upstream team on a nightly basis. The data is stored in a directory structure with a base path of "/path/events/data". The upstream team drops daily data into the underlying subdirectories following the convention year/month/day.

A few examples of the directory structure are:

Which of the following code snippets will read all the data within the directory structure?

df = spark.read.option("inferSchema", "true").parquet("/path/events/data/")
df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")
df = spark.read.parquet("/path/events/data/*")
df = spark.read.parquet("/path/events/data/")

Answer(s): B

Explanation:

To read all files recursively within a nested directory structure, Spark requires the recursiveFileLookup option to be explicitly enabled. According to Databricks official documentation, when dealing with deeply nested Parquet files in a directory tree (as shown in this example), you should set:

df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")

This ensures that Spark searches through all subdirectories under /path/events/data/ and reads any Parquet files it finds, regardless of the folder depth.

Option A is incorrect because while it includes an option, inferSchema is irrelevant here and does not enable recursive file reading.

Option C is incorrect because wildcards may not reliably match deep nested structures beyond one directory level.

Option D is incorrect because it will only read files directly within /path/events/data/ and not subdirectories like /2023/01/01.

Databricks documentation reference:

"To read files recursively from nested folders, set the recursiveFileLookup option to true. This is useful when data is organized in hierarchical folder structures" -- Databricks documentation on Parquet files ingestion and options.

Reveal Solution Next Question

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 2) Databricks Certified Associate Developer for Apache Spark 3.5 - Python Updated on: 21-Feb-2026

QUESTION: 9

Explanation:

QUESTION: 10

Explanation:

Reference:

QUESTION: 11

Explanation:

QUESTION: 12

Explanation:

QUESTION: 13

Explanation:

Reference:

QUESTION: 14

Explanation:

Reference:

QUESTION: 15

Explanation:

Reference:

QUESTION: 16

Explanation:

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam (page: 2)
Databricks Certified Associate Developer for Apache Spark 3.5 - Python
Updated on: 21-Feb-2026