Databricks Certified Associate Developer for Apache Spark 3.5 - Python Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Questions in PDF

Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Dumps Questions (page: 2)

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:



The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

A)



B)



C)



D)



The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

  1. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort('region_id')
    .take(3)
    )
  2. regions = dict(
    regions_df

    .select('region_id', 'region')
    .sort('region_id')
    .take(3)
    )
  3. regions = dict(
    regions_df
    .select('region_id', 'region')
    .limit(3)
    .collect()
    )
  4. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort(desc('region_id'))
    .take(3)
    )
  5. Option A
  6. Option B
  7. Option C
  8. Option D

Answer(s): A

Explanation:

The question requires creating a dictionary where keys are region values and values are the corresponding region_id integers. Furthermore, it asks to retrieve only the smallest 3 region_id values.

Key observations:

.select('region', 'region_id') puts the column order as expected by dict() -- where the first column becomes the key and the second the value.

.sort('region_id') ensures sorting in ascending order so the smallest IDs are first.

.take(3) retrieves exactly 3 rows.

Wrapping the result in dict(...) correctly builds the required Python dictionary: { 'AFRICA': 0, 'AMERICA': 1, 'ASIA': 2 }.

Incorrect options:

Option B flips the order to region_id first, resulting in a dictionary with integer keys -- not what's asked.

Option C uses .limit(3) without sorting, which leads to non-deterministic rows based on partition layout.

Option D sorts in descending order, giving the largest rather than smallest region_ids.

Hence, Option A meets all the requirements precisely.



An engineer has a large ORC file located at /file/test_data.orc and wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e., col1, col2, during the reading process?

  1. spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")
  2. spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")
  3. spark.read.orc("/file/test_data.orc").selected("col1", "col2")
  4. spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Answer(s): D

Explanation:

The correct way to load specific columns from an ORC file is to first load the file using .load() and then apply .select() on the resulting DataFrame. This is valid with .read.format("orc") or the shortcut

.read.orc().

df = spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Why others are incorrect:

A performs selection after filtering, but doesn't match the intention to minimize memory at load.

B incorrectly tries to use .select() before .load(), which is invalid.

C uses a non-existent .selected() method.

D correctly loads and then selects.


Reference:

Apache Spark SQL API - ORC Format



Given the code fragment:



import pyspark.pandas as ps psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

  1. psdf.to_spark()
  2. psdf.to_pyspark()
  3. psdf.to_pandas()
  4. psdf.to_dataframe()

Answer(s): A

Explanation:

Pandas API on Spark (pyspark.pandas) allows interoperability with PySpark DataFrames. To convert a pyspark.pandas.DataFrame to a standard PySpark DataFrame, you use .to_spark().

Example:

df = psdf.to_spark()

This is the officially supported method as per Databricks Documentation.

Incorrect options:

B, D: Invalid or nonexistent methods.

C: Converts to a local pandas DataFrame, not a PySpark DataFrame.



A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

  1. Optimize the data processing logic by repartitioning the DataFrame.
  2. Modify the Spark configuration to disable garbage collection
  3. Increase the memory allocated to the Spark Driver.
  4. Cache large DataFrames to persist them in memory.

Answer(s): C

Explanation:

The message "GC overhead limit exceeded" typically indicates that the JVM is spending too much time in garbage collection with little memory recovery. This suggests that the driver or executor is under-provisioned in memory.

The most effective remedy is to increase the driver memory using:

--driver-memory 4g

This is confirmed in Spark's official troubleshooting documentation:

"If you see a lot of GC overhead limit exceeded errors in the driver logs, it's a sign that the driver is running out of memory."
-- Spark Tuning Guide

Why others are incorrect:

A may help but does not directly address the driver memory shortage.

B is not a valid action; GC cannot be disabled.

D increases memory usage, worsening the problem.



A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

  1. df.orderBy(col("age").asc(), col("salary").asc()).show()
  2. df.sort("age", "salary", ascending=[True, True]).show()
  3. df.sort("age", "salary", ascending=[False, True]).show()
  4. df.orderBy("age", "salary", ascending=[True, False]).show()

Answer(s): D

Explanation:

To sort a PySpark DataFrame by multiple columns with mixed sort directions, the correct usage is:

python

CopyEdit df.orderBy("age", "salary", ascending=[True, False])

age will be sorted in ascending order salary will be sorted in descending order

The orderBy() and sort() methods in PySpark accept a list of booleans to specify the sort direction for each column.

Documentation


Reference:

PySpark API - DataFrame.orderBy



What is the difference between df.cache() and df.persist() in Spark DataFrame?

  1. Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)
  2. Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.
  3. persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.
  4. cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist()
    - Can be used to set different storage levels to persist the contents of the DataFrame

Answer(s): D

Explanation:

df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)

df.persist() allows specifying any storage level such as MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK_SER, etc.

By default, persist() uses MEMORY_AND_DISK, unless specified otherwise.


Reference:

Spark Programming Guide - Caching and Persistence



A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

  1. groupBy
  2. filter
  3. select
  4. coalesce

Answer(s): A

Explanation:

The groupBy() operation causes a shuffle because it requires all values for a specific key to be brought together, which may involve moving data across partitions.

In contrast:

filter() and select() are narrow transformations and do not cause shuffles.

coalesce() tries to reduce the number of partitions and avoids shuffling by moving data to fewer partitions without a full shuffle (unlike repartition()).


Reference:

Apache Spark - Understanding Shuffle



A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an upstream team on a nightly basis. The data is stored in a directory structure with a base path of "/path/events/data". The upstream team drops daily data into the underlying subdirectories following the convention year/month/day.

A few examples of the directory structure are:



Which of the following code snippets will read all the data within the directory structure?

  1. df = spark.read.option("inferSchema", "true").parquet("/path/events/data/")
  2. df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")
  3. df = spark.read.parquet("/path/events/data/*")
  4. df = spark.read.parquet("/path/events/data/")

Answer(s): B

Explanation:

To read all files recursively within a nested directory structure, Spark requires the recursiveFileLookup option to be explicitly enabled. According to Databricks official documentation, when dealing with deeply nested Parquet files in a directory tree (as shown in this example), you should set:

df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")

This ensures that Spark searches through all subdirectories under /path/events/data/ and reads any Parquet files it finds, regardless of the folder depth.

Option A is incorrect because while it includes an option, inferSchema is irrelevant here and does not enable recursive file reading.

Option C is incorrect because wildcards may not reliably match deep nested structures beyond one directory level.

Option D is incorrect because it will only read files directly within /path/events/data/ and not subdirectories like /2023/01/01.

Databricks documentation reference:

"To read files recursively from nested folders, set the recursiveFileLookup option to true. This is useful when data is organized in hierarchical folder structures" -- Databricks documentation on Parquet files ingestion and options.



Share your comments for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam with other users:

S
sachin
6/27/2023 1:22:00 PM

can you share the pdf

B
Blessious Phiri
8/13/2023 10:26:00 AM

admin ii is real technical stuff

L
Luis Manuel
7/13/2023 9:30:00 PM

could you post the link

V
vijendra
8/18/2023 7:54:00 AM

hello send me dumps

S
Simeneh
7/9/2023 8:46:00 AM

it is very nice

J
john
11/16/2023 5:13:00 PM

i gave the amazon dva-c02 tests today and passed. very helpful.

T
Tao
11/20/2023 8:53:00 AM

there is an incorrect word in the problem statement. for example, in question 1, there is the word "speci c". this is "specific. in the other question, there is the word "noti cation". this is "notification. these mistakes make this site difficult for me to use.

P
patricks
10/24/2023 6:02:00 AM

passed my az-120 certification exam today with 90% marks. studied using the dumps highly recommended to all.

A
Ananya
9/14/2023 5:17:00 AM

i need it, plz make it available

J
JM
12/19/2023 2:41:00 PM

q47: intrusion prevention system is the correct answer, not patch management. by definition, there are no patches available for a zero-day vulnerability. the way to prevent an attacker from exploiting a zero-day vulnerability is to use an ips.

R
Ronke
8/18/2023 10:39:00 AM

this is simple but tiugh as well

C
CesarPA
7/12/2023 10:36:00 PM

questão 4, segundo meu compilador local e o site https://www.jdoodle.com/online-java-compiler/, a resposta correta é "c" !

J
Jeya
9/13/2023 7:50:00 AM

its very useful

T
Tracy
10/24/2023 6:28:00 AM

i mastered my skills and aced the comptia 220-1102 exam with a score of 920/1000. i give the credit to for my success.

J
James
8/17/2023 4:33:00 PM

real questions

A
Aderonke
10/23/2023 1:07:00 PM

very helpful assessments

S
Simmi
8/24/2023 7:25:00 AM

hi there, i would like to get dumps for this exam

J
johnson
10/24/2023 5:47:00 AM

i studied for the microsoft azure az-204 exam through it has 100% real questions available for practice along with various mock tests. i scored 900/1000.

M
Manas
9/9/2023 1:48:00 AM

please upload 1z0-1072-23 exam dups

S
SB
9/12/2023 5:15:00 AM

i was hoping if you could please share the pdf as i’m currently preparing to give the exam.

J
Jagjit
8/26/2023 5:01:00 PM

i am looking for oracle 1z0-116 exam

S
S Mallik
11/27/2023 12:32:00 AM

where we can get the answer to the questions

P
PiPi Li
12/12/2023 8:32:00 PM

nice questions

D
Dan
8/10/2023 4:19:00 PM

question 129 is completely wrong.

G
gayathiri
7/6/2023 12:10:00 AM

i need dump

D
Deb
8/15/2023 8:28:00 PM

love the site.

M
Michelle
6/23/2023 4:08:00 AM

can you please upload it back?

A
Ajay
10/3/2023 12:17:00 PM

could you please re-upload this exam? thanks a lot!

H
him
9/30/2023 2:38:00 AM

great about shared quiz

S
San
11/14/2023 12:46:00 AM

goood helping

W
Wang
6/9/2022 10:05:00 PM

pay attention to questions. they are very tricky. i waould say about 80 to 85% of the questions are in this exam dump.

M
Mary
5/16/2023 4:50:00 AM

wish you would allow more free questions

T
thomas
9/12/2023 4:28:00 AM

great simulation

S
Sandhya
12/9/2023 12:57:00 AM

very g inood

AI Tutor 👋 I’m here to help!