Databricks Certified Associate Developer for Apache Spark 3.5 - Python Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Questions in PDF

Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Dumps Questions (page: 3)

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

  1. 10
  2. Same number as the cluster executors
  3. 1
  4. 20

Answer(s): A

Explanation:

The .coalesce(numPartitions) function is used to reduce the number of partitions in a DataFrame. It does not increase the number of partitions. If the specified number of partitions is greater than the current number, it will not have any effect.

From the official Spark documentation:

"coalesce() results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim one or more of the current partitions."
However, if you try to increase partitions using coalesce (e.g., from 10 to 20), the number of partitions remains unchanged.

Hence, df.coalesce(20) will still return a DataFrame with 10 partitions.


Reference:

Apache Spark 3.5 Programming Guide RDD and DataFrame Operations coalesce()



Given the following code snippet in my_spark_app.py:



What is the role of the driver node?

  1. The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes
  2. The driver node only provides the user interface for monitoring the application
  3. The driver node holds the DataFrame data and performs all computations locally
  4. The driver node stores the final result after computations are completed by worker nodes

Answer(s): A

Explanation:

In the Spark architecture, the driver node is responsible for orchestrating the execution of a Spark application. It converts user-defined transformations and actions into a logical plan, optimizes it into a physical plan, and then splits the plan into tasks that are distributed to the executor nodes.

As per Databricks and Spark documentation:

"The driver node is responsible for maintaining information about the Spark application, responding to a user's program or input, and analyzing, distributing, and scheduling work across the executors."

This means:

Option A is correct because the driver schedules and coordinates the job execution.

Option B is incorrect because the driver does more than just UI monitoring.

Option C is incorrect since data and computations are distributed across executor nodes.

Option D is incorrect; results are returned to the driver but not stored long-term by it.


Reference:

Databricks Certified Developer Spark 3.5 Documentation Spark Architecture Driver vs Executors.



A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:



import hashlib import pyspark.sql.functions as sf from pyspark.sql.types import StringType def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = sf.udf(shake_256, StringType())

The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition of shake_256_udf to this:CopyEdit shake_256_udf = sf.pandas_udf(shake_256, StringType())

However, the developer receives the error:

What should the signature of the shake_256() function be changed to in order to fix this error?

  1. def shake_256(df: pd.Series) -> str:
  2. def shake_256(df: Iterator[pd.Series]) -> Iterator[pd.Series]:
  3. def shake_256(raw: str) -> str:
  4. def shake_256(df: pd.Series) -> pd.Series:

Answer(s): D

Explanation:

When converting a standard PySpark UDF to a Pandas UDF for performance optimization, the function must operate on a Pandas Series as input and return a Pandas Series as output.

In this case, the original function signature:

def shake_256(raw: str) -> str is scalar -- not compatible with Pandas UDFs.

According to the official Spark documentation:

"Pandas UDFs operate on pandas.Series and return pandas.Series. The function definition should be:

def my_udf(s: pd.Series) -> pd.Series:

and it must be registered using pandas_udf(...)."

Therefore, to fix the error:

The function should be updated to:

def shake_256(df: pd.Series) -> pd.Series:

return df.apply(lambda x: hashlib.shake_256(x.encode()).hexdigest(20))

This will allow Spark to efficiently execute the Pandas UDF in vectorized form, improving performance compared to standard UDFs.


Reference:

Apache Spark 3.5 Documentation User-Defined Functions Pandas UDFs



A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)



C)



D)

  1. Use the applyInPandas API:
    df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
  2. Use the mapInPandas API:
    df.mapInPandas(mean_func, schema="user_id long, value double").show()
  3. Use a regular Spark UDF:
    from pyspark.sql.functions import mean df.groupBy("user_id").agg(mean("value")).show()
  4. Use a Pandas UDF:
    @pandas_udf("double")
    def mean_func(value: pd.Series) -> float:
    return value.mean()
    df.groupby("user_id").agg(mean_func(df["value"])).show()

Answer(s): A

Explanation:

The correct approach to perform a parallelized groupBy operation across Spark worker nodes using Pandas API is via applyInPandas. This function enables grouped map operations using Pandas logic in a distributed Spark environment. It applies a user-defined function to each group of data represented as a Pandas DataFrame.

As per the Databricks documentation:

"applyInPandas() allows for vectorized operations on grouped data in Spark. It applies a user-defined function to each group of a DataFrame and outputs a new DataFrame. This is the recommended approach for using Pandas logic across grouped data with parallel execution."

Option A is correct and achieves this parallel execution.

Option B (mapInPandas) applies to the entire DataFrame, not grouped operations.

Option C uses built-in aggregation functions, which are efficient but not customizable with Pandas logic.

Option D creates a scalar Pandas UDF which does not perform a group-wise transformation.

Therefore, to run a groupBy with parallel Pandas logic on Spark workers, Option A using applyInPandas is the only correct answer.


Reference:

Apache Spark 3.5 Documentation Pandas API on Spark Grouped Map Pandas UDFs (applyInPandas)



Given:

python

CopyEdit spark.sparkContext.setLogLevel("<LOG_LEVEL>")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

  1. ALL, DEBUG, FAIL, INFO
  2. ERROR, WARN, TRACE, OFF
  3. WARN, NONE, ERROR, FATAL
  4. FATAL, NONE, INFO, DEBUG

Answer(s): B

Explanation:

The setLogLevel() method of SparkContext sets the logging level on the driver, which controls the verbosity of logs emitted during job execution. Supported levels are inherited from log4j and include the following:

ALL

DEBUG

ERROR

FATAL

INFO

OFF

TRACE

WARN

According to official Spark and Databricks documentation:

"Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, and WARN."

Among the choices provided, only option B (ERROR, WARN, TRACE, OFF) includes four valid log levels and excludes invalid ones like "FAIL" or "NONE".


Reference:

Apache Spark API docs SparkContext.setLogLevel



An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

  1. The code fails to execute because the column names employee_id and emp_id do not match automatically
  2. The code fails to execute because it must use on='employee_id' to specify the join column explicitly
  3. The code fails to execute because PySpark does not support joining DataFrames with a different structure
  4. The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2

Answer(s): D

Explanation:

In PySpark, when performing a join between two DataFrames, the columns do not have to share the same name. You can explicitly provide a join condition by comparing specific columns from each DataFrame.

This syntax is correct and fully supported:

df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

This will perform an inner join between df1 and df2 using the employee_id from df1 and emp_id from df2.


Reference:

Databricks Spark 3.5 Documentation DataFrame API join()



A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.

Which line of Spark code will produce a Parquet table that meets these requirements?

  1. final_df \
    .sort("market_time") \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  2. final_df \
    .orderBy("market_time") \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  3. final_df \
    .sort("market_time") \
    .coalesce(1) \
    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  4. final_df \
    .sortWithinPartitions("market_time") \

    .write \
    .format("parquet") \
    .mode("overwrite") \
    .saveAsTable("output.market_events")
  5. Option A
  6. Option B
  7. Option C
  8. Option D

Answer(s): D

Explanation:

To ensure that data written out to disk is sorted, it is important to consider how Spark writes data when saving to Parquet tables. The methods .sort() or .orderBy() apply a global sort but do not guarantee that the sorting will persist in the final output files unless certain conditions are met (e.g. a single partition via .coalesce(1) -- which is not scalable).

Instead, the proper method in distributed Spark processing to ensure rows are sorted within their respective partitions when written out is:

.sortWithinPartitions("column_name")

According to Apache Spark documentation:

"sortWithinPartitions() ensures each partition is sorted by the specified columns. This is useful for downstream systems that require sorted files."

This method works efficiently in distributed settings, avoids the performance bottleneck of global sorting (as in .orderBy() or .sort()), and guarantees each output partition has sorted records -- which meets the requirement of consistently sorted data.

Thus:

Option A and B do not guarantee the persisted file contents are sorted.

Option C introduces a bottleneck via .coalesce(1) (single partition).

Option D correctly applies sorting within partitions and is scalable.


Reference:

Databricks & Apache Spark 3.5 Documentation DataFrame API sortWithinPartitions()



In the code block below, aggDF contains aggregations on a streaming DataFrame:



Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

  1. complete
  2. append
  3. replace
  4. aggregate

Answer(s): A

Explanation:

The correct output mode for streaming aggregations that need to output the full updated results at each trigger is "complete".

From the official documentation:

"complete: The entire updated result table will be output to the sink every time there is a trigger."

This is ideal for aggregations, such as counts or averages grouped by a key, where the result table changes incrementally over time.

append: only outputs newly added rows replace and aggregate: invalid values for output mode


Reference:

Spark Structured Streaming Programming Guide Output Modes



Share your comments for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam with other users:

P
Prasad
9/29/2023 7:27:00 AM

please help with jn0-649 latest dumps

G
GTI9982
7/31/2023 10:15:00 PM

please i need this dump. thanks

E
Elton Riva
12/12/2023 8:20:00 PM

i have to take the aws certified developer - associate dva-c02 in the next few weeks and i wanted to know if the questions on your website are the same as the official exam.

B
Berihun Desalegn Wonde
7/13/2023 11:00:00 AM

all questions are more important

G
gr
7/2/2023 7:03:00 AM

ques 4 answer should be c ie automatically recover from failure

R
RS
7/27/2023 7:17:00 AM

very very useful page

B
Blessious Phiri
8/12/2023 11:47:00 AM

the exams are giving me an eye opener

A
AD
10/22/2023 9:08:00 AM

3rd so far, need to cover more

M
Matt
11/18/2023 2:32:00 AM

aligns with the pecd notes

S
Sri
10/15/2023 4:38:00 PM

question 4: b securityadmin is the correct answer. https://docs.snowflake.com/en/user-guide/security-access-control-overview#access-control-framework

H
H.T.M. D
6/25/2023 2:55:00 PM

kindly please share dumps

S
Satish
11/6/2023 4:27:00 AM

it is very useful, thank you

C
Chinna
7/30/2023 8:37:00 AM

need safe rte dumps

1
1234
6/30/2023 3:40:00 AM

can you upload the cis - cpg dumps

D
Did
1/12/2024 3:01:00 AM

q6 = 1. download odt application 2. create a configuration file (xml) 3. setup.exe /download to download the installation files 4. setup.exe /configure to deploy the application

J
John
10/12/2023 12:30:00 PM

great material

D
Dinesh
8/1/2023 2:26:00 PM

could you please upload sap c_arsor_2302 questions? it will be very much helpful.

L
LBert
6/19/2023 10:23:00 AM

vraag 20c: rsa veilig voor symmtrische cryptografie? antwoord c is toch fout. rsa is voor asymmetrische cryptogafie??

G
g
12/22/2023 1:51:00 PM

so far good

M
Milos
8/4/2023 9:33:00 AM

question 31 has obviously wrong answers. tls and ssl are used to encrypt data at transit, not at rest.

D
Diksha
9/25/2023 2:32:00 AM

pls provide dump for 1z0-1080-23 planning exams

H
H
7/17/2023 4:28:00 AM

could you please upload the exam?

A
Anonymous
9/14/2023 4:47:00 AM

please upload this

N
Naveena
1/13/2024 9:55:00 AM

good material

W
WildWilly
1/19/2024 10:43:00 AM

lets see if this is good stuff...

L
Lavanya
11/2/2023 1:53:00 AM

useful information

M
Moussa
12/12/2023 5:52:00 AM

intéressant

M
Madan
6/22/2023 9:22:00 AM

thank you for making the interactive questions

V
Vavz
11/2/2023 6:51:00 AM

questions are accurate

S
Su
11/23/2023 4:34:00 AM

i need questions/dumps for this exam.

L
LuvSN
7/16/2023 11:19:00 AM

i need this exam, when will it be uploaded

M
Mihai
7/19/2023 12:03:00 PM

i need the dumps !

W
Wafa
11/13/2023 3:06:00 AM

very helpful

A
Alokit
7/3/2023 2:13:00 PM

good source

AI Tutor 👋 I’m here to help!