Databricks Databricks-Machine-Learning-Associate Exam (page: 2)
Databricks Certified Machine Learning Associate
Updated on: 02-Jan-2026

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.
Which of the following lines of code can the data scientist run to accomplish the task?

  1. spark_df.summary ()
  2. spark_df.stats()
  3. spark_df.describe().head()
  4. spark_df.printSchema()
  5. spark_df.toPandas()

Answer(s): A

Explanation:

The summary() function in PySpark's DataFrame API provides descriptive statistics which include count, mean, standard deviation, min, max, and quantiles for numeric columns. Here are the steps on how it can be used:
Import PySpark: Ensure PySpark is installed and correctly configured in the Databricks environment.
Load Data: Load the data into a Spark DataFrame.
Apply Summary: Use spark_df.summary() to generate summary statistics. View Results: The output from the summary() function includes the statistics specified in the query (count, mean, standard deviation, min, max, and potentially quartiles which approximate the interquartile range).
Reference
PySpark Documentation:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.summary.ht ml



An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one- hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?

  1. One-hot encoding is not supported by most machine learning libraries.
  2. One-hot encoding is dependent on the target variable's values which differ for each application.
  3. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
  4. One-hot encoding is not a common strategy for representing categorical feature variables numerically.
  5. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Answer(s): E

Explanation:

One-hot encoding transforms categorical variables into a format that can be provided to machine learning algorithms to better predict the output. However, when done prematurely or universally within a feature repository, it can be problematic:
Dimensionality Increase: One-hot encoding significantly increases the feature space, especially with high cardinality features, which can lead to high memory consumption and slower computation. Model Specificity: Some models handle categorical variables natively (like decision trees and boosting algorithms), and premature one-hot encoding can lead to inefficiency and loss of information (e.g., ordinal relationships).
Sparse Matrix Issue: It often results in a sparse matrix where most values are zero, which can be inefficient in both storage and computation for some algorithms. Generalization vs. Specificity: Encoding should ideally be tailored to specific models and use cases rather than applied generally in a feature repository.
Reference
"Feature Engineering and Selection: A Practical Approach for Predictive Models" by Max Kuhn and Kjell Johnson (CRC Press, 2019).



A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable.
When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?

  1. The second model is much more accurate than the first model
  2. The data scientist failed to exponentiate the predictions in the second model prior to computing the RMSE
  3. The data scientist failed to take the log of the predictions in the first model prior to computing the RMSE
  4. The first model is much more accurate than the second model
  5. The RMSE is an invalid evaluation metric for regression problems

Answer(s): E

Explanation:

The Root Mean Squared Error (RMSE) is a standard and widely used metric for evaluating the accuracy of regression models. The statement that it is invalid is incorrect. Here's a breakdown of why the other statements are or are not valid:

Transformations and RMSE Calculation: If the model predictions were transformed (e.g., using log), they should be converted back to their original scale before calculating RMSE to ensure accuracy in the evaluation. Missteps in this conversion process can lead to misleading RMSE values. Accuracy of Models: Without additional information, we can't definitively say which model is more accurate without considering their RMSE values properly scaled back to the original price scale. Appropriateness of RMSE: RMSE is entirely valid for regression problems as it provides a measure of how accurately a model predicts the outcome, expressed in the same units as the dependent variable.
Reference
"Applied Predictive Modeling" by Max Kuhn and Kjell Johnson (Springer, 2013), particularly the chapters discussing model evaluation metrics.



A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:
· 10.0
· 12.0
· 17.0
Which of the following values represents the overall cross-validation root-mean-squared error?

  1. 13.0
  2. 17.0
  3. 12.0
  4. 39.0
  5. 10.0

Answer(s): A

Explanation:

To calculate the overall cross-validation root-mean-squared error (RMSE), you average the RMSE values obtained from each validation fold. Given the RMSE values of 10.0, 12.0, and 17.0 for the three folds, the overall cross-validation RMSE is calculated as the average of these three values:
OverallCVRMSE=10.0+12.0+17.03=39.03=13.0OverallCVRMSE=310.0+12.0+17.0=339.0=13.0 Thus, the correct answer is 13.0, which accurately represents the average RMSE across all folds.


Reference:

Cross-validation in Regression (Understanding Cross-Validation Metrics).



A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:



A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model.

Which of the following is a negative consequence of the approach suggested by the colleague?

  1. The model will take longer to train for each unique combination of hvperparameter values
  2. The feature engineering stages will be computed using validation data
  3. The cross-validation process will no longer be
  4. The cross-validation process will no longer be reproducible
  5. The model will be refit one more per cross-validation fold

Answer(s): B

Explanation:

If the model object is passed to the estimator parameter of CrossValidator and the cross-validation object itself is placed as a stage in the pipeline, the feature engineering stages within the pipeline would be applied separately to each training and validation fold during cross-validation. This leads to a significant issue: the feature engineering stages would be computed using validation data, thereby leaking information from the validation set into the training process. This would potentially invalidate the cross-validation results by giving an overly optimistic performance estimate.


Reference:

Cross-validation and Pipeline Integration in MLlib (Avoiding Data Leakage in Pipelines).



Viewing Page 2 of 16



Share your comments for Databricks Databricks-Machine-Learning-Associate exam with other users:

Reddy 12/14/2023 2:42:00 AM

these are pretty useful
Anonymous


Daisy Delgado 1/9/2023 1:05:00 PM

awesome
UNITED STATES


Atif 6/13/2023 4:09:00 AM

yes please upload
UNITED STATES


Xunil 6/12/2023 3:04:00 PM

great job whoever put this together, for the greater good! thanks!
Anonymous


Lakshmi 10/2/2023 5:26:00 AM

just started to view all questions for the exam
NETHERLANDS


rani 1/19/2024 11:52:00 AM

helpful material
Anonymous


Greg 11/16/2023 6:59:00 AM

hope for the best
UNITED STATES


hi 10/5/2023 4:00:00 AM

will post exam has finished
UNITED STATES


Vmotu 8/24/2023 11:14:00 AM

really correct and good analyze!
AZERBAIJAN


hicham 5/30/2023 8:57:00 AM

excellent thanks a lot
FRANCE


Suman C 7/7/2023 8:13:00 AM

will post once pass the cka exam
INDIA


Ram 11/3/2023 5:10:00 AM

good content
Anonymous


Nagendra Pedipina 7/13/2023 2:12:00 AM

q:32 answer has to be option c
INDIA


Tamer Barakat 12/7/2023 5:17:00 PM

nice questions
Anonymous


Daryl 8/1/2022 11:33:00 PM

i really like the support team in this website. they are fast in communication and very helpful.
UNITED KINGDOM


Curtis Nakawaki 6/29/2023 9:13:00 PM

a good contemporary exam review
UNITED STATES


x-men 5/23/2023 1:02:00 AM

q23, its an array, isnt it? starts with [ and end with ]. its an array of objects, not object.
UNITED STATES


abuti 7/21/2023 6:24:00 PM

cool very helpfull
Anonymous


Krishneel 3/17/2023 10:34:00 AM

i just passed. this exam dumps is the same one from prepaway and examcollection. it has all the real test questions.
INDIA


Regor 12/4/2023 2:01:00 PM

is this a valid prince2 practitioner dumps?
UNITED KINGDOM


asl 9/14/2023 3:59:00 PM

all are relatable questions
CANADA


Siyya 1/19/2024 8:30:00 PM

might help me to prepare for the exam
Anonymous


Ted 6/21/2023 11:11:00 PM

just paid and downlaod the 2 exams using the 50% sale discount. so far i was able to download the pdf and the test engine. all looks good.
GERMANY


Paul K 11/27/2023 2:28:00 AM

i think it should be a,c. option d goes against the principle of building anything custom unless there are no work arounds available
INDIA


ph 6/16/2023 12:41:00 AM

very legible
Anonymous


sephs2001 7/31/2023 10:42:00 PM

is this exam accurate or helpful?
Anonymous


ash 7/11/2023 3:00:00 AM

please upload dump, i have exam in 2 days
INDIA


Sneha 8/17/2023 6:29:00 PM

this is useful
CANADA


sachin 12/27/2023 2:45:00 PM

question 232 answer should be perimeter not netowrk layer. wrong answer selected
Anonymous


tomAws 7/18/2023 5:05:00 AM

nice questions
BRAZIL


Rahul 6/11/2023 2:07:00 AM

hi team, could you please provide this dump ?
INDIA


TeamOraTech 12/5/2023 9:49:00 AM

very helpful to clear the exam and understand the concept.
Anonymous


Curtis 7/12/2023 8:20:00 PM

i think it is great that you are helping people when they need it. thanks.
UNITED STATES


sam 7/17/2023 6:22:00 PM

cannot evaluate yet
Anonymous