Databricks Databricks-Machine-Learning-Associate Exam (page: 1)
Databricks Certified Machine Learning Associate
Updated on: 09-Apr-2026

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs.
When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.
Which of the following lines of code will return the metadata description?

  1. There is no way to return the metadata description programmatically.
  2. fs.create_training_set("new_table")
  3. fs.get_table("new_table").description
  4. fs.get_table("new_table").load_df()
  5. fs.get_table("new_table")

Answer(s): C

Explanation:

To retrieve the metadata description of a feature table created using the Feature Store Client (referred here as fs), the correct method involves calling get_table on the fs client with the table name as an argument, followed by accessing the description attribute of the returned object. The code snippet fs.get_table("new_table").description correctly achieves this by fetching the table object for "new_table" and then accessing its description attribute, where the metadata is stored. The other options do not correctly focus on retrieving the metadata description.


Reference:

Databricks Feature Store documentation (Accessing Feature Table Metadata).



A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.
Which of the following code blocks will accomplish this task?

  1. spark_df[spark_df["price"] > 0]
  2. spark_df.filter(col("price") > 0)
  3. SELECT * FROM spark_df WHERE price > 0
  4. spark_df.loc[spark_df["price"] > 0,:]
  5. spark_df.loc[:,spark_df["price"] > 0]

Answer(s): B

Explanation:

To filter rows in a Spark DataFrame based on a condition, you use the filter method along with a column condition. The correct syntax in PySpark to accomplish this task is spark_df.filter(col("price") > 0), which filters the DataFrame to include only those rows where the value in the "price" column is greater than 0. The col function is used to specify column-based operations. The other options provided either do not use correct Spark DataFrame syntax or are intended for different types of data manipulation frameworks like pandas.


Reference:

PySpark DataFrame API documentation (Filtering DataFrames).



A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?

  1. RMSE
  2. Precision
  3. Area under the residual operating curve
  4. Accuracy
  5. Recall

Answer(s): E

Explanation:

When the goal is to maximize the identification of positive cases in a classification task, the metric of interest is Recall. Recall, also known as sensitivity, measures the proportion of actual positives that are correctly identified by the model (i.e., the true positive rate). It is crucial for scenarios where missing a positive case (false negative) has serious implications, such as in medical diagnostics. The other metrics like Precision, RMSE, and Accuracy serve different aspects of performance measurement and are not specifically focused on maximizing the detection of positive cases alone.


Reference:

Classification Metrics in Machine Learning (Understanding Recall).



In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

  1. When the features are of the categorical type
  2. When the features are of the boolean type
  3. When the features contain a lot of extreme outliers
  4. When the features contain no outliers
  5. When the features contain no missing no values

Answer(s): C

Explanation:

Imputing missing values with the median is often preferred over the mean in scenarios where the data contains a lot of extreme outliers. The median is a more robust measure of central tendency in such cases, as it is not as heavily influenced by outliers as the mean. Using the median ensures that the imputed values are more representative of the typical data point, thus preserving the integrity of the dataset's distribution. The other options are not specifically relevant to the question of handling outliers in numerical data.


Reference:

Data Imputation Techniques (Dealing with Outliers).



A data scientist has replaced missing values in their feature set with each respective feature variable's median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.
Which of the following approaches can they take to include as much information as possible in the feature set?

  1. Impute the missing values using each respective feature variable's mean value instead of the median value
  2. Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them
  3. Remove all feature variables that originally contained missing values from the feature set
  4. Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed
  5. Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Answer(s): D

Explanation:

By creating a binary feature variable for each feature with missing values to indicate whether a value has been imputed, the data scientist can preserve information about the original state of the data. This approach maintains the integrity of the dataset by marking which values are original and which are synthetic (imputed). Here are the steps to implement this approach:
Identify Missing Values: Determine which features contain missing values. Impute Missing Values: Continue with median imputation or choose another method (mean, mode, regression, etc.) to fill missing values.
Create Indicator Variables: For each feature that had missing values, add a new binary feature. This feature should be '1' if the original value was missing and imputed, and '0' otherwise. Data Integration: Integrate these new binary features into the existing dataset. This maintains a record of where data imputation occurred, allowing models to potentially weight these observations differently.
Model Adjustment: Adjust machine learning models to account for these new features, which might involve considering interactions between these binary indicators and other features.
Reference
"Feature Engineering for Machine Learning" by Alice Zheng and Amanda Casari (O'Reilly Media, 2018), especially the sections on handling missing data. Scikit-learn documentation on imputing missing values: https://scikit- learn.org/stable/modules/impute.html



Viewing Page 1 of 16



Share your comments for Databricks Databricks-Machine-Learning-Associate exam with other users:

Av dey 8/16/2023 2:35:00 PM

can you please upload the dumps for 1z0-1096-23 for oracle
INDIA


Mayur Shermale 11/23/2023 12:22:00 AM

its intresting, i would like to learn more abouth this
JAPAN


JM 12/19/2023 2:23:00 PM

q252: dns poisoning is the correct answer, not locator redirection. beaconing is detected from a host. this indicates that the system has been infected with malware, which could be the source of local dns poisoning. location redirection works by either embedding the redirection in the original websites code or having a user click on a url that has an embedded redirect. since users at a different office are not getting redirected, it isnt an embedded redirection on the original website and since the user is manually typing in the url and not clicking a link, it isnt a modified link.
UNITED STATES


Freddie 12/12/2023 12:37:00 PM

helpful dump questions
SOUTH AFRICA


Da Costa 8/25/2023 7:30:00 AM

question 423 eigrp uses metric
Anonymous


Bsmaind 8/20/2023 9:22:00 AM

hello nice dumps
Anonymous


beau 1/12/2024 4:53:00 PM

good resource for learning
UNITED STATES


Sandeep 12/29/2023 4:07:00 AM

very useful
Anonymous


kevin 9/29/2023 8:04:00 AM

physical tempering techniques
Anonymous


Blessious Phiri 8/15/2023 4:08:00 PM

its giving best technical knowledge
Anonymous


Testbear 6/13/2023 11:15:00 AM

please upload
ITALY


shime 10/24/2023 4:23:00 AM

great question with explanation thanks!!
ETHIOPIA


Thembelani 5/30/2023 2:40:00 AM

does this exam have lab sections?
Anonymous


Shin 9/8/2023 5:31:00 AM

please upload
PHILIPPINES


priti kagwade 7/22/2023 5:17:00 AM

please upload the braindump for .net
UNITED STATES


Robe 9/27/2023 8:15:00 PM

i need this exam 1z0-1107-2. please.
Anonymous


Chiranthaka 9/20/2023 11:22:00 AM

very useful!
Anonymous


Not Miguel 11/26/2023 9:43:00 PM

for this question - "which three type of basic patient or member information is displayed on the patient info component? (choose three.)", list of conditions is not displayed (it is displayed in patient card, not patient info). so should be thumbnail of chatter photo
Anonymous


Andrus 12/17/2023 12:09:00 PM

q52 should be d. vm storage controller bandwidth represents the amount of data (in terms of bandwidth) that a vms storage controller is using to read and write data to the storage fabric.
Anonymous


Raj 5/25/2023 8:43:00 AM

nice questions
UNITED STATES


max 12/22/2023 3:45:00 PM

very useful
Anonymous


Muhammad Rawish Siddiqui 12/8/2023 6:12:00 PM

question # 208: failure logs is not an example of operational metadata.
SAUDI ARABIA


Sachin Bedi 1/5/2024 4:47:00 AM

good questions
Anonymous


Kenneth 12/8/2023 7:34:00 AM

thank you for the test materials!
KOREA REPUBLIC OF


Harjinder Singh 8/9/2023 4:16:00 AM

its very helpful
HONG KONG


SD 7/13/2023 12:56:00 AM

good questions
UNITED STATES


kanjoe 7/2/2023 11:40:00 AM

good questons
UNITED STATES


Mahmoud 7/6/2023 4:24:00 AM

i need the dumb of the hcip security v4.0 exam
EGYPT


Wei 8/3/2023 4:18:00 AM

upload the dump please
HONG KONG


Stephen 10/3/2023 6:24:00 PM

yes, iam looking this
AUSTRALIA


Stephen 8/4/2023 9:08:00 PM

please upload cima e2 managing performance dumps
Anonymous


hp 6/16/2023 12:44:00 AM

wonderful questions
Anonymous


Priyo 11/14/2023 2:23:00 AM

i used this site since 2000, still great to support my career
INDONESIA


Jude 8/29/2023 1:56:00 PM

why is the answer to "which of the following is required by scrum?" all of the following stated below since most of them are not mandatory? sprint retrospective. members must be stand up at the daily scrum. sprint burndown chart. release planning.
UNITED STATES