Professional Machine Learning Engineer Exam Questions and Answers (Page: 3)

QUESTION: 17

You have trained a model on a dataset that required computationally expensive preprocessing operations. You need to execute the same preprocessing at prediction time. You deployed the model on Al Platform for high-throughput online prediction.
Which architecture should you use?

· Validate the accuracy of the model that you trained on preprocessed data · Create a new model that uses the raw data and is available in real time · Deploy the new model onto Al Platform for online prediction
· Send incoming prediction requests to a Pub/Sub topic
· Transform the incoming data using a Dataflow job
· Submit a prediction request to Al Platform using the transformed data · Write the predictions to an outbound Pub/Sub queue
· Stream incoming prediction request data into Cloud Spanner · Create a view to abstract your preprocessing logic.
· Query the view every second for new records
· Submit a prediction request to Al Platform using the transformed data · Write the predictions to an outbound Pub/Sub queue.
· Send incoming prediction requests to a Pub/Sub topic
· Set up a Cloud Function that is triggered when messages are published to the Pub/Sub topic.
· Implement your preprocessing logic in the Cloud Function · Submit a prediction request to Al Platform using the transformed data · Write the predictions to an outbound Pub/Sub queue

Answer(s): D

Explanation:

Option A is incorrect because creating a new model that uses the raw data and is available in real time would require retraining the model and deploying it again, which is not efficient or scalable. Option B is incorrect because using a Dataflow job to transform the incoming data would introduce unnecessary latency and complexity for online prediction, which requires fast and simple processing. Option C is incorrect because using Cloud Spanner to stream and query the incoming data would incur high costs and overhead for online prediction, which does not need a relational database. Option D is correct because using a Cloud Function to preprocess the data and submit a prediction request to Al Platform is a simple and scalable solution for online prediction, which leverages the serverless and event-driven features of Cloud Functions.

Reveal Solution Next Question

QUESTION: 18

You are building a model to predict daily temperatures. You split the data randomly and then transformed the training and test datasets. Temperature data for model training is uploaded hourly. During testing, your model performed with 97% accuracy; however, after deploying to production, the model's accuracy dropped to 66%. How can you make your production model more accurate?

Normalize the data for the training, and test datasets as two separate steps.
Split the training and test data based on time rather than a random split to avoid leakage
Add more data to your test set to ensure that you have a fair distribution and sample for testing
Apply data transformations before splitting, and cross-validate to make sure that the transformations are applied to both the training and test sets.

Answer(s): B

Explanation:

When building a model to predict daily temperatures, it is important to split the training and test data based on time rather than a random split. This is because temperature data is likely to have temporal dependencies and patterns, such as seasonality, trends, and cycles. If the data is split randomly, there is a risk of data leakage, which occurs when information from the future is used to train or validate the model. Data leakage can lead to overfitting and unrealistic performance estimates, as the model may learn from data that it should not have access to. By splitting the data based on time, such as using the most recent data as the test set and the older data as the training set, the model can be evaluated on how well it can forecast future temperatures based on past data, which is the realistic scenario in production. Therefore, splitting the data based on time rather than a random split is the best way to make the production model more accurate.

Reveal Solution Next Question

QUESTION: 19

You have a demand forecasting pipeline in production that uses Dataflow to preprocess raw data prior to model training and prediction. During preprocessing, you employ Z-score normalization on data stored in BigQuery and write it back to BigQuery. New training data is added every week. You want to make the process more efficient by minimizing computation time and manual intervention.
What should you do?

Normalize the data using Google Kubernetes Engine
Translate the normalization algorithm into SQL for use with BigQuery
Use the normalizer_fn argument in TensorFlow's Feature Column API
Normalize the data with Apache Spark using the Dataproc connector for BigQuery

Answer(s): B

Explanation:

Z-score normalization is a technique that transforms the values of a numeric variable into standardized units, such that the mean is zero and the standard deviation is one. Z-score normalization can help to compare variables with different scales and ranges, and to reduce the effect of outliers and skewness. The formula for z-score normalization is:
z (x - mu) / sigma where x is the original value, mu is the mean of the variable, and sigma is the standard deviation of the variable.
Dataflow is a service that allows you to create and run data processing pipelines on Google Cloud. You can use Dataflow to preprocess raw data prior to model training and prediction, such as applying z-score normalization on data stored in BigQuery. However, using Dataflow for this task may not be the most efficient option, as it involves reading and writing data from and to BigQuery, which can be time-consuming and costly. Moreover, using Dataflow requires manual intervention to update the pipeline whenever new training data is added.
A more efficient way to perform z-score normalization on data stored in BigQuery is to translate the normalization algorithm into SQL and use it with BigQuery. BigQuery is a service that allows you to analyze large-scale and complex data using SQL queries. You can use BigQuery to perform z-score normalization on your data using SQL functions such as AVG(), STDDEV_POP(), and OVER(). For example, the following SQL query can normalize the values of a column called temperature in a table called weather:
SELECT (temperature - AVG(temperature) OVER ()) / STDDEV_POP(temperature) OVER () AS normalized_temperature FROM weather;
By using SQL to perform z-score normalization on BigQuery, you can make the process more efficient by minimizing computation time and manual intervention. You can also leverage the scalability and performance of BigQuery to handle large and complex datasets. Therefore, translating the normalization algorithm into SQL for use with BigQuery is the best option for this use case.

Reveal Solution Next Question

QUESTION: 20

You were asked to investigate failures of a production line component based on sensor readings. After receiving the dataset, you discover that less than 1% of the readings are positive examples representing failure incidents. You have tried to train several classification models, but none of them converge. How should you resolve the class imbalance problem?

Use the class distribution to generate 10% positive examples
Use a convolutional neural network with max pooling and softmax activation
Downsample the data with upweighting to create a sample with 10% positive examples
Remove negative examples until the numbers of positive and negative examples are equal

Answer(s): C

Explanation:

The class imbalance problem is a common challenge in machine learning, especially in classification tasks. It occurs when the distribution of the target classes is highly skewed, such that one class (the majority class) has much more examples than the other class (the minority class). The minority class is often the more interesting or important class, such as failure incidents, fraud cases, or rare diseases. However, most machine learning algorithms are designed to optimize the overall accuracy, which can be biased towards the majority class and ignore the minority class. This can result in poor predictive performance, especially for the minority class. There are different techniques to deal with the class imbalance problem, such as data-level methods, algorithm-level methods, and evaluation-level methods1. Data-level methods involve resampling the original dataset to create a more balanced class distribution. There are two main types of data-level methods: oversampling and undersampling. Oversampling methods increase the number of examples in the minority class, either by duplicating existing examples or by generating synthetic examples. Undersampling methods reduce the number of examples in the majority class, either by randomly removing examples or by using clustering or other criteria to select representative examples. Both oversampling and undersampling methods can be combined with upweighting or downweighting, which assign different weights to the examples according to their class frequency, to further balance the dataset.
For the use case of investigating failures of a production line component based on sensor readings, the best option is to downsample the data with upweighting to create a sample with 10% positive examples. This option involves randomly removing some of the negative examples (the majority class) until the ratio of positive to negative examples is 1:9, and then assigning higher weights to the positive examples to compensate for their low frequency. This option can create a more balanced dataset that can improve the performance of the classification models, while preserving the diversity and representativeness of the original data. This option can also reduce the computation time and memory usage, as the size of the dataset is reduced. Therefore, downsampling the data with upweighting to create a sample with 10% positive examples is the best option for this use case.

Reference:

A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks

Reveal Solution Next Question

QUESTION: 21

You need to design a customized deep neural network in Keras that will predict customer purchases based on their purchase history. You want to explore model performance using multiple model architectures, store training data, and be able to compare the evaluation metrics in the same dashboard.
What should you do?

Create multiple models using AutoML Tables
Automate multiple training runs using Cloud Composer
Run multiple training jobs on Al Platform with similar job names
Create an experiment in Kubeflow Pipelines to organize multiple runs

Answer(s): D

Explanation:

Kubeflow Pipelines is a service that allows you to create and run machine learning workflows on Google Cloud using various features, model architectures, and hyperparameters. You can use Kubeflow Pipelines to scale up your workflows, leverage distributed training, and access specialized hardware such as GPUs and TPUs1. An experiment in Kubeflow Pipelines is a workspace where you can try different configurations of your pipelines and organize your runs into logical groups. You can use experiments to compare the performance of different models and track the evaluation metrics in the same dashboard2.
For the use case of designing a customized deep neural network in Keras that will predict customer purchases based on their purchase history, the best option is to create an experiment in Kubeflow Pipelines to organize multiple runs. This option allows you to explore model performance using multiple model architectures, store training data, and compare the evaluation metrics in the same dashboard. You can use Keras to build and train your deep neural network models, and then package them as pipeline components that can be reused and combined with other components. You can also use Kubeflow Pipelines SDK to define and submit your pipelines programmatically, and use Kubeflow Pipelines UI to monitor and manage your experiments. Therefore, creating an experiment in Kubeflow Pipelines to organize multiple runs is the best option for this use case.

Reference:

Kubeflow Pipelines documentation
Experiment | Kubeflow

Reveal Solution Next Question

QUESTION: 22

Your team needs to build a model that predicts whether images contain a driver's license, passport, or credit card. The data engineering team already built the pipeline and generated a dataset composed of 10,000 images with driver's licenses, 1,000 images with passports, and 1,000 images with credit cards. You now have to train a model with the following label map: ['driversjicense', 'passport', 'credit_card'].
Which loss function should you use?

Categorical hinge
Binary cross-entropy
Categorical cross-entropy
Sparse categorical cross-entropy

Answer(s): C

Explanation:

Categorical cross-entropy is a loss function that is suitable for multi-class classification problems, where the target variable has more than two possible values. Categorical cross-entropy measures the difference between the true probability distribution of the target classes and the predicted probability distribution of the model. It is defined as:
L - sum(y_i * log(p_i))
where y_i is the true probability of class i, and p_i is the predicted probability of class i. Categorical cross-entropy penalizes the model for making incorrect predictions, and encourages the model to assign high probabilities to the correct classes and low probabilities to the incorrect classes. For the use case of building a model that predicts whether images contain a driver's license, passport, or credit card, categorical cross-entropy is the appropriate loss function to use. This is because the problem is a multi-class classification problem, where the target variable has three possible values: [`drivers_license', `passport', `credit_card']. The label map is a list that maps the class names to the class indices, such that `drivers_license' corresponds to index 0, `passport' corresponds to index 1, and `credit_card' corresponds to index 2. The model should output a probability distribution over the three classes for each image, and the categorical cross-entropy loss function should compare the output with the true labels. Therefore, categorical cross-entropy is the best loss function for this use case.

Reveal Solution Next Question

QUESTION: 23

You are an ML engineer at a global car manufacturer. You need to build an ML model to predict car sales in different cities around the world.
Which features or feature crosses should you use to train city-specific relationships between car type and number of sales?

Three individual features binned latitude, binned longitude, and one-hot encoded car type
One feature obtained as an element-wise product between latitude, longitude, and car type
One feature obtained as an element-wise product between binned latitude, binned longitude, and one-hot encoded car type
Two feature crosses as a element-wise product the first between binned latitude and one-hot encoded car type, and the second between binned longitude and one-hot encoded car type

Answer(s): C

Explanation:

A feature cross is a synthetic feature that is obtained by combining two or more existing features, usually by taking their product or concatenation. A feature cross can help to capture the nonlinear and interaction effects between the original features, and improve the predictive performance of the model. A feature cross can be applied to different types of features, such as numeric, categorical, or geospatial features1.
For the use case of building an ML model to predict car sales in different cities around the world, the best option is to use one feature obtained as an element-wise product between binned latitude,

binned longitude, and one-hot encoded car type. This option involves creating a feature cross that combines three individual features: binned latitude, binned longitude, and one-hot encoded car type. Binning is a technique that transforms a continuous numeric feature into a discrete categorical feature by dividing its range into equal intervals, or bins. One-hot encoding is a technique that transforms a categorical feature into a binary vector, where each element corresponds to a possible category, and has a value of 1 if the feature belongs to that category, and 0 otherwise. By applying binning and one-hot encoding to the latitude, longitude, and car type features, the feature cross can capture the city-specific relationships between car type and number of sales, as each combination of bins and car types can represent a different city and its preference for a certain car type. For example, the feature cross can learn that a city with a latitude bin of [40, 50], a longitude bin of [-80, -70], and a car type of SUV has a higher number of sales than a city with a latitude bin of [-10, 0], a longitude bin of [10, 20], and a car type of sedan. Therefore, using one feature obtained as an element-wise product between binned latitude, binned longitude, and one-hot encoded car type is the best option for this use case.

Reference:

Feature Crosses | Machine Learning Crash Course

Reveal Solution Next Question

QUESTION: 24

You trained a text classification model. You have the following SignatureDefs:

What is the correct way to write the predict request?

data json.dumps({"signature_name": "serving_default'\ "instances": [fab', 'be1, 'cd']]})
data json dumps({"signature_name": "serving_default"! "instances": [['a', 'b', "c", 'd', 'e', 'f']]})
data json.dumps({"signature_name": "serving_default, "instances": [['a', 'b\ 'c'1, [d\ 'e\ T]]})
data json dumps({"signature_name": f,serving_default", "instances": [['a', 'b'], [c\ 'd'], ['e\ T]]})

Answer(s): D

Explanation:

A predict request is a way to send data to a trained model and get predictions in return. A predict request can be written in different formats, such as JSON, protobuf, or gRPC, depending on the service and the platform that are used to host and serve the model. A predict request usually contains the following information:
The signature name: This is the name of the signature that defines the inputs and outputs of the model. A signature is a way to specify the expected format, type, and shape of the data that the model can accept and produce. A signature can be specified when exporting or saving the model, or it can be automatically inferred by the service or the platform. A model can have multiple signatures, but only one can be used for each predict request.
The instances: This is the data that is sent to the model for prediction. The instances can be a single instance or a batch of instances, depending on the size and shape of the data. The instances should match the input specification of the signature, such as the number, name, and type of the input tensors.
For the use case of training a text classification model, the correct way to write the predict request is D. data json.dumps({"signature_name": "serving_default", "instances": [[`a', `b'], [`c', `d'], [`e', `f']]}) This option involves writing the predict request in JSON format, which is a common and convenient format for sending and receiving data over the web. JSON stands for JavaScript Object Notation, and it is a way to represent data as a collection of name-value pairs or an ordered list of values. JSON can be easily converted to and from Python objects using the json module. This option also involves using the signature name "serving_default", which is the default signature name that is assigned to the model when it is saved or exported without specifying a custom signature name. The serving_default signature defines the input and output tensors of the model based on the SignatureDef that is shown in the image. According to the SignatureDef, the model expects an input tensor called "text" that has a shape of (-1, 2) and a type of DT_STRING, and produces an output tensor called "softmax" that has a shape of (-1, 2) and a type of DT_FLOAT. The -1 in the shape indicates that the dimension can vary depending on the number of instances, and the 2 indicates that the dimension is fixed at 2. The DT_STRING and DT_FLOAT indicate that the data type is string and float, respectively.
This option also involves sending a batch of three instances to the model for prediction. Each instance is a list of two strings, such as [`a', `b'], [`c', `d'], or [`e', `f']. These instances match the input specification of the signature, as they have a shape of (3, 2) and a type of string. The model will process these instances and produce a batch of three predictions, each with a softmax output that has a shape of (1, 2) and a type of float. The softmax output is a probability distribution over the two possible classes that the model can predict, such as positive or negative sentiment. Therefore, writing the predict request as data json.dumps({"signature_name": "serving_default", "instances": [[`a', `b'], [`c', `d'], [`e', `f']]}) is the correct and valid way to send data to the text classification model and get predictions in return.

Reference:

[json -- JSON encoder and decoder]

Reveal Solution Next Question

Google Professional Machine Learning Engineer Exam (page: 3) Google Professional Machine Learning Engineer Updated on: 10-Oct-2025

QUESTION: 17

Explanation:

QUESTION: 18

Explanation:

QUESTION: 19

Explanation:

QUESTION: 20

Explanation:

Reference:

QUESTION: 21

Explanation:

Reference:

QUESTION: 22

Explanation:

QUESTION: 23

Explanation:

Reference:

QUESTION: 24

Explanation:

Reference:

Google Professional Machine Learning Engineer Exam (page: 3)
Google Professional Machine Learning Engineer
Updated on: 10-Oct-2025