Google Professional Machine Learning Engineer Exam (page: 3)
Google Professional Machine Learning Engineer
Updated on: 10-Oct-2025

You have trained a model on a dataset that required computationally expensive preprocessing operations. You need to execute the same preprocessing at prediction time. You deployed the model on Al Platform for high-throughput online prediction.
Which architecture should you use?

  1. · Validate the accuracy of the model that you trained on preprocessed data · Create a new model that uses the raw data and is available in real time · Deploy the new model onto Al Platform for online prediction
  2. · Send incoming prediction requests to a Pub/Sub topic
    · Transform the incoming data using a Dataflow job
    · Submit a prediction request to Al Platform using the transformed data · Write the predictions to an outbound Pub/Sub queue
  3. · Stream incoming prediction request data into Cloud Spanner · Create a view to abstract your preprocessing logic.
    · Query the view every second for new records
    · Submit a prediction request to Al Platform using the transformed data · Write the predictions to an outbound Pub/Sub queue.
  4. · Send incoming prediction requests to a Pub/Sub topic
    · Set up a Cloud Function that is triggered when messages are published to the Pub/Sub topic.
    · Implement your preprocessing logic in the Cloud Function · Submit a prediction request to Al Platform using the transformed data · Write the predictions to an outbound Pub/Sub queue

Answer(s): D

Explanation:

Option A is incorrect because creating a new model that uses the raw data and is available in real time would require retraining the model and deploying it again, which is not efficient or scalable. Option B is incorrect because using a Dataflow job to transform the incoming data would introduce unnecessary latency and complexity for online prediction, which requires fast and simple processing. Option C is incorrect because using Cloud Spanner to stream and query the incoming data would incur high costs and overhead for online prediction, which does not need a relational database. Option D is correct because using a Cloud Function to preprocess the data and submit a prediction request to Al Platform is a simple and scalable solution for online prediction, which leverages the serverless and event-driven features of Cloud Functions.



You are building a model to predict daily temperatures. You split the data randomly and then transformed the training and test datasets. Temperature data for model training is uploaded hourly. During testing, your model performed with 97% accuracy; however, after deploying to production, the model's accuracy dropped to 66%. How can you make your production model more accurate?

  1. Normalize the data for the training, and test datasets as two separate steps.
  2. Split the training and test data based on time rather than a random split to avoid leakage
  3. Add more data to your test set to ensure that you have a fair distribution and sample for testing
  4. Apply data transformations before splitting, and cross-validate to make sure that the transformations are applied to both the training and test sets.

Answer(s): B

Explanation:

When building a model to predict daily temperatures, it is important to split the training and test data based on time rather than a random split. This is because temperature data is likely to have temporal dependencies and patterns, such as seasonality, trends, and cycles. If the data is split randomly, there is a risk of data leakage, which occurs when information from the future is used to train or validate the model. Data leakage can lead to overfitting and unrealistic performance estimates, as the model may learn from data that it should not have access to. By splitting the data based on time, such as using the most recent data as the test set and the older data as the training set, the model can be evaluated on how well it can forecast future temperatures based on past data, which is the realistic scenario in production. Therefore, splitting the data based on time rather than a random split is the best way to make the production model more accurate.



You have a demand forecasting pipeline in production that uses Dataflow to preprocess raw data prior to model training and prediction. During preprocessing, you employ Z-score normalization on data stored in BigQuery and write it back to BigQuery. New training data is added every week. You want to make the process more efficient by minimizing computation time and manual intervention.
What should you do?

  1. Normalize the data using Google Kubernetes Engine
  2. Translate the normalization algorithm into SQL for use with BigQuery
  3. Use the normalizer_fn argument in TensorFlow's Feature Column API
  4. Normalize the data with Apache Spark using the Dataproc connector for BigQuery

Answer(s): B

Explanation:

Z-score normalization is a technique that transforms the values of a numeric variable into standardized units, such that the mean is zero and the standard deviation is one. Z-score normalization can help to compare variables with different scales and ranges, and to reduce the effect of outliers and skewness. The formula for z-score normalization is:
z (x - mu) / sigma where x is the original value, mu is the mean of the variable, and sigma is the standard deviation of the variable.
Dataflow is a service that allows you to create and run data processing pipelines on Google Cloud. You can use Dataflow to preprocess raw data prior to model training and prediction, such as applying z-score normalization on data stored in BigQuery. However, using Dataflow for this task may not be the most efficient option, as it involves reading and writing data from and to BigQuery, which can be time-consuming and costly. Moreover, using Dataflow requires manual intervention to update the pipeline whenever new training data is added.
A more efficient way to perform z-score normalization on data stored in BigQuery is to translate the normalization algorithm into SQL and use it with BigQuery. BigQuery is a service that allows you to analyze large-scale and complex data using SQL queries. You can use BigQuery to perform z-score normalization on your data using SQL functions such as AVG(), STDDEV_POP(), and OVER(). For example, the following SQL query can normalize the values of a column called temperature in a table called weather:
SELECT (temperature - AVG(temperature) OVER ()) / STDDEV_POP(temperature) OVER () AS normalized_temperature FROM weather;
By using SQL to perform z-score normalization on BigQuery, you can make the process more efficient by minimizing computation time and manual intervention. You can also leverage the scalability and performance of BigQuery to handle large and complex datasets. Therefore, translating the normalization algorithm into SQL for use with BigQuery is the best option for this use case.



You were asked to investigate failures of a production line component based on sensor readings. After receiving the dataset, you discover that less than 1% of the readings are positive examples representing failure incidents. You have tried to train several classification models, but none of them converge. How should you resolve the class imbalance problem?

  1. Use the class distribution to generate 10% positive examples
  2. Use a convolutional neural network with max pooling and softmax activation
  3. Downsample the data with upweighting to create a sample with 10% positive examples
  4. Remove negative examples until the numbers of positive and negative examples are equal

Answer(s): C

Explanation:

The class imbalance problem is a common challenge in machine learning, especially in classification tasks. It occurs when the distribution of the target classes is highly skewed, such that one class (the majority class) has much more examples than the other class (the minority class). The minority class is often the more interesting or important class, such as failure incidents, fraud cases, or rare diseases. However, most machine learning algorithms are designed to optimize the overall accuracy, which can be biased towards the majority class and ignore the minority class. This can result in poor predictive performance, especially for the minority class. There are different techniques to deal with the class imbalance problem, such as data-level methods, algorithm-level methods, and evaluation-level methods1. Data-level methods involve resampling the original dataset to create a more balanced class distribution. There are two main types of data-level methods: oversampling and undersampling. Oversampling methods increase the number of examples in the minority class, either by duplicating existing examples or by generating synthetic examples. Undersampling methods reduce the number of examples in the majority class, either by randomly removing examples or by using clustering or other criteria to select representative examples. Both oversampling and undersampling methods can be combined with upweighting or downweighting, which assign different weights to the examples according to their class frequency, to further balance the dataset.
For the use case of investigating failures of a production line component based on sensor readings, the best option is to downsample the data with upweighting to create a sample with 10% positive examples. This option involves randomly removing some of the negative examples (the majority class) until the ratio of positive to negative examples is 1:9, and then assigning higher weights to the positive examples to compensate for their low frequency. This option can create a more balanced dataset that can improve the performance of the classification models, while preserving the diversity and representativeness of the original data. This option can also reduce the computation time and memory usage, as the size of the dataset is reduced. Therefore, downsampling the data with upweighting to create a sample with 10% positive examples is the best option for this use case.


Reference:

A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks



You need to design a customized deep neural network in Keras that will predict customer purchases based on their purchase history. You want to explore model performance using multiple model architectures, store training data, and be able to compare the evaluation metrics in the same dashboard.
What should you do?

  1. Create multiple models using AutoML Tables
  2. Automate multiple training runs using Cloud Composer
  3. Run multiple training jobs on Al Platform with similar job names
  4. Create an experiment in Kubeflow Pipelines to organize multiple runs

Answer(s): D

Explanation:

Kubeflow Pipelines is a service that allows you to create and run machine learning workflows on Google Cloud using various features, model architectures, and hyperparameters. You can use Kubeflow Pipelines to scale up your workflows, leverage distributed training, and access specialized hardware such as GPUs and TPUs1. An experiment in Kubeflow Pipelines is a workspace where you can try different configurations of your pipelines and organize your runs into logical groups. You can use experiments to compare the performance of different models and track the evaluation metrics in the same dashboard2.
For the use case of designing a customized deep neural network in Keras that will predict customer purchases based on their purchase history, the best option is to create an experiment in Kubeflow Pipelines to organize multiple runs. This option allows you to explore model performance using multiple model architectures, store training data, and compare the evaluation metrics in the same dashboard. You can use Keras to build and train your deep neural network models, and then package them as pipeline components that can be reused and combined with other components. You can also use Kubeflow Pipelines SDK to define and submit your pipelines programmatically, and use Kubeflow Pipelines UI to monitor and manage your experiments. Therefore, creating an experiment in Kubeflow Pipelines to organize multiple runs is the best option for this use case.


Reference:

Kubeflow Pipelines documentation
Experiment | Kubeflow



Your team needs to build a model that predicts whether images contain a driver's license, passport, or credit card. The data engineering team already built the pipeline and generated a dataset composed of 10,000 images with driver's licenses, 1,000 images with passports, and 1,000 images with credit cards. You now have to train a model with the following label map: ['driversjicense', 'passport', 'credit_card'].
Which loss function should you use?

  1. Categorical hinge
  2. Binary cross-entropy
  3. Categorical cross-entropy
  4. Sparse categorical cross-entropy

Answer(s): C

Explanation:

Categorical cross-entropy is a loss function that is suitable for multi-class classification problems, where the target variable has more than two possible values. Categorical cross-entropy measures the difference between the true probability distribution of the target classes and the predicted probability distribution of the model. It is defined as:
L - sum(y_i * log(p_i))
where y_i is the true probability of class i, and p_i is the predicted probability of class i. Categorical cross-entropy penalizes the model for making incorrect predictions, and encourages the model to assign high probabilities to the correct classes and low probabilities to the incorrect classes. For the use case of building a model that predicts whether images contain a driver's license, passport, or credit card, categorical cross-entropy is the appropriate loss function to use. This is because the problem is a multi-class classification problem, where the target variable has three possible values: [`drivers_license', `passport', `credit_card']. The label map is a list that maps the class names to the class indices, such that `drivers_license' corresponds to index 0, `passport' corresponds to index 1, and `credit_card' corresponds to index 2. The model should output a probability distribution over the three classes for each image, and the categorical cross-entropy loss function should compare the output with the true labels. Therefore, categorical cross-entropy is the best loss function for this use case.



You are an ML engineer at a global car manufacturer. You need to build an ML model to predict car sales in different cities around the world.
Which features or feature crosses should you use to train city-specific relationships between car type and number of sales?

  1. Three individual features binned latitude, binned longitude, and one-hot encoded car type
  2. One feature obtained as an element-wise product between latitude, longitude, and car type
  3. One feature obtained as an element-wise product between binned latitude, binned longitude, and one-hot encoded car type
  4. Two feature crosses as a element-wise product the first between binned latitude and one-hot encoded car type, and the second between binned longitude and one-hot encoded car type

Answer(s): C

Explanation:

A feature cross is a synthetic feature that is obtained by combining two or more existing features, usually by taking their product or concatenation. A feature cross can help to capture the nonlinear and interaction effects between the original features, and improve the predictive performance of the model. A feature cross can be applied to different types of features, such as numeric, categorical, or geospatial features1.
For the use case of building an ML model to predict car sales in different cities around the world, the best option is to use one feature obtained as an element-wise product between binned latitude,

binned longitude, and one-hot encoded car type. This option involves creating a feature cross that combines three individual features: binned latitude, binned longitude, and one-hot encoded car type. Binning is a technique that transforms a continuous numeric feature into a discrete categorical feature by dividing its range into equal intervals, or bins. One-hot encoding is a technique that transforms a categorical feature into a binary vector, where each element corresponds to a possible category, and has a value of 1 if the feature belongs to that category, and 0 otherwise. By applying binning and one-hot encoding to the latitude, longitude, and car type features, the feature cross can capture the city-specific relationships between car type and number of sales, as each combination of bins and car types can represent a different city and its preference for a certain car type. For example, the feature cross can learn that a city with a latitude bin of [40, 50], a longitude bin of [-80, -70], and a car type of SUV has a higher number of sales than a city with a latitude bin of [-10, 0], a longitude bin of [10, 20], and a car type of sedan. Therefore, using one feature obtained as an element-wise product between binned latitude, binned longitude, and one-hot encoded car type is the best option for this use case.


Reference:

Feature Crosses | Machine Learning Crash Course



You trained a text classification model. You have the following SignatureDefs:



What is the correct way to write the predict request?

  1. data json.dumps({"signature_name": "serving_default'\ "instances": [fab', 'be1, 'cd']]})
  2. data json dumps({"signature_name": "serving_default"! "instances": [['a', 'b', "c", 'd', 'e', 'f']]})
  3. data json.dumps({"signature_name": "serving_default, "instances": [['a', 'b\ 'c'1, [d\ 'e\ T]]})
  4. data json dumps({"signature_name": f,serving_default", "instances": [['a', 'b'], [c\ 'd'], ['e\ T]]})

Answer(s): D

Explanation:

A predict request is a way to send data to a trained model and get predictions in return. A predict request can be written in different formats, such as JSON, protobuf, or gRPC, depending on the service and the platform that are used to host and serve the model. A predict request usually contains the following information:
The signature name: This is the name of the signature that defines the inputs and outputs of the model. A signature is a way to specify the expected format, type, and shape of the data that the model can accept and produce. A signature can be specified when exporting or saving the model, or it can be automatically inferred by the service or the platform. A model can have multiple signatures, but only one can be used for each predict request.
The instances: This is the data that is sent to the model for prediction. The instances can be a single instance or a batch of instances, depending on the size and shape of the data. The instances should match the input specification of the signature, such as the number, name, and type of the input tensors.
For the use case of training a text classification model, the correct way to write the predict request is D. data json.dumps({"signature_name": "serving_default", "instances": [[`a', `b'], [`c', `d'], [`e', `f']]}) This option involves writing the predict request in JSON format, which is a common and convenient format for sending and receiving data over the web. JSON stands for JavaScript Object Notation, and it is a way to represent data as a collection of name-value pairs or an ordered list of values. JSON can be easily converted to and from Python objects using the json module. This option also involves using the signature name "serving_default", which is the default signature name that is assigned to the model when it is saved or exported without specifying a custom signature name. The serving_default signature defines the input and output tensors of the model based on the SignatureDef that is shown in the image. According to the SignatureDef, the model expects an input tensor called "text" that has a shape of (-1, 2) and a type of DT_STRING, and produces an output tensor called "softmax" that has a shape of (-1, 2) and a type of DT_FLOAT. The -1 in the shape indicates that the dimension can vary depending on the number of instances, and the 2 indicates that the dimension is fixed at 2. The DT_STRING and DT_FLOAT indicate that the data type is string and float, respectively.
This option also involves sending a batch of three instances to the model for prediction. Each instance is a list of two strings, such as [`a', `b'], [`c', `d'], or [`e', `f']. These instances match the input specification of the signature, as they have a shape of (3, 2) and a type of string. The model will process these instances and produce a batch of three predictions, each with a softmax output that has a shape of (1, 2) and a type of float. The softmax output is a probability distribution over the two possible classes that the model can predict, such as positive or negative sentiment. Therefore, writing the predict request as data json.dumps({"signature_name": "serving_default", "instances": [[`a', `b'], [`c', `d'], [`e', `f']]}) is the correct and valid way to send data to the text classification model and get predictions in return.


Reference:

[json -- JSON encoder and decoder]



Viewing Page 3 of 37



Share your comments for Google Professional Machine Learning Engineer exam with other users:

Anonymus 11/9/2023 5:41:00 AM

anyone who wrote this exam recently?
SOUTH AFRICA


Khalid Javid 11/17/2023 3:46:00 PM

kindly share the dump
Anonymous


Na 8/9/2023 8:39:00 AM

could you please upload cfe fraud prevention and deterrence questions? it will be very much helpful.
Anonymous


shime 10/23/2023 10:03:00 AM

this is really very very helpful for mcd level 1
ETHIOPIA


Vnu 6/3/2023 2:39:00 AM

very helpful!
Anonymous


Steve 8/17/2023 2:19:00 PM

question #18s answer should be a, not d. this should be corrected. it should be minvalidityperiod
CANADA


RITEISH 12/24/2023 4:33:00 AM

thanks for the exact solution
Anonymous


SB 10/15/2023 7:58:00 AM

need to refer the questions and have to give the exam
INDIA


Mike Derfalem 7/16/2023 7:59:00 PM

i need it right now if it was possible please
Anonymous


Isak 7/6/2023 3:21:00 AM

i need it very much please share it in the fastest time.
Anonymous


Maria 6/23/2023 11:40:00 AM

correct answer is d for student.java program
IRELAND


Nagendra Pedipina 7/12/2023 9:10:00 AM

q:37 c is correct
INDIA


John 9/16/2023 9:37:00 PM

q6 exam topic: terramearth, c: correct answer: copy 1petabyte to encrypted usb device ???
GERMANY


SAM 12/4/2023 12:56:00 AM

explained answers
INDIA


Andy 12/26/2023 9:35:00 PM

plan to take theaws certified developer - associate dva-c02 in the next few weeks
SINGAPORE


siva 5/17/2023 12:32:00 AM

very helpfull
Anonymous


mouna 9/27/2023 8:53:00 AM

good questions
Anonymous


Bhavya 9/12/2023 7:18:00 AM

help to practice csa exam
Anonymous


Malik 9/28/2023 1:09:00 PM

nice tip and well documented
Anonymous


rodrigo 6/22/2023 7:55:00 AM

i need the exam
Anonymous


Dan 6/29/2023 1:53:00 PM

please upload
Anonymous


Ale M 11/22/2023 6:38:00 PM

prepping for fsc exam
AUSTRALIA


ahmad hassan 9/6/2023 3:26:00 AM

pd1 with great experience
Anonymous


Žarko 9/5/2023 3:35:00 AM

@t it seems like azure service bus message quesues could be the best solution
UNITED KINGDOM


Shiji 10/15/2023 1:08:00 PM

helpful to check your understanding.
INDIA


Da Costa 8/27/2023 11:43:00 AM

question 128 the answer should be static not auto
Anonymous


bot 7/26/2023 6:45:00 PM

more comments here
UNITED STATES


Kaleemullah 12/31/2023 1:35:00 AM

great support to appear for exams
Anonymous


Bsmaind 8/20/2023 9:26:00 AM

useful dumps
Anonymous


Blessious Phiri 8/13/2023 8:37:00 AM

making progress
Anonymous


Nabla 9/17/2023 10:20:00 AM

q31 answer should be d i think
FRANCE


vladputin 7/20/2023 5:00:00 AM

is this real?
UNITED STATES


Nick W 9/29/2023 7:32:00 AM

q10: c and f are also true. q11: this is outdated. you no longer need ownership on a pipe to operate it
Anonymous


Naveed 8/28/2023 2:48:00 AM

good questions with simple explanation
UNITED STATES