Databricks Databricks-Certified-Professional-Data-Engineer Exam (page: 3)
Databricks Certified Data Engineer Professional
Updated on: 02-Jan-2026

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.
The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

  1. Because the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
  2. Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day.
  3. Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.
  4. Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
  5. Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.

Answer(s): E



A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.


Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

  1. Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.
  2. The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
  3. Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.
  4. One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.
  5. The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

Answer(s): C



An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?

  1. Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.
  2. Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.
  3. Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log.
  4. Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
  5. Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.

Answer(s): E



An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

  1. Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger once job to batch update newly detected files into the account_current table.
  2. Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
  3. Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id.
  4. Use Delta Lake version history to get the difference between the latest version of account_history and one version prior, then write these records to account_current.
  5. Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.

Answer(s): C



A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?

  1. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
  2. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
  3. Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
  4. Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.
  5. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Answer(s): E



Viewing Page 3 of 37



Share your comments for Databricks Databricks-Certified-Professional-Data-Engineer exam with other users:

nSiva 9/22/2023 5:58:00 AM

its useful.
UNITED STATES


Ranveer 7/26/2023 7:26:00 PM

Pass this exam 3 days ago. The PDF version and the Xengine App is quite useful.
SOUTH AFRICA


Sanjay 8/15/2023 10:22:00 AM

informative for me.
UNITED STATES


Tom 12/12/2023 8:53:00 PM

question 134s answer shoule be "dlp"
JAPAN


Alex 11/7/2023 11:02:00 AM

in 72 the answer must be [sys_user_has_role] table.
Anonymous


Finn 5/4/2023 10:21:00 PM

i appreciated the mix of multiple-choice and short answer questions. i passed my exam this morning.
IRLAND


AJ 7/13/2023 8:33:00 AM

great to find this website, thanks
UNITED ARAB EMIRATES


Curtis Nakawaki 6/29/2023 9:11:00 PM

examination questions seem to be relevant.
UNITED STATES


Umashankar Sharma 10/22/2023 9:39:00 AM

planning to take psm test
Anonymous


ED SHAW 7/31/2023 10:34:00 AM

please allow to download
UNITED STATES


AD 7/22/2023 11:29:00 AM

please provide dumps
UNITED STATES


Ayyjayy 11/6/2023 7:29:00 AM

is the answer to question 15 correct ? i feel like the answer should be b
BAHRAIN


Blessious Phiri 8/12/2023 11:56:00 AM

its getting more technical
Anonymous


Jeanine J 7/11/2023 3:04:00 PM

i think these questions are what i need.
UNITED STATES


Aderonke 10/23/2023 2:13:00 PM

helpful assessment
UNITED KINGDOM


Tom 1/5/2024 2:32:00 AM

i am confused about the answers to the questions. do you know if the answers are correct?
KOREA REPUBLIC OF


Vinit N. 8/28/2023 2:33:00 AM

hi, please make the dumps available for my upcoming examination.
UNITED STATES


Sanyog Deshpande 9/14/2023 7:05:00 AM

good practice
UNITED STATES


Tyron 9/8/2023 12:12:00 AM

so far it is really informative
Anonymous


beast 7/30/2023 2:22:00 PM

hi i want it please please upload it
Anonymous


Mirex 5/26/2023 3:45:00 AM

am preparing for exam ,just nice questions
Anonymous


exampei 8/7/2023 8:05:00 AM

please upload c_tadm_23 exam
TURKEY


Anonymous 9/12/2023 12:50:00 PM

can we get tdvan4 vantage data engineering pdf?
UNITED STATES


Aish 10/11/2023 5:51:00 AM

want to clear the exam.
INDIA


Smaranika 6/22/2023 8:42:00 AM

could you please upload the dumps of sap c_sac_2302
INDIA


Blessious Phiri 8/15/2023 1:56:00 PM

asm management configuration is about storage
Anonymous


Lewis 7/6/2023 8:49:00 PM

kool thumb up
UNITED STATES


Moreece 5/15/2023 8:44:00 AM

just passed the az-500 exam this last friday. most of the questions in this exam dumps are in the exam. i bought the full version and noticed some of the questions which were answered wrong in the free version are all corrected in the full version. this site is good but i wish the had it in an interactive version like a test engine simulator.
Anonymous


Terry 5/24/2023 4:41:00 PM

i can practice for exam
Anonymous


Emerys 7/29/2023 6:55:00 AM

please i need this exam.
Anonymous


Goni Mala 9/2/2023 12:27:00 PM

i need the dump
Anonymous


Lenny 9/29/2023 11:30:00 AM

i want it bad, even if cs6 maybe retired, i want to learn cs6
HONG KONG


MilfSlayer 12/28/2023 8:32:00 PM

i hate comptia with all my heart with their "choose the best" answer format as an argument could be made on every question. they say "the "comptia way", lmao no this right here boys is the comptia way 100%. take it from someone whos failed this exam twice but can configure an entire complex network that these are the questions that are on the test 100% no questions asked. the pbqs are dead on! nice work
Anonymous


Swati Raj 11/14/2023 6:28:00 AM

very good materials
UNITED STATES