Amazon AWS Certified Data Engineer - Associate DEA-C01 AWS Certified Data Engineer - Associate DEA-C01 Exam Questions in PDF

Free Amazon AWS Certified Data Engineer - Associate DEA-C01 Dumps Questions (page: 4)

A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3.
Which solution will meet this requirement MOST cost-effectively?

  1. Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis.
  2. Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files.
  3. Use Amazon Athena Federated Query to join the data from all data sources.
  4. Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift.

Answer(s): C

Explanation:

The correct answer is C.
A) An EMR cluster incurs provisioning and ongoing compute costs; for a one-time analysis, it is not the most cost-effective option compared to managed federated querying. B) Copying data into S3 adds ETL/storage costs and time, increasing total cost for a one-time analysis. C) Athena Federated Query enables on-demand, serverless access to multiple data sources (DynamoDB, RDS, Redshift, S3) with pay-per-query pricing, minimizing setup and cost for a one-off analysis. D) Redshift Spectrum can query S3 data and some external sources, but it does not natively federate across DynamoDB and RDS as seamlessly or cost-effectively as Athena Federated Queries for a one-time analysis.



A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance.
Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)

  1. Use Hadoop Distributed File System (HDFS) as a persistent data store.
  2. Use Amazon S3 as a persistent data store.
  3. Use x86-based instances for core nodes and task nodes.
  4. Use Graviton instances for core nodes and task nodes.
  5. Use Spot Instances for all primary nodes.

Answer(s): B,D

Explanation:

A robust, cost-effective EMR setup uses S3 for persistent storage and Graviton-based core/task nodes for efficiency and price performance.
A) HDFS as persistent storage is discouraged for long-running, cost-optimized EMR workloads because S3 provides durable, scalable object storage with lower management overhead.
B) S3 as persistent data store is correct due to durability, lifecycle management, and lower maintenance for long-running Spark jobs.
C) x86-based instances for core/task is not as cost-efficient as Graviton2/3 for many EMR workloads.
D) Graviton instances offer better price/performance for Spark workloads on EMR, improving TCO.
E) Spot Instances for all primary nodes risks interruption and is unsuitable for continuous, high-reliability workloads.



A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools.
Which solution will meet these requirements with the LEAST operational overhead?

  1. Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command to load data from Amazon S3 directly into Amazon Redshift to make the data immediately available for real-time analysis.
  2. Access the data from Kinesis Data Streams by using SQL queries. Create materialized views directly on top of the stream. Refresh the materialized views regularly to query the most recent stream data.
  3. Create an external schema in Amazon Redshift to map the data from Kinesis Data Streams to an Amazon Redshift object. Create a materialized view to read data from the stream. Set the materialized view to auto refresh.
  4. Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift.

Answer(s): C

Explanation:

K) C is correct because Redshift’s external schema (Spectrum-style) allows Redshift to query streaming data proxied through Kinesis Data Streams with a materialized view that auto-refreshes, delivering near-real-time insights with minimal operational overhead. It avoids manual ETL and maintains low latency by directly exposing stream data to Redshift.
A) Incorrect: COPY from S3 introduces latency and batching; not truly real-time and adds unnecessary storage steps.
B) Incorrect: Materialized views on streams are not natively supported for real-time querying in Redshift; auto-refresh on streams is not standard behavior.
D) Incorrect: Firehose+S3/COPY adds extra staging and latency; not as low-overhead for live streaming analytics as an external schema with auto-refresh.



A company uses an Amazon QuickSight dashboard to monitor usage of one of the company's applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day.
A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs.
Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.)

  1. Partition the data that is in the S3 bucket. Organize the data by year, month, and day.
  2. Increase the AWS Glue instance size by scaling up the worker type.
  3. Convert the AWS Glue schema to the DynamicFrame schema class.
  4. Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day.
  5. Modify the IAM role that grants access to AWS glue to grant access to all S3 features.

Answer(s): A,B

Explanation:

A) Partition the data that is in the S3 bucket. Organize the data by year, month, and day.
B) Increase the AWS Glue instance size by scaling up the worker type.
C) Convert the AWS Glue schema to the DynamicFrame schema class.
D) Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day.
E) Modify the IAM role that grants access to AWS glue to grant access to all S3 features.
A) Correct: Partitioning reduces scan scope and speeds queries for large S3 datasets used by Glue ETL and downstream QuickSight. B) Correct: Larger worker type improves parallelism and throughput, reducing job runtimes. C) Incorrect: DynamicFrame vs DataFrame choice affects transformation API, not core performance for partitioned data. D) Incorrect: Fewer runs may delay updates; performance not improved. E) Incorrect: Overly broad IAM permissions do not enhance ETL performance.



A data engineer needs to use AWS Step Functions to design an orchestration workflow. The workflow must parallel process a large collection of data files and apply a specific transformation to each file.
Which Step Functions state should the data engineer use to meet these requirements?

  1. Parallel state
  2. Choice state
  3. Map state
  4. Wait state

Answer(s): C

Explanation:

A Map state
A) Parallel state is for running multiple branches concurrently but does not automatically apply per-item transformation with dynamic collection; it is not designed to iterate over a collection of items.
B) Choice state selects between branches based on conditions, not for per-item processing across a collection.
C) Map state scales per-element processing by applying a defined workflow to each item in an input array, ideal for transforming every file in parallel.
D) Wait state introduces a delay and does not perform any per-item processing or parallel work.



A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information.
The data engineer must identify and remove duplicate information from the legacy application data.
Which solution will meet these requirements with the LEAST operational overhead?

  1. Write a custom extract, transform, and load (ETL) job in Python. Use the DataFrame.drop_duplicates() function by importing the Pandas library to perform data deduplication.
  2. Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.
  3. Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe library. Use the dedupe library to perform data deduplication.
  4. Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use the dedupe library to perform data deduplication.

Answer(s): B

Explanation:

A short summary: Using AWS Glue FindMatches ML transform provides deduplication with minimal ops.
A) Incorrect: Pandas drop_duplicates is in-memory and requires custom orchestration, not scalable with large S3 data; increases operational overhead.
B) Correct: AWS Glue FindMatches ML transform identifies duplicates with built-in, serverless deduplication; minimal maintenance and seamless integration with Glue ETL.
C) Incorrect: Python dedupe library requires custom code and management of similarity schemas and performance tuning; higher operational burden.
D) Incorrect: Importing Python dedupe in AWS Glue adds dependency management and custom logic, increasing complexity versus using managed FindMatches.



A company is building an analytics solution. The solution uses Amazon S3 for data lake storage and Amazon Redshift for a data warehouse. The company wants to use Amazon Redshift Spectrum to query the data that is in Amazon S3.
Which actions will provide the FASTEST queries? (Choose two.)

  1. Use gzip compression to compress individual files to sizes that are between 1 GB and 5 GB.
  2. Use a columnar storage file format.
  3. Partition the data based on the most common query predicates.
  4. Split the data into files that are less than 10 KB.
  5. Use file formats that are not splittable.

Answer(s): B,C

Explanation:

Using a columnar storage file format and partitioning the data by common predicates yields the fastest Redshift Spectrum queries.
A) Not correct: gzip compresses individual files but larger compressed sizes reduce parallelism and do not inherently guarantee faster scans; 1–5 GB per file is not optimal for Spectrum performance.
B) Correct: Columnar formats (e.g., ORC, Parquet) enable predicate pushdown and selective column reading, speeding scans.
C) Correct: Partitioning by common predicates reduces the data scanned and improves query performance via pruning.
D) Not correct: 10 KB files create excessive metadata operations and overhead, hurting performance.
E) Not correct: Non-splittable formats hinder parallelism and slow queries; splittable formats enable efficient parallel reads.



A company uses Amazon RDS to store transactional data. The company runs an RDS DB instance in a private subnet. A developer wrote an AWS Lambda function with default settings to insert, update, or delete data in the DB instance.
The developer needs to give the Lambda function the ability to connect to the DB instance privately without using the public internet.
Which combination of steps will meet this requirement with the LEAST operational overhead? (Choose two.)

  1. Turn on the public access setting for the DB instance.
  2. Update the security group of the DB instance to allow only Lambda function invocations on the database port.
  3. Configure the Lambda function to run in the same subnet that the DB instance uses.
  4. Attach the same security group to the Lambda function and the DB instance. Include a self-referencing rule that allows access through the database port.
  5. Update the network ACL of the private subnet to include a self-referencing rule that allows access through the database port.

Answer(s): C,D

Explanation:

C) Running the Lambda in the same VPC/subnet as the RDS instance ensures the function’s traffic stays within the private network, enabling private connectivity without internet exposure. D) Attaching the same security group to both Lambda and RDS with self-referencing rules allows intra-security-group communication on the database port, enabling authorized access without additional routing or public endpoints.
A) Turning on public access would expose the DB to the internet, contradicting “private” access. B) Security group on the DB to allow Lambda invocations is vague and not sufficient without correct networking; it also doesn’t guarantee same-subnet routing. E) Modifying NACLs adds unnecessary complexity and is not required when SG-based isolation suffices.



Share your comments for Amazon AWS Certified Data Engineer - Associate DEA-C01 exam with other users:

K
KJ
11/17/2023 3:50:00 PM

good practice exam

S
sowm
10/29/2023 2:44:00 PM

impressivre qustion

C
CW
7/6/2023 7:06:00 PM

questions seem helpful

L
luke
9/26/2023 10:52:00 AM

good content

Z
zazza
6/16/2023 9:08:00 AM

question 21 answer is alerts

A
Abwoch Peter
7/4/2023 3:08:00 AM

am preparing for exam

M
mohamed
9/12/2023 5:26:00 AM

good one thanks

M
Mfc
10/23/2023 3:35:00 PM

only got thru 5 questions, need more to evaluate

W
Whizzle
7/24/2023 6:19:00 AM

q26 should be b

S
sarra
1/17/2024 3:44:00 AM

the aaa triad in information security is authentication, accounting and authorisation so the answer should be d 1, 3 and 5.

D
DBS
5/14/2023 12:56:00 PM

need to attend this

D
Da_costa
8/1/2023 5:28:00 PM

these are free brain dumps i understand, how can one get free pdf

V
vikas
10/28/2023 6:57:00 AM

provide access

A
Abdullah
9/29/2023 2:06:00 AM

good morning

R
Raj
6/26/2023 3:12:00 PM

please upload the ncp-mci 6.5 dumps, really need to practice this one. thanks guys

M
Miguel
10/5/2023 12:21:00 PM

question 16: https://help.salesforce.com/s/articleview?id=sf.care_console_overview.htm&type=5

H
Hiren Ladva
7/8/2023 10:34:00 PM

yes i m prepared exam

O
oliverjames
10/24/2023 5:37:00 AM

my experience was great with this site as i studied for the ms-900 from here and got 900/1000 on the test. my main focus was on the tutorials which were provided and practice questions. thanks!

B
Bhuddhiman
7/20/2023 11:52:00 AM

great course

A
Anuj
1/14/2024 4:07:00 PM

very good question

S
Saravana Kumar TS
12/8/2023 9:49:00 AM

question: 93 which statement is true regarding the result? sales contain 6 columns and values contain 7 columns so c is not right answer.

L
Lue
3/30/2023 11:43:00 PM

highly recommend just passed my exam.

D
DC
1/7/2024 10:17:00 AM

great practice! thanks

A
Anonymus
11/9/2023 5:41:00 AM

anyone who wrote this exam recently?

K
Khalid Javid
11/17/2023 3:46:00 PM

kindly share the dump

N
Na
8/9/2023 8:39:00 AM

could you please upload cfe fraud prevention and deterrence questions? it will be very much helpful.

S
shime
10/23/2023 10:03:00 AM

this is really very very helpful for mcd level 1

V
Vnu
6/3/2023 2:39:00 AM

very helpful!

S
Steve
8/17/2023 2:19:00 PM

question #18s answer should be a, not d. this should be corrected. it should be minvalidityperiod

R
RITEISH
12/24/2023 4:33:00 AM

thanks for the exact solution

S
SB
10/15/2023 7:58:00 AM

need to refer the questions and have to give the exam

M
Mike Derfalem
7/16/2023 7:59:00 PM

i need it right now if it was possible please

I
Isak
7/6/2023 3:21:00 AM

i need it very much please share it in the fastest time.

M
Maria
6/23/2023 11:40:00 AM

correct answer is d for student.java program

AI Tutor 👋 I’m here to help!