AWS Certified Data Engineer - Associate Exam Questions PDF download

QUESTION: 33

A company has a frontend ReactJS website that uses Amazon API Gateway to invoke REST APIs. The APIs perform the functionality of the website. A data engineer needs to write a Python script that can be occasionally invoked through API Gateway. The code must return results to API Gateway.
Which solution will meet these requirements with the LEAST operational overhead?

Deploy a custom Python script on an Amazon Elastic Container Service (Amazon ECS) cluster.
Create an AWS Lambda Python function with provisioned concurrency.
Deploy a custom Python script that can integrate with API Gateway on Amazon Elastic Kubernetes Service (Amazon EKS).
Create an AWS Lambda function. Ensure that the function is warm by scheduling an Amazon EventBridge rule to invoke the Lambda function every 5 minutes by using mock events.

Answer(s): B

Explanation:

A) The least operational overhead is achieved with a Lambda function that can be invoked by API Gateway and does not require managing servers or containers, with provisioned concurrency ensuring cold-start avoidance.
B) Correct: AWS Lambda Python function with provisioned concurrency minimizes latency and management effort; API Gateway integration is direct, and no infrastructure provisioning is required.
C) EKS adds substantial operational overhead for Kubernetes management and does not align with “least overhead” for a small script invoked via API Gateway.
D) Regularly pinging a Lambda to stay warm is unnecessary with provisioned concurrency and adds unnecessary scheduling, increasing operational overhead and complexity.

Reveal Solution Next Question

QUESTION: 34

A company has a production AWS account that runs company workloads. The company's security team created a security AWS account to store and analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs.
The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account.
Which solution will meet these requirements?

Create a destination data stream in the production AWS account. In the security AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the production AWS account.
Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the security AWS account.
Create a destination data stream in the production AWS account. In the production AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the security AWS account.
Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the production AWS account.

Answer(s): D

Explanation:

Kinesis Data Streams must reside in the destination account (security account) where logs from CloudWatch Logs will be delivered via a subscription filter. Creating the destination stream in the security account and granting CloudWatch Logs permission to put records, with a subscription filter, aligns cross-account delivery without requiring cross-account Data Streams permissions in the source account.
A) Wrong: destination stream in production requires cross-account Kinesis permissions; not aligned with central security account ownership.
B) Wrong: subscription filter targets CloudWatch Logs to a Kinesis stream in security account, but missing cross-account trust for CloudWatch Logs in production; workflow incorrect.
C) Wrong: destination stream in production; cross-account role in production to security is unnecessary and misaligned with security-centric design.
D) Correct: destination stream in security account; IAM trust policy allows CloudWatch Logs to write; subscription filter enables delivery from production logs to the security account.

Reveal Solution Next Question

QUESTION: 35

A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes.
A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake.
Which solution will capture the changed data MOST cost-effectively?

Create an AWS Lambda function to identify the changes between the previous data and the current data. Configure the Lambda function to ingest the changes into the data lake.
Ingest the data into Amazon RDS for MySQL. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.
Use an open source data lake format to merge the data source with the S3 data lake to insert the new data and update the existing data.
Ingest the data into an Amazon Aurora MySQL DB instance that runs Aurora Serverless. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.

Answer(s): C

Explanation:

The correct answer is C because using an open source data lake format (such as Apache Iceberg or Delta Lake) enables ACID-compliant upserts/merges on a large-scale S3 data lake, allowing efficient CDC by merging daily full snapshots with existing data without heavy per-row processing or data movement. It minimizes storage and compute costs for tens of terabytes and small files, and supports scalable incremental updates.
A) Lambda-based diffing on large datasets is prohibitively expensive and slow for multi-terabyte files.
B) DMS with RDS MySQL adds relational DB maintenance and ongoing replication cost; CDC via DMS is not optimal for bulk S3 lake merging.
D) Aurora Serverless with DMS adds database compute cost and complexity; not the most cost-effective for bulk lake merges.

Reveal Solution Next Question

QUESTION: 36

A data engineer runs Amazon Athena queries on data that is in an Amazon S3 bucket. The Athena queries use AWS Glue Data Catalog as a metadata table.
The data engineer notices that the Athena query plans are experiencing a performance bottleneck. The data engineer determines that the cause of the performance bottleneck is the large number of partitions that are in the S3 bucket. The data engineer must resolve the performance bottleneck and reduce Athena query planning time.
Which solutions will meet these requirements? (Choose two.)

Create an AWS Glue partition index. Enable partition filtering.
Bucket the data based on a column that the data have in common in a WHERE clause of the user query.
Use Athena partition projection based on the S3 bucket prefix.
Transform the data that is in the S3 bucket to Apache Parquet format.
Use the Amazon EMR S3DistCP utility to combine smaller objects in the S3 bucket into larger objects.

Answer(s): A,C

Explanation:

Athena planning is sped up by reducing partition discovery and enabling predicate pushdown through partition metadata, which Glue partition index and partition projection provide.
A) Creates a Glue partition index and enables partition filtering to prune partitions at query planning time.
B) Bucketing by a common column does not affect partition discovery or metadata pruning in Athena when using Glue Catalog; it mainly affects data layout for certain query engines but not partition pruning in this setup.
C) Enables partition projection to avoid enumerating large numbers of partitions, speeding up planning by computing partitions from the S3 prefix without metadata lookups.
D) Parquet formatting improves scan efficiency but does not directly reduce partition discovery or planning time; it affects I/O and scan cost rather than planning bottlenecks.
E) S3DistCP merging objects reduces small object overhead but does not impact Athena’s partition planning or metadata pruning.

Reveal Solution Next Question

QUESTION: 37

A data engineer must manage the ingestion of real-time streaming data into AWS. The data engineer wants to perform real-time analytics on the incoming streaming data by using time-based aggregations over a window of up to 30 minutes. The data engineer needs a solution that is highly fault tolerant.
Which solution will meet these requirements with the LEAST operational overhead?

Use an AWS Lambda function that includes both the business and the analytics logic to perform time-based aggregations over a window of up to 30 minutes for the data in Amazon Kinesis Data Streams.
Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data that might occasionally contain duplicates by using multiple types of aggregations.
Use an AWS Lambda function that includes both the business and the analytics logic to perform aggregations for a tumbling window of up to 30 minutes, based on the event timestamp.
Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data by using multiple types of aggregations to perform time-based analytics over a window of up to 30 minutes.

Answer(s): D

Explanation:

A concise, correct choice explanation:
D) Fully managed Apache Flink service provides stateful, fault-tolerant stream processing with native support for time-based windows (up to 30 minutes) and multiple aggregations, minimizing operational overhead. It handles exactly-once semantics and scalable stateful processing for real-time analytics.
A) Lambda with both business and analytics logic is operationally heavy and lacks built-in fault-tolerant streaming state management for long windows; potential reprocessing and state management increase overhead.
B) Flink option is correct; the claim about occasional duplicates is not aligned with Flink’s strong exactly-once semantics and proper windowing.
C) Lambda-based tumbling window on event time is less suitable for long-running, continuous aggregations and still requires complex orchestration and fault handling.

Reveal Solution Next Question

QUESTION: 38

A company is planning to upgrade its Amazon Elastic Block Store (Amazon EBS) General Purpose SSD storage from gp2 to gp3. The company wants to prevent any interruptions in its Amazon EC2 instances that will cause data loss during the migration to the upgraded storage.
Which solution will meet these requirements with the LEAST operational overhead?

Create snapshots of the gp2 volumes. Create new gp3 volumes from the snapshots. Attach the new gp3 volumes to the EC2 instances.
Create new gp3 volumes. Gradually transfer the data to the new gp3 volumes. When the transfer is complete, mount the new gp3 volumes to the EC2 instances to replace the gp2 volumes.
Change the volume type of the existing gp2 volumes to gp3. Enter new values for volume size, IOPS, and throughput.
Use AWS DataSync to create new gp3 volumes. Transfer the data from the original gp2 volumes to the new gp3 volumes.

Answer(s): C

Explanation:

GP3 supports in-place volume type conversion and allows adjusting IOPS/throughput without data migration, minimizing downtime and operational overhead.
A) Incorrect: snapshots then create and attach new volumes introduces downtime during detachment/attachment and data consistency concerns; extra steps increase risk and cost.
B) Incorrect: gradual data transfer requires coordination and can still affect availability; mounting changes can cause brief interruption.
D) Incorrect: DataSync is unnecessary for block storage within the same volume; adds complexity and latency without reliability benefits for in-place upgrades.

Reveal Solution Next Question

QUESTION: 39

A company is migrating its database servers from Amazon EC2 instances that run Microsoft SQL Server to Amazon RDS for Microsoft SQL Server DB instances. The company's analytics team must export large data elements every day until the migration is complete. The data elements are the result of SQL joins across multiple tables. The data must be in Apache Parquet format. The analytics team must store the data in Amazon S3.
Which solution will meet these requirements in the MOST operationally efficient way?

Create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create an AWS Glue job that selects the data directly from the view and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.
Schedule SQL Server Agent to run a daily SQL query that selects the desired data elements from the EC2 instance-based SQL Server databases. Configure the query to direct the output .csv objects to an S3 bucket. Create an S3 event that invokes an AWS Lambda function to transform the output format from .csv to Parquet.
Use a SQL query to create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create and run an AWS Glue crawler to read the view. Create an AWS Glue job that retrieves the data and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.
Create an AWS Lambda function that queries the EC2 instance-based databases by using Java Database Connectivity (JDBC). Configure the Lambda function to retrieve the required data, transform the data into Parquet format, and transfer the data into an S3 bucket. Use Amazon EventBridge to schedule the Lambda function to run every day.

Answer(s): C

Explanation:

A) Using Glue to read directly from a live view on EC2 is not straightforward since Glue typically crawls data stores for cataloging; accessing an on-premises or EC2-hosted SQL Server view would require a persistent connection and proper JDBC/ODBC setup, which adds operational overhead. B) SQL Server Agent on EC2 producing CSV to S3 and Lambda to convert adds unnecessary steps and potential ETL drift; managing two services increases operational burden. C) Creating a view, then using a Glue crawler to catalog the view and a Glue job to extract and convert to Parquet into S3 provides a fully managed, serverless, repeatable, and scalable workflow with minimal maintenance. D) Lambda with JDBC requires continuous connection management, cold start considerations, and scripting for Parquet conversion, which is less operationally efficient than Glue-based ETL.

Reveal Solution Next Question

QUESTION: 40

A data engineering team is using an Amazon Redshift data warehouse for operational reporting. The team wants to prevent performance issues that might result from long- running queries. A data engineer must choose a system table in Amazon Redshift to record anomalies when a query optimizer identifies conditions that might indicate performance issues.
Which table views should the data engineer use to meet this requirement?

STL_USAGE_CONTROL
STL_ALERT_EVENT_LOG
STL_QUERY_METRICS
STL_PLAN_INFO

Answer(s): B

Explanation:

A) STL_ALERT_EVENT_LOG
B) STL_ALERT_EVENT_LOG is the system view that records anomalies and alerts detected by the query optimizer when potential performance issues are identified, making it the appropriate source for monitoring long-running or problematic queries. C) STL_QUERY_METRICS contains per-query metrics but does not specifically log anomalies identified by the optimizer. D) STL_PLAN_INFO provides plan details but not a centralized anomaly/alert log. A) STL_USAGE_CONTROL is related to usage controls and does not capture optimizer anomaly events.

Reveal Solution Next Question

Amazon AWS Certified Data Engineer - Associate Amazon-DEA-C01 Dumps in PDF

Free Amazon Amazon-DEA-C01 Real Questions (page: 11)

QUESTION: 33

Explanation:

QUESTION: 34

Explanation:

QUESTION: 35

Explanation:

QUESTION: 36

Explanation:

QUESTION: 37

Explanation:

QUESTION: 38

Explanation:

QUESTION: 39

Explanation:

QUESTION: 40

Explanation: