Amazon (MLS-C01) Exam Questions And Answers page 6
A Data Scientist is building a model to predict customer churn using a dataset of 100 continuous numerical features. The Marketing team has not provided any insight about which features are relevant for churn prediction. The Marketing team wants to interpret the model and see the direct impact of relevant features on the model outcome. While training a logistic regression model, the Data Scientist observes that there is a wide gap between the training and validation set accuracy.
Which methods can the Data Scientist use to improve the model performance and satisfy the Marketing team s needs? (Choose two.)
Which methods can the Data Scientist use to improve the model performance and satisfy the Marketing team s needs? (Choose two.)
Add features to the dataset
Perform recursive feature elimination
Perform t-distributed stochastic neighbor embedding (t-SNE)
Perform linear discriminant analysis
Exploratory Data Analysis
Machine Learning Implementation and Operations
A Data Scientist is developing a binary classifier to predict whether a patient has a particular disease on a series of test results. The Data Scientist has data on 400 patients randomly selected from the population. The disease is seen in 3% of the population.
Which cross-validation strategy should the Data Scientist adopt?
Which cross-validation strategy should the Data Scientist adopt?
A k-fold cross-validation strategy with k=5
A stratified k-fold cross-validation strategy with k=5
A k-fold cross-validation strategy with k=5 and 3 repeats
An 80/20 stratified split between training and validation
Data Engineering
Exploratory Data Analysis
A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of 100,000 non-fraudulent observations and 1,000 fraudulent observations.
The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist has been asked to reduce the number of false negatives.
Which combination of steps should the Data Scientist take to reduce the number of false positive predictions by the model? (Choose two.)
The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist has been asked to reduce the number of false negatives.
Which combination of steps should the Data Scientist take to reduce the number of false positive predictions by the model? (Choose two.)
Change the XGBoost eval_metric parameter to optimize based on rmse instead of error.
Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights.
Increase the XGBoost max_depth parameter because the model is currently underfitting the data.
Change the XGBoost eval_metric parameter to optimize based on AUC instead of error.
Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.
Model Development
Machine Learning Implementation and Operations
A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of 100,000 non-fraudulent observations and 1,000 fraudulent observations.
The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist needs to reduce the number of false negatives.
Which combination of steps should the Data Scientist take to reduce the number of false negative predictions by the model? (Choose two.)
The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist needs to reduce the number of false negatives.
Which combination of steps should the Data Scientist take to reduce the number of false negative predictions by the model? (Choose two.)
Change the XGBoost eval_metric parameter to optimize based on Root Mean Square Error (RMSE).
Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights.
Increase the XGBoost max_depth parameter because the model is currently underfitting the data.
Change the XGBoost eval_metric parameter to optimize based on Area Under the ROC Curve (AUC).
Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.
Model Development
Machine Learning Implementation and Operations
A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.
Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population
How should the Data Scientist correct this issue?
Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population
How should the Data Scientist correct this issue?
Drop all records from the dataset where age has been set to 0.
Replace the age field value for records with a value of 0 with the mean or median value from the dataset
Drop the age feature from the dataset and train the model using the rest of the features.
Use k-means clustering to handle missing features
Data Engineering
Exploratory Data Analysis
A data scientist is developing a pipeline to ingest streaming web traffic data. The data scientist needs to implement a process to identify unusual web traffic patterns as part of the pipeline. The patterns will be used downstream for alerting and incident response. The data scientist has access to unlabeled historic data to use, if needed.
The solution needs to do the following:
• Calculate an anomaly score for each web traffic entry.
• Adapt unusual event identification to changing web patterns over time.
Which approach should the data scientist implement to meet these requirements?
The solution needs to do the following:
• Calculate an anomaly score for each web traffic entry.
• Adapt unusual event identification to changing web patterns over time.
Which approach should the data scientist implement to meet these requirements?
Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker Random Cut Forest (RCF) built-in model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the RCF model to calculate the anomaly score for each record.
Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker built-in XGBoost model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the XGBoost model to calculate the anomaly score for each record.
Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the k-Nearest Neighbors (kNN) SQL extension to calculate anomaly scores for each record using a tumbling window.
Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the Amazon Random Cut Forest (RCF) SQL extension to calculate anomaly scores for each record using a sliding window.
Data Engineering
Exploratory Data Analysis
A data scientist is evaluating a GluonTS on Amazon SageMaker DeepAR model. The evaluation metrics on the test set indicate that the coverage score is 0.489 and 0.889 at the 0.5 and 0.9 quantiles, respectively.
What can the data scientist reasonably conclude about the distributional forecast related to the test set?
What can the data scientist reasonably conclude about the distributional forecast related to the test set?
The coverage scores indicate that the distributional forecast is poorly calibrated. These scores should be approximately equal to each other at all quantiles.
The coverage scores indicate that the distributional forecast is poorly calibrated. These scores should peak at the median and be lower at the tails.
The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should always fall below the quantile itself.
The coverage scores indicate that the distributional forecast is correctly calibrated. These scores should be approximately equal to the quantile itself.
Model Development
Machine Learning Implementation and Operations
A data scientist is reviewing customer comments about a company's products. The data scientist needs to present an initial exploratory analysis by using charts and a word cloud. The data scientist must use feature engineering techniques to prepare this analysis before starting a natural language processing (NLP) model.
Which combination of feature engineering techniques should the data scientist use to meet these requirements? (Choose two.)
Which combination of feature engineering techniques should the data scientist use to meet these requirements? (Choose two.)
Named entity recognition
Coreference
Stemming
Term frequency-inverse document frequency (TF-IDF)
Sentiment analysis
Exploratory Data Analysis
Model Development
A Data Scientist is training a multilayer perception (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes within the dataset, but it does not achieve and acceptable recall metric. The Data Scientist has already tried varying the number and size of the MLP s hidden layers, which has not significantly improved the results. A solution to improve recall must be implemented as quickly as possible.
Which techniques should be used to meet these requirements?
Which techniques should be used to meet these requirements?
Gather more data using Amazon Mechanical Turk and then retrain
Train an anomaly detection model instead of an MLP
Train an XGBoost model instead of an MLP
Add class weights to the MLP s loss function and then retrain
Exploratory Data Analysis
Machine Learning Implementation and Operations
A data scientist is training a text classification model by using the Amazon SageMaker built-in BlazingText algorithm. There are 5 classes in the dataset, with 300 samples for category A, 292 samples for category B, 240 samples for category C, 258 samples for category D, and 310 samples for category E.
The data scientist shuffles the data and splits off 10% for testing. After training the model, the data scientist generates confusion matrices for the training and test sets.
What could the data scientist conclude form these results?
The data scientist shuffles the data and splits off 10% for testing. After training the model, the data scientist generates confusion matrices for the training and test sets.
What could the data scientist conclude form these results?
Classes C and D are too similar.
The dataset is too small for holdout cross-validation.
The data distribution is skewed.
The model is overfitting for classes B and E.
Exploratory Data Analysis
Machine Learning Implementation and Operations
Comments