# Amazon (MLS-C01) Exam Questions And Answers page 6

Which methods can the Data Scientist use to improve the model performance and satisfy the Marketing team s needs? (Choose two.)

Add features to the dataset

Perform recursive feature elimination

Perform t-distributed stochastic neighbor embedding (t-SNE)

Perform linear discriminant analysis

Exploratory Data Analysis
Machine Learning Implementation and Operations

Which cross-validation strategy should the Data Scientist adopt?

A k-fold cross-validation strategy with k=5

A stratified k-fold cross-validation strategy with k=5

A k-fold cross-validation strategy with k=5 and 3 repeats

An 80/20 stratified split between training and validation

Data Engineering
Exploratory Data Analysis

The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist has been asked to reduce the number of false negatives.

Which combination of steps should the Data Scientist take to reduce the number of false positive predictions by the model? (Choose two.)

Change the XGBoost eval_metric parameter to optimize based on rmse instead of error.

Increase the XGBoost max_depth parameter because the model is currently underfitting the data.

Change the XGBoost eval_metric parameter to optimize based on AUC instead of error.

Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.

Model Development
Machine Learning Implementation and Operations

The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist needs to reduce the number of false negatives.

Which combination of steps should the Data Scientist take to reduce the number of false negative predictions by the model? (Choose two.)

Change the XGBoost eval_metric parameter to optimize based on Root Mean Square Error (RMSE).

Increase the XGBoost max_depth parameter because the model is currently underfitting the data.

Change the XGBoost eval_metric parameter to optimize based on Area Under the ROC Curve (AUC).

Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.

Model Development
Machine Learning Implementation and Operations

Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population

How should the Data Scientist correct this issue?

Drop all records from the dataset where age has been set to 0.

Drop the age feature from the dataset and train the model using the rest of the features.

Use k-means clustering to handle missing features

Data Engineering
Exploratory Data Analysis

The solution needs to do the following:

â€¢ Calculate an anomaly score for each web traffic entry.

â€¢ Adapt unusual event identification to changing web patterns over time.

Which approach should the data scientist implement to meet these requirements?

Data Engineering
Exploratory Data Analysis

What can the data scientist reasonably conclude about the distributional forecast related to the test set?

Model Development
Machine Learning Implementation and Operations

Which combination of feature engineering techniques should the data scientist use to meet these requirements? (Choose two.)

Named entity recognition

Coreference

Stemming

Term frequency-inverse document frequency (TF-IDF)

Sentiment analysis

Exploratory Data Analysis
Model Development

Which techniques should be used to meet these requirements?

Gather more data using Amazon Mechanical Turk and then retrain

Train an anomaly detection model instead of an MLP

Train an XGBoost model instead of an MLP

Add class weights to the MLP s loss function and then retrain

Exploratory Data Analysis
Machine Learning Implementation and Operations

The data scientist shuffles the data and splits off 10% for testing. After training the model, the data scientist generates confusion matrices for the training and test sets.

What could the data scientist conclude form these results?

Classes C and D are too similar.

The dataset is too small for holdout cross-validation.

The data distribution is skewed.

The model is overfitting for classes B and E.

Exploratory Data Analysis
Machine Learning Implementation and Operations

## Comments