Random Forest Algorithm

Author

Amy Brown, Jeff Eddy, Armani Harris

Published

August 6, 2023

Introduction

Random Forests (RF) is a popular ensemble learning algorithm that combines multiple decision trees to improve predictive accuracy and handle complex datasets. This literature review aims to explore ten articles that discuss various aspects of the Random Forest algorithm, including its theoretical foundation, feature selection, data imbalance resolution, interpretability, and comparisons with other classification methods. The selected articles offer valuable insights into the strengths, weaknesses, and practical applications of Random Forest in different domains.

Literature Review

Breiman (Breiman 2001) introduced the concept of Random Forests and highlighted the importance of the Law of Large Numbers in avoiding overfitting. Liaw and Wiener (Liaw and Wiener 2002) expanded on this work, focusing on the classification and regression capabilities of RF. They discussed the use of bagging and highlighted the significance of variable importance in RF models.When dealing with datasets containing a high number of variables the importance of feature selection is discussed by Jaiswal, K. & Samikannu, R. (Jaiswal and Samikannu 2017). Han and Kim investigated the optimal size of candidate feature sets in Random Forest for classification and regression tasks. The study reveals that the default candidate feature set is not always the best performing and that the optimal size can vary depending on the dataset used (Han and Kim 2019). The potential challenges such as overfitting and handling multi-valued attributes exist when using the Random Forest (RF) algorithm. However, the advantages emphasized by Abdulkareem, N. M., & Abdulazeez, A. M entail accuracy, efficiency with large datasets, and minimal data preprocessing requirements (Abdulkareem and Abdulazeez 2021). To address the issue of data imbalance, More and Rana (More and Rana 2017) reviewed techniques for resolving imbalanced datasets using RF. The various techniques include sampling, weighted Random Forest, and cost-sensitive methods, to handle imbalanced datasets. A practical application of Random Forest in computational biology and bioinformatics was conducted by Boulesteix, A. L. et al. (Ali et al. 2012), providing insights into decision tree construction, handling bias, and recommendations for parameter tuning. Schonlau and Zou (Schonlau and Yuyan Zou 2020) focused on optimizing RF models by tuning parameters and subtree iterations. They discussed the use of out-of-bag (OOB) samples to approximate the model’s error during training. Zhu proposes a new algorithm, the SearchSize method, which estimates the optimal feature set size using out-of-bag error methodology (Zhu 2020). Tuning these parameters helps achieve higher prediction accuracy Schonlau, M., & Zou, R. Y. (Schonlau and Yuyan Zou 2020). Aria et al. (Aria 2022) discussed internal approaches that provide a global overview of the model. They mentioned the Mean Decrease Impurity (MDI) as a measure of variable importance and emphasized the visualization of RF classifiers using toolkits.

Methods

The data flow diagram of a random forest is in Figure 1. A random forest process starts by building multiple decision trees starting at the tree root node and splitting into an internal node by measuring the feature importance through Gini Impurity or Permuatation Importance. The Gini Impurity is calculated by summing the Gini decrease across every tree in the forest and dividing by the number of trees in the forest to provide an average. The Permutation Importance is calculated by evaluating the decrease in accuracy when a feature is randomly shuffled over all trees in the forest. Each decision tree continues to grow as the data splits on the internal node. The decision tree ends with the leaf node when there is one node remaining. The leaf node will predict a category if a classification model is identified, and a numerical value for a regression model. The prediction is determined by taking the majority vote of all leaf nodes for classification and an average of the leaf nodes for regression (Abdulkareem and Abdulazeez 2021).

Figure 1 (Abdulkareem and Abdulazeez 2021)

The Random Forest algorithm is a supervised learning model consisting of multiple uncorrelated un-pruned decision trees creating a forest. The algorithm can be used for classification or regression models. The target selected for the algorithm identifies if the algorithm will use a classification or regression model (Abdulkareem and Abdulazeez 2021).

There are five basic steps of the Random Forest algorithm. The steps are as follows:

Randomly select k features from a total of m features where k < m.
Specify a node size d between the minimum and maximum node size for the k features in step 1 using the best split point discussed in step 3.
Each internal node is split into additional internal nodes using Gini Impurity or Permutation Importance.
Repeat the previous steps 1 through 3 steps until 1 node is remaining.
Continue to build the forest by repeating steps 1 through 4 for n number of times to create n number of trees (Jaiswal and Samikannu 2017).

Once the final model is created from the training data of the Random Forest algorithm, predictions can be made based on the following steps:

Run the test data through the final model to predict the outcome.
The predicted target outcome for each decision tree is stored from the results in step 1.
The votes are determined for each of the predicted targets.
The final prediction is provided by the majority vote/average of each predicted target (Abdulkareem and Abdulazeez 2021).

Analysis and Results

Packages

The R packages utilized for running the statistical modeling for the Random Forest algorithm were the dplyr, reshape2 and caret packages. The dplyr package is used to filter the dataframes and to produce the box plots of the features. The reshape2 package reshaped the data to the correct format to build the box plots. The caret package is used for various classification and regression models. The method parameter used in the caret package is rf for the Random Forest algorithm. The metric parameter specifies how the feature importance is measured such as Accuracy and Gini Impurity (Johnson 2023). The trcontrol parameter specifies the sampling method. The trcontrol parameters evaluated during the statistical modeling process are k-fold cross-validation and bootstrapping. The k-fold cross-validation sampling method splits the training data randomly into k-folds. The training model is fit for k minus 1 fold. The training model is validated with the remaining kth fold with the scores noted for each tree for the final prediction (Li 2021). The bootstrap sampling method samples two-thirds of the training data and draws the sample with replacement for the next resampling. The chosen samples are referred to the bag, whereas the remaining are called out of bag (OOB). The OOB for each tree is used to vote for the final prediction (Jaiswal and Samikannu 2017).

Code

library(caret)
library(dplyr)
library(reshape2)

Load Data

Code

data<- read.csv('C:/Users/amybr/OneDrive/Desktop/dataset_heart.csv')

Data and Visualization

The Random Forest algorithm is reviewed using the Heart Disease Prediction dataset found on Kaggle. There are 270 observations in the dataset without any missing values. The target is heart disease predicting the presence or absence of heart disease. The 14 variables with their values are listed in Table 1 (Singh 2023):

Link to Documentation for Heart Disease Prediction Data

Heart Disease Prediction Features and Target:

No.	Data Variables	Data Values
1	age	range: 29 -77
2	sex	0, 1
3	chest pain type	1, 2, 3, 4
4	resting blood pressure	range: 94-200
5	serum cholestoral in mg/dl	range: 126-564
6	fasting blood sugar 120 mg/dl	0, 1
7	resting electrocardiographic results	0, 1, 2
8	maximum heart rate achieved	range: 71-202
9	exercise induced angina	0, 1
10	oldpeak: ST depression (positions on the ECG plot) induced by exercise relative to rest	range: 0-6.2
11	slope of the peak exercise ST segment	1, 2 , 3
12	number of major vessels colored by flouroscopy	0, 1, 2, 3
13	thal: genetic blood disorder	3 = normal; 6 = fixed defect; 7 = reversible defect
14	heart disease (target)	1 = absence; 2 = presence

Table 1 (Singh 2023)

To visualize the data, two box plots were created to scale the features accordingly. The first box plot has a scale of up to 600 and shows an extreme outlier was observed in the feature serum cholesterol at a value of 564. The first box plot also shows mild outliers were observed in features resting blood pressure and max heart rate. The second box plot has a scale up to 7 and shows mild outliers for chest pain type, fasting blood sugar, old peak, and major vessels. Since there are only 270 observations, the extreme outlier data point will be removed before running the model.

Code

# Split dataframe into dataframe for box plots only
data_gt_100=data
data_lt_100=data

# Box plot age, serum cholesterol, resting blood pressure, and max heart rate
data_gt_100 = data %>% select(1,4,5,8)

# Reshape sample data to long form
data_gt_100_melt <- melt(data_gt_100)

# Add variable parameter for axis label in dataset
levels(data_gt_100_melt$variable) <- c("aging","resting.blood.pressure","serum.cholesterol","max.heart.rate")
  
# Draw box plot
ggplot(data_gt_100_melt, aes(variable, value)) + 
geom_boxplot(width = .25)  +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Code

# Box plot of all other variables
data_lt_100 = data_lt_100 %>% select(2,3,6,7,9:14)

# Reshape sample data to long form
data_lt_100_melt <- melt(data_lt_100)

# Add variable parameter for axis label in dataset
levels(data_lt_100_melt$variable) <- c("sex","chest.pain.type","fasting.blood.sugar","resting.electrocardiographic.results", "exercise.induced.angina","oldpeak","ST.segment","major.vessels", "thal", "heart.disease")
  
# Draw box plot
ggplot(data_lt_100_melt, aes(variable, value)) + 
geom_boxplot(width = .25)  +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Removal of Extreme Outlier

The extreme outlier for the feature serum cholesterol is removed from the dataset and validated by rerunning the first box plot. The scale was reduced from 600 to 450 indicating the data point 564 was removed for feature serum cholesterol.

Code

# Remove extreme outlier
data_gt_100 = data_gt_100[data$serum.cholestoral != "564", ]
data1=data
data1 = data1[data1$serum.cholestoral != "564", ]

# Rerun box plot after extreme outlier removed 

# Reshape sample data to long form
data_gt_100_melt <- melt(data_gt_100)

# Add variable parameter for axis label in dataset
levels(data_gt_100_melt$variable) <- c("aging","resting.blood.pressure","serum.cholesterol","max.heart.rate")
  
# Draw box plot
ggplot(data_gt_100_melt, aes(variable, value)) + 
geom_boxplot(width = .25)  +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Data Transformation

In order to run a classification model of the random forest algorithm with heart disease as a target, the heart disease variable was transformed to a factor variable. The random forest classification model requires a dependent factor variable.

Code

data1$heart.disease<- as.factor(data1$heart.disease)

Statistical Modeling

Model Training

The target variable was binary with categories of no heart disease (1) and heart disease (2). After removing the outlier observation in cholesterol, 269 observations were used to create a random forest algorithm to predict the presence of heart disease. Using the caret package in R, the data set was split for a training model and a testing model at a 0.7 ratio. This resulted in 189 observations in the training data set and 80 in the testing data set. The training dataset was then assessed for the best parameter options that produced the greatest estimated accuracy. Algorithm parameters are defined as such:

Ntree: The number of trees.

Mtry: Controls the amount of randomness that is added to the process. The number of randomly sampled variables drawn at each split.

Terminal nodes: The number of leaf nodes in the tree.

Max nodes: A way to control the complexity of the trees. Maximum number of terminal nodes trees in the forest can have.

Nodesize: Minimum size of terminal nodes.

Feature: A set of inputs.

The testing dataset was then passed into the final tuned algorithm to create a predictive model.

Code

set.seed(123)
split <- createDataPartition(data1$heart.disease, p = 0.7, list = FALSE)
train <- data1[split, ]
test <- data1[-split, ]

K-Fold Cross-Validation Sampling Method

A default training model using k-fold cross-validation sampling method based on classification was created to establish a baseline accuracy. Random Forests are predictive models that form multiple decision trees. At each tree node, a subset of the 13 predictor features were used to introduce randomness into the model. The default training model used 500 decision trees (ntree=500) and determined the optimal number of features to compare to be 2 (mtry= 2). For this classification RF, the decision at each tree node was conducted by permutation importance by removing the association between the feature and target. Accuracy of the default model was 0.8358.

Code

set.seed(123)
# Define the control
trControl <- trainControl(method = "cv", number = 10, search = "grid")
#model with default values
rf_default <- train(heart.disease~.,
    data = train,
    method = "rf",
    metric = "Accuracy",
    trControl = trControl)
# Print the results
print(rf_default)

Random Forest 

189 samples
 13 predictor
  2 classes: '1', '2' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 169, 170, 170, 171, 169, 171, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8357895  0.6638724
   7    0.8311404  0.6531982
  13    0.8264035  0.6444556

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

The training model was assessed for the best possible number of decision trees, nodes, and number of features to consider at each node. After performing the following procedures, the greatest accuracy was determined to be 300 trees with a node size of 14, max nodes of 6, and 2 features considered at each node. The first tuning parameter evaluated was the number of features the model could consider at each node in a tree. The parameter mtry has the highest accuracy at mtry = 2 with a value of .8522. The mtry value is the same as the default model.

Code

#Search best mtry
set.seed(123)
tuneGrid <- expand.grid(.mtry = c(1: 10))
rf_mtry <- train(heart.disease~.,
    data = train,
    method = "rf",
    metric = "Accuracy",
    tuneGrid = tuneGrid,
    trControl = trControl,
    importance = TRUE,
    nodesize = 14,
    ntree = 300)
print(rf_mtry)

Random Forest 

189 samples
 13 predictor
  2 classes: '1', '2' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 169, 170, 170, 171, 169, 171, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   1    0.8308187  0.6497031
   2    0.8521637  0.6938909
   3    0.8407895  0.6711918
   4    0.8410819  0.6746692
   5    0.8305556  0.6515543
   6    0.8311404  0.6546218
   7    0.8261404  0.6452077
   8    0.8150292  0.6223296
   9    0.8150292  0.6232846
  10    0.8047953  0.6038736

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Code

best_mtry <- rf_mtry$bestTune$mtry

The subsequent procedure evaluates the optimal number of max nodes for the model. The maxnodes = 6 has the highest accuracy of max nodes ran between 5 and 15.

Code

# Search maxnodes - look for highest accuracy

store_maxnode <- list()
# Assigns best mtry to tuneGrid
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(5: 15)) {
    set.seed(123)
    rf_maxnode <- train(heart.disease~.,
        data = train,
        method = "rf",
        metric = "Accuracy",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        nodesize = 14,
        maxnodes = maxnodes,
        ntree = 300)
    current_iteration <- toString(maxnodes)
    store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry, metric = "Accuracy")


Call:
summary.resamples(object = results_mtry, metric = "Accuracy")

Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 
Number of resamples: 10 

Accuracy 
        Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
5  0.6842105 0.8355263 0.8947368 0.8674269 0.9000000    1    0
6  0.6842105 0.8355263 0.8947368 0.8674269 0.9000000    1    0
7  0.6842105 0.7807018 0.8210526 0.8360819 0.8947368    1    0
8  0.7368421 0.8355263 0.8460526 0.8624269 0.8986842    1    0
9  0.6842105 0.8355263 0.8460526 0.8519006 0.8947368    1    0
10 0.6842105 0.7921053 0.8377193 0.8363743 0.8815789    1    0
11 0.7368421 0.7938596 0.8421053 0.8463450 0.8835526    1    0
12 0.6842105 0.7833333 0.8421053 0.8413450 0.8855263    1    0
13 0.6842105 0.8355263 0.8421053 0.8521930 0.8835526    1    0
14 0.6842105 0.8083333 0.8421053 0.8521637 0.8986842    1    0
15 0.6842105 0.8355263 0.8421053 0.8569006 0.9000000    1    0

The final parameter evaluation was the ideal number of trees for the model. From the output of the code, it was observed the accuracy of the model stabilizes at 300 trees with the highest accuracy mean of ntree ran between 100 and 1,000.

Code

# Search best ntree
store_maxtrees <- list()
for (ntree in c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000)) {
    set.seed(123)
    rf_maxtrees <- train(heart.disease~.,
        data = train,
        method = "rf",
        metric = "Accuracy",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        nodesize = 14,
        maxnodes = 6,
        ntree = ntree)
    key <- toString(ntree)
    store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree, metric="Accuracy")


Call:
summary.resamples(object = results_tree, metric = "Accuracy")

Models: 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 
Number of resamples: 10 

Accuracy 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
100  0.7368421 0.7777778 0.8947368 0.8457895 0.8986842 0.9444444    0
200  0.7368421 0.7807018 0.8947368 0.8618713 0.9000000 1.0000000    0
300  0.6842105 0.8355263 0.8947368 0.8674269 0.9000000 1.0000000    0
400  0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000    0
500  0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000    0
600  0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000    0
700  0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000    0
800  0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000    0
900  0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000    0
1000 0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000    0

The final training model produced an accuracy of 0.8674 compared to the default training model with an accuracy of 0.8358, a 3.78% improvement in accuracy over the default RF parameters.

Code

# Train final model
set.seed(123)
fit_rf <- train(heart.disease~.,
    data = train,
    method = "rf",
    metric = "Accuracy",
    tuneGrid = tuneGrid,
    trControl = trControl,
    importance = TRUE,
    nodesize = 14,
    ntree = 300,
    maxnodes = 6)
print(fit_rf)

Random Forest 

189 samples
 13 predictor
  2 classes: '1', '2' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 169, 170, 170, 171, 169, 171, ... 
Resampling results:

  Accuracy   Kappa    
  0.8674269  0.7242847

Tuning parameter 'mtry' was held constant at a value of 2

Bootstrap Sampling Method

A default training model using bootstrap sampling method based on classification was created to compare with the k-fold cross-validation sampling method. Bootstrap does not need cross-validation. Random Forests are predictive models that form multiple decision trees. At each tree node, a subset of the 13 predictor features were used to introduce randomness into the model. The default training model used 500 decision trees (ntree=500) and determined the optimal number of features to compare to be 2 (mtry= 2). For this classification RF, the decision at each tree node was conducted by Ginni Impurity, which is a process of estimating the likelihood that a feature may be misclassified at random, then the feature with the lowest likelihood of misclassification is selected at that node. Accuracy of the bootstrap default model is 0.8322 compared to .8358 of the k-fold cross-validation sampling method.

Code

set.seed(123)
fit_rf2 <- train(heart.disease~.,
    data = train,
    method = "rf",
    importance = TRUE)
print(fit_rf2)

Random Forest 

189 samples
 13 predictor
  2 classes: '1', '2' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 189, 189, 189, 189, 189, 189, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8321817  0.6571619
   7    0.8229013  0.6393790
  13    0.8117677  0.6172537

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

To compare the final training settings in the k-fold cross-validation sampling method to the bootstrap sampling method, the same settings were used to train. The bootstrap sampling method is still slightly lower than the k-fold cross-validation sampling method with .8345 accuracy compared to .8674 of the k-fold cross validation sampling method. Since k-fold cross-validation sampling method is higher in Accuracy, the predictive modeling will be based off from the k-fold cross-validation sampling method.

Code

# Train final model
set.seed(123)
fit_rf2 <- train(heart.disease~.,
    data = train,
    method = "rf",
    tuneGrid = tuneGrid,
    importance = TRUE,
    nodesize = 14,
    ntree = 300,
    maxnodes = 6)
print(fit_rf2)

Random Forest 

189 samples
 13 predictor
  2 classes: '1', '2' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 189, 189, 189, 189, 189, 189, ... 
Resampling results:

  Accuracy   Kappa    
  0.8344737  0.6605306

Tuning parameter 'mtry' was held constant at a value of 2

Predictive Modeling

Code

# Evaluate model
prediction <-predict(fit_rf, test)

The final training model was run against the test data. The results from the predictive test model indicated that the feature major vessels was the most influential feature when predicting heart disease from the current dataset with a variable importance score of 100.00. The five most influential features all had variable importance scores above 50.00. These features were major vessels, thal, chest pain type, old peak, and max heart rate.

Code

varImp(fit_rf)

rf variable importance

                                     Importance
major.vessels                           100.000
thal                                     94.662
chest.pain.type                          76.738
oldpeak                                  67.389
max.heart.rate                           54.524
exercise.induced.angina                  47.493
sex                                      35.574
age                                      32.505
ST.segment                               30.580
resting.blood.pressure                   15.787
resting.electrocardiographic.results      3.221
fasting.blood.sugar                       2.348
serum.cholestoral                         0.000

Code

rfImp<- varImp(fit_rf)
ggplot(rfImp, aes(x = importance(), y = value)) +
  geom_bar(stat = "identity", linewidth = 0.5,fill="blue")

A confusion matrix was ran for the predictive model. Accuracy for the predictive model was calculated to be 83.8% (0.8375, CI: 0.7382, 0.9105, p< 0.001). Sensitivity and specificity were calculated at 0.909 and 0.75 respectively.

Code

cm <- confusionMatrix(prediction, test$heart.disease)
cm

Confusion Matrix and Statistics

          Reference
Prediction  1  2
         1 40  9
         2  4 27
                                          
               Accuracy : 0.8375          
                 95% CI : (0.7382, 0.9105)
    No Information Rate : 0.55            
    P-Value [Acc > NIR] : 5.091e-08       
                                          
                  Kappa : 0.6675          
                                          
 Mcnemar's Test P-Value : 0.2673          
                                          
            Sensitivity : 0.9091          
            Specificity : 0.7500          
         Pos Pred Value : 0.8163          
         Neg Pred Value : 0.8710          
             Prevalence : 0.5500          
         Detection Rate : 0.5000          
   Detection Prevalence : 0.6125          
      Balanced Accuracy : 0.8295          
                                          
       'Positive' Class : 1

Code

draw_confusion_matrix <- function(cm) {
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # Create the matrix 
  rect(150, 430, 240, 370, col='#3F97D0')
  text(195, 435, 'Positive', cex=1.2)
  rect(250, 430, 340, 370, col='#F7AD50')
  text(295, 435, 'Negative', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#F7AD50')
  rect(250, 305, 340, 365, col='#3F97D0')
  text(140, 400, 'True', cex=1.2, srt=90)
  text(140, 335, 'False', cex=1.2, srt=90)

  # Add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # Add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # Add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  

draw_confusion_matrix(cm)

The predictive model has an estimated false positive rate of 30.77%. False positive rate is calculated by the number of false positives divided by the sum of false positives and true negatives. FPR= FP/(FP + TN).

Code

4/(4+9)

[1] 0.3076923

Conclusion

In closing, the Random Forest algorithm is a powerful supervised learning model that combines multiple uncorrelated decision trees to form a forest. Two sampling methods were evaluated: k-fold cross-validation and bootstrap. In k-fold cross-validation, decision tree splits are constructed based on Permutation Importance, while in bootstrap is constructed based on Gini Impurity. The algorithm can be applied to both classification and regression tasks, with the prediction determined by the majority vote or average of the leaf nodes, respectively. The Random Forest algorithm offers a versatile approach to solving complex problems and is widely used in various domains of machine learning and data analysis.

References

Abdulkareem, Nasiba Mahdi, and Adnan Mohsin Abdulazeez. 2021. “Machine Learning Classification Based on Radom Forest Algorithm: A Review.” International Journal of Science and Business 5 (2): 128–42.

Ali, Jehad, Rehanullah Khan, Nasir Ahmad, and Imran Maqsood. 2012. “Random Forests and Decision Trees.” International Journal of Computer Science Issues(IJCSI) 9 (September).

Aria, Corrado & Gnasso, Massimo & Cuccurullo. 2022. “A Comparison Among Interpretative Proposals for Random Forests.” R News 10: 100438.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32. https://doi.org/10.1023/A:1010933404324.

Han, Sunwoo, and Hyunjoong Kim. 2019. “On the Optimal Size of Candidate Feature Set in Random Forest.” Applied Sciences 9 (5). https://doi.org/10.3390/app9050898.

Jaiswal, Jitendra Kumar, and Rita Samikannu. 2017. “Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression.” In 2017 World Congress on Computing and Communication Technologies (WCCCT), 65–68. https://doi.org/10.1109/WCCCT.2016.25.

Johnson, Daniel. 2023. “R Random Forest Tutorial with Example.” Guru99. https://www.guru99.com/r-random-forest-tutorial.html.

Li, Gangmin. 2021. “Do a Data Science Project in 10 Days.” https://bookdown.org/Do-Data-Science-in-10days/.

Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22.

More, A. S., and Dipti P. Rana. 2017. “Review of Random Forest Classification Techniques to Resolve Data Imbalance.” In 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 72–78. https://doi.org/10.1109/ICISIM.2017.8122151.

Schonlau, Matthias, and Rosie Yuyan Zou. 2020. “The Random Forest Algorithm for Statistical Learning.” The Stata Journal 20: 3–29. https://doi.org/10.1177/1536867X20909688.

Singh, Utkarsh. 2023. “Heart Disease Prediction Dataset.” Kaggle. https://www.kaggle.com/datasets/utkarshx27/heart-disease-diagnosis-dataset.

Zhu, Tongtian. 2020. “Analysis on the Applicability of the Random Forest.” Journal of Physics:Conference Series 1607: 119–21.