Code
library(caret)
library(dplyr)
library(reshape2)
Random Forests (RF) is a popular ensemble learning algorithm that combines multiple decision trees to improve predictive accuracy and handle complex datasets. This literature review aims to explore ten articles that discuss various aspects of the Random Forest algorithm, including its theoretical foundation, feature selection, data imbalance resolution, interpretability, and comparisons with other classification methods. The selected articles offer valuable insights into the strengths, weaknesses, and practical applications of Random Forest in different domains.
Breiman (Breiman 2001) introduced the concept of Random Forests and highlighted the importance of the Law of Large Numbers in avoiding overfitting. Liaw and Wiener (Liaw and Wiener 2002) expanded on this work, focusing on the classification and regression capabilities of RF. They discussed the use of bagging and highlighted the significance of variable importance in RF models.When dealing with datasets containing a high number of variables the importance of feature selection is discussed by Jaiswal, K. & Samikannu, R. (Jaiswal and Samikannu 2017). Han and Kim investigated the optimal size of candidate feature sets in Random Forest for classification and regression tasks. The study reveals that the default candidate feature set is not always the best performing and that the optimal size can vary depending on the dataset used (Han and Kim 2019). The potential challenges such as overfitting and handling multi-valued attributes exist when using the Random Forest (RF) algorithm. However, the advantages emphasized by Abdulkareem, N. M., & Abdulazeez, A. M entail accuracy, efficiency with large datasets, and minimal data preprocessing requirements (Abdulkareem and Abdulazeez 2021). To address the issue of data imbalance, More and Rana (More and Rana 2017) reviewed techniques for resolving imbalanced datasets using RF. The various techniques include sampling, weighted Random Forest, and cost-sensitive methods, to handle imbalanced datasets. A practical application of Random Forest in computational biology and bioinformatics was conducted by Boulesteix, A. L. et al. (Ali et al. 2012), providing insights into decision tree construction, handling bias, and recommendations for parameter tuning. Schonlau and Zou (Schonlau and Yuyan Zou 2020) focused on optimizing RF models by tuning parameters and subtree iterations. They discussed the use of out-of-bag (OOB) samples to approximate the model’s error during training. Zhu proposes a new algorithm, the SearchSize method, which estimates the optimal feature set size using out-of-bag error methodology (Zhu 2020). Tuning these parameters helps achieve higher prediction accuracy Schonlau, M., & Zou, R. Y. (Schonlau and Yuyan Zou 2020). Aria et al. (Aria 2022) discussed internal approaches that provide a global overview of the model. They mentioned the Mean Decrease Impurity (MDI) as a measure of variable importance and emphasized the visualization of RF classifiers using toolkits.
The data flow diagram of a random forest is in Figure 1. A random forest process starts by building multiple decision trees starting at the tree root node and splitting into an internal node by measuring the feature importance through Gini Impurity or Permuatation Importance. The Gini Impurity is calculated by summing the Gini decrease across every tree in the forest and dividing by the number of trees in the forest to provide an average. The Permutation Importance is calculated by evaluating the decrease in accuracy when a feature is randomly shuffled over all trees in the forest. Each decision tree continues to grow as the data splits on the internal node. The decision tree ends with the leaf node when there is one node remaining. The leaf node will predict a category if a classification model is identified, and a numerical value for a regression model. The prediction is determined by taking the majority vote of all leaf nodes for classification and an average of the leaf nodes for regression (Abdulkareem and Abdulazeez 2021).
The Random Forest algorithm is a supervised learning model consisting of multiple uncorrelated un-pruned decision trees creating a forest. The algorithm can be used for classification or regression models. The target selected for the algorithm identifies if the algorithm will use a classification or regression model (Abdulkareem and Abdulazeez 2021).
There are five basic steps of the Random Forest algorithm. The steps are as follows:
Once the final model is created from the training data of the Random Forest algorithm, predictions can be made based on the following steps:
The R packages utilized for running the statistical modeling for the Random Forest algorithm were the dplyr, reshape2 and caret packages. The dplyr package is used to filter the dataframes and to produce the box plots of the features. The reshape2 package reshaped the data to the correct format to build the box plots. The caret package is used for various classification and regression models. The method parameter used in the caret package is rf for the Random Forest algorithm. The metric parameter specifies how the feature importance is measured such as Accuracy and Gini Impurity (Johnson 2023). The trcontrol parameter specifies the sampling method. The trcontrol parameters evaluated during the statistical modeling process are k-fold cross-validation and bootstrapping. The k-fold cross-validation sampling method splits the training data randomly into k-folds. The training model is fit for k minus 1 fold. The training model is validated with the remaining kth fold with the scores noted for each tree for the final prediction (Li 2021). The bootstrap sampling method samples two-thirds of the training data and draws the sample with replacement for the next resampling. The chosen samples are referred to the bag, whereas the remaining are called out of bag (OOB). The OOB for each tree is used to vote for the final prediction (Jaiswal and Samikannu 2017).
library(caret)
library(dplyr)
library(reshape2)
<- read.csv('C:/Users/amybr/OneDrive/Desktop/dataset_heart.csv') data
The Random Forest algorithm is reviewed using the Heart Disease Prediction dataset found on Kaggle. There are 270 observations in the dataset without any missing values. The target is heart disease predicting the presence or absence of heart disease. The 14 variables with their values are listed in Table 1 (Singh 2023):
Link to Documentation for Heart Disease Prediction Data
Heart Disease Prediction Features and Target:
No. | Data Variables | Data Values |
---|---|---|
1 | age | range: 29 -77 |
2 | sex | 0, 1 |
3 | chest pain type | 1, 2, 3, 4 |
4 | resting blood pressure | range: 94-200 |
5 | serum cholestoral in mg/dl | range: 126-564 |
6 | fasting blood sugar 120 mg/dl | 0, 1 |
7 | resting electrocardiographic results | 0, 1, 2 |
8 | maximum heart rate achieved | range: 71-202 |
9 | exercise induced angina | 0, 1 |
10 | oldpeak: ST depression (positions on the ECG plot) induced by exercise relative to rest | range: 0-6.2 |
11 | slope of the peak exercise ST segment | 1, 2 , 3 |
12 | number of major vessels colored by flouroscopy | 0, 1, 2, 3 |
13 | thal: genetic blood disorder | 3 = normal; 6 = fixed defect; 7 = reversible defect |
14 | heart disease (target) | 1 = absence; 2 = presence |
Table 1 (Singh 2023)
To visualize the data, two box plots were created to scale the features accordingly. The first box plot has a scale of up to 600 and shows an extreme outlier was observed in the feature serum cholesterol at a value of 564. The first box plot also shows mild outliers were observed in features resting blood pressure and max heart rate. The second box plot has a scale up to 7 and shows mild outliers for chest pain type, fasting blood sugar, old peak, and major vessels. Since there are only 270 observations, the extreme outlier data point will be removed before running the model.
# Split dataframe into dataframe for box plots only
=data
data_gt_100=data
data_lt_100
# Box plot age, serum cholesterol, resting blood pressure, and max heart rate
= data %>% select(1,4,5,8)
data_gt_100
# Reshape sample data to long form
<- melt(data_gt_100)
data_gt_100_melt
# Add variable parameter for axis label in dataset
levels(data_gt_100_melt$variable) <- c("aging","resting.blood.pressure","serum.cholesterol","max.heart.rate")
# Draw box plot
ggplot(data_gt_100_melt, aes(variable, value)) +
geom_boxplot(width = .25) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
# Box plot of all other variables
= data_lt_100 %>% select(2,3,6,7,9:14)
data_lt_100
# Reshape sample data to long form
<- melt(data_lt_100)
data_lt_100_melt
# Add variable parameter for axis label in dataset
levels(data_lt_100_melt$variable) <- c("sex","chest.pain.type","fasting.blood.sugar","resting.electrocardiographic.results", "exercise.induced.angina","oldpeak","ST.segment","major.vessels", "thal", "heart.disease")
# Draw box plot
ggplot(data_lt_100_melt, aes(variable, value)) +
geom_boxplot(width = .25) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
The extreme outlier for the feature serum cholesterol is removed from the dataset and validated by rerunning the first box plot. The scale was reduced from 600 to 450 indicating the data point 564 was removed for feature serum cholesterol.
# Remove extreme outlier
= data_gt_100[data$serum.cholestoral != "564", ]
data_gt_100 =data
data1= data1[data1$serum.cholestoral != "564", ]
data1
# Rerun box plot after extreme outlier removed
# Reshape sample data to long form
<- melt(data_gt_100)
data_gt_100_melt
# Add variable parameter for axis label in dataset
levels(data_gt_100_melt$variable) <- c("aging","resting.blood.pressure","serum.cholesterol","max.heart.rate")
# Draw box plot
ggplot(data_gt_100_melt, aes(variable, value)) +
geom_boxplot(width = .25) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
In order to run a classification model of the random forest algorithm with heart disease as a target, the heart disease variable was transformed to a factor variable. The random forest classification model requires a dependent factor variable.
$heart.disease<- as.factor(data1$heart.disease) data1
The target variable was binary with categories of no heart disease (1) and heart disease (2). After removing the outlier observation in cholesterol, 269 observations were used to create a random forest algorithm to predict the presence of heart disease. Using the caret package in R, the data set was split for a training model and a testing model at a 0.7 ratio. This resulted in 189 observations in the training data set and 80 in the testing data set. The training dataset was then assessed for the best parameter options that produced the greatest estimated accuracy. Algorithm parameters are defined as such:
Ntree: The number of trees.
Mtry: Controls the amount of randomness that is added to the process. The number of randomly sampled variables drawn at each split.
Terminal nodes: The number of leaf nodes in the tree.
Max nodes: A way to control the complexity of the trees. Maximum number of terminal nodes trees in the forest can have.
Nodesize: Minimum size of terminal nodes.
Feature: A set of inputs.
The testing dataset was then passed into the final tuned algorithm to create a predictive model.
set.seed(123)
<- createDataPartition(data1$heart.disease, p = 0.7, list = FALSE)
split <- data1[split, ]
train <- data1[-split, ] test
A default training model using k-fold cross-validation sampling method based on classification was created to establish a baseline accuracy. Random Forests are predictive models that form multiple decision trees. At each tree node, a subset of the 13 predictor features were used to introduce randomness into the model. The default training model used 500 decision trees (ntree=500) and determined the optimal number of features to compare to be 2 (mtry= 2). For this classification RF, the decision at each tree node was conducted by permutation importance by removing the association between the feature and target. Accuracy of the default model was 0.8358.
set.seed(123)
# Define the control
<- trainControl(method = "cv", number = 10, search = "grid")
trControl #model with default values
<- train(heart.disease~.,
rf_default data = train,
method = "rf",
metric = "Accuracy",
trControl = trControl)
# Print the results
print(rf_default)
Random Forest
189 samples
13 predictor
2 classes: '1', '2'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 169, 170, 170, 171, 169, 171, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8357895 0.6638724
7 0.8311404 0.6531982
13 0.8264035 0.6444556
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
The training model was assessed for the best possible number of decision trees, nodes, and number of features to consider at each node. After performing the following procedures, the greatest accuracy was determined to be 300 trees with a node size of 14, max nodes of 6, and 2 features considered at each node. The first tuning parameter evaluated was the number of features the model could consider at each node in a tree. The parameter mtry has the highest accuracy at mtry = 2 with a value of .8522. The mtry value is the same as the default model.
#Search best mtry
set.seed(123)
<- expand.grid(.mtry = c(1: 10))
tuneGrid <- train(heart.disease~.,
rf_mtry data = train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 300)
print(rf_mtry)
Random Forest
189 samples
13 predictor
2 classes: '1', '2'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 169, 170, 170, 171, 169, 171, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
1 0.8308187 0.6497031
2 0.8521637 0.6938909
3 0.8407895 0.6711918
4 0.8410819 0.6746692
5 0.8305556 0.6515543
6 0.8311404 0.6546218
7 0.8261404 0.6452077
8 0.8150292 0.6223296
9 0.8150292 0.6232846
10 0.8047953 0.6038736
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
<- rf_mtry$bestTune$mtry best_mtry
The subsequent procedure evaluates the optimal number of max nodes for the model. The maxnodes = 6 has the highest accuracy of max nodes ran between 5 and 15.
# Search maxnodes - look for highest accuracy
<- list()
store_maxnode # Assigns best mtry to tuneGrid
<- expand.grid(.mtry = best_mtry)
tuneGrid for (maxnodes in c(5: 15)) {
set.seed(123)
<- train(heart.disease~.,
rf_maxnode data = train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
<- toString(maxnodes)
current_iteration <- rf_maxnode
store_maxnode[[current_iteration]]
}<- resamples(store_maxnode)
results_mtry summary(results_mtry, metric = "Accuracy")
Call:
summary.resamples(object = results_mtry, metric = "Accuracy")
Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5 0.6842105 0.8355263 0.8947368 0.8674269 0.9000000 1 0
6 0.6842105 0.8355263 0.8947368 0.8674269 0.9000000 1 0
7 0.6842105 0.7807018 0.8210526 0.8360819 0.8947368 1 0
8 0.7368421 0.8355263 0.8460526 0.8624269 0.8986842 1 0
9 0.6842105 0.8355263 0.8460526 0.8519006 0.8947368 1 0
10 0.6842105 0.7921053 0.8377193 0.8363743 0.8815789 1 0
11 0.7368421 0.7938596 0.8421053 0.8463450 0.8835526 1 0
12 0.6842105 0.7833333 0.8421053 0.8413450 0.8855263 1 0
13 0.6842105 0.8355263 0.8421053 0.8521930 0.8835526 1 0
14 0.6842105 0.8083333 0.8421053 0.8521637 0.8986842 1 0
15 0.6842105 0.8355263 0.8421053 0.8569006 0.9000000 1 0
The final parameter evaluation was the ideal number of trees for the model. From the output of the code, it was observed the accuracy of the model stabilizes at 300 trees with the highest accuracy mean of ntree ran between 100 and 1,000.
# Search best ntree
<- list()
store_maxtrees for (ntree in c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000)) {
set.seed(123)
<- train(heart.disease~.,
rf_maxtrees data = train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = 6,
ntree = ntree)
<- toString(ntree)
key <- rf_maxtrees
store_maxtrees[[key]]
}<- resamples(store_maxtrees)
results_tree summary(results_tree, metric="Accuracy")
Call:
summary.resamples(object = results_tree, metric = "Accuracy")
Models: 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
100 0.7368421 0.7777778 0.8947368 0.8457895 0.8986842 0.9444444 0
200 0.7368421 0.7807018 0.8947368 0.8618713 0.9000000 1.0000000 0
300 0.6842105 0.8355263 0.8947368 0.8674269 0.9000000 1.0000000 0
400 0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000 0
500 0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000 0
600 0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000 0
700 0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000 0
800 0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000 0
900 0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000 0
1000 0.6842105 0.8355263 0.8460526 0.8571637 0.8986842 1.0000000 0
The final training model produced an accuracy of 0.8674 compared to the default training model with an accuracy of 0.8358, a 3.78% improvement in accuracy over the default RF parameters.
# Train final model
set.seed(123)
<- train(heart.disease~.,
fit_rf data = train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 300,
maxnodes = 6)
print(fit_rf)
Random Forest
189 samples
13 predictor
2 classes: '1', '2'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 169, 170, 170, 171, 169, 171, ...
Resampling results:
Accuracy Kappa
0.8674269 0.7242847
Tuning parameter 'mtry' was held constant at a value of 2
A default training model using bootstrap sampling method based on classification was created to compare with the k-fold cross-validation sampling method. Bootstrap does not need cross-validation. Random Forests are predictive models that form multiple decision trees. At each tree node, a subset of the 13 predictor features were used to introduce randomness into the model. The default training model used 500 decision trees (ntree=500) and determined the optimal number of features to compare to be 2 (mtry= 2). For this classification RF, the decision at each tree node was conducted by Ginni Impurity, which is a process of estimating the likelihood that a feature may be misclassified at random, then the feature with the lowest likelihood of misclassification is selected at that node. Accuracy of the bootstrap default model is 0.8322 compared to .8358 of the k-fold cross-validation sampling method.
set.seed(123)
<- train(heart.disease~.,
fit_rf2 data = train,
method = "rf",
importance = TRUE)
print(fit_rf2)
Random Forest
189 samples
13 predictor
2 classes: '1', '2'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 189, 189, 189, 189, 189, 189, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8321817 0.6571619
7 0.8229013 0.6393790
13 0.8117677 0.6172537
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
To compare the final training settings in the k-fold cross-validation sampling method to the bootstrap sampling method, the same settings were used to train. The bootstrap sampling method is still slightly lower than the k-fold cross-validation sampling method with .8345 accuracy compared to .8674 of the k-fold cross validation sampling method. Since k-fold cross-validation sampling method is higher in Accuracy, the predictive modeling will be based off from the k-fold cross-validation sampling method.
# Train final model
set.seed(123)
<- train(heart.disease~.,
fit_rf2 data = train,
method = "rf",
tuneGrid = tuneGrid,
importance = TRUE,
nodesize = 14,
ntree = 300,
maxnodes = 6)
print(fit_rf2)
Random Forest
189 samples
13 predictor
2 classes: '1', '2'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 189, 189, 189, 189, 189, 189, ...
Resampling results:
Accuracy Kappa
0.8344737 0.6605306
Tuning parameter 'mtry' was held constant at a value of 2
# Evaluate model
<-predict(fit_rf, test) prediction
The final training model was run against the test data. The results from the predictive test model indicated that the feature major vessels was the most influential feature when predicting heart disease from the current dataset with a variable importance score of 100.00. The five most influential features all had variable importance scores above 50.00. These features were major vessels, thal, chest pain type, old peak, and max heart rate.
varImp(fit_rf)
rf variable importance
Importance
major.vessels 100.000
thal 94.662
chest.pain.type 76.738
oldpeak 67.389
max.heart.rate 54.524
exercise.induced.angina 47.493
sex 35.574
age 32.505
ST.segment 30.580
resting.blood.pressure 15.787
resting.electrocardiographic.results 3.221
fasting.blood.sugar 2.348
serum.cholestoral 0.000
<- varImp(fit_rf)
rfImpggplot(rfImp, aes(x = importance(), y = value)) +
geom_bar(stat = "identity", linewidth = 0.5,fill="blue")
A confusion matrix was ran for the predictive model. Accuracy for the predictive model was calculated to be 83.8% (0.8375, CI: 0.7382, 0.9105, p< 0.001). Sensitivity and specificity were calculated at 0.909 and 0.75 respectively.
<- confusionMatrix(prediction, test$heart.disease)
cm cm
Confusion Matrix and Statistics
Reference
Prediction 1 2
1 40 9
2 4 27
Accuracy : 0.8375
95% CI : (0.7382, 0.9105)
No Information Rate : 0.55
P-Value [Acc > NIR] : 5.091e-08
Kappa : 0.6675
Mcnemar's Test P-Value : 0.2673
Sensitivity : 0.9091
Specificity : 0.7500
Pos Pred Value : 0.8163
Neg Pred Value : 0.8710
Prevalence : 0.5500
Detection Rate : 0.5000
Detection Prevalence : 0.6125
Balanced Accuracy : 0.8295
'Positive' Class : 1
<- function(cm) {
draw_confusion_matrix layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# Create the matrix
rect(150, 430, 240, 370, col='#3F97D0')
text(195, 435, 'Positive', cex=1.2)
rect(250, 430, 340, 370, col='#F7AD50')
text(295, 435, 'Negative', cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col='#F7AD50')
rect(250, 305, 340, 365, col='#3F97D0')
text(140, 400, 'True', cex=1.2, srt=90)
text(140, 335, 'False', cex=1.2, srt=90)
# Add in the cm results
<- as.numeric(cm$table)
res text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# Add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# Add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
draw_confusion_matrix(cm)
The predictive model has an estimated false positive rate of 30.77%. False positive rate is calculated by the number of false positives divided by the sum of false positives and true negatives. FPR= FP/(FP + TN).
4/(4+9)
[1] 0.3076923
In closing, the Random Forest algorithm is a powerful supervised learning model that combines multiple uncorrelated decision trees to form a forest. Two sampling methods were evaluated: k-fold cross-validation and bootstrap. In k-fold cross-validation, decision tree splits are constructed based on Permutation Importance, while in bootstrap is constructed based on Gini Impurity. The algorithm can be applied to both classification and regression tasks, with the prediction determined by the majority vote or average of the leaf nodes, respectively. The Random Forest algorithm offers a versatile approach to solving complex problems and is widely used in various domains of machine learning and data analysis.