What Is Random About Random Forest

Himanshu Sachdeva
Oct 12, 2020
5 min read

Ensemble models are very commonly used in machine learning to boost the performance of various machine learning algorithms. An ensemble means a group of things viewed as a whole rather than individually. In ensembles, a collection of models is used to make predictions, rather than individual models. Arguably, the most popular in the family of ensemble models is the Random Forest: an ensemble made by the combination of a large number of decision trees. The great thing about random forests is that - they almost always outperform a decision tree in terms of accuracy.

An ensemble is successful when each model of the ensemble complies with the following conditions:

Each model should be diverse. Diversity ensures that the models serve complementary purposes, which means that the individual models make predictions independent of each other.
Each model should be acceptable. Acceptability implies that each model is at least better than a random model.

Randomness of Random Forests

Random forests use a special ensemble method called bagging. Bagging stands for Bootstrap Aggregation. Bootstrapping means creating bootstrap samples from a given data set. A bootstrap sample is created by sampling the given data set uniformly and with replacement. A bootstrap sample typically contains about 30 - 70% data from the data set. Aggregation implies combining the results of different models present in the ensemble.

Bagging considers all features for splitting a node. This really can be a potential problem of bootstrap samples producing very similar trees as only the best features among all the features will be used to split the nodes. However in random forest, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree.

Here we can see how random sampling of data to build trees and random feature selection to split the nodes enables random forest to perform much better than individual decision trees.

Steps to create a random forest

Training process of each tree in Random forest is the same as a decision tree except that at each node in the tree only a random selection of features is used for the split in that node.

Following are the steps to create a random forest:

Create a bootstrap sample from the training data set.
Construct a decision tree using the bootstrap sample. Like mentioned earlier, while splitting a node of the tree, only consider a random subset of features. Every time a node has to split, a different random subset of features will be considered.
Repeat the above steps n times to construct n trees in the forest. Remember each tree is constructed independently, so it is possible to construct each tree in parallel.
While predicting a test case, each tree predicts individually, and the final prediction is given by the majority vote of all trees for which the test case is excluded in the training sample. An important point to note is that the votes of only those trees are considered which are not trained on the test case being predicted.

OOB (Out-of-Bag) error

Random forest do not follow the usual convention of training and test data. They use the concept of the Out-of-Bag (OOB) error. Since each tree is built on a bootstrap sample there is a percentage of data points left unseen by each decision tree while building the tree. So, each observation can be used as a test observation by those trees which did not have it in their bootstrap sample. All these trees predict on this observation and you get an error (or mis-classification) for a single observation from the ensemble. So, OOB error is calculated by using each observation of the training set as a test observation. The final OOB error is calculated by calculating the error on each observation and aggregating it.

Time required to construct a random forest model

To construct a forest of S trees, on a dataset which has M features and N observations, the time taken will depends on the following factors:

The number of trees. The time is directly proportional to the number of trees. But this time can be reduced by creating the trees in parallel.
The size of bootstrap sample. Generally the size of a bootstrap sample is 30 - 70% of N. The smaller the size the faster it takes to create a forest.
The size of subset of features while splitting a node. Generally this is taken as √𝑀 in classification and M/3 in regression.

Hyperparameter Tuning

The random forest have tendency to grow very complex as we increase the depth and number of constituent trees which may impact the performance of the model.

The following hyperparameters are present in a random forest classifier to tune the performance of the model. Note that most of these hyperparameters are the same as the decision trees in the forest.

n_estimators: The number of trees in the forest.
criterion: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_features : The number of features to consider when looking for the best split.
max_depth : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split : The minimum number of samples required to split an internal node. Default is 2
min_samples_leaf : The minimum number of samples required to be at a leaf node. default is 1.
min_weight_fraction_leaf : The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_leaf_nodes : Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_split : Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

Advantages of a random forest

A random forest is more stable than any single decision tree because the results get averaged out; it is not affected by the instability and bias of an individual tree.
A random forest is immune to the curse of dimensionality since only a subset of features is used to split a node.
You can parallelize the training of a forest since each tree is constructed independently.
You can calculate the OOB (Out-of-Bag) error using the training set which gives a really good estimate of the performance of the forest on unseen data. Hence there is no need to split the data into training and validation; you can use all the data to train the forest.

Disadvantages of a random forest

A big disadvantage of Random Forest is that it's like a black box model and we lose the interpretability of decision trees. We are unable to point out the factors that led to a specific outcome.
Besides, high number of trees might make the computation process much slower and inefficient for real-time predictions.

Random forest is widely used supervised learning algorithm. Though it works for both classification as well as regression, however, it is mainly used for classification problems.