Monday, August 29, 2011

Ensemble learning, bagging, boosting

Ensemble learning - Wikipedia, the free encyclopedia:
"an ensemble is a technique for combining many weak learners in an attempt to produce a strong learner"

Bagging, abbreviation of 'bootstrap aggregating', trains each model in the ensemble using a randomly-drawn subset of the training set. So, each model is independent. Random forest algorithm, for instance, combines random decision tree with bagging to get a high classification accuracy.

Boosting, is to build the ensembl by incrementally including new model instances to emphasize the training instances that were mis-classified by previous models. Common implementation of Boosting is Adaboost.

Other understanding from the randomForest documentation:

Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging (Breiman, 1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees— each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction.

1 comment:

  1. On slide 14 (of 28) in this powerpoint (

    Bagging alone utilizes the same full set of predictors to determine each split.
    However random forest applies another judicious injection of randomness: namely by selecting a random subset of the predictors for each split (Breiman 2001)
    Number of predictors to try at each split is known as mtry.
    While this becomes new parameter, typically (classification) or (regression) works quite well.
    RF is not overly sensitive to mtry

    Bagging is a special case of random forest where mtry