December 11, 2013 in answer
I think you may be a bit confused, or using terminology I’m not familiar with. Nothing in AdaBoost or more general stage wise additive models is independent or mutually exclusive, nor were the designed to be mutually exclusive.
Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model?
No. Methods that produce an ensemble of classifiers can be powerful, and their power comes mostly from the ability to reduce the error caused by variance of the base model. AdaBoost & others can also reduce the bias, but it is much easier to reduce variance induced error.
For this reason we use Decision Trees, as we can control the level of bias/variance on the tree by altering the maximum depth of the tree. This makes life easy, but they are not the end all of boosting (example: boosting in a high dimensional space is quite difficult, and trees are horrible in such situations).
We don’t usually use linear models in boosting because they simply aren’t that good at it. We can produce “easy” data sets that will not converge well being boosted by linear models without too much thought (consider 1 ring in another where each class is of equal size, and a base learner that cuts the inner (and therefor outer) ring in half). A lowly decision stump is often better simply because it has a non-linearity in it, allowing for a much faster adaption to the data.
We avoid complex models such as SVMs because they take a long time to train. Regardless of what base model you choose, AdaBoost is going to run towards the same type of solution (it tries to maximize the L1 margin, where SVMs maximize the L2 margin). If you have to boost 1000 trees, or 500 SVMs, its going to probably be a lot faster to boost the trees. This doesn’t even get into all the parameter search you would have to do for each SVM for each model added. It is simply too time consuming. However, there are cases where this can work well – here is a face detection case.
There is also the issue of prediction time. If you need to boost 100 or 1000 models, its going to increase the prediction time by a 2 or 3 orders of magnitude. SMVs are already not the fastest predictors, and this only makes the situation worse.
The details of this are more picked up from the math then discussed in english. If you are interested in more explicit discussion of the issue about why such models work, read some of Leo Breiman’s papers.