请教：有关boosting的原理

sweetjujube · 发表于 2008-10-18 00:00:54

请问各位大侠，有没有看过这篇文章：
Additive logistic regression: a statistical view of boosting
里面有许多公式没有推导，本人读起来非常费劲，但我觉得后续许多boosting的改进算法，都是以此文为理论依据。拜托读过此文的高人给讲解讲解。本人QQ：1061093265

路呆 · 发表于 2008-10-18 01:11:02

提出 logic 和gentle两个boost的那篇？

sweetjujube · 发表于 2008-10-18 11:27:21

不是吧，是运用概率论相关理论，从加性模型和极大似然率角度对boosting的原理进行分析，有两个版本，一个是98年发的PS文件，另一个是00年版本，有PDF文件可下载。这是00年版的摘要：
Boosting (Freund & Schapire 1995) is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data, and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multi-class generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multi-class generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on bestfirst truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule.

路呆 · 发表于 2008-10-18 12:26:23

1998年Jerome Friedman & Trevor Hastie & Robert Tibshirani发表文章Additive Logistic Regression: a Statistical View of Boosting

一些重要的结论:
Bagging是一个纯粹的降低相关度的方法,如果树的节点具有很高的相关性,bagging就会有好的结果
1.早期的AdaBoost在第二步的时候采用重采样方法,即使某些样本权重增加.这种方法与bagging存在某种关联,它也是Boost的成功之处中降低相关度方面的重要部分.
2.在第二步中如果使用加权的tree-growing算法,而不是重采样算法,效果会更好. This removes the randomization component essential in bagging
3.使用stumps作为弱分类器

Logit和Gentle算法的提出过程大致是这样的
1. 验证Boosting algorithms是一种拟合一个additive logistic regression model(加性的逻辑回归模型)的阶段式估计过程.它有最优一个指数判据,这个判据由第二定理与二项式对数似然判据是等价的.
2. 作者证明Discrete是使用adaptive Newton updates拟合一个additive logistic regression model来最小化Ee^(-yF(x))的过程,其中F(x)=求和fm(x),而fm(x)就是各层分类器的结果
3. 作者证明Real是使用层级最优的方法来拟合一个additive logistic regression model.
4. 作者说明了为什么要选择Ee^(-yF(x))作为目标:因为大家都用这个
5. 作者证明了当（blabla一个很复杂的公式，贴不出来）时Ee^(-yF(x))最小
6. 作者证明了每次权值更新以后,最近一次训练出的弱分类器错误率为50%.
7. 作者证明了对于最优的F(x),样本的分类乘以权值的和应该为0.

于是作者用80年代兴起的逻辑回归的寻优方法中提炼出了LogitBoost(我终于找到logitBoost的logic了)

自适应的牛顿法,拟合加性logistic回归模型
1. 获得样本,(x,y)序列.将分类y*=(y+1)/2
2. 设置初值,F(x)=0,p(xi)=1/2

进入循环
1. 依式计算zi,wi.
2. 通过加权最小二乘的方式拟合函数fm(x).由zi拟合xi,权重为wi
3. 更新F(x),p(x)
退出循环

输出分类器sign[F(x)].

作者提出,logitAdaBoost在每一轮中都使Ee^(-y(F(x)+f(x)))最优,会使训练样本的代表性下降,于是提出了Gentle AdaBoost(牛顿步长法)

这篇？

sweetjujube · 发表于 2008-10-18 18:21:18

感谢路呆兄的讲解，看样子路呆兄在这方面很有研究，本人还有一些不太明白之处，望兄不吝点拨：

1.p(x)=P(y=1|x)=exp(F(x))/[exp(F(x))+exp(-F(x))]，P(y=-1|x)=1-p(x)，y*=(y+1)/2，所以P(y*|x)都是F(x)的函数。
2.由于y*只能取0和1，所以对数似然函数可写成log[P(y*|x)]=2y*F(x)-log[1+exp(2F(x))]，
由于是F(x)的函数，所以用F(x)为变量对它的数学期望求导得s(F(x))，用牛顿切线法求满足s(F(x))=0的F(x)，每次迭代更新量为E[y*-p(x)|x]/2E[p(x)((1-p(x))|x]=Ew{[y*-p(x)]/[p(x)(1-p(x)]|x}/2，这个等式只有wi做归一化才能成立，不过它的算法伪代码里没出现。
3.用牛顿切线法求满足s(F(x))=0的F(x)，这种方法是最优的吗？别的方程根的数值解法不知是否有没有人尝试过？
4.拟合fm(x)，即为从所有弱分类器中找出符合对所有样本xi求和([fm(xi)-zi/2]*wi)^2，取得最小值的fm(x)。
5.文中log(y*,p(x))=-log(1+exp(-2yF(x)))，在F=0处二阶泰勒展开，怎么就成了exp(-yF)+log(2)-1?
6.exp(-yF(x))=|y*-p(x)|/sqrt(p(x)(1-p(x)))，右侧怎么与卡方分布挂上钩？卡方分布因该是独立同分布的若干个标准正态平方和的分布，假如y*表示正态分布的话，充其量只能说是对它做了个标准化变换。
除非变成E(exp(-2yF(x)))=E{[|y*-p(x)|/sqrt(p(x)(1-p(x)))]^2}，这样就成了卡方分布了。所以只能说E(exp(-2yF(x)))服从卡方分布。

路呆 · 发表于 2008-10-18 18:46:59

没...能看到那一步已经是我的极限了。这篇文章丢了好久了，让我慢慢看看啊...不一定搞的定的....

其实这篇文章中的很多地方被大牛们批过的，具体见这篇Discussion of the Paper Additive Logistic Regression A Statistical View of Boosting by Jerome Friedman, Trevor Hastie and Robert Tibshirani

sweetjujube · 发表于 2008-10-19 18:10:15

这篇Discussion of the Paper Additive Logistic Regression A Statistical View of Boosting by Jerome Friedman, Trevor Hastie and Robert Tibshirani主要讨论了有关Boosting的overfitting问题，不过两个人好象互相否定对方，Jerome觉得Yoav的bounds “our bounds are very rough and yield bounds that are not useful in practice.” ；而Yoav觉得Jerome的analysis “It is completely unclear whether this analysis can be used to predict the performance of classification rules outside of the training sample.”
但是Yoav认为，Jerome的数据是基于pre-specified class分布，因此在这种类分布中，特征和类标签间的关系可用对数线性函数描述，存在局限性，。但这对某些情况，如人脸检测，这种限定是不是已经足够了？
另外，Yoav关于Jerome对有关错分样本的过分强调的分析，挺有道理。

sweetjujube · 发表于 2008-10-22 00:02:01

请教路呆兄:
在自适应的牛顿法,拟合加性logistic回归模型中：
“2. 通过加权最小二乘的方式拟合函数fm(x).由zi拟合xi,权重为wi”
需要遍历每个特征，针对每个特征都要用fm(xi的该特征)对zi进行逼近拟合？一般怎样拟合？用分段函数？还是一个由“x的该特征值”组成的表达式？
别的几种方式不存在这个问题，如Discrete boost,只需要找出p和threshold就行；Real 和 Gentle boost,只需统计特征值的分段区间内的正负样本就可以。

sweetjujube · 发表于 2008-10-23 21:05:04

logit的 fm(x) =Ew(z|x);
gentle的 fm(x) =Ew(y|x);

		自动登录	找回密码
密码			立即注册