快速方差更新算法:Welford Online算法
方差Two Pass算法
最近在数据处理中验证数据时发现细小的理论偏差,排查后发现是数据的方差更新方法出现了错误,起源是想优化这么一组函数:
1
2
3
4
5
6
7
8
9
10
11
12auto calculateVariance = [](const std::deque<int>& data){
if(data.empty()){
return 0;
}
int mean = std::accumulate(data.begin(), data.end(), 0)/data.size();
double sum_sq_diff = 0.0f;
for(const auto& pos : data){
int diff = pos - mean;
sum_sq_diff += diff*diff;
}
return (int)(sum_sq_diff/data.size());
};1
2
3
4
5
6
7
8
9
10
11
12
13
14//这是错误的方法:
auto calculateVariance = [&sum, &buffer, &sum_variance](int value){
sum += value;
int size = buffer.size() + 1;
int mean = sum/size;
int diff = value-mean;
sum_variance += diff*diff; //更新错误,没有重算每组均值diff
int var = (int)(sum_variance/size);
if(var > thresh){
sum -= value;
sum_variance -= diff*diff;
}
return var;
};
方差Naive方法
从概率论二次项原理原理可知,方差计算与期望、平方数期望存在数学关系,为:
Welford Online算法
无论从性能还是数值稳定性入手,Welford在线算法都是最被推荐的方差更新算法,尤其适合这种需要动态输入更新方差的场合。
其数学原理结论给出了递推形式的均值和方差计算:
其推导也比较简单:
均值有:
对方差有:
由均值公式:
C++实现
根据公式,方差更新的方法就出来了: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16auto calculateVariance = [&average,&buffer, &sum_variance](int value){
int size = buffer.size() + 1; //n+1
int diff = value - average; //X_{n+1} - X_{n}_bar
average += (double)diff/size;
int diff_new = value - average; //X_{n+1} - X_{n+1}_bar
sum_variance += diff*diff_new; //加前为(n)σ_{n},加后为(n+1)σ_{n+1}
int var = (int)(sum_variance/size); //σ_{n+1}
if(var > 10){ //回退:由σ_{n+1} 计算 σ_{n}
int diff_new = value - average;
average = (size*average - value)/(size - 1);
int diff = value - average;
sum_variance -= diff_new * diff;
}
return var;
};
参考链接:

