by David A. Burton
I asked NCSU Statistics Prof. Dave Dickey something similar to the following question:
Suppose there is a dataset consisting of a large set of N observations (measurements), call them X1 … Xn.
The overall mean or “grand mean” for the whole dataset (of all N observations) is:
GM = (X1 + X2 + … + Xn) / N
The overall variance or “grand variance” for the whole dataset of N observations is:
GV = ( (X1-GM)2 + (X2-GM)2 + … + (Xn-GM)2 ) / (N-1)
The overall standard deviation or “grand standard deviation” for the whole dataset is the square root of the variance, √GV.
Now, suppose we partition the set of observations into groups.
For each group we can calculate a mean and standard deviation.
Now, what can we say about the whole dataset of N observations, given only the statistics for each group? (For each group we know the mean, the number of samples, and the standard deviation.)
Calculating GM (the mean for the whole dataset) from the individual group means (and sample counts, if the groups vary in size) is easy.
But is it possible to calculate √GV (the standard deviation for the whole dataset) from the individual group means and standard deviations? How is it done?
For example, perhaps N is 365, and the measurements were taken daily, and the groups are months. There are 12 groups of measurements, with 28 to 31 measurements in each group (month). Given the mean and standard deviation for each month's data, calculating the mean for the whole year's data is simple, but how can we calculate the standard deviation for the whole year's data?
On Fri, Apr 16, 2010 at 10:21 AM, Prof. Dickey replied. This is my heavily edited & annotated version of his reply:
Let G be the number of groups, and i be a group number (i=1,2,…,G).
Let n(i) be the number of observations in group i.
N is the sum of the individual n(i) observation counts, i.e. the total number of observations:
N = n(1) + n(2) + … + n(G)
Let's rename the observations X1…Xn, to number them according to which groups they are in. So Let Y(i,j) be the observation number j in group i. So Y(1,1) is the first observation in group 1, Y(G,1) is the first observation in the last group, Y(1,n(1)) is the last observation in the first group, and Y(G,n(G)) is the last observation in the last group.
Let Y(i) be the group i mean:
Y(i) = ( Y(i,1) + Y(i,2) + … + Y(i,n(i)) ) / n(i)
Let V(i) be the group i variance:
V(i) = ( Y(i,1)-Y(i))2 + (Y(i,2)-Y(i))2 + … + (Y(i,n(i))-Y(i))2 ) / (n(i)-1)
The numerator is called the Error Sum of Squares for group i:
ESSG(i) = Y(i,1)-Y(i))2 + (Y(i,2)-Y(i))2 + … + (Y(i,n(i))-Y(i))2
GM is the overall mean or “grand mean.” Get it by first summing n(i) times Y(i) over the groups (i=1,2,…,G), then dividing the sum by N:
GM = ( (n(1) · Y(1)) + (n(2) · Y(2)) + … + (n(G) · Y(G)) ) / N
What we're seeking is the overall variance for the whole dataset, GV:
GV = ( (Y(1,1)-GM)2 +
(Y(1,2)-GM)2 + … + (Y(1,n(1))-GM)2
+ (Y(2,1)-GM)2 + (Y(2,2)-GM)2 + … + (Y(2,n(1))-GM)2
+ (Y(G,1)-GM)2 + (Y(G,2)-GM)2 + … + (Y(G,n(1))-GM)2 ) / (N-1)
The “error sum of squares” (call it ESS) for the whole dataset is the sum of all values of Y(i,j)-Y(i) squared:
ESS = (Y(1,1)-Y(1))2 + (Y(1,2)-Y(1))2 + … +
+ (Y(2,1)-Y(2))2 + (Y(2,2)-Y(2))2 + … + (Y(2,n(1))-Y(2))2
+ (Y(G,1)-Y(G))2 + (Y(G,2)-Y(G))2 + … + (Y(G,n(1))-Y(G))2
where Y(i,j) is observation j in group i and Y(i) is the group i mean.
(Notice the similarity between V(i) and each of the lines in the summation for ESS.)
The number you seek is the square root of what is called the Total Mean Square in a so-called “Analysis of Variance.” Here is what you do:
(1) Square the standard deviations within each group to get the variances, V(i).
Multiply each variance by n(i)-1, the so-called “degrees of freedom” for each group, getting what is called the “Error Sum of Squares” within each group, ESSG(i):
ESSG(i) = V(i) · (n(i)-1)
For example, if group 5 has n(5)=16 observations and standard deviation 3, the variance is 32 = 9, so you would compute 9 · (16-1) = 135.
(2) Add these up over all of your groups, getting an overall Error Sum of Squares:
ESS = ESSG(1) + ESSG(2) + … + ESSG(G)
But wait - there's more! Your individual group means are varying around the overall mean GM and we have to take that into account, so....
(3) Compute the deviation Y(i)-GM of each group mean from your overall grand mean GM. Square each one and multiply by its n(i).
For example, if in group 5 you have mean 82 and the overall mean is GM=80, you would compute 16 · 4 = 64, because we had n(5)=16 observations in group 5, and 4 is the square of 2 (i.e. the square of 82-80). This is the “Group sum of Squares” for group 5.
GSS(i) = (Y(i)-GM)2 · n(i)
(4) Sum these group sums of squares over all G of your groups getting the total (overall) group sum of squares:
TGSS = GSS(1) + GSS(2) + … + GSS(G)
(5) Add the overall Error Sum of Squares ESS from step (2) to the overall Group Sum of Squares TGSS from step (4) to get the “Total Sum of Squares.” Now divide this by the “degrees of freedom” N-1 where you recall that N is the total number of observations you have. This is the grand variance you seek:
GV = (ESS + TGSS) / (N-1)
Take the square root of that, to get the standard deviation you seek:
and the composite standard deviation = √GV.
To calculate the overall variance, GV, what we need is the sum of all (Y(i,j) - GM)2 (which we'll then divide by (N-1)). But what we have is sums of (Y(i,j) - Y(i))2 for each group.
So, how do the (Y(i,j) - GM)2 terms (which we need) differ from the (Y(i,j) - Y(i))2 terms (which we have)?
Let D(i) = Y(i) - GM, the difference between the group mean for group i and the overall dataset mean.
So GM = Y(i) - D(i)
Substitute for GM in one of the terms of the summation in the definition of the overall variance, GV:
(Y(i,j) - GM)2 = (Y(i,j) - (Y(i)-D(i)))2
= (Y(i,j) - Y(i) + D(i))2
= ((Y(i,j)-Y(i)) + D(i))2
= (Y(i,j)-Y(i))2 + (D(i))2 + 2·D(i)·(Y(i,j)-Y(i))
But, since Y(i) is the average of Y(i,j), that means when we sum for all observations in group i the 2·D(i)·(Y(i,j)-Y(i)) terms sum to zero.
So we can ignore the 2·D(i)·(Y(i,j)-Y(i)) terms in the summation.
Thus, the Total Mean Square is just the sum of all the (Y(i,j)-Y(i))2 and (D(i))2 terms, over all observations j in all groups i.
The sum of (Y(i,j)-Y(i))2 is the “error sum of squares,” ESS, calculated above.
The sum of (D(i))2 for all observations j in group i is simply ((D(i))2 · n(i)) = ((Y(i)-GM)2 · n(i)) = GSS(i).
So, summing for all groups, GSS(1) + GSS(2) + … + GSS(G) = TGSS.
Thus the summation used in the variance calculation, which is called the Total Sum of Squares, TSS = ESS + TGSS, and the overall variance, GV = (ESS + TGSS) / (N-1).
Note that it doesn't matter how the observations were partitioned into groups. Regardless of whether the observations were assigned to the groups randomly, or in sorted order, or in any other way, the resulting grand variance GV and combined standard deviation √GV will be unaffected.
Finally, notice what happens in the two degenerate cases:
On Wed, Apr 21, 2010 at 9:36 AM, Prof. Dickey replied again:
“Indeed you interpreted things correctly and developed the classical analysis of variance breakdown between the variation due to differing group means and the variation within the groups.
“The analysis of variance F test is a ratio of the group to group mean square to the within group mean square to see if the variation between groups is more than you'd expect from a random assignment of observations to groups - in your case you'd be wondering if there were a "month effect" in your data, that is, a seasonal component. Each mean square is the sum of squares you discuss divided by its degrees of freedom. Details are available in most stat books.”
I've written implementations of this algorithm in Perl
and in Python, and Eric Jennings kindly
translated the Python version to PHP.
You can download all three versions here: