bart_kosko's picture
Information Scientist and Professor of Electrical Engineering and Law, University of Southern California; Author, Noise, Fuzzy Thinking

I have changed my mind about using the sample mean as the best way to combine measurements into a single predictive value.  Sometimes it is the best way to combine data but in general you do not know that in advance.  So it is not the one number from or about a data set that I would want to know in the face of total uncertainty if my life depended on the predicted outcome.

Using the sample mean always seemed like the natural thing to do.  Just add up the numerical data and divide by the number of data.  I do not recall ever doubting that procedure until my college years.  Even then I kept running into the mean in science classes and even in philosophy classes where the discussion of ethics sometimes revolved around Aristotle's theory of the "golden mean." There were occasional mentions of medians and modes and other measures of central tendency but they were only occasional.      

The sample mean also kept emerging as the optimal way to combine data in many formal settings.  At least it did given what appeared to be the reasonable criterion of minimizing the squared errors of the observations.  The sample mean falls out from just one quick application of the differential calculus.  So the sample mean had on its side not only mathematical proof and the resulting prominence of appearing in hundreds if not thousands of textbooks and journal articles.  It was and remains the evidentiary workhorse of modern applied science and engineering.  The sample mean summarizes test scores and gets plotted in trend lines and centers confidence intervals among numerous other applications.

Then I ran into the counter-example of Cauchy data.  These data come from bell curves with tails just slightly thicker than the familiar "normal" bell curve.  Cauchy bell curves also describe "normal" events that correspond to the main bell of the curves.  But Cauchy bell curves have thicker tails than normal bell curves have and these thicker tails allow for many more "outliers" or rare events.  And Cauchy bell curves arise in a variety of real and theoretical cases.  The counter-example is that the sample mean of Cauchy data does not improve no matter how many samples you combine.  This result contrasts with the usual result from sampling theory that the variance of the sample mean falls with each new measurement and hence predictive accuracy improves with sample size (assuming that the square-based variance term measures dispersion and that such a mathematical construct always produces a finite value — which it need not produce in general).  The sample mean of ten thousand Cauchy data points has no more predictive power than does the sample mean of ten such data points.  Indeed the sample mean of Cauchy data has no more predictive power than does any one of the data points picked at random.  This counter-example is but one of the anomalous effects that arise from averaging data from many real-world probability curves that deviate from the normal bell curve or from the twenty or so other closed-form probability curves that have found there way into the literature in the last century.

Nor have scientists always used the sample mean.  Historians of mathematics have pointed to the late sixteenth century and the introduction of the decimal system for the start of the modern practice of computing the sample mean of data sets to estimate typical parameters.  Before then the mean apparently meant the arithmetic average of just two numbers as it did with Aristotle.  So Hernan Cortes may well have had a clear idea about the typical height of an adult male Aztec in the early sixteenth century.  But he quite likely did not arrive at his estimate of the typical height by adding measured heights of Aztec males and then dividing by the number added.  We have no reason to believe that Cortes would have resorted to such a computation if the Church or King Charles had pressed him to justify his estimate.  He might just as well have lined up a large number of Aztec adult males from shortest to tallest and then reported the height of the one in the middle.

There was a related and deeper problem with the sample mean:  It is not robust.  Extremely small or large values distort it.  This rotten-apple property stems from working not with measurement errors but with squared errors.  The squaring operation exaggerates extreme data even though it greatly simplifies the calculus when trying to find the estimate that minimizes the observed errors.  That estimate turns out to be the sample mean but not in general if one works with the raw error itself or other measures.  The statistical surprise of sorts is that using the raw or absolute error of the data gives the sample median as the optimal estimate.

The sample median is robust against outliers.  If you throw away the largest and smallest values in a data set then the median does not change but the sample mean does (and gives a more robust "trimmed" mean as used in combining the judging scores in figure skating and elsewhere to remove judging bias).  Realtors have long since stated typical housing prices as sample medians rather than sample means because a few mansions can so easily skew the sample mean.  The sample median would not change even if the price of the most expensive house rose to infinity.  The median would still be the middle-ranked house if the number of houses were odd.  But this robustness is not a free lunch.  It comes at the cost of ignoring some of the information in the numerical magnitudes of the data and has its own complexities for multidimensional data.

Other evidence pointed to using the sample median rather than the sample mean.  Statisticians have computed the so-called breakdown point of these and other statistical measures of central tendency.  The breakdown point measures the largest proportion of data outliers that a statistic can endure before it breaks down in a formal sense of producing very large deviations.  The sample median achieves the theoretical maximum breakdown point.  The sample mean does not come close.  The sample median also turns out to be the optimal estimate for certain types of data (such as Laplacian data) found in many problems of image processing and elsewhere — if the criterion is maximizing the probability or likelihood of the observed data.  And the sample median can also center confidence intervals.  So it too gives rises to hypothesis tests and does so while making fewer assumptions about the data than the sample mean often requires for the same task. 

The clincher was the increasing use of adaptive or neural-type algorithms in engineering and especially in signal processing.  These algorithms cancel echoes and noise on phone lines as well as steer antennas and dampen vibrations in control systems.  The whole point of using an adaptive algorithm is that the engineer cannot reasonably foresee all the statistical patterns of noise and signals that will bombard the system over its lifetime.  No type of lifetime average will give the kind of performance that real-time adaptation will give if the adaptive algorithm is sufficiently sensitive and responsive to its measured environment.  The trouble is that most of the standard adaptive algorithms derive from the same old and non-robust assumptions about minimizing squared errors and thus they result in the use of sample means or related non-robust quantities.  So real-world gusts of data wind tend to destabilize them.  That is a high price to pay just because in effect it makes nineteenth-century calculus computations easy and because such easy computations still hold sway in so much of the engineering curriculum.  It is an unreasonably high price to pay in many cases where a comparable robust median-based system or its kin both avoids such destabilization and performs similarly in good data weather and does so for only a slightly higher computational cost.  There is a growing trend toward using robust algorithms.  But engineers still have launched thousands of these non-robust adaptive systems into the stream of commerce in recent years.  We do not know whether the social costs involved from using these non-robust algorithms are negligible or substantial.

So if under total uncertainty I had to pick a predictive number from a set of measured data and if my life depended on it — I would now pick the median.