To Notes
Estimates of the Center
The Sample Mean
- The sample mean x is defined as
x =
(x1 + ... + xn) / n
- For a dataset that has a bell-shaped histogram, the sample mean is the most
efficient estimate of the center of the histogram.
The Median
- For a dataset that has a skewed histogram (for example with
a long right tail):
x is pulled in the direction of the
long tail, so Q2 better represents the "center" of the histogram.
x is more influenced by outliers than Q2 is.
- Q2 is also a better estimate of the histogram center if the histogram has thick tails
(has outliers on both sides).
Other Measures of Central Tendency
- A third another statistic that has been proposed
(in addition to the mean and median) to estimate the center of a
dataset: the 5%-trimmed mean: throw out the bottom 2.5% and top 2.5%
of the observations, then compute the sample mean of the remaining
observations.
- The median and the 5%-trimmed mean are resistant or robust
statistics because they are resistant to outliers.
- Resistant to outliers means that the value of the trimmed mean is not affected very much by outliers.
- If there are less than 2.5% outliers on the left and less than
2.5% outliers on the right, then the trimmed mean is more efficient for
estimating the center of the histogram than the mean or the median.
Normal Histograms
- Many histograms of real data are normal or bell shaped. Here is the standard normal curve:
The bell-shaped curve is symmetric around its center.
If we disregard the two extreme outliers,
the histogram of the NBS-10 data is roughly bell-shaped.
- Use R to do the following with the NBS-10 data
nist-10.txt:
- Find the dataset mean.
- Graph the histogram and the boxplot.
- Delete the outliers according to the boxplot.
- Plot a histogram with superimposed normal curve.
- If a histogram is bell shaped, it can be parsimoniously described
by its center and spread.
The center is the location of its axis of symmetry.
The spread is the distance between the center and one
of its
inflection points.
- Here is an
a bell-shaped histogram with its
inflection points marked.
- Here is the histogram of some times between eruptions of the
Old Faithful Geyser in minutes:
- This histogram is not bell-shaped, so the center and spread are
not a good summary of the data.
- Here are some histograms and the terms used to describe them:
- The right-skewed and J-shaped histograms have
long right tails.