Numerical Summaries for Your Data

Definitions

Arithmetic Mean: (a.k.a. mean) the sum of a set of numbers divided by the number of numbers in the set.

Average: see "Central Tendency."

Bell-Shaped Distribution: a distribution with the shape like a bell. We will later learn about two very common and useful bell-shaped distributions, normal and t-Student.

Box-and-Whisker Plot: see "Box Plot."

Box-Plot: graphical presentation of the Five-Number Summary.

Central Tendency: a single value that represents a "typical" value in a daata set (a.k.a. what to expect if you look at an observation from that data set). A central tendency is sometimes referred to as an "average."

Coefficient of Correlation: a measure of the direction and strength of the linear relationship between two variables.

Coefficient of Variation: a relative measure of dispersion calculated as the ratio of the standard deviation over the mean.

Covariance: a measure of the direction of the linear relationship between two variables.

Decile: one of the values of a variable that divides the distribution of the variable into ten groups having equal frequencies.

Dispersion: see "Variation."

Empirical Rule: the statement that tells what proportion of the data, approximately, falls within 1, 2, or 3 standard deviations around the mean.

First Quartile: the value of the variable such that one quarter of all the values in the data set are smaller that the first quartile and three quarters of all the values in the data set are greater that the first quartile.

Five-Number Summary: the list of five values: the minimum, the first quartile, the median, the third quartile, and the maximum of the data set.

Gap: an empty numerical class in a distribution (i.e., the class with zero frequency) surrounded by non-empty classes.

Inter-Quartile Range: a measure of dispersion that is equal to the difference between the third and the first quartile.

Kurtosis: a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.

Median: the value of the variable such that one half of all the values in the data set are smaller that the median and one half of all the values in the data set are greater that the median. The median is same value as the second quartile.

Mean: see "Arithmetic Mean."

Mode: the value that appears most often in a data set.

Outlier: an unusual value in a data set.

Percentile: one of the values of a variable that divides the distribution of the variable into one hundred groups having equal frequencies.

Quantile: one of the values of a variable that divides the distribution of the variable into several groups having equal frequencies.

Quartile: one of the values of a variable that divides the distribution of the variable into four groups having equal frequencies.

Quintile: one of the values of a variable that divides the distribution of the variable into five groups having equal frequencies.

Range: a measure of dispersion that is equal to the difference between the maximum and the minimum.

Rectangular Distribution: see "Uniform Distribution."

Resistant Measure: a measure that is not influenced by outliers.

Scatter: see "Variation."

Second Quartile: see "Median."

Shape: the pattern of the distribution of values from the lowest value to the highest value.

Skewed Distribution: asymmetrical distribution. Opposite of symmetrical distribution.

Spread: see "Variation."

Standard Deviation: a measure of dispersion that is equal to the square root of the variance.

Summary Measure: a single number that describes the whole set of data; designed show as much important information about a data set as possible in as simple a form as possible.

Symmetrical Distribution: the frequencies of the classes to the left from the mean are same as the frequencies of the classes to theright from the mean (for the pairs of the left-right classes that are same distance from the mean). Opposite of skewed distribution.

Third Quartile: the value of the variable such that three quarters of all the values in the data set are smaller that the third quartile and one quarter of all the values in the data set are greater that the third quartile.

Uniform Distribution: the distribution with equal frequencies across the classes (a.k.a. a rectangular distribution).

Variability: see "Variation."

Variance: a measure of dispersion that is calculated as the average (mean) of the squared deviations of the values in a data set from their mean.

Variation: (a.k.a. dispersion, or variability, or scatter, or spread) a measure of how much the individual values in a data set differ from each other.

Z-Score: the difference between the value and the mean, divided by the standard deviation.

These are the illustrations I have used in class:

illustration of the mean, median, and mode

Figure 0040.040. An illustration of the mean, median, and mode for a small data set.

the mean and median income over time

Figure 0040.050. Here, the mean income is growing faster than the median income. You should think, "What does this development imply about changing shape of the income distribution?"

symmetrical distribution

Figure 0040.060. This is a symmetrical distribution.

a distribution with a smaller range

Figure 0040.070. This distribution has a smaller range than the one above (Figure 0040.060).

a distribution with gaps

Figure 0040.080. This distribution has got a gap at the class with midpoint 26. Also, you could think about the class with midpoint 11 as a gap.

a distribution with two peaks

Figure 0040.090. This distribution has got two peaks.

a symmetrical distribution with two peaks

Figure 0040.100. This distribution also has got two peaks.

a distribution with an outlier

Figure 0040.110. This distribution has got an outlier at 39.

an obviously bell-shaped distribution

Figure 0040.120. This distribution is bell-shaped.

a bell-shaped distribution with few classes

Figure 0040.130. This distribution is approximately bell-shaped, but it is harder to see because there are few classes. For a tip, see the next graph (Figure 0040.140).

a bell-shaped distribution with few classes and a polygon that helps to see the shape

Figure 0040.140. This is same distribution as above (Figure 0040.130) but I have added a polygon that connects the class midpoints. This polygon does suggest that the distribution is approximately bell-shaped, and it is easier to see than in Figure 0040.130.

a skewed distribution

Figure 0040.150. This distribution is skewed (to the right).

a uniform distribution

Figure 0040.160. This distribution is uniform. You can also see why a uniform distribution is sometimes called a rectangular distribution.

a distribution that students should try and describe

Figure 0040.170. Can you describe this distribution's shape?

Notes

Numerical summaries (a.k.a. summary statistics) are a collection of measures that try to describe as much as possible about the data set in as few numbers/words as possible. Obviously, trying to achieve both of these objectives simultaneously is going to make your head hurt; they conflict with each other, and the conflict results in trade-offs: "Should I keep some important details and necessarily use more numbers/words to describe the set, or should I use very few numbers/words that are easy to understand/remember and give up some of those details?" The answer is, "It depends." It depends on how important the details are, how informative the summary numbers are, how simple the data set is, how smart or dumb your audience is, and many other things. In other words, do not expect to just learn the steps how to summarize the data sets. Rather, try to understand what the summaries do, and then use this understanding in your projects. You will have to make the decisions about those trade-offs every time you work with the data.

Luckily, even though there is no general rule how to summarize every data set, some summaries work very well for most data sets, and are used almost universally. These are the most common measures of cenral tendency, dispersion, distribution shape, statistical dependence, and distribution's special features.

A central tendency is a "typical" value for a variable in the data set. What is meant by "typical" is open to interpretations, so we have several possible measures of central tendency (I could name about 12, we will study 3 most common ones). Which one is most appropriate depends on the questions asked in a study, characteristics of the data set, source of the data, etc. I suggest that you think about a central tendency as the value that you expect to encounter in a data set. When you actually see a datum, it may be different from what you expected. A good central tendency measure should try and somehow minimize the errors in your expectations.

The arithmetic mean minimizes the sum of the squared differences between the values in a data set and the mean. These differences are the errors we would make if we try to predict the value from a data set by using the mean. When they are squared, larger errors gain dispropotionally more weight in constructing the measure. Translation: Think about the mean as an expected value of a datum. This expected value is formed in such a way as to avoid large errors (and worry less about small errors).

The median, unlike the mean, treats all errors (small or large) equally. It minimizes the sum of the absolute errors. Translation: Think about the median as an expected value of a datum. This expected value is formed so that (an error of) expecting a value above the actual datum is just as likely as (an error of) expecting a value below the actual datum.

The mode minimizes the number of errors, no matter what their magnitudes are. Translation: Think about the mode as an expected value of a datum. This expected value is formed in such a way as make the likelihood of observing exactly correct datum (exactly equal to the expected one) the largest possible.

When you build a frequency distribution, you divide the set of data into the classes of equal width and compare their frequencies. With the quantiles, you do everything "contrariwise:" divide the set of data into the groups of equal frequencies and compare their width. Also, you may divide the set of data into the groups of equal frequencies and compare their midpoints or averages.

The empirical rules come from a normal distribution (we will study that distribution later in the course).

The covariance does not tell you how strong the relatonship between two variables is. The coefficient of correlation does tell you how strong the relatonship between two variables is

Read These

Chapter 3. Numerical Descriptive Measures in the textbook:

3.1 Central Tendency (pp. 102-106)

3.2 Variation and Shape (pp. 107-116)

3.3 Exploring Numerical Data (pp. 120-125)

3.4 Numerical Descriptive Measures for a Population (pp. 127-130)

3.5 The Covariance and the Coefficient of Correlation (pp. 131-135)

3.6 Descriptive Statistics: Pitfalls and Ethical Issues (pp. 137-138)

Watch This

Figure 0040.180. Maths Tutorial: Describing Statistical Distributions (Part 1 of 2).

Figure 0040.190. Maths Tutorial: Describing Statistical Distributions (Part 2 of 2).

Figure 0040.200. Maths Tutorial: Stats - the 68-95-99.7% Rule (Part 1 of 2).

Figure 0040.210. Maths Tutorial: Stats - the 68-95-99.7% Rule (Part 2 of 2).

Answer These

Describe the distribution in Figure 0040.150.

Describe the distribution in Figure 0040.170.

Describe the distribution in Figure 0040.080.

Does the scatter plot below indicate that an increase in one variable in the graph would increase the other variable? Explain.

a scatter plot with positive correlation

The table below lists the property taxes for the US states (and the District of Columbia [D.C.]). Use it to answer the following:

Compute the mean.

Compute the median.

Compute the quartiles.

Compute the range.

Compute the IQR.

Compute the variance.

Compute the standard deviation.

Compute the coefficient of variation.

Construct the boxplot.

hat do you learn from the boxplot?

Is there an outlier? Explain.

State Property Taxes Per Capita ($)
Alabama 506
Alaska 1714
Arizona 1071
Arkansas 548
California 1458
Colorado 1253
Connecticut 2498
Delaware 714
D.C. 2985
Florida 1593
Georgia 1062
Hawaii 1016
Idaho 812
Illinois 1763
Indiana 1127
Iowa 1312
Kansas 1354
Kentucky 662
Louisiana 698
Maine 1655
Maryland 1206
Massachusetts 1845
Michigan 1445
Minnesota 1345
Mississippi 794
Missouri 922
Montana 1308
Nebraska 1443
Nevada 1331
New Hampshire 2424
New Jersey 2671
New Mexico 611
New York 2105
North Carolina 867
North Dakota 1191
Ohio 1133
Oklahoma 598
Oregon 1161
Pennsylvania 1230
Rhode Island 2020
South Carolina 970
South Dakota 1098
Tennessee 746
Texas 1461
Utah 834
Vermont 2065
Virginia 1430
Washington 1217
West Virginia 718
Wisconsin 1633
Wyoming 2321