Next: About this document ...
Statistics 201 Introduction
Goals:
- Introduction to statistics
- Why study statistics?
- What is statistics?
- Data distributions
- Displaying distributions with graphs
Why Study Statistics?
- 1.
- Which of four calorie-reduced diets lowers long term cholesterol
levels best?
- 2.
- What effect do differences in the U.S. and Canadian health care
systems have on how patients are treated for low back pain?
- 3.
- How likely is someone to require reoperation after spinal fusion
surgery? How does this reoperation rate compare to other types of spine
surgery?
- 4.
- What are the genetic and environmental factors that affect type I
(insulin dependent) diabetes?
- 5.
- Can we characterize the genetic diversity of HIV within an individual
at the time of transmission?
See also the research intersts of the other faculty members
in the Statistics group at SFU.
Homework 1, Exercise 1:
What is your major? Give an example of a question in your field that can,
or has been, answered by collecting and analysing data.
What is Statistics?
Statistics is a data science.
- 1.
- Data analysis - Organizing and describing data using graphs and
numerical summaries.
- 2.
- Data production - Selecting samples, designing experiments,
collecting data.
- 3.
- Statistical inference - Using the data to draw conclusions about
some wider universe.
Data and Data Distributions
Problem: Characterize differences in the Canadian and U.S.
health care systems.
Are lengths of stay in the hospital longer in Canada than the U.S.?
Individuals: the objects described by a set of data.
E.g. individuals hospitalized for low back pain in Ontario or
Washington state in 1994.
Variables: the characteristics of the individuals that are of
interest (e.g. length of stay, age, gender, ``county" of
residence).
Categorical Variable: one that indicates which of several
groups or categories an individual belongs to (e.g. gender, county).
Quantitative Variable: one that takes numerical values for
which it makes sense to do arithmetic (e.g. age, length of stay).
Displaying Distributions of Categorical Variables
Example: marital status for Americans age 18 and over (Figure 1.1 of
Moore, page 7).
Displaying Distributions of Quantitative Variables
Example: Senior citizens data from U.S. states (Figure 1.2 of Moore, page
10).
Patterns in Distributions
Symmetric distribution:
Symmetric distribution with outliers (e.g. Senior citizens by state -
Figure 1.2 of Moore):
Patterns in Distributions cont.
Distribution skewed right (e.g. income data):
Distribution skewed left (e.g. income data where high income
individuals tend to not report income):
Hysterectomy Example
Data from 15 male Swiss doctors. Numbers of
hysterectomies (surgeries to remove the ovaries)
performed in one year are
27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28
Measuring centre with the mean
The mean is the ordinary arithmetic average of the observations.
Notation: Let
denote the values of a variable measured
on n individuals. The mean,
is
Hysterectomy Example: Data from 15 male Swiss doctors. Numbers of
hysterectomies performed in one year are
27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28
with mean
Measuring centre with the median
The median is the ``middle value'' of the variable.
Ordered hysterectomy counts from the 15 male Swiss doctors
20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86
The centre observation is the median. M=34.
Compare with counts from 10 female Swiss doctors
5, 7, 10, 14, 18, 19, 25, 29, 31, 33
The pair (18,19) is the centre pair and the median is
M=(18+19)/2=18.5.
The median represents a ``typical'' observation.
Mean vs. Median
The two outliers pull the mean up.
Symmetric distribution: mean and median close.
Right-skewed: mean larger than median.
Measuring Spread with the Range
- The easiest measure of spread is the range: the difference
between the largest and smallest observation.
- Sensitive to outliers.
- For the hysterectomy example the range is 86-20=66.
- Compare to the range when the two outliers are excluded:
59-20=39.
Measuring Spread with the Quartiles
- The first quartile marks the first quarter of the observations
- The third quartile marks the first three quarters of the
observations.
- The second quartile is the median.
Male doctors:
20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86
Female doctors:
5, 7, 10, 14, 18
19, 25, 29, 31, 33
Measure of the spread is the inter-quartile range (IQR). E.g. Male doctors
27 to 50 so IQR=50-27=23. Female doctors 10 to 29 so IQR is 29-10=19.
The Five-Number Summary
The minimum, first quartile, median, third quartile, and maximum.
Boxplots
A boxplot is a graphical representation of the five-number summary.
Measuring Spread with the Standard Deviation
The variance, s2, is the average of the squares of the deviations
of each observation from the mean. The standard deviation, s,
is the square-root of the variance.
Notation:
measurements
on n individuals with mean
.
Male doctors: s=20.6
Female doctors: s=10.1
Properties of the Standard Deviation
- Measures spread about the mean so is most appropriate when
the centre is measured by the mean
- s=0 means no spread. Always
.
- s is influenced by outliers
Male doctors: s=20.6
Male doctors excluding outliers: s=11.0
Densities versus Distributions
- Instead of frequency on vertical axis, plot the relative frequency.
- Area of the bars = relative frequency
bin width = proportion
- Proportions sum to 1
Density Curves
- Area under the curve represents proportions
- Smooth curves represent idealizations of distributions.
- Population distribution versus sample distribution
Population vs. Sample
Mean, Median and Quartiles
- Median of a symmetric distribution = mean
- Q1 and Q3 mark 25% and 75% of the density
- Mean pulled toward heavier tail for a skewed density
Population Mean and Standard Deviation
- Sample from a population whose distribution is described by
a density curve
- As the sample size gets larger, the histogram looks
more like the density curve
- As the sample size gets larger, the sample mean gets closer to the
mean of the density (population mean).
- As the sample size gets larger, the sample standard deviation
gets closer to the standard deviation of the density (population s.d.).
- Population mean is denoted
- Population standard deviation is denoted

Mean and S.D. of the Standard Normal
- Mean of the standard normal density is 0
- Standard deviation of the density is 1
- ``68-95-99.7 rule''
Normal Mean
, S.D.
- Notation:
- ``68-95-99.7 rule''
- If we re-scale we get a standard normal
Example
The distribution of heights of young women
aged 18 to 24 is approximately normal with mean
inches and s.d.
inches.
So 99.7% of these young women are between
to
,
or
57 inches (4'9") to 72 inches (6').
Standardized Observations
Notation:
means the observed x's come from
a distribution that is described by the normal density curves with
mean
and s.d.
.
To standardize, subtract the mean and divide by the s.d.
Can compute z-scores for all observations
If
,
then
Why? and why do we care?
Why?
-
If you know X has a normal distribution, the mean and s.d. tell you which
one.
- Subtracting constants from a normal
still gives you a normal.
- By subtracting the mean from X we end up with a variable with
mean 0.
Why? (cont.)
- Dividing a normal
by a constant still gives you a normal.
- By dividing a normal distribution with mean 0 by the s.d. we get
a variable with s.d. 1.
has mean 0, s.d.
-
has mean 0, s.d. 1
Why Do We Care?
One answer: For calculating normal proportions for data sets
that with approximately normal distribution.
If
and
we want to know what proportion of observations fall between
xL and xU we can
- standardize xL and xU to get zL and zU
- find the proportion of standard normal
observations that fall between zL and zU
Conclusion:
- 1.
- The population mean and s.d. give a full summary of the
population density if the density is approximately normal.
- 2.
- The sample mean and s.d. give a very full summary of the data
if the data come from a population with an approximately normal density
.
Example cont.
What proportion of young women are between 60 and 72 inches?
(Recall
and
.)
- Standardize 60 and 72 to get -1.8 and 3
- Find the proportion of standard normal
observations that fall between -1.8 and 3
- Area = 0.999 - 0.036 = 0.963
Chapter 1 Summary
- Introduction to data - individuals, variables, categorical variables,
quantitative variables
- Histograms - to graph the distribution of quantitative variables
- Examining distributions - look for shape and outliers, describe
the distribution with numbers, decide on measures of centre and spread
(mean and s.d. or five-number summary)
- Sample versus population distribution - density curves have area
1 under the curve
- Normal distributions - for approximately normal data the mean
and s.d. provide a good summary
Next: About this document ...
Brad McNeney
2002-01-02