next up previous
Next: About this document ...


Statistics 201 Introduction

Goals:


Why Study Statistics?

1.
Which of four calorie-reduced diets lowers long term cholesterol levels best?
2.
What effect do differences in the U.S. and Canadian health care systems have on how patients are treated for low back pain?
3.
How likely is someone to require reoperation after spinal fusion surgery? How does this reoperation rate compare to other types of spine surgery?
4.
What are the genetic and environmental factors that affect type I (insulin dependent) diabetes?
5.
Can we characterize the genetic diversity of HIV within an individual at the time of transmission?

See also the research intersts of the other faculty members in the Statistics group at SFU.

Homework 1, Exercise 1: What is your major? Give an example of a question in your field that can, or has been, answered by collecting and analysing data.


What is Statistics?

Statistics is a data science.

1.
Data analysis - Organizing and describing data using graphs and numerical summaries.
2.
Data production - Selecting samples, designing experiments, collecting data.
3.
Statistical inference - Using the data to draw conclusions about some wider universe.


Data and Data Distributions

Problem: Characterize differences in the Canadian and U.S. health care systems. Are lengths of stay in the hospital longer in Canada than the U.S.?

Individuals: the objects described by a set of data. E.g. individuals hospitalized for low back pain in Ontario or Washington state in 1994.

Variables: the characteristics of the individuals that are of interest (e.g. length of stay, age, gender, ``county" of residence).

Categorical Variable: one that indicates which of several groups or categories an individual belongs to (e.g. gender, county).

Quantitative Variable: one that takes numerical values for which it makes sense to do arithmetic (e.g. age, length of stay).


Displaying Distributions of Categorical Variables

Example: marital status for Americans age 18 and over (Figure 1.1 of Moore, page 7).


Displaying Distributions of Quantitative Variables

Example: Senior citizens data from U.S. states (Figure 1.2 of Moore, page 10).


Patterns in Distributions

Symmetric distribution:
















Symmetric distribution with outliers (e.g. Senior citizens by state - Figure 1.2 of Moore):

















Patterns in Distributions cont.

Distribution skewed right (e.g. income data):
















Distribution skewed left (e.g. income data where high income individuals tend to not report income):

















Hysterectomy Example

Data from 15 male Swiss doctors. Numbers of hysterectomies (surgeries to remove the ovaries) performed in one year are

27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28

\epsfig{file=hyst.ps, height=7.0in, angle=-90}


Measuring centre with the mean

The mean is the ordinary arithmetic average of the observations.

Notation: Let $x_1,x_2,\ldots,x_n$ denote the values of a variable measured on n individuals. The mean, $\overline{x}$ is

\begin{displaymath}\overline{x} = \frac{x_1+x_2+\ldots+x_n}{n} = \frac{1}{n}\sum_{i=1}^n x_i
\end{displaymath}

Hysterectomy Example: Data from 15 male Swiss doctors. Numbers of hysterectomies performed in one year are

27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28

with mean

\begin{displaymath}\frac{27+50+\ldots+28}{n}=41.3
\end{displaymath}


Measuring centre with the median

The median is the ``middle value'' of the variable.

Ordered hysterectomy counts from the 15 male Swiss doctors

20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86

The centre observation is the median. M=34.

Compare with counts from 10 female Swiss doctors

5, 7, 10, 14, 18, 19, 25, 29, 31, 33

The pair (18,19) is the centre pair and the median is M=(18+19)/2=18.5.

The median represents a ``typical'' observation.


Mean vs. Median

\epsfig{file=hyst2.ps, height=7.0in, angle=-90}
$\bullet$ The two outliers pull the mean up.
$\bullet$ Symmetric distribution: mean and median close.
$\bullet$ Right-skewed: mean larger than median.


Measuring Spread with the Range


Measuring Spread with the Quartiles

Male doctors:
20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86

Female doctors:
5, 7, 10, 14, 18 $\mid$ 19, 25, 29, 31, 33

Measure of the spread is the inter-quartile range (IQR). E.g. Male doctors 27 to 50 so IQR=50-27=23. Female doctors 10 to 29 so IQR is 29-10=19.


The Five-Number Summary

The minimum, first quartile, median, third quartile, and maximum.
\epsfig{file=hyst3.ps, height=7.0in, angle=-90}


Boxplots

A boxplot is a graphical representation of the five-number summary.
\epsfig{file=boxplot.ps, height=7.0in, angle=-90}


Measuring Spread with the Standard Deviation

The variance, s2, is the average of the squares of the deviations of each observation from the mean. The standard deviation, s, is the square-root of the variance.

Notation: $x_1,x_2,\ldots,x_n$ measurements on n individuals with mean $\overline{x}$.

\begin{eqnarray*}s^2 & = & \frac{(x_1-\overline{x})^2+(x_2-\overline{x})^2+\ldot...
...qrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\overline{x})^2}
\end{eqnarray*}


Male doctors: s=20.6

Female doctors: s=10.1


Properties of the Standard Deviation

Male doctors: s=20.6

Male doctors excluding outliers: s=11.0


Densities versus Distributions

\epsfig{file=hyst_prop.ps, height=7.0in, angle=-90}


Density Curves

\epsfig{file=normal_dens.ps, height=7.0in, angle=-90}


Population vs. Sample

\epsfig{file=normal_4dens.ps, height=6.5in, angle=-90}


Mean, Median and Quartiles

\epsfig{file=normal_qdens.ps, height=7.0in, angle=-90}


Population Mean and Standard Deviation


Mean and S.D. of the Standard Normal

\epsfig{file=normal_sigdens.ps, height=6.0in, angle=-90}


Normal Mean $\mu$, S.D. $\sigma$

\epsfig{file=normal_sig2dens.ps, height=6.0in, angle=-90}


Example

The distribution of heights of young women aged 18 to 24 is approximately normal with mean $\mu=64.5$ inches and s.d. $\sigma=2.5$ inches.


So 99.7% of these young women are between $\mu - 3\sigma$ to $\mu+3\sigma$, or 57 inches (4'9") to 72 inches (6').
\epsfig{file=normal_htdens.ps, height=6.0in, angle=-90}


Standardized Observations

Notation: $X \sim N(\mu,\sigma)$ means the observed x's come from a distribution that is described by the normal density curves with mean $\mu$ and s.d. $\sigma$.
To standardize, subtract the mean and divide by the s.d.

\begin{displaymath}Z= \frac{X-\mu}{\sigma}
\end{displaymath}

Can compute z-scores for all observations
$x_1, x_2,\ldots,x_n \quad \Rightarrow \quad z_1,z_2,\ldots,z_n$
If $X \sim N(\mu,\sigma)$, then

\begin{displaymath}Z=\frac{X-\mu}{\sigma} \sim N(0,1) \end{displaymath}

Why? and why do we care?


Why?


Why? (cont.)


Why Do We Care?

One answer: For calculating normal proportions for data sets that with approximately normal distribution.


If $X \sim N(\mu,\sigma)$ and we want to know what proportion of observations fall between xL and xU we can

Conclusion:

1.
The population mean and s.d. give a full summary of the population density if the density is approximately normal.

2.
The sample mean and s.d. give a very full summary of the data if the data come from a population with an approximately normal density
.


Example cont.

What proportion of young women are between 60 and 72 inches? (Recall $\mu=64.5$ and $\sigma=2.5$.)
\epsfig{file=normal_htadens.ps, height=6.0in, angle=-90}


Chapter 1 Summary




 
next up previous
Next: About this document ...
Brad McNeney
2002-01-02