Collecting Your Data

Definitions

Categorical Variable: the variable whose values can be put into categories (groups).

Cluster Sampling: a researcher divides the population into clusters. Then, a simple random sample of the clusters is selected.

Cluster: "natural" but relatively heterogeneous part of population.

Convenience Sample is made up of people who are easy to reach.

Coverage Error: the frame is different from the population.

Expert Survey involves discussing a research problem with someone (or a group of people) with experience on a particular subject.

Frame: the set of values from which a sample is selected.

Interval Scale does imply order in the values and the differences between values are measurable and meaningful.

Measurement Error: the difference between a measured value of quantity and its true value.

Nominal Scale does not imply any kind of order in the values. The data on a nominal scale simply categorize the objects that we study.

Non-Probability Sampling: sampling methods where we do not know probability of an object or a particular sample being selected.

Non-Response Error: some units selected for a sample do not give information (e.g., some people you send a questionnaire to do not reply).

Ordinal Scale does imply an order in the values but leaves out the differences between values.

Parameter: anything that is calculated from the population values.

Population: all the values of interest.

Primary Data: the data collected for this project.

Probability Sampling: sampling methods where we know probability of an object or a particular sample being selected.

Quantitative Variable: the variable whose values can be ordered and measured.

Quota Sampling: individuals are chosen out of a specific subgroup (this is usually non-random).

Ratio Scale is interval scale that contains a true zero.

Sample: values selected for a study.

Sampling Error: the difference between a sample statistic and the population parameter that statistic is trying to estimate.

Secondary Data: the data collected for another project but used for this one.

Simple Random Sampling (SRS): each individual object is randomly picked from population - every member of population and every possible sample have same likelihood of being selected.

Statistic: anything that is calculated from the sample values.

Stratified Sampling: a researcher divides the population into strata. Then, a probability sample is drawn from each group.

Stratum (plural: strata): (relatively homogeneous) part of a population (i.e., all members of a stratum posses same value for some variable).

Systematic Sampling: sample members from a population are selected according to a random starting point and a fixed, periodic interval.

True Zero: a value that implies the absence of the attribute whose value is zero.

Voluntary Response Survey is made up of people who self-select into the survey.

These are the illustrations I have used in class:

Here’s Why Gallup Won’t Poll the 2016 Election.

Short Form and Long Form for the 2016 Census of Population.

Figure 0020.030. 2015 Jeep Wrangler 4-door small overlap IIHS crash test.

Notes

In "Data Sources" (p. 18) the book tells you that the data are primary (or come from a primary data source) if you collect your own data for analysis, and that the data are secondary (or come from a secondary data source) if the data for your analysis have been collected by someone else. This is, in my opinion, quite misleading. It implies that who has collected the data determines whether the data are primary or secondary. It should really be defined with regard to the project for which the data are used. I will ask you to use the definition in line with Hox/Boeije:

You must remember, though, that (a) people are often sloppy, and (b) there is no law about "true" definitions - one can make any definition she wants as long as she is consistent while using that definition. People use the definition of primary and secondary just like one in the textbook (p. 18) all the time. Be aware, do not assume they use same definition as ours or Hox/Boeije's.

Systematic sampling is often used in a way that makes it non-probability sampling, to some extent. For example, a business owner may want to find something about his customers. He may assign an employee to stand by the door and ask each 10th customer entering the store a question, starting with the random customer between 9:15 and 9:30AM, until the employee gets 20 answers. In this case, we do not know the probability of any customer being selected.

Simple random sample, in vast majority of situations, is superior to any other sample because the tools of Probability Theory are readily available to apply the sample data to a population based on a simple random sample. Really, when we study sampling distribution, confidence intervals, and hypothesis tests later in this class, all the techniques we use are for the simple random tests. The only (common) good reason to use any other sampling method is high cost of selecting a SRS.

Read These

Chapter 1. Defining an Collecting Data in the textbook:

1.1 Defining Data (pp. 14-15)

1.2 Measurement Scales for Variables (pp. 15-17)

1.3 Collecting Data (pp. 18-19), "Data Sources" and "Population and Samples" only (you are not required to study "Data Formatting," "Data Cleaning," and "Recoding Variables").

1.4 Types of Sampling Methods (pp. 21-23)

1.5 Types of Survey Errors (pp. 24-26), "Coverage Error," "Nonresponse Error," "Sampling Error," and "Measurement Error" (you are not required to study "Ethical Issues about Surveys")

Watch These

Figure 0020.040. Survey errors and remedies.

Figure 020.050. Measurement scales in statistics by Laura Swart. Laura gives you some mnemonics to remember the measurement scales well.

Figure 0020.060. Sampling 02: Simple Random Sampling.

Figure 0020.070. Sampling 03: Stratified Random Sampling.

Figure 0020.080. Sampling 04: Cluster Sampling.

Figure 0020.090. Sampling 05: Systematic Sampling.

Answer These

In the video clip in Figure 0020.040. Survey errors and remedies, gives an example of a sampling error when "I only consider women" to study the relationship between education attainment and earnings (3:45-4:15 min). That is actually misleading. Explain what kind of error that example is really about.

What is the difference between a population and frame? Give an example that illustrates the difference.

What does SRS stand for?

What is the difference between a stratum and a cluster? Give an example that illustrates the difference.

Find out the meaning of the term "census" in Statistics. Explain it by compparing census to sample.

Do problem 1.1 (p. 17 in the textbook).

Do problem 1.11 (p. 18 in the textbook).

From the following list of names, select 8 to create a SRS. Write down your sample. Describe the steps you took to select the sample.

From the same list of names, select 6 to create a stratified sample. Describe the steps you took to select the sample.

From the same list of names, select 6 to create a systematic sample. Describe the steps you took to select the sample.

Do problem 1.25 (p. 24 in the textbook).

This one is more difficult: One popular book reports, "In a recent study published in the July 2006 issue of the journal Obesity, researchers examined more than 120,000 children under the age of six over a twenty-two year period and found that the prevalence of overweight among the children increased from 6.3 percent to 10 percent (a 59 percent increase);". What kind of a survey error have these researchers virtually guaranteed? Explain.