Fall 2019 - STAT 310 D100

Introduction to Data Science for the Social Sciences (2)

Class Number: 10314

Delivery Method: In Person

Overview

  • Course Times + Location:

    Mo 10:30 AM – 12:20 PM
    SECB 1010, Burnaby

  • Exam Times + Location:

    Dec 9, 2019
    12:00 PM – 3:00 PM
    AQ 5005, Burnaby

  • Prerequisites:

    60 units in subjects outside of the Faculties of Science and Applied Science and one of STAT 201, STAT 203, STAT 205, STAT 270, BUEC 232, or POL 201. Corequisite: STAT 311.

Description

CALENDAR DESCRIPTION:

An introduction to modern tools and methods for data acquisition, management, visualization, and machine learning, capable of scaling to Big Data. No prior computer programming experience required. Examples will draw from the social sciences. This course may not be used to satisfy the upper division requirements of the Statistics honours, major, or minor programs. Students who have taken STAT 240, STAT 440, or any 200-level or higher CMPT course first may not then take this course for further credit. Quantitative.

COURSE DETAILS:

STAT 310 is a concept-oriented course.  Materials will be taught using the minimum amount of mathematical formalism necessary.

Outline:
1. Review and extension of methods from STAT 203

A. Probability.
B. Some Common Distributions.
C. Measures of Central Tendency.
D. Statistics without Formulas:  Permutation Tests and Bootstrap Methods.

Labs:  sampling from distributions, creating plots, computing correlations and Kendall’s tau, bootstrapping for uncertainty and conducting permutation tests. 

2. Introduction to Data Science
A. Common Tasks: Prediction, explanation, and exploration.
B. Basic Supervised Learning for continuous and categorical data:  Ordinary least squares, Logistic regression, Poisson regression and Multinomial regression.
C. Machine Learning Methods for Supervised Learning:  Nearest Neighbors, Decision Trees, Random Forests and Gaussian Processes.
D. Unsupervised Learning:  Measures of similarity and clustering.

Labs:  Answering questions with data, practice with regression methods, nearest neighbors, computing a random forest and clustering.

3. Model Evaluation and Validation
A. Fitting the Data:  Residuals and Likelihood Ratios.
B. Cross-validation and Overfitting.
C. Measures of Prediction Error.

Labs:  Evaluating statistical models, K-fold cross validation and ROC curves.

4. Application:  Networks
A. Basic concepts of graphs.
B. Measures on Networks.
C. Community Detection.
D. Random Graph Models.

Labs:  Analyzing a network data set, creating network plots, designing your own random graph model.

5. Application:  Text Analysis
A. Words as Data.
B. Sentiment Analysis.
C. Tools from Natural Language Processing.

Labs:  Acquiring text data, perform sentiment analysis, word2vec.

6. Application:  Spatial Data
A. Coordinates as Data.
B. Some Techniques for Spatial Data Analysis.

Labs:  Acquiring and managing spatial data, software for spatial data.

About the Instructor:
Dr. Danielson holds degrees in Philosophy (BA Northwestern, AM University of Chicago), Policy Studies (MPP University of Chicago), Economics (MA New York University) and Statistics (Ph.D. UCLA).  Most of his research focuses on the development of statistical methodology for network data structures in the social and biological sciences.

Grading

  • Assignments 10%
  • Lab Write Ups 15%
  • Midterm 25%
  • Presentation 25%
  • Final Exam 25%

NOTES:

Students are required to complete assignments graded for completion.  The lab component is treated much like labs in the physical and biological sciences.  Students are expected to follow instructions and submit a writeup of the work done in the lab.  One multiple choice midterm assesses the methodological topics taught in the first half of the course.  Before the final exam, students complete and present a project to the class on an applied topic of their choice.  A mixed-response final exam tests on all material introduced in the course.

All above grading is subject to change.

Materials

REQUIRED READING:

Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text
Authors: Taylor Arnold and Lauren Defanti Tilton. Publisher: Springer
ISBN: 9783319207018

Book is available on-line for free through the SFU Library

RECOMMENDED READING:

Text Mining with R: A Tidy Approach
Authors: Julia Silge and David Robinson. Publisher: Beijing, China : O'Reilly
ISBN: 1491981628

Book is available on-line for free through the SFU Library

Text Analysis with R for Students of Literature
Author: Matthew L. Jockers Publisher: Springer
ISBN: 978-3-319-03163-7

Book is available on-line for free through the SFU Library

Department Undergraduate Notes:

Students with Disabilites:
Students requiring accommodations as a result of disability must contact the Centre for Accessible Learning 778-782-3112 or csdo@sfu.ca


Tutor Requests:
Students looking for a Tutor should visit http://www.stat.sfu.ca/teaching/need-a-tutor-.html. We accept no responsibility for the consequences of any actions taken related to tutors.

Registrar Notes:

SFU’s Academic Integrity web site http://www.sfu.ca/students/academicintegrity.html is filled with information on what is meant by academic dishonesty, where you can find resources to help with your studies and the consequences of cheating.  Check out the site for more information and videos that help explain the issues in plain English.

Each student is responsible for his or her conduct as it affects the University community.  Academic dishonesty, in whatever form, is ultimately destructive of the values of the University. Furthermore, it is unfair and discouraging to the majority of students who pursue their studies honestly. Scholarly integrity is required of all members of the University. http://www.sfu.ca/policies/gazette/student/s10-01.html

ACADEMIC INTEGRITY: YOUR WORK, YOUR SUCCESS