Fall 2020 - STAT 311 D100

Data Science Laboratory for the Social Sciences (2)

Class Number: 3864

Delivery Method: In Person

Overview

  • Course Times + Location:

    Sep 9 – Dec 8, 2020: Mon, 12:30–2:20 p.m.
    Burnaby

  • Instructor:

    Aaron Danielson
  • Prerequisites:

    60 units in subjects outside of the Faculties of Science and Applied Science and one of STAT 201, STAT 203, STAT 205, STAT 270, BUEC 232, or POL 201. Corequisite: STAT 310.

Description

CALENDAR DESCRIPTION:

A hands-on application of modern tools and methods for data acquisition, management, visualization, and machine learning, capable of scaling to Big Data. No prior computer programming experience required. Projects will draw from the social sciences and integrate application area insight into the analytic toolkit from STAT 310. This course may not be used to satisfy the upper division requirements of the Statistics honours, major, or minor programs. Students who have taken STAT 240, STAT 440, or any 200-level or higher CMPT course first may not then take this course for further credit. Quantitative.

COURSE DETAILS:

Course Description:

Welcome to STAT 310/311!  This course will introduce you to essential concepts and methods in data science.  Each week new modules will be uploaded to the website.   To practice the new material, you will work through laboratory exercises using the R statistical computing language.  We will reinforce new concepts by regularly participating in Kaggle competitions.  At the end of the semester you will (1) be familiar with the R statistical language, (2) understand the basics of several modern methods in data science, and (3) be prepared to continue learning topics in data science. 

Mode of teaching:

The course features asynchronous and synchronous components.  You will be responsible for viewing lecture on your own.  Additionally, the laboratory tutorials will also be delivered and completed asynchronously.  Each week, we will have online office hours.  This will be an opportunity to ask questions and meet your classmates.  You are encouraged to form study groups and collaborate with one another.  Finally, we will attempt to get everyone together to present a final project.  To ensure everyone’s safety, we will book a very large room on campus.   If this is not possible, we will present remotely via a recorded zoom session.


Syllabus:

This syllabus is tentative and will be updated as we progress!

I. Topic One: Introduction to Data Science
A. Prediction, explanation and exploration
B. Supervised, Unsupervised, and Semi-supervised learning
C. Probability as a way to express uncertainty
D. Different Types of Data and Some Common Distributions

Readings:   (1) Chapter 2 ISL, (2) Chapter 2 IDS, (3) Chapter 1 ML

Lab:  Getting familiar with R. Load data.  Clean and restructure data.  Ask some basic questions.  Simulation and data types.

II. Topic Two: Getting Our Feet Wet
A. Nearest Neighbour Methods
    1. Regression
    2. Classification
    3. Density Estimation
    4. Similarity Networks
B. Resampling Methods to Visualize Distributions: Permutation and Bootstrap Methods
    1. Sampling with Replacement
    2. Standard Errors
    3. Bias and Variance
    4. Confidence Intervals
    5. Visualizing a Bootstrap Distribution
    6. Hypothesis Tests with the Bootstrap

Readings: (1) Chapter 5.2 ISL, (2) An Introduction to the Bootstrap (Roger W. Johnson), (3) An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression (Altman), (4) https://medium.com/analytics-vidhya/permutation-test-as-an-alternative-to-two-sample-t-test-using-r-9f5da921bc95

Labs:  Implement the Bootstrap and Nearest Neighbour methods using hand-written functions and packages.

Kaggle Competition One:  Nearest Neighbour Challenge

https://www.kaggle.com/c/nearest-neighbour-challenge/host

Crash data in the Vancouver metropolitan area.


III. Topic Three: Linear Regression Methods

A. Prediction, explanation and exploration
B. Bias/variance trade-off
C. Evaluating model performance: cross validation and measures of prediction error
D. Overfitting and Penalized Regression Methods

Lab:  Practice with regression methods.


IV. Topic Four: Classification Methods

A. KNN classification review
B. Logistic Regression
C. Multinomial Regression
D. Linear Discriminant Analysis
E. Evaluating model performance: ROC curves

Kaggle Competition Two:  No Labels Challenge

https://www.kaggle.com/c/no-labels

BC Ferries Data

V. Topic Five: Decision and Regression Trees

A. A different approach to learning
B. How to build a tree
C. The pros and cons of tree methods
D. Assessing variable importance for ML methods

VI. Topic Six: Ensemble Learning:  Bagging, Boosting and Stacking
A. Review of the Bootstrap
B. When does bagging help?
C. Random Forests
D. Boosting Methods
E. Gradient Boosted Trees
F. An Ensemble of Weak Learners

Readings:  https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205,

Kaggle Competition Three:  Strength in Numbers

VII. Topic Seven: Unsupervised Learning
A. Measures of Similarity
B. Clustering: kmeans, mixture models and other options
C. Dimension Reduction: PCA, MDS, Isomap, and T-SNE

Labs:  Clustering

VIII. Topic Eight:  Text Data
A. Preprocessing Text Data
B. Useful Packages in R
C. Sentiment Analysis and Classification Methods
D. TF-IDF vectors and clustering documents
E. Neural Networks and Word-Embeddings

Readings: https://playground.tensorflow.org/#activation=sigmoid&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0.003&noise=0&networkShape=8,2,2,2&seed=0.43606&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=true&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

Kaggle Competition Four:  Everyone ‘Complanes’

IX. Topic Nine: Network Analysis
A. Network Data Structures
B. Sufficient Statistics for Graphs
C. Basic Models for Network Data
D. Community Detection

X. Topen Ten: Spatial Data Analysis:
A. Data Visualization
B. Kernel Density Estimation
 

Grading

  • Lab write-ups, Kaggle competitions, and assignments 25%
  • Quiz 1 (Date: TBD) 25%
  • Quiz 2 (Date: TBD) 25%
  • Project Presentation and Writeup (In Person, Date:TBD) 25%

NOTES:

Students are required to complete assignments graded for completion.  The lab component is treated much like labs in the physical and biological sciences.  Students are expected to follow instructions and submit a writeup of the work done in the lab.  One multiple choice midterm assesses the methodological topics taught in the first half of the course.  Before the final exam, students complete and present a project to the class on an applied topic of their choice.  A mixed-response final exam tests on all material introduced in the course.

All above grading is subject to change.

Materials

MATERIALS + SUPPLIES:

Access to high-speed internet, webcam

REQUIRED READING:

We will use four different textbooks.  All of them can be obtained online or through SFU’s library.

I. Statistical Learning and Data Science in R:
  • (IDS) Introduction to Data Science:  Data Analysis and Prediction Algorithms with R Rafael A. Irizarry.  This is available at https://rafalab.github.io/dsbook/(Required)
  • Introduction to Statistical Machine Learning by Masashi Sugiyama. Available through SFU library.
  • Machine Learning with R: Expert techniques for predictive modeling, 3rd Edition. Available through SFU library.
  • R for Data Science by Garrett Grolemund and Hadley Wickham. This is available at https://r4ds.had.co.nz

II. Text Analysis in R:
  • (PTA) Practical Text Analytics: Maximizing the Value of Text Data by Murugan Anandarajan, Chelsey Hill, and Thomas Nolan. Available through SFU. (Required)
  • Text Mining in Practice with R by Ted Kwartler. Available through SFU library
  • Text Mining with R: A Tidy Approach Julia Silge and David Robinson. Available through SFU library.

Department Undergraduate Notes:

Students with Disabilites:
Students requiring accommodations as a result of disability must contact the Centre for Accessible Learning 778-782-3112 or csdo@sfu.ca


Tutor Requests:
Students looking for a Tutor should visit http://www.stat.sfu.ca/teaching/need-a-tutor-.html. We accept no responsibility for the consequences of any actions taken related to tutors.

Registrar Notes:

ACADEMIC INTEGRITY: YOUR WORK, YOUR SUCCESS

SFU’s Academic Integrity web site http://www.sfu.ca/students/academicintegrity.html is filled with information on what is meant by academic dishonesty, where you can find resources to help with your studies and the consequences of cheating.  Check out the site for more information and videos that help explain the issues in plain English.

Each student is responsible for his or her conduct as it affects the University community.  Academic dishonesty, in whatever form, is ultimately destructive of the values of the University. Furthermore, it is unfair and discouraging to the majority of students who pursue their studies honestly. Scholarly integrity is required of all members of the University. http://www.sfu.ca/policies/gazette/student/s10-01.html

TEACHING AT SFU IN FALL 2020

Teaching at SFU in fall 2020 will be conducted primarily through remote methods. There will be in-person course components in a few exceptional cases where this is fundamental to the educational goals of the course. Such course components will be clearly identified at registration, as will course components that will be “live” (synchronous) vs. at your own pace (asynchronous). Enrollment acknowledges that remote study may entail different modes of learning, interaction with your instructor, and ways of getting feedback on your work than may be the case for in-person classes. To ensure you can access all course materials, we recommend you have access to a computer with a microphone and camera, and the internet. In some cases your instructor may use Zoom or other means requiring a camera and microphone to invigilate exams. If proctoring software will be used, this will be confirmed in the first week of class.

Students with hidden or visible disabilities who believe they may need class or exam accommodations, including in the current context of remote learning, are encouraged to register with the SFU Centre for Accessible Learning (caladmin@sfu.ca or 778-782-3112).