Fall 2022 - STAT 311 D100

Data Science Laboratory for the Social Sciences (2)

Class Number: 4708

Delivery Method: In Person

Overview

  • Course Times + Location:

    Sep 7 – Dec 6, 2022: Mon, 2:30–4:20 p.m.
    Burnaby

  • Prerequisites:

    60 units in subjects outside of the Faculties of Science and Applied Sciences and one of STAT 201, STAT 203, STAT 205, STAT 270, BUS 232, ECON 233, or POL 201, with a minimum grade of C-. Corequisite: STAT 310.

Description

CALENDAR DESCRIPTION:

A hands-on application of modern tools and methods for data acquisition, management, visualization, and machine learning, capable of scaling to Big Data. No prior computer programming experience required. Projects will draw from the social sciences and integrate application area insight into the analytic toolkit from STAT 310. This course may not be used to satisfy the upper division requirements of the statistics honours, major, or minor programs. Students who have taken STAT 240, STAT 440, or any 200-level or higher CMPT course first may not then take this course for further credit. Quantitative.

COURSE DETAILS:

Course Description:

Welcome to STAT 310/311!  This course will introduce you to essential concepts and methods in data science.  Each week new modules will be uploaded to the website.   To practice the new material, you will work through laboratory exercises using the R statistical computing language.  We will reinforce new concepts by regularly participating in Kaggle competitions.  At the end of the semester you will (1) be familiar with the R statistical language, (2) understand the basics of several modern methods in data science, and (3) be prepared to continue learning topics in data science. 

Mode of teaching:

The course will be in-person format. The laboratory tutorials will be delivered and completed synchronously. Each week, we will have office hours held either by the instructor or a TA. This will be an opportunity to ask questions and meet your classmates. You are encouraged to form study groups and collaborate with one another. At the end of the semester, you should also submit in your project report with a short presentation video about your project.


Syllabus:

This syllabus is tentative and will be updated as we progress!

I. Topic One: Introduction to Data Science
A. Prediction, explanation and exploration
B. Supervised, Unsupervised, and Semi-supervised learning
C. Probability as a way to express uncertainty
D. Different Types of Data and Some Common Distributions

Readings:   (1) Chapter 2 ISL, (2) Chapter 2 IDS, (3) Chapter 1 ML

Lab:  Getting familiar with R. Load data.  Clean and restructure data.  Ask some basic questions.  Simulation and data types.

II. Topic Two: Getting Our Feet Wet
A. Nearest Neighbour Methods
    1. Regression
    2. Classification
    3. Density Estimation
    4. Similarity Networks
B. Resampling Methods to Visualize Distributions: Permutation and Bootstrap Methods
    1. Sampling with Replacement
    2. Standard Errors
    3. Bias and Variance
    4. Confidence Intervals
    5. Visualizing a Bootstrap Distribution
    6. Hypothesis Tests with the Bootstrap

Readings: (1) Chapter 5.2 ISL, (2) An Introduction to the Bootstrap (Roger W. Johnson), (3) An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression (Altman), (4) https://medium.com/analytics-vidhya/permutation-test-as-an-alternative-to-two-sample-t-test-using-r-9f5da921bc95

Labs:  Implement the Bootstrap and Nearest Neighbour methods using hand-written functions and packages.

Kaggle Competition One:  Nearest Neighbour Challenge

https://www.kaggle.com/c/nearest-neighbour-challenge/host

Crash data in the Vancouver metropolitan area.


III. Topic Three: Linear Regression Methods

A. Prediction, explanation and exploration
B. Bias/variance trade-off
C. Evaluating model performance: cross validation and measures of prediction error
D. Overfitting and Penalized Regression Methods

Lab:  Practice with regression methods.


IV. Topic Four: Classification Methods

A. KNN classification review
B. Logistic Regression
C. Multinomial Regression
D. Linear Discriminant Analysis
E. Evaluating model performance: ROC curves

Kaggle Competition Two:  No Labels Challenge

https://www.kaggle.com/c/no-labels

BC Ferries Data

V. Topic Five: Decision and Regression Trees

A. A different approach to learning
B. How to build a tree
C. The pros and cons of tree methods
D. Assessing variable importance for ML methods

VI. Topic Six: Ensemble Learning:  Bagging, Boosting and Stacking
A. Review of the Bootstrap
B. When does bagging help?
C. Random Forests
D. Boosting Methods
E. Gradient Boosted Trees
F. An Ensemble of Weak Learners

Readings:  https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205,

Kaggle Competition Three:  Strength in Numbers

VII. Topic Seven: Unsupervised Learning
A. Measures of Similarity
B. Clustering: kmeans, mixture models and other options
C. Dimension Reduction: PCA, MDS, Isomap, and T-SNE

Labs:  Clustering

VIII. Topic Eight:  Text Data
A. Preprocessing Text Data
B. Useful Packages in R
C. Sentiment Analysis and Classification Methods
D. TF-IDF vectors and clustering documents
E. Neural Networks and Word-Embeddings

Readings: https://playground.tensorflow.org/#activation=sigmoid&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0.003&noise=0&networkShape=8,2,2,2&seed=0.43606&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=true&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

Kaggle Competition Four:  Everyone ‘Complanes’

IX. Topic Nine: Network Analysis
A. Network Data Structures
B. Sufficient Statistics for Graphs
C. Basic Models for Network Data
D. Community Detection

X. Topen Ten: Spatial Data Analysis:
A. Data Visualization
B. Kernel Density Estimation
 

Grading

  • Lab write-ups, Kaggle competitions, and assignments 25%
  • Quiz 1 25%
  • Quiz 2 25%
  • Project Presentation and Writeup 25%

NOTES:

Students are required to complete assignments graded for completion.  The lab component is treated much like labs in the physical and biological sciences.  Students are expected to follow instructions and submit a writeup of the work done in the lab.  One multiple choice midterm assesses the methodological topics taught in the first half of the course.  Before the final exam, students complete and present a project to the class on an applied topic of their choice.  A mixed-response final exam tests on all material introduced in the course.

All above grading is subject to change.

Materials

REQUIRED READING:

Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text
Authors: Taylor Arnold and Lauren Defanti Tilton. Publisher: Springer
ISBN: 9783319207018

Book is available on-line for free through the SFU Library


REQUIRED READING NOTES:

Your personalized Course Material list, including digital and physical textbooks, are available through the SFU Bookstore website by simply entering your Computing ID at: shop.sfu.ca/course-materials/my-personalized-course-materials.

Department Undergraduate Notes:

Students with Disabilities:
Students requiring accommodations as a result of disability must contact the Centre for Accessible Learning 778-782-3112 or caladmin@sfu.ca.  


Tutor Requests:
Students looking for a tutor should visit https://www.sfu.ca/stat-actsci/all-students/other-resources/tutoring.html. We accept no responsibility for the consequences of any actions taken related to tutors.

Registrar Notes:

ACADEMIC INTEGRITY: YOUR WORK, YOUR SUCCESS

SFU’s Academic Integrity website http://www.sfu.ca/students/academicintegrity.html is filled with information on what is meant by academic dishonesty, where you can find resources to help with your studies and the consequences of cheating. Check out the site for more information and videos that help explain the issues in plain English.

Each student is responsible for his or her conduct as it affects the university community. Academic dishonesty, in whatever form, is ultimately destructive of the values of the university. Furthermore, it is unfair and discouraging to the majority of students who pursue their studies honestly. Scholarly integrity is required of all members of the university. http://www.sfu.ca/policies/gazette/student/s10-01.html