Fall 2022 - STAT 311 D100
Data Science Laboratory for the Social Sciences (2)
Class Number: 4708
Delivery Method: In Person
Course Times + Location:
Mo 2:30 PM – 4:20 PM
AQ 3148.2, Burnaby
1 778 782-4489
Prerequisites:60 units in subjects outside of the Faculties of Science and Applied Sciences and one of STAT 201, STAT 203, STAT 205, STAT 270, BUS 232, ECON 233, or POL 201, with a minimum grade of C-. Corequisite: STAT 310.
A hands-on application of modern tools and methods for data acquisition, management, visualization, and machine learning, capable of scaling to Big Data. No prior computer programming experience required. Projects will draw from the social sciences and integrate application area insight into the analytic toolkit from STAT 310. This course may not be used to satisfy the upper division requirements of the statistics honours, major, or minor programs. Students who have taken STAT 240, STAT 440, or any 200-level or higher CMPT course first may not then take this course for further credit. Quantitative.
Welcome to STAT 310/311! This course will introduce you to essential concepts and methods in data science. Each week new modules will be uploaded to the website. To practice the new material, you will work through laboratory exercises using the R statistical computing language. We will reinforce new concepts by regularly participating in Kaggle competitions. At the end of the semester you will (1) be familiar with the R statistical language, (2) understand the basics of several modern methods in data science, and (3) be prepared to continue learning topics in data science.
Mode of teaching:
The course will be in-person format. The laboratory tutorials will be delivered and completed synchronously. Each week, we will have office hours held either by the instructor or a TA. This will be an opportunity to ask questions and meet your classmates. You are encouraged to form study groups and collaborate with one another. At the end of the semester, you should also submit in your project report with a short presentation video about your project.
This syllabus is tentative and will be updated as we progress!I. Topic One: Introduction to Data Science
Readings: (1) Chapter 2 ISL, (2) Chapter 2 IDS, (3) Chapter 1 ML
Lab: Getting familiar with R. Load data. Clean and restructure data. Ask some basic questions. Simulation and data types.
- Density Estimation
- Similarity Networks
Sampling with Replacement
Bias and Variance
Visualizing a Bootstrap Distribution
Hypothesis Tests with the Bootstrap
Readings: (1) Chapter 5.2 ISL, (2) An Introduction to the Bootstrap (Roger W. Johnson), (3) An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression (Altman), (4) https://medium.com/analytics-vidhya/permutation-test-as-an-alternative-to-two-sample-t-test-using-r-9f5da921bc95
Labs: Implement the Bootstrap and Nearest Neighbour methods using hand-written functions and packages.
Kaggle Competition One: Nearest Neighbour Challenge
Crash data in the Vancouver metropolitan area.
III. Topic Three: Linear Regression Methods
Lab: Practice with regression methods.
IV. Topic Four: Classification Methods
A. KNN classification review
B. Logistic Regression
C. Multinomial Regression
D. Linear Discriminant Analysis
E. Evaluating model performance: ROC curves
Kaggle Competition Two: No Labels Challenge
BC Ferries Data
A. A different approach to learning
B. How to build a tree
C. The pros and cons of tree methods
D. Assessing variable importance for ML methods
Kaggle Competition Three: Strength in Numbers
Kaggle Competition Four: Everyone ‘Complanes’
- Lab write-ups, Kaggle competitions, and assignments 25%
- Quiz 1 25%
- Quiz 2 25%
- Project Presentation and Writeup 25%
Students are required to complete assignments graded for completion. The lab component is treated much like labs in the physical and biological sciences. Students are expected to follow instructions and submit a writeup of the work done in the lab. One multiple choice midterm assesses the methodological topics taught in the first half of the course. Before the final exam, students complete and present a project to the class on an applied topic of their choice. A mixed-response final exam tests on all material introduced in the course.
All above grading is subject to change.
Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text
Authors: Taylor Arnold and Lauren Defanti Tilton. Publisher: Springer
Book is available on-line for free through the SFU Library
REQUIRED READING NOTES:
Your personalized Course Material list, including digital and physical textbooks, are available through the SFU Bookstore website by simply entering your Computing ID at: shop.sfu.ca/course-materials/my-personalized-course-materials.
Department Undergraduate Notes:
Students with Disabilities:
Students requiring accommodations as a result of disability must contact the Centre for Accessible Learning 778-782-3112 or email@example.com.
Students looking for a tutor should visit https://www.sfu.ca/stat-actsci/all-students/other-resources/tutoring.html. We accept no responsibility for the consequences of any actions taken related to tutors.
ACADEMIC INTEGRITY: YOUR WORK, YOUR SUCCESS
SFU’s Academic Integrity website http://www.sfu.ca/students/academicintegrity.html is filled with information on what is meant by academic dishonesty, where you can find resources to help with your studies and the consequences of cheating. Check out the site for more information and videos that help explain the issues in plain English.
Each student is responsible for his or her conduct as it affects the university community. Academic dishonesty, in whatever form, is ultimately destructive of the values of the university. Furthermore, it is unfair and discouraging to the majority of students who pursue their studies honestly. Scholarly integrity is required of all members of the university. http://www.sfu.ca/policies/gazette/student/s10-01.html