Warming-up#
Lecture outline#
Intro#
What does the term “data science” mean to you?
So we have two terms here:
Statistics
Machine learning (ML)
How are they different from each other? For example, what is the difference between a “statistical model” and a “machine learning model?” Or maybe they are the same thing? Just fancy words to be created to confuse people?
What is the difference between supervised and unsupervised learning?
Checking in your answers for Q4 in the Pre-course Quiz.
Data skills#
Here are things we don’t primarily focus on during the class but are essential for building up your data skills for reproducible science.
Programming
Collaboration and version control (Git & GitHub)
Jupyter (incl. Google Colab)
Data import and wrangling
Visualization
Basics of statistics#
Random variable: A value that follows a probability distribution. Are the following quantities random variables?
The number of apples in a basket
Your height
The height of a person sitting here who I randomly pick up
Now let’s revisit expectation, variance, and their mathematical expressions (cont. Q1 in Pre-course Quiz):
Gaussian distribution: many distributions converge to this thanks to the central limit theory
How does this relate to Q2 in the Pre-course Quiz?
Statistical Tests#
How do statistical tests relate to machine learning?
A simple alternative (we don’t need to run a ML model for all questions!)
Having a statistics-based inference is sometimes important
Five general steps of hypothesis testing
Set up \(H_0\) (null hypothesis)
Set up \(H_1\) (alternative hypothesis)
Set up test statistic and significance level (\(\alpha\))
Find the null distribution for the test statistic; calculate the \(p\)-value
Reject or not reject \(H_0\) by comparing \(p\) with \(\alpha\)
p-hacking#
What if we repeat the test above 100 times, each with a different sample, and report the test result with the minimum \(p\)?
Confidence interval (CI)#
For predicting an unknown population parameter. It uses a similar concept from hypothesis testing but without any hypothesis being made.
Significance level -> Confidence level
Result reporting:
xxx \(\pm\) yyy (zzz% CI)
[aaa, bbb] (zzz% CI)
Frequentist view of CI: how to interpret the interval?
Bootstrapping: Dealing with a situation in which finding a CI for a population parameter is traditionally impossible.
To find the CI of any population parameter using bootstrapping:
Percentile method: CI = \([ \theta^{\ast}_{\alpha/2}, \theta^{\ast}_{(1 - \alpha/2)} ]\) (Do not use this!)
Basic method: CI = \([ 2\theta_s - \theta^{\ast}_{(1 - \alpha/2)}, 2\theta_s - \theta^{\ast}_{\alpha/2} ]\)
Group discussion & Demos#
Discuss how you would reproduce the figure for Q3 in the Pre-course Quiz.
Get the data necessary for doing Exercise 5.8 in Hsieh’s book.