Warming-up#

2025.02.25

Lecture outline#

Intro#

What does the term “data science” mean to you?

So we have two terms here:

  • Statistics

  • Machine learning (ML)

How are they different from each other? For example, what is the difference between a “statistical model” and a “machine learning model?” Or maybe they are the same thing? Just fancy words to be created to confuse people?

What is the difference between supervised and unsupervised learning?

Checking in your answers for Q4 in the Pre-course Quiz.

Data skills#

Here are things we don’t primarily focus on during the class but are essential for building up your data skills for reproducible science.

  • Programming

  • Collaboration and version control (Git & GitHub)

  • Jupyter (incl. Google Colab)

  • Data import and wrangling

  • Visualization

Basics of statistics#

Random variable: A value that follows a probability distribution. Are the following quantities random variables?

  • The number of apples in a basket

  • Your height

  • The height of a person sitting here who I randomly pick up

Now let’s revisit expectation, variance, and their mathematical expressions (cont. Q1 in Pre-course Quiz):

More about probability distributions#

  • Gaussian: many distributions converge to this thanks to the central limit theory

    • How does this relate to Q2 in the Pre-course Quiz?

  • Student t-distribution: Bounded to the mean

  • Beta and Gamma distributions: different supports for different kinds of environment variables

Group discussion & Demos#

  1. Discuss how you would reproduce the figure for Q3 in the Pre-course Quiz.

  2. Get the data necessary for doing Exercise 4.5 in Hsieh’s book.