Updated on 2020-11-08.

There are two foundational ideas that are driving me to carve out some time to sit down and right this today:

  1. This idea that writing technical blog posts are in many ways writing documentation for you to reference in the future.
  2. It amazes me that in 2020 this wasn’t easier to find.

I recently started my MBA program, and one of the first required course is a statistics class. Coursework starts in probability and moves on to typical stats subjects. Almost half way in, we’ve covered probability, probability distributions, sampling, confidence intervals, and hypothesis testing.

My course, like most MBA courses, presents the theory behind these things before presenting how to do, and those presentations are done in Excel. Let me tell you- I haven’t used Excel since… 2015? Maybe? (When I took this class in undergrad, we only used R, but R is no longer my primary language.)

Excel is a tool, but it’s never the tool that I go for. The same is true for Sheets. I do use the Google Suite and I’ve used GSheet for work occasionally, but I still almost always find myself gravitating towards a Jupyter notebook or something better than a spreadsheet to do the thing whatever that may be.

I can see, though, why my professors proclaim Excel as the way to go. For non-engineering folks, you can:

  • Open software and use a GUI
  • Lots of internet tutorials and help videos
  • For most people, these are probably the tools they’ll be using at work to do the job

In that case, I definitely see the value, but for me, Excel brings up a series of bad memories:

  • No version control
  • Difficult to parse formulas
  • Poorly organized or presented information that is not easy to understand or read

I’m an engineer, give me engineering tools!

I also like the narrative ability that comes with moving from top to bottom as I move through a notebook. I can take a word problem, translate it to a bunch of comments, and use those to drive what comes next.

This also makes it easier to return to later on, if I want to revisit the problem. I don’t need to redecipher what the word problem is saying, especially because in my notes, I delineate between information that is given versus information that is calculated or assumed.

While my professor teaches how to do these things in Excel, if I wanted to learn how to do it in a notebook, I needed to figure out how to do it on my own. I knew I probably wanted to use SciPy, but I didn’t know much more.

Below my cheatsheet on converting Excel formulas into SciPy formulas.

Imports needed for a notebook

from math import sqrt, ceil
from scipy.stats import binom
from scipy.stats import norm
from scipy.stats import t
import numpy as np
import pandas as pd
    
## remember: \\ escapes dollar signs
## $H_0$ and $H_a$; μ ≠;

Probability

  • Complements: P!(A) = 1-P(A)
  • Addition: P(A or B) = P(A) + P(B) - P(A and B)
  • Mutually Exclusive: P(A and B) = 0
  • Conditional: P(A|B)= P(A and B) / P(B)
  • Multiplication: P(A and B) = P(A|B) * P(B)
  • Independence: P(A and B) = P(A) * P(B)

Probability Distributions

Binomial Distribution

  • Discrete
  • Success or failure
  • n = trials
  • π = probability of success
  • mean = n * π
  • standard deviation = √n*π(1-π)

Notes on Excel formulas:

  • r = target number, see Purpose
  • n = trials
  • π = probability of success

Notes for mapping SciPy to Excel:

  • k = r
  • n = n
  • p = π
Purpose Excel Formula SciPy (or Python)
Exact values: P(X=r) BINOM.DIST(r, n, π, FALSE) binom.pmf(k, n, p)
Cumulative: P(X ≤ r) (left or lower tail) BINOM.DIST(r, n, π, TRUE) binom.cdf(k, n, p)
Complement: P(X ≥ r) (right or upper tail) 1 – BINOM.DIST(r-1, n, π, TRUE) 1 - binom.cdf(k, n, p)

Normal Distribution

  • Bell shaped curve
  • µ = mean
  • σ = standard deviation

Notes on Excel formulas:

  • c (or d) = target number, see Purpose
  • µ = mean
  • σ = standard deviation

Notes for mapping SciPy to Excel:

  • x = c (or d)
  • loc = µ
  • scale = σ
  • q = prob
Purpose Excel Formula SciPy
P(X ≤ c) ; left or lower tail NORM.DIST(c, μ, σ, TRUE) norm.cdf(x,loc, scale)
P(X ≥ d) ; right or upper tail 1 – NORM.DIST(d, μ, σ, TRUE) 1- norm.cdf(x,loc, scale)
P(c ≤ X ≤ d); interval NORM.DIST(d, μ, σ, TRUE) - NORM.DIST(c, μ, σ, TRUE) norm.cdf(d,loc, scale) - norm.cdf(c,loc, scale)
P(X ≤ c) = prob; finding the number from a percent (inverse) NORM.INV(prob, μ, σ) norm.ppf(q, loc, scale)

Standard Normal Distribution is just a normal distribution where the mean = 0 and the standard deviation = 1.

Purpose Excel Formula SciPy
P(X ≤ c) – left or lower tail, where we know the mean is 0 and standard deviation is 1 NORM.S.DIST(x,TRUE) norm.cdf(x, loc=0, scale = 1)

Statistical Inference

Confidence Intervals (Estimation)

  • µ is unknown
  • use x̄ (aka xbar)
  • interval is about x̄

Hypothesis Testing

  • some belief about µ leads to the H_0 (aka null hypothesis)
  • observe x̄ and ask how unlikely it is

Confidence Intervals

When to use Parameter Statistic Standard Error Sampling Distribution Formula
When means and σ is unknown µ σ/√n normal x̄±z(σ/√n)
When means and σ is unknown µ s/√n (use when σ is unknown) t x̄±t(s/√n)
When proportions π p √p(1-p)/n normal, if (n*p) > 5 AND n(1-p) > 5 p±Z√(p(1-p))/n

For calculating the test statistic:

  • number to enter into the formula = (1-confidence interval / 2) + confidence interval
  • 95% Confidence Interval = NORM.S.INV(.975)
  • degrees of freedom = n - 1

Notes for mapping SciPy to Excel:

  • q = number to enter into the formula = (1-confidence interval / 2) + confidence interval
  • loc = µ
  • scale = σ
  • df = df
Calculating test statistic Excel formula SciPy
t T.INV(number to enter into the formula, df) t.ppf(q,df)
z NORM.S.INV(number to enter into the formula) norm.ppf(q, loc=0, scale=1)

Specified Margin of Error (MoE)

  • MoE = margin of error
  • n = number of trials needed
  • use π = .5 for most conservative estimate
MoE Formula
mean n=(zσ/MoE)^2
proportion n=π(1-π)(z/MoE)^2
## sigma = 12
## margin_of_error = 3
## confidence_interval = 0.9
enter_into_formula = ((1-confidence_interval) / 2) + confidence_interval
z_score = norm.ppf(enter_into_formula, loc=0, scale=1)
n = ceil((z_score*sigma/margin_of_error)**2)
print(n)

Hypothesis Testing

Parameter Statistic Standard Error Sampling Distribution
µ s/√n t
π p √π(1-π)/n normal, if (n*π) > 5 AND n(1-π) > 5

Steps:

  1. Determine H_0 and H_a
  2. Specify value of α
  3. Calculate the test statistic (zobs or tobs)
  4. Draw a picture
  5. Use the test statistic to calculate the p-value
  6. Compare the p-value to α; Reject H_0 if p-value < α
  7. Translate to words

Putting it together

Confidence interval for means when σ is known

[to come]

Confidence interval for means when σ is unknown
  • s = sample standard deviation
  • confidence_interval = confidence interval
  • n = trials or sample size
  • xbar = x̄ or mean of the sample
  • Sample conclusion: “There is a 98% probability that a confidence interval calculated in this way captures the true population average claims per employee.”

    t_score = t.ppf(((1-confidence_interval)/2)+confidence_interval, n-1)
    calc = t_score*s/sqrt(n)
    print(f"{xbar} plus or minus {calc}")
    print(f"Range is {xbar-calc} to {xbar+calc}")
    
Confidence interval for proportions
  • n = trials or sample size
  • confidence_interval = confidence interval
  • success = number correct
  • Sample conclusion: “”

    proportion = success/n
    z_score = norm.ppf(((1-confidence_interval) / 2) + confidence_interval, loc=0, scale=1)
    calc = z_score*sqrt((proportion*(1-proportion))/n)
        
    print(f"{proportion} plus or minus {calc}")
    print(f"Range is {proportion-calc} to {proportion+calc}")
    
Hypothesis testing for means
  • mu = mean of the population # what we’re testing
  • n = trials or sample size
  • xbar = x̄ or mean of the sample
  • s = sample standard deviation (used when σ unknown)
  • alpha = signficance (e.g. 0.05 alpha = 5% significance)

    df = n-1
    s_over_sqrt_n = s/sqrt(n)
    tobs = (xbar - mu)/s_over_sqrt_n
    p_value = 2 * (t.cdf(tobs, df))
    tobs, p_value, p_value > alpha ## if true, reject the null
        
    # Sample conclusion: At the X% level of significance, the mean weight of chocolate is signficiantly different from (the null).
    
Hypothesis testing for proportions
  • alpha = signficance (e.g. 0.05 alpha = 5% significance)
  • success = number of successes
  • n = trials or sample size
  • mu_p = mean of the proportion
  • sigma_p = standard deviation of the proportion
  • proportion = success/n

    alpha = 0.10
    n = 50
    success = 15
    proportion = success/n
    mu_p = 0.25
    pi = mu_p
    
    ## First, check that the sampling distribution is at least approximately normal.
    n*pi > 5, n*(1-pi)> 5
    
    sigma_p = sqrt((pi*(1-pi)/n))
    zobs = (proportion-pi)/sigma_p
    p_value = 1-norm.cdf(zobs, loc = 0, scale = 1)
    zobs, p_value, p_value > alpha ## if true, reject the null
    

Typing Shortcuts

  • π = option+p
  • √ = option+v
  • ∑ = option+w
  • µ = option+m
  • ≤ = option+,
  • ≥ = option+.
  • ± = shift+option+=
  • σ requires symbol font
  • x̄ requires symbol font
  • α requires symbol font