Updated on 2020-11-08.
There are two foundational ideas that are driving me to carve out some time to sit down and right this today:
- This idea that writing technical blog posts are in many ways writing documentation for you to reference in the future.
- It amazes me that in 2020 this wasn’t easier to find.
I recently started my MBA program, and one of the first required course is a statistics class. Coursework starts in probability and moves on to typical stats subjects. Almost half way in, we’ve covered probability, probability distributions, sampling, confidence intervals, and hypothesis testing.
My course, like most MBA courses, presents the theory behind these things before presenting how to do, and those presentations are done in Excel. Let me tell you- I haven’t used Excel since… 2015? Maybe? (When I took this class in undergrad, we only used R, but R is no longer my primary language.)
Excel is just chaos engineering for data teams.
— TA Murphy (@tayloramurphy) September 25, 2020
Excel is a tool, but it’s never the tool that I go for. The same is true for Sheets. I do use the Google Suite and I’ve used GSheet for work occasionally, but I still almost always find myself gravitating towards a Jupyter notebook or something better than a spreadsheet to do the thing whatever that may be.
I can see, though, why my professors proclaim Excel as the way to go. For non-engineering folks, you can:
- Open software and use a GUI
- Lots of internet tutorials and help videos
- For most people, these are probably the tools they’ll be using at work to do the job
In that case, I definitely see the value, but for me, Excel brings up a series of bad memories:
- No version control
- Difficult to parse formulas
- Poorly organized or presented information that is not easy to understand or read
I’m an engineer, give me engineering tools!
I also like the narrative ability that comes with moving from top to bottom as I move through a notebook. I can take a word problem, translate it to a bunch of comments, and use those to drive what comes next.
This also makes it easier to return to later on, if I want to revisit the problem. I don’t need to redecipher what the word problem is saying, especially because in my notes, I delineate between information that is given versus information that is calculated or assumed.
While my professor teaches how to do these things in Excel, if I wanted to learn how to do it in a notebook, I needed to figure out how to do it on my own. I knew I probably wanted to use SciPy, but I didn’t know much more.
Below my cheatsheet on converting Excel formulas into SciPy formulas.
Imports needed for a notebook
from math import sqrt, ceil
from scipy.stats import binom
from scipy.stats import norm
from scipy.stats import t
import numpy as np
import pandas as pd
## remember: \\ escapes dollar signs
## $H_0$ and $H_a$; μ ≠;
Probability
- Complements: P!(A) = 1-P(A)
- Addition: P(A or B) = P(A) + P(B) - P(A and B)
- Mutually Exclusive: P(A and B) = 0
- Conditional: P(A|B)= P(A and B) / P(B)
- Multiplication: P(A and B) = P(A|B) * P(B)
- Independence: P(A and B) = P(A) * P(B)
Probability Distributions
Binomial Distribution
- Discrete
- Success or failure
- n = trials
- π = probability of success
- mean = n * π
- standard deviation = √n*π(1-π)
Notes on Excel formulas:
- r = target number, see Purpose
- n = trials
- π = probability of success
Notes for mapping SciPy to Excel:
- k = r
- n = n
- p = π
Purpose | Excel Formula | SciPy (or Python) |
---|---|---|
Exact values: P(X=r) | BINOM.DIST(r, n, π, FALSE) |
binom.pmf(k, n, p) |
Cumulative: P(X ≤ r) (left or lower tail) | BINOM.DIST(r, n, π, TRUE) |
binom.cdf(k, n, p) |
Complement: P(X ≥ r) (right or upper tail) | 1 – BINOM.DIST(r-1, n, π, TRUE) |
1 - binom.cdf(k, n, p) |
Normal Distribution
- Bell shaped curve
- µ = mean
- σ = standard deviation
Notes on Excel formulas:
- c (or d) = target number, see Purpose
- µ = mean
- σ = standard deviation
Notes for mapping SciPy to Excel:
- x = c (or d)
- loc = µ
- scale = σ
- q = prob
Purpose | Excel Formula | SciPy |
---|---|---|
P(X ≤ c) ; left or lower tail | NORM.DIST(c, μ, σ, TRUE) |
norm.cdf(x,loc, scale) |
P(X ≥ d) ; right or upper tail | 1 – NORM.DIST(d, μ, σ, TRUE) |
1- norm.cdf(x,loc, scale) |
P(c ≤ X ≤ d); interval | NORM.DIST(d, μ, σ, TRUE) - NORM.DIST(c, μ, σ, TRUE) |
norm.cdf(d,loc, scale) - norm.cdf(c,loc, scale) |
P(X ≤ c) = prob; finding the number from a percent (inverse) | NORM.INV(prob, μ, σ) |
norm.ppf(q, loc, scale) |
Standard Normal Distribution is just a normal distribution where the mean = 0 and the standard deviation = 1.
Purpose | Excel Formula | SciPy |
---|---|---|
P(X ≤ c) – left or lower tail, where we know the mean is 0 and standard deviation is 1 | NORM.S.DIST(x,TRUE) |
norm.cdf(x, loc=0, scale = 1) |
Statistical Inference
Confidence Intervals (Estimation)
- µ is unknown
- use x̄ (aka xbar)
- interval is about x̄
Hypothesis Testing
- some belief about µ leads to the H_0 (aka null hypothesis)
- observe x̄ and ask how unlikely it is
Confidence Intervals
When to use | Parameter | Statistic | Standard Error | Sampling Distribution | Formula |
---|---|---|---|---|---|
When means and σ is unknown | µ | x̄ | σ/√n |
normal | x̄±z(σ/√n) |
When means and σ is unknown | µ | x̄ | s/√n (use when σ is unknown) |
t | x̄±t(s/√n) |
When proportions | π | p | √p(1-p)/n |
normal, if (n*p) > 5 AND n(1-p) > 5 |
p±Z√(p(1-p))/n |
For calculating the test statistic:
- number to enter into the formula = (1-confidence interval / 2) + confidence interval
- 95% Confidence Interval =
NORM.S.INV(.975)
- degrees of freedom = n - 1
Notes for mapping SciPy to Excel:
- q = number to enter into the formula = (1-confidence interval / 2) + confidence interval
- loc = µ
- scale = σ
- df = df
Calculating test statistic | Excel formula | SciPy |
---|---|---|
t | T.INV(number to enter into the formula, df) |
t.ppf(q,df) |
z | NORM.S.INV(number to enter into the formula) |
norm.ppf(q, loc=0, scale=1) |
Specified Margin of Error (MoE)
- MoE = margin of error
- n = number of trials needed
- use π = .5 for most conservative estimate
MoE | Formula |
---|---|
mean | n=(zσ/MoE)^2 |
proportion | n=π(1-π)(z/MoE)^2 |
## sigma = 12
## margin_of_error = 3
## confidence_interval = 0.9
enter_into_formula = ((1-confidence_interval) / 2) + confidence_interval
z_score = norm.ppf(enter_into_formula, loc=0, scale=1)
n = ceil((z_score*sigma/margin_of_error)**2)
print(n)
Hypothesis Testing
Parameter | Statistic | Standard Error | Sampling Distribution |
---|---|---|---|
µ | x̄ | s/√n |
t |
π | p | √π(1-π)/n |
normal, if (n*π) > 5 AND n(1-π) > 5 |
Steps:
- Determine H_0 and H_a
- Specify value of α
- Calculate the test statistic (zobs or tobs)
- Draw a picture
- Use the test statistic to calculate the p-value
- Compare the p-value to α; Reject H_0 if p-value < α
- Translate to words
Putting it together
Confidence interval for means when σ is known
[to come]
Confidence interval for means when σ is unknown
- s = sample standard deviation
- confidence_interval = confidence interval
- n = trials or sample size
- xbar = x̄ or mean of the sample
Sample conclusion: “There is a 98% probability that a confidence interval calculated in this way captures the true population average claims per employee.”
t_score = t.ppf(((1-confidence_interval)/2)+confidence_interval, n-1) calc = t_score*s/sqrt(n) print(f"{xbar} plus or minus {calc}") print(f"Range is {xbar-calc} to {xbar+calc}")
Confidence interval for proportions
- n = trials or sample size
- confidence_interval = confidence interval
- success = number correct
Sample conclusion: “”
proportion = success/n z_score = norm.ppf(((1-confidence_interval) / 2) + confidence_interval, loc=0, scale=1) calc = z_score*sqrt((proportion*(1-proportion))/n) print(f"{proportion} plus or minus {calc}") print(f"Range is {proportion-calc} to {proportion+calc}")
Hypothesis testing for means
- mu = mean of the population # what we’re testing
- n = trials or sample size
- xbar = x̄ or mean of the sample
- s = sample standard deviation (used when σ unknown)
alpha = signficance (e.g. 0.05 alpha = 5% significance)
df = n-1 s_over_sqrt_n = s/sqrt(n) tobs = (xbar - mu)/s_over_sqrt_n p_value = 2 * (t.cdf(tobs, df)) tobs, p_value, p_value > alpha ## if true, reject the null # Sample conclusion: At the X% level of significance, the mean weight of chocolate is signficiantly different from (the null).
Hypothesis testing for proportions
- alpha = signficance (e.g. 0.05 alpha = 5% significance)
- success = number of successes
- n = trials or sample size
- mu_p = mean of the proportion
- sigma_p = standard deviation of the proportion
proportion = success/n
alpha = 0.10 n = 50 success = 15 proportion = success/n mu_p = 0.25 pi = mu_p ## First, check that the sampling distribution is at least approximately normal. n*pi > 5, n*(1-pi)> 5 sigma_p = sqrt((pi*(1-pi)/n)) zobs = (proportion-pi)/sigma_p p_value = 1-norm.cdf(zobs, loc = 0, scale = 1) zobs, p_value, p_value > alpha ## if true, reject the null
Typing Shortcuts
- π = option+p
- √ = option+v
- ∑ = option+w
- µ = option+m
- ≤ = option+,
- ≥ = option+.
- ± = shift+option+=
- σ requires symbol font
- x̄ requires symbol font
- α requires symbol font