📊 Statistics Fundamentals
Master statistical concepts essential for data science, from descriptive statistics to hypothesis testing
1. Introduction to Statistics
Statistics is the science of collecting, analyzing, interpreting, and presenting data. It's fundamental to data science, enabling us to extract insights, make predictions, and support decision-making.
Why Statistics? Understand data patterns, make data-driven decisions, validate hypotheses, build predictive models, quantify uncertainty
Types of Statistics
- Descriptive Statistics - Summarize and describe data (mean, median, standard deviation)
- Inferential Statistics - Make predictions and inferences about populations from samples
Types of Data
- Quantitative (Numerical)
- Discrete: Countable values (number of students, items sold)
- Continuous: Measurable values (height, weight, temperature)
- Qualitative (Categorical)
- Nominal: No order (colors, gender, country)
- Ordinal: Has order (ratings, education level)
2. Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset.
Key Concepts
import numpy as np
import pandas as pd
# Sample data
data = [12, 15, 18, 20, 22, 25, 28, 30, 35, 40]
# Basic descriptive statistics
print("Count:", len(data))
print("Sum:", sum(data))
print("Min:", min(data))
print("Max:", max(data))
print("Range:", max(data) - min(data))
# Using NumPy
data_np = np.array(data)
print("\nNumPy Statistics:")
print("Mean:", np.mean(data_np))
print("Median:", np.median(data_np))
print("Std Dev:", np.std(data_np))
print("Variance:", np.var(data_np))
# Using Pandas
df = pd.DataFrame({'values': data})
print("\nPandas describe():")
print(df.describe())
Frequency Distributions
import pandas as pd
# Categorical data
grades = ['A', 'B', 'A', 'C', 'B', 'A', 'D', 'B', 'C', 'A']
# Frequency distribution
freq = pd.Series(grades).value_counts()
print("Frequency:\n", freq)
# Relative frequency (proportions)
rel_freq = pd.Series(grades).value_counts(normalize=True)
print("\nRelative Frequency:\n", rel_freq)
# Cumulative frequency
cum_freq = freq.sort_index().cumsum()
print("\nCumulative Frequency:\n", cum_freq)
3. Measures of Central Tendency
Measures that describe the center or typical value of a dataset.
Mean (Average)
Sum of all values divided by the number of values. Sensitive to outliers.
import numpy as np
import statistics as stats
data = [10, 20, 30, 40, 50]
# Mean calculation
mean = sum(data) / len(data)
print("Manual Mean:", mean)
# Using built-in functions
print("NumPy Mean:", np.mean(data))
print("Statistics Mean:", stats.mean(data))
# Weighted mean
values = [80, 85, 90]
weights = [0.3, 0.3, 0.4]
weighted_mean = sum(v * w for v, w in zip(values, weights))
print("Weighted Mean:", weighted_mean)
Median
Middle value when data is sorted. Robust to outliers.
data = [10, 20, 30, 40, 50]
# Median
print("Median:", np.median(data))
print("Median:", stats.median(data))
# With outlier
data_with_outlier = [10, 20, 30, 40, 1000]
print("\nWith outlier:")
print("Mean:", np.mean(data_with_outlier)) # 220 (affected)
print("Median:", np.median(data_with_outlier)) # 30 (not affected)
Mode
Most frequently occurring value. Can have multiple modes or no mode.
from scipy import stats as sp_stats
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
# Mode using scipy
mode_result = sp_stats.mode(data, keepdims=True)
print("Mode:", mode_result.mode[0])
print("Count:", mode_result.count[0])
# Mode for categorical data
grades = ['A', 'B', 'A', 'C', 'B', 'A']
mode = pd.Series(grades).mode()[0]
print("Most common grade:", mode)
When to Use Each Measure
Mean: Symmetric distribution, no outliers. Median: Skewed distribution or outliers present. Mode: Categorical data or finding most common value
4. Measures of Dispersion (Spread)
Measures that describe how spread out or varied the data is.
Range
data = [10, 20, 30, 40, 50]
# Range
data_range = max(data) - min(data)
print("Range:", data_range) # 40
# NumPy
print("Range:", np.ptp(data)) # Peak to peak
Variance
Average of squared deviations from the mean.
data = [10, 20, 30, 40, 50]
# Population variance
pop_variance = np.var(data)
print("Population Variance:", pop_variance)
# Sample variance (n-1 denominator)
sample_variance = np.var(data, ddof=1)
print("Sample Variance:", sample_variance)
# Manual calculation
mean = np.mean(data)
variance = sum((x - mean)**2 for x in data) / len(data)
print("Manual Variance:", variance)
Standard Deviation
Square root of variance. Same unit as original data.
# Standard deviation
pop_std = np.std(data)
sample_std = np.std(data, ddof=1)
print("Population Std Dev:", pop_std)
print("Sample Std Dev:", sample_std)
# Interpretation: ~68% of data within 1 std dev of mean
# ~95% within 2 std devs, ~99.7% within 3 std devs (for normal distribution)
Interquartile Range (IQR)
Range of middle 50% of data. Robust to outliers.
data = [10, 15, 20, 25, 30, 35, 40, 45, 50]
# Quartiles
q1 = np.percentile(data, 25) # 1st quartile
q2 = np.percentile(data, 50) # 2nd quartile (median)
q3 = np.percentile(data, 75) # 3rd quartile
# IQR
iqr = q3 - q1
print("Q1:", q1)
print("Q2 (Median):", q2)
print("Q3:", q3)
print("IQR:", iqr)
# Outlier detection using IQR
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print("\nOutlier bounds:", lower_bound, "to", upper_bound)
# Coefficient of Variation (CV)
cv = (np.std(data) / np.mean(data)) * 100
print("Coefficient of Variation:", cv, "%")
5. Probability Distributions
Normal Distribution (Gaussian)
Bell-shaped, symmetric distribution. Many natural phenomena follow this.
from scipy import stats
import matplotlib.pyplot as plt
# Generate normal distribution
mu, sigma = 0, 1 # mean and standard deviation
x = np.linspace(-4, 4, 100)
y = stats.norm.pdf(x, mu, sigma)
# Properties
print("Mean:", mu)
print("Std Dev:", sigma)
# Probability calculations
# P(X < 1)
prob = stats.norm.cdf(1, mu, sigma)
print("P(X < 1):", prob)
# P(X > 1)
prob = 1 - stats.norm.cdf(1, mu, sigma)
print("P(X > 1):", prob)
# P(-1 < X < 1)
prob = stats.norm.cdf(1, mu, sigma) - stats.norm.cdf(-1, mu, sigma)
print("P(-1 < X < 1):", prob) # ~0.68
# Z-score (standardization)
value = 85
mean = 75
std = 10
z_score = (value - mean) / std
print("\nZ-score for 85:", z_score)
Binomial Distribution
Discrete distribution for number of successes in n independent trials.
# Binomial distribution
n = 10 # number of trials
p = 0.5 # probability of success
# Probability of exactly k successes
k = 6
prob = stats.binom.pmf(k, n, p)
print(f"P(X = {k}):", prob)
# Probability of at most k successes
prob_cumulative = stats.binom.cdf(k, n, p)
print(f"P(X ≤ {k}):", prob_cumulative)
# Mean and variance
mean = n * p
variance = n * p * (1 - p)
print("Mean:", mean)
print("Variance:", variance)
Poisson Distribution
Models number of events in fixed interval (calls per hour, defects per batch).
# Poisson distribution
lambda_param = 3 # average rate
# Probability of k events
k = 5
prob = stats.poisson.pmf(k, lambda_param)
print(f"P(X = {k}):", prob)
# Mean and variance (both equal to λ)
print("Mean:", lambda_param)
print("Variance:", lambda_param)
Exponential Distribution
Time between events in a Poisson process.
# Exponential distribution
lambda_param = 0.5 # rate parameter
# Probability density at x
x = 2
prob = stats.expon.pdf(x, scale=1/lambda_param)
print(f"PDF at x={x}:", prob)
# Probability X < x
prob_cdf = stats.expon.cdf(x, scale=1/lambda_param)
print(f"P(X < {x}):", prob_cdf)
6. Probability Basics
Fundamental Concepts
# Probability rules
# P(A) = favorable outcomes / total outcomes
# Example: Rolling a die
total_outcomes = 6
favorable = 1 # rolling a 6
prob = favorable / total_outcomes
print("P(rolling 6):", prob) # 0.1667
# Complement rule: P(A') = 1 - P(A)
prob_not_6 = 1 - prob
print("P(not 6):", prob_not_6) # 0.8333
Conditional Probability
P(A|B) = Probability of A given B has occurred
# Example: Cards
# P(King | Face card)
total_cards = 52
face_cards = 12
kings = 4
# P(King and Face card) = P(King) since all kings are face cards
p_king_and_face = kings / total_cards
# P(Face card)
p_face = face_cards / total_cards
# P(King | Face card)
p_king_given_face = p_king_and_face / p_face
print("P(King | Face card):", p_king_given_face) # 0.333
Independence
Events A and B are independent if P(A and B) = P(A) × P(B)
# Example: Coin flips
# P(Heads on flip 1 AND Heads on flip 2)
p_heads = 0.5
p_both_heads = p_heads * p_heads
print("P(HH):", p_both_heads) # 0.25
# Three independent events
p_three_heads = p_heads ** 3
print("P(HHH):", p_three_heads) # 0.125
Addition Rule
# P(A or B) = P(A) + P(B) - P(A and B)
# Example: Drawing a card
p_king = 4/52
p_heart = 13/52
p_king_of_hearts = 1/52
# P(King OR Heart)
p_king_or_heart = p_king + p_heart - p_king_of_hearts
print("P(King or Heart):", p_king_or_heart) # 0.308
7. Sampling Methods
Population vs Sample
- Population: Entire group of interest
- Sample: Subset of population used for analysis
- Sampling: Process of selecting samples
Sampling Techniques
import random
# Population
population = list(range(1, 101)) # 1 to 100
# 1. Simple Random Sampling
simple_sample = random.sample(population, 10)
print("Simple Random Sample:", simple_sample)
# 2. Systematic Sampling
k = 10 # every kth element
systematic_sample = population[::k]
print("Systematic Sample:", systematic_sample)
# 3. Stratified Sampling
# Divide into groups and sample from each
group1 = population[:50]
group2 = population[50:]
stratified_sample = random.sample(group1, 5) + random.sample(group2, 5)
print("Stratified Sample:", stratified_sample)
# 4. Using Pandas
df = pd.DataFrame({'value': population})
random_sample = df.sample(n=10) # 10 random rows
print("\nPandas Random Sample:")
print(random_sample)
Central Limit Theorem (CLT)
As sample size increases, sampling distribution of mean approaches normal distribution.
# Demonstrate CLT
np.random.seed(42)
# Non-normal population (uniform)
population = np.random.uniform(0, 100, 10000)
# Take many samples and calculate means
sample_means = []
for i in range(1000):
sample = np.random.choice(population, size=30)
sample_means.append(np.mean(sample))
# Sample means are approximately normal!
print("Mean of sample means:", np.mean(sample_means))
print("Std of sample means:", np.std(sample_means))
print("Population mean:", np.mean(population))
# Standard Error of Mean
sem = np.std(population) / np.sqrt(30)
print("Standard Error:", sem)
8. Hypothesis Testing
Statistical method to make decisions based on data.
Key Concepts
- Null Hypothesis (H₀): No effect or difference exists
- Alternative Hypothesis (H₁): Effect or difference exists
- p-value: Probability of observing data if H₀ is true
- Significance level (α): Threshold for rejecting H₀ (commonly 0.05)
- Type I Error: Rejecting true H₀ (false positive)
- Type II Error: Failing to reject false H₀ (false negative)
One-Sample t-Test
Test if sample mean differs from known population mean.
from scipy import stats
# Sample data
sample = [23, 25, 27, 29, 31, 33, 35, 37, 39, 41]
population_mean = 30
# H₀: μ = 30
# H₁: μ ≠ 30
# Perform t-test
t_statistic, p_value = stats.ttest_1samp(sample, population_mean)
print("t-statistic:", t_statistic)
print("p-value:", p_value)
# Decision
alpha = 0.05
if p_value < alpha:
print("Reject H₀: Mean is significantly different from 30")
else:
print("Fail to reject H₀: No significant difference from 30")
Two-Sample t-Test
Compare means of two independent groups.
# Two groups
group1 = [23, 25, 27, 29, 31, 33, 35]
group2 = [30, 32, 34, 36, 38, 40, 42]
# H₀: μ₁ = μ₂
# H₁: μ₁ ≠ μ₂
# Independent samples t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print("t-statistic:", t_stat)
print("p-value:", p_value)
if p_value < 0.05:
print("Significant difference between groups")
else:
print("No significant difference")
Paired t-Test
Compare means of same group at different times.
# Before and after measurements
before = [72, 75, 78, 80, 82, 85, 88]
after = [70, 73, 76, 78, 80, 83, 86]
# H₀: μ_diff = 0
# H₁: μ_diff ≠ 0
# Paired t-test
t_stat, p_value = stats.ttest_rel(before, after)
print("t-statistic:", t_stat)
print("p-value:", p_value)
if p_value < 0.05:
print("Significant change from before to after")
else:
print("No significant change")
Z-Test
Use when population standard deviation is known and n > 30.
from statsmodels.stats.weightstats import ztest
# Sample
sample = [23, 25, 27, 29, 31, 33, 35, 37, 39, 41]
population_mean = 30
# Z-test
z_stat, p_value = ztest(sample, value=population_mean)
print("z-statistic:", z_stat)
print("p-value:", p_value)
9. Correlation Analysis
Measures strength and direction of relationship between variables.
Pearson Correlation
Measures linear relationship. Range: -1 to +1.
import numpy as np
from scipy import stats
# Two variables
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 7, 8, 10, 11, 13, 14, 16]
# Pearson correlation
corr, p_value = stats.pearsonr(x, y)
print("Pearson r:", corr)
print("p-value:", p_value)
# Interpretation:
# r > 0.7: Strong positive correlation
# r = 0: No correlation
# r < -0.7: Strong negative correlation
# Using NumPy
corr_matrix = np.corrcoef(x, y)
print("\nCorrelation matrix:\n", corr_matrix)
# Using Pandas
df = pd.DataFrame({'x': x, 'y': y})
print("\nPandas correlation:")
print(df.corr())
Spearman Correlation
Measures monotonic relationship (non-linear). Better for ordinal data.
# Spearman correlation
corr, p_value = stats.spearmanr(x, y)
print("Spearman rho:", corr)
print("p-value:", p_value)
Correlation vs Causation
Important: Correlation does NOT imply causation! Just because two variables are correlated doesn't mean one causes the other. There could be confounding variables or spurious correlations.
10. Regression Analysis
Model relationship between dependent and independent variables.
Simple Linear Regression
Y = β₀ + β₁X + ε
from scipy import stats
import numpy as np
# Data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2.1, 4.2, 5.8, 8.1, 10.3, 11.9, 14.2, 16.1, 17.9, 20.2])
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("Slope (β₁):", slope)
print("Intercept (β₀):", intercept)
print("R-squared:", r_value**2)
print("p-value:", p_value)
# Make predictions
x_new = 11
y_pred = slope * x_new + intercept
print(f"\nPrediction for x={x_new}: {y_pred}")
# Predict for all x values
y_predicted = slope * x + intercept
print("Predicted values:", y_predicted)
Multiple Linear Regression
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# Multiple predictors
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([3, 5, 7, 9, 11])
# Fit model
model = LinearRegression()
model.fit(X, y)
# Coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)
# Make predictions
y_pred = model.predict(X)
print("Predictions:", y_pred)
# Model evaluation
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
print("\nR-squared:", r2)
print("MSE:", mse)
print("RMSE:", rmse)
Assumptions of Linear Regression
- Linearity: Relationship between X and Y is linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance of errors
- Normality: Errors are normally distributed
- No multicollinearity: Predictors not highly correlated (multiple regression)
11. Analysis of Variance (ANOVA)
Test if means of three or more groups are significantly different.
One-Way ANOVA
Compare means across one categorical variable.
from scipy import stats
# Three groups
group1 = [23, 25, 27, 29, 31]
group2 = [30, 32, 34, 36, 38]
group3 = [35, 37, 39, 41, 43]
# H₀: μ₁ = μ₂ = μ₃
# H₁: At least one mean is different
# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)
print("F-statistic:", f_statistic)
print("p-value:", p_value)
if p_value < 0.05:
print("At least one group mean is significantly different")
else:
print("No significant difference between group means")
Two-Way ANOVA
Examine effect of two categorical variables and their interaction.
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# Sample data
df = pd.DataFrame({
'score': [23, 25, 27, 30, 32, 34, 35, 37, 39, 40, 42, 44],
'method': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'B'],
'gender': ['M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F']
})
# Fit model
model = ols('score ~ C(method) + C(gender) + C(method):C(gender)', data=df).fit()
# ANOVA table
anova_table = anova_lm(model, typ=2)
print(anova_table)
Post-hoc Tests
After significant ANOVA, determine which groups differ.
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Data
data = [23, 25, 27, 29, 31, 30, 32, 34, 36, 38, 35, 37, 39, 41, 43]
groups = ['A']*5 + ['B']*5 + ['C']*5
# Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=data, groups=groups, alpha=0.05)
print(tukey)
12. Chi-Square Tests
Test relationships between categorical variables.
Chi-Square Goodness of Fit
Test if observed frequencies match expected distribution.
from scipy import stats
# Observed frequencies
observed = [30, 25, 20, 25]
# Expected frequencies (equal distribution)
expected = [25, 25, 25, 25]
# Chi-square test
chi2, p_value = stats.chisquare(observed, expected)
print("Chi-square statistic:", chi2)
print("p-value:", p_value)
if p_value < 0.05:
print("Distribution significantly different from expected")
else:
print("No significant difference from expected distribution")
Chi-Square Test of Independence
Test if two categorical variables are independent.
# Contingency table
# Rows: Gender (M/F), Columns: Preference (A/B/C)
observed = np.array([[30, 25, 20],
[20, 30, 25]])
# Chi-square test of independence
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print("Chi-square statistic:", chi2)
print("p-value:", p_value)
print("Degrees of freedom:", dof)
print("\nExpected frequencies:\n", expected)
if p_value < 0.05:
print("\nVariables are dependent (associated)")
else:
print("\nVariables are independent (not associated)")
13. Bayesian Statistics
Update beliefs based on new evidence using Bayes' Theorem.
Bayes' Theorem
P(A|B) = P(B|A) × P(A) / P(B)
# Example: Medical test
# P(Disease) = 0.01 (1% of population has disease)
# P(Positive|Disease) = 0.95 (95% sensitivity)
# P(Positive|No Disease) = 0.05 (5% false positive rate)
p_disease = 0.01
p_no_disease = 1 - p_disease
p_pos_given_disease = 0.95
p_pos_given_no_disease = 0.05
# P(Positive)
p_positive = (p_pos_given_disease * p_disease +
p_pos_given_no_disease * p_no_disease)
# P(Disease|Positive) using Bayes' Theorem
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive
print("P(Disease|Positive Test):", p_disease_given_pos)
print(f"Only {p_disease_given_pos*100:.1f}% chance of having disease!")
# Prior and Posterior
print("\nPrior probability:", p_disease)
print("Posterior probability:", p_disease_given_pos)
Bayesian Updating
# Example: Coin flip bias estimation
from scipy.stats import beta
# Prior: Beta(1, 1) - uniform prior
a_prior, b_prior = 1, 1
# Observe 7 heads, 3 tails
heads, tails = 7, 3
# Posterior: Beta(a + heads, b + tails)
a_posterior = a_prior + heads
b_posterior = b_prior + tails
# Posterior mean (estimated probability of heads)
posterior_mean = a_posterior / (a_posterior + b_posterior)
print("Estimated P(Heads):", posterior_mean)
# 95% credible interval
lower = beta.ppf(0.025, a_posterior, b_posterior)
upper = beta.ppf(0.975, a_posterior, b_posterior)
print(f"95% Credible Interval: [{lower:.3f}, {upper:.3f}]")
14. Time Series Basics
Analyzing data points collected over time.
Components of Time Series
- Trend: Long-term increase or decrease
- Seasonality: Regular patterns repeating over time
- Cyclicity: Irregular fluctuations
- Noise: Random variation
import pandas as pd
import numpy as np
# Create time series
dates = pd.date_range('2023-01-01', periods=100, freq='D')
trend = np.linspace(10, 50, 100)
seasonality = 10 * np.sin(np.linspace(0, 4*np.pi, 100))
noise = np.random.normal(0, 2, 100)
values = trend + seasonality + noise
ts = pd.Series(values, index=dates)
# Moving average (smooth data)
ma_7 = ts.rolling(window=7).mean()
print("7-day moving average:\n", ma_7.tail())
# Exponential smoothing
ema = ts.ewm(span=7).mean()
print("\nExponential moving average:\n", ema.tail())
# Calculate returns (percent change)
returns = ts.pct_change()
print("\nDaily returns:\n", returns.tail())
Stationarity Testing
Stationary series has constant mean and variance over time.
from statsmodels.tsa.stattools import adfuller
# Augmented Dickey-Fuller test
result = adfuller(ts)
print("ADF Statistic:", result[0])
print("p-value:", result[1])
print("Critical Values:", result[4])
if result[1] < 0.05:
print("\nSeries is stationary")
else:
print("\nSeries is non-stationary")
# Make series stationary by differencing
ts_diff = ts.diff().dropna()
result_diff = adfuller(ts_diff)
print("\nAfter differencing p-value:", result_diff[1])
Autocorrelation
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf
# Calculate autocorrelation
acf_values = acf(ts, nlags=20)
print("Autocorrelation:\n", acf_values)
# Note: Use plot_acf() and plot_pacf() for visualization
# (would require matplotlib)
15. Python for Statistics - Quick Reference
Essential Libraries
# Core libraries
import numpy as np # Numerical computing
import pandas as pd # Data manipulation
import matplotlib.pyplot as plt # Visualization
import seaborn as sns # Statistical visualization
# Statistical libraries
from scipy import stats # Statistical functions
import statsmodels.api as sm # Statistical models
from sklearn.linear_model import LinearRegression # ML
# Install if needed:
# pip install numpy pandas scipy statsmodels scikit-learn matplotlib seaborn
Common Statistical Operations
# Generate sample data
np.random.seed(42)
data = np.random.normal(100, 15, 1000) # mean=100, std=15, n=1000
# Descriptive statistics
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Std Dev:", np.std(data))
print("Variance:", np.var(data))
print("Min:", np.min(data))
print("Max:", np.max(data))
print("25th percentile:", np.percentile(data, 25))
print("75th percentile:", np.percentile(data, 75))
# Pandas describe() - comprehensive summary
df = pd.DataFrame({'values': data})
print("\nPandas describe():")
print(df.describe())
# Skewness and Kurtosis
from scipy.stats import skew, kurtosis
print("\nSkewness:", skew(data))
print("Kurtosis:", kurtosis(data))
Statistical Tests Cheat Sheet
from scipy import stats
# Normality test
statistic, p = stats.shapiro(data) # Shapiro-Wilk test
print("Normal?", "Yes" if p > 0.05 else "No")
# t-tests
stats.ttest_1samp(sample, popmean) # One-sample
stats.ttest_ind(group1, group2) # Independent samples
stats.ttest_rel(before, after) # Paired samples
# Non-parametric tests
stats.mannwhitneyu(group1, group2) # Mann-Whitney U
stats.wilcoxon(before, after) # Wilcoxon signed-rank
stats.kruskal(group1, group2, group3) # Kruskal-Wallis
# Correlation
stats.pearsonr(x, y) # Pearson
stats.spearmanr(x, y) # Spearman
stats.kendalltau(x, y) # Kendall's tau
# Chi-square
stats.chisquare(observed, expected) # Goodness of fit
stats.chi2_contingency(contingency_table) # Independence
# ANOVA
stats.f_oneway(group1, group2, group3) # One-way ANOVA
Data Visualization for Statistics
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style("whitegrid")
# Histogram
plt.hist(data, bins=30, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution')
# plt.show()
# Box plot
plt.boxplot([group1, group2, group3])
plt.xticks([1, 2, 3], ['Group 1', 'Group 2', 'Group 3'])
# plt.show()
# Scatter plot with regression line
sns.regplot(x=x, y=y)
# plt.show()
# Correlation heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
# plt.show()
# Q-Q plot (check normality)
from scipy import stats
stats.probplot(data, dist="norm", plot=plt)
# plt.show()
Key Takeaways
- Descriptive stats summarize data; inferential stats make predictions
- Mean for symmetric data, median for skewed data or outliers
- Standard deviation measures spread in same units as data
- Normal distribution is bell-shaped; many stats assume normality
- p-value < 0.05 typically means statistically significant
- Correlation ≠ causation - always remember this!
- Sample size matters - larger samples give more reliable results
- Check assumptions before applying statistical tests
- Visualize first - plots reveal patterns tests might miss
- Context matters - statistical significance ≠ practical significance
Further Learning Resources
- Books: "The Art of Statistics" by David Spiegelhalter, "Think Stats" by Allen Downey
- Online: Khan Academy Statistics, StatQuest YouTube channel
- Practice: Kaggle datasets, UCI Machine Learning Repository
- Python Docs: SciPy stats documentation, Statsmodels documentation
🎉 Statistics Mastered!
You now have the statistical foundation for data science. Ready to apply these concepts?