Statistics Guide - Learning Hub

1. Introduction to Statistics

Statistics is the science of collecting, analyzing, interpreting, and presenting data. It's fundamental to data science, enabling us to extract insights, make predictions, and support decision-making.

Why Statistics? Understand data patterns, make data-driven decisions, validate hypotheses, build predictive models, quantify uncertainty

Types of Statistics

Descriptive Statistics - Summarize and describe data (mean, median, standard deviation)
Inferential Statistics - Make predictions and inferences about populations from samples

Types of Data

Quantitative (Numerical)
- Discrete: Countable values (number of students, items sold)
- Continuous: Measurable values (height, weight, temperature)
Qualitative (Categorical)
- Nominal: No order (colors, gender, country)
- Ordinal: Has order (ratings, education level)

2. Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset.

Key Concepts

Python

import numpy as np
import pandas as pd

# Sample data
data = [12, 15, 18, 20, 22, 25, 28, 30, 35, 40]

# Basic descriptive statistics
print("Count:", len(data))
print("Sum:", sum(data))
print("Min:", min(data))
print("Max:", max(data))
print("Range:", max(data) - min(data))

# Using NumPy
data_np = np.array(data)
print("\nNumPy Statistics:")
print("Mean:", np.mean(data_np))
print("Median:", np.median(data_np))
print("Std Dev:", np.std(data_np))
print("Variance:", np.var(data_np))

# Using Pandas
df = pd.DataFrame({'values': data})
print("\nPandas describe():")
print(df.describe())

Frequency Distributions

Python

import pandas as pd

# Categorical data
grades = ['A', 'B', 'A', 'C', 'B', 'A', 'D', 'B', 'C', 'A']

# Frequency distribution
freq = pd.Series(grades).value_counts()
print("Frequency:\n", freq)

# Relative frequency (proportions)
rel_freq = pd.Series(grades).value_counts(normalize=True)
print("\nRelative Frequency:\n", rel_freq)

# Cumulative frequency
cum_freq = freq.sort_index().cumsum()
print("\nCumulative Frequency:\n", cum_freq)

3. Measures of Central Tendency

Measures that describe the center or typical value of a dataset.

Mean (Average)

Sum of all values divided by the number of values. Sensitive to outliers.

Python

import numpy as np
import statistics as stats

data = [10, 20, 30, 40, 50]

# Mean calculation
mean = sum(data) / len(data)
print("Manual Mean:", mean)

# Using built-in functions
print("NumPy Mean:", np.mean(data))
print("Statistics Mean:", stats.mean(data))

# Weighted mean
values = [80, 85, 90]
weights = [0.3, 0.3, 0.4]
weighted_mean = sum(v * w for v, w in zip(values, weights))
print("Weighted Mean:", weighted_mean)

Median

Middle value when data is sorted. Robust to outliers.

Python

data = [10, 20, 30, 40, 50]

# Median
print("Median:", np.median(data))
print("Median:", stats.median(data))

# With outlier
data_with_outlier = [10, 20, 30, 40, 1000]
print("\nWith outlier:")
print("Mean:", np.mean(data_with_outlier))    # 220 (affected)
print("Median:", np.median(data_with_outlier)) # 30 (not affected)

Mode

Most frequently occurring value. Can have multiple modes or no mode.

Python

from scipy import stats as sp_stats

data = [1, 2, 2, 3, 3, 3, 4, 4, 5]

# Mode using scipy
mode_result = sp_stats.mode(data, keepdims=True)
print("Mode:", mode_result.mode[0])
print("Count:", mode_result.count[0])

# Mode for categorical data
grades = ['A', 'B', 'A', 'C', 'B', 'A']
mode = pd.Series(grades).mode()[0]
print("Most common grade:", mode)

When to Use Each Measure

Mean: Symmetric distribution, no outliers. Median: Skewed distribution or outliers present. Mode: Categorical data or finding most common value

4. Measures of Dispersion (Spread)

Measures that describe how spread out or varied the data is.

Range

Python

data = [10, 20, 30, 40, 50]

# Range
data_range = max(data) - min(data)
print("Range:", data_range)  # 40

# NumPy
print("Range:", np.ptp(data))  # Peak to peak

Variance

Average of squared deviations from the mean.

Python

data = [10, 20, 30, 40, 50]

# Population variance
pop_variance = np.var(data)
print("Population Variance:", pop_variance)

# Sample variance (n-1 denominator)
sample_variance = np.var(data, ddof=1)
print("Sample Variance:", sample_variance)

# Manual calculation
mean = np.mean(data)
variance = sum((x - mean)**2 for x in data) / len(data)
print("Manual Variance:", variance)

Standard Deviation

Square root of variance. Same unit as original data.

Python

# Standard deviation
pop_std = np.std(data)
sample_std = np.std(data, ddof=1)

print("Population Std Dev:", pop_std)
print("Sample Std Dev:", sample_std)

# Interpretation: ~68% of data within 1 std dev of mean
# ~95% within 2 std devs, ~99.7% within 3 std devs (for normal distribution)

Interquartile Range (IQR)

Range of middle 50% of data. Robust to outliers.

Python

data = [10, 15, 20, 25, 30, 35, 40, 45, 50]

# Quartiles
q1 = np.percentile(data, 25)  # 1st quartile
q2 = np.percentile(data, 50)  # 2nd quartile (median)
q3 = np.percentile(data, 75)  # 3rd quartile

# IQR
iqr = q3 - q1
print("Q1:", q1)
print("Q2 (Median):", q2)
print("Q3:", q3)
print("IQR:", iqr)

# Outlier detection using IQR
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print("\nOutlier bounds:", lower_bound, "to", upper_bound)

# Coefficient of Variation (CV)
cv = (np.std(data) / np.mean(data)) * 100
print("Coefficient of Variation:", cv, "%")

5. Probability Distributions

Normal Distribution (Gaussian)

Bell-shaped, symmetric distribution. Many natural phenomena follow this.

Python

from scipy import stats
import matplotlib.pyplot as plt

# Generate normal distribution
mu, sigma = 0, 1  # mean and standard deviation
x = np.linspace(-4, 4, 100)
y = stats.norm.pdf(x, mu, sigma)

# Properties
print("Mean:", mu)
print("Std Dev:", sigma)

# Probability calculations
# P(X < 1)
prob = stats.norm.cdf(1, mu, sigma)
print("P(X < 1):", prob)

# P(X > 1)
prob = 1 - stats.norm.cdf(1, mu, sigma)
print("P(X > 1):", prob)

# P(-1 < X < 1)
prob = stats.norm.cdf(1, mu, sigma) - stats.norm.cdf(-1, mu, sigma)
print("P(-1 < X < 1):", prob)  # ~0.68

# Z-score (standardization)
value = 85
mean = 75
std = 10
z_score = (value - mean) / std
print("\nZ-score for 85:", z_score)

Binomial Distribution

Discrete distribution for number of successes in n independent trials.

Python

# Binomial distribution
n = 10  # number of trials
p = 0.5  # probability of success

# Probability of exactly k successes
k = 6
prob = stats.binom.pmf(k, n, p)
print(f"P(X = {k}):", prob)

# Probability of at most k successes
prob_cumulative = stats.binom.cdf(k, n, p)
print(f"P(X ≤ {k}):", prob_cumulative)

# Mean and variance
mean = n * p
variance = n * p * (1 - p)
print("Mean:", mean)
print("Variance:", variance)

Poisson Distribution

Models number of events in fixed interval (calls per hour, defects per batch).

Python

# Poisson distribution
lambda_param = 3  # average rate

# Probability of k events
k = 5
prob = stats.poisson.pmf(k, lambda_param)
print(f"P(X = {k}):", prob)

# Mean and variance (both equal to λ)
print("Mean:", lambda_param)
print("Variance:", lambda_param)

Exponential Distribution

Time between events in a Poisson process.

Python

# Exponential distribution
lambda_param = 0.5  # rate parameter

# Probability density at x
x = 2
prob = stats.expon.pdf(x, scale=1/lambda_param)
print(f"PDF at x={x}:", prob)

# Probability X < x
prob_cdf = stats.expon.cdf(x, scale=1/lambda_param)
print(f"P(X < {x}):", prob_cdf)

6. Probability Basics

Fundamental Concepts

Python

# Probability rules
# P(A) = favorable outcomes / total outcomes

# Example: Rolling a die
total_outcomes = 6
favorable = 1  # rolling a 6
prob = favorable / total_outcomes
print("P(rolling 6):", prob)  # 0.1667

# Complement rule: P(A') = 1 - P(A)
prob_not_6 = 1 - prob
print("P(not 6):", prob_not_6)  # 0.8333

Conditional Probability

P(A|B) = Probability of A given B has occurred

Python

# Example: Cards
# P(King | Face card)

total_cards = 52
face_cards = 12
kings = 4

# P(King and Face card) = P(King) since all kings are face cards
p_king_and_face = kings / total_cards

# P(Face card)
p_face = face_cards / total_cards

# P(King | Face card)
p_king_given_face = p_king_and_face / p_face
print("P(King | Face card):", p_king_given_face)  # 0.333

Independence

Events A and B are independent if P(A and B) = P(A) × P(B)

Python

# Example: Coin flips
# P(Heads on flip 1 AND Heads on flip 2)

p_heads = 0.5
p_both_heads = p_heads * p_heads
print("P(HH):", p_both_heads)  # 0.25

# Three independent events
p_three_heads = p_heads ** 3
print("P(HHH):", p_three_heads)  # 0.125

Addition Rule

Python

# P(A or B) = P(A) + P(B) - P(A and B)

# Example: Drawing a card
p_king = 4/52
p_heart = 13/52
p_king_of_hearts = 1/52

# P(King OR Heart)
p_king_or_heart = p_king + p_heart - p_king_of_hearts
print("P(King or Heart):", p_king_or_heart)  # 0.308

7. Sampling Methods

Population vs Sample

Population: Entire group of interest
Sample: Subset of population used for analysis
Sampling: Process of selecting samples

Sampling Techniques

Python

import random

# Population
population = list(range(1, 101))  # 1 to 100

# 1. Simple Random Sampling
simple_sample = random.sample(population, 10)
print("Simple Random Sample:", simple_sample)

# 2. Systematic Sampling
k = 10  # every kth element
systematic_sample = population[::k]
print("Systematic Sample:", systematic_sample)

# 3. Stratified Sampling
# Divide into groups and sample from each
group1 = population[:50]
group2 = population[50:]
stratified_sample = random.sample(group1, 5) + random.sample(group2, 5)
print("Stratified Sample:", stratified_sample)

# 4. Using Pandas
df = pd.DataFrame({'value': population})
random_sample = df.sample(n=10)  # 10 random rows
print("\nPandas Random Sample:")
print(random_sample)

Central Limit Theorem (CLT)

As sample size increases, sampling distribution of mean approaches normal distribution.

Python

# Demonstrate CLT
np.random.seed(42)

# Non-normal population (uniform)
population = np.random.uniform(0, 100, 10000)

# Take many samples and calculate means
sample_means = []
for i in range(1000):
    sample = np.random.choice(population, size=30)
    sample_means.append(np.mean(sample))

# Sample means are approximately normal!
print("Mean of sample means:", np.mean(sample_means))
print("Std of sample means:", np.std(sample_means))
print("Population mean:", np.mean(population))

# Standard Error of Mean
sem = np.std(population) / np.sqrt(30)
print("Standard Error:", sem)

8. Hypothesis Testing

Statistical method to make decisions based on data.

Key Concepts

Null Hypothesis (H₀): No effect or difference exists
Alternative Hypothesis (H₁): Effect or difference exists
p-value: Probability of observing data if H₀ is true
Significance level (α): Threshold for rejecting H₀ (commonly 0.05)
Type I Error: Rejecting true H₀ (false positive)
Type II Error: Failing to reject false H₀ (false negative)

One-Sample t-Test

Test if sample mean differs from known population mean.

Python

from scipy import stats

# Sample data
sample = [23, 25, 27, 29, 31, 33, 35, 37, 39, 41]
population_mean = 30

# H₀: μ = 30
# H₁: μ ≠ 30

# Perform t-test
t_statistic, p_value = stats.ttest_1samp(sample, population_mean)

print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Decision
alpha = 0.05
if p_value < alpha:
    print("Reject H₀: Mean is significantly different from 30")
else:
    print("Fail to reject H₀: No significant difference from 30")

Two-Sample t-Test

Compare means of two independent groups.

Python

# Two groups
group1 = [23, 25, 27, 29, 31, 33, 35]
group2 = [30, 32, 34, 36, 38, 40, 42]

# H₀: μ₁ = μ₂
# H₁: μ₁ ≠ μ₂

# Independent samples t-test
t_stat, p_value = stats.ttest_ind(group1, group2)

print("t-statistic:", t_stat)
print("p-value:", p_value)

if p_value < 0.05:
    print("Significant difference between groups")
else:
    print("No significant difference")

Paired t-Test

Compare means of same group at different times.

Python

# Before and after measurements
before = [72, 75, 78, 80, 82, 85, 88]
after = [70, 73, 76, 78, 80, 83, 86]

# H₀: μ_diff = 0
# H₁: μ_diff ≠ 0

# Paired t-test
t_stat, p_value = stats.ttest_rel(before, after)

print("t-statistic:", t_stat)
print("p-value:", p_value)

if p_value < 0.05:
    print("Significant change from before to after")
else:
    print("No significant change")

Z-Test

Use when population standard deviation is known and n > 30.

Python

from statsmodels.stats.weightstats import ztest

# Sample
sample = [23, 25, 27, 29, 31, 33, 35, 37, 39, 41]
population_mean = 30

# Z-test
z_stat, p_value = ztest(sample, value=population_mean)

print("z-statistic:", z_stat)
print("p-value:", p_value)

9. Correlation Analysis

Measures strength and direction of relationship between variables.

Pearson Correlation

Measures linear relationship. Range: -1 to +1.

Python

import numpy as np
from scipy import stats

# Two variables
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 7, 8, 10, 11, 13, 14, 16]

# Pearson correlation
corr, p_value = stats.pearsonr(x, y)
print("Pearson r:", corr)
print("p-value:", p_value)

# Interpretation:
# r > 0.7: Strong positive correlation
# r = 0: No correlation
# r < -0.7: Strong negative correlation

# Using NumPy
corr_matrix = np.corrcoef(x, y)
print("\nCorrelation matrix:\n", corr_matrix)

# Using Pandas
df = pd.DataFrame({'x': x, 'y': y})
print("\nPandas correlation:")
print(df.corr())

Spearman Correlation

Measures monotonic relationship (non-linear). Better for ordinal data.

Python

# Spearman correlation
corr, p_value = stats.spearmanr(x, y)
print("Spearman rho:", corr)
print("p-value:", p_value)

Correlation vs Causation

Important: Correlation does NOT imply causation! Just because two variables are correlated doesn't mean one causes the other. There could be confounding variables or spurious correlations.

10. Regression Analysis

Model relationship between dependent and independent variables.

Simple Linear Regression

Y = β₀ + β₁X + ε

Python

from scipy import stats
import numpy as np

# Data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2.1, 4.2, 5.8, 8.1, 10.3, 11.9, 14.2, 16.1, 17.9, 20.2])

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

print("Slope (β₁):", slope)
print("Intercept (β₀):", intercept)
print("R-squared:", r_value**2)
print("p-value:", p_value)

# Make predictions
x_new = 11
y_pred = slope * x_new + intercept
print(f"\nPrediction for x={x_new}: {y_pred}")

# Predict for all x values
y_predicted = slope * x + intercept
print("Predicted values:", y_predicted)

Multiple Linear Regression

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε

Python

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Multiple predictors
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([3, 5, 7, 9, 11])

# Fit model
model = LinearRegression()
model.fit(X, y)

# Coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

# Make predictions
y_pred = model.predict(X)
print("Predictions:", y_pred)

# Model evaluation
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)

print("\nR-squared:", r2)
print("MSE:", mse)
print("RMSE:", rmse)

Assumptions of Linear Regression

Linearity: Relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Constant variance of errors
Normality: Errors are normally distributed
No multicollinearity: Predictors not highly correlated (multiple regression)

11. Analysis of Variance (ANOVA)

Test if means of three or more groups are significantly different.

One-Way ANOVA

Compare means across one categorical variable.

Python

from scipy import stats

# Three groups
group1 = [23, 25, 27, 29, 31]
group2 = [30, 32, 34, 36, 38]
group3 = [35, 37, 39, 41, 43]

# H₀: μ₁ = μ₂ = μ₃
# H₁: At least one mean is different

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("At least one group mean is significantly different")
else:
    print("No significant difference between group means")

Two-Way ANOVA

Examine effect of two categorical variables and their interaction.

Python

import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Sample data
df = pd.DataFrame({
    'score': [23, 25, 27, 30, 32, 34, 35, 37, 39, 40, 42, 44],
    'method': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'B'],
    'gender': ['M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F']
})

# Fit model
model = ols('score ~ C(method) + C(gender) + C(method):C(gender)', data=df).fit()

# ANOVA table
anova_table = anova_lm(model, typ=2)
print(anova_table)

Post-hoc Tests

After significant ANOVA, determine which groups differ.

Python

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Data
data = [23, 25, 27, 29, 31, 30, 32, 34, 36, 38, 35, 37, 39, 41, 43]
groups = ['A']*5 + ['B']*5 + ['C']*5

# Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=data, groups=groups, alpha=0.05)
print(tukey)

12. Chi-Square Tests

Test relationships between categorical variables.

Chi-Square Goodness of Fit

Test if observed frequencies match expected distribution.

Python

from scipy import stats

# Observed frequencies
observed = [30, 25, 20, 25]

# Expected frequencies (equal distribution)
expected = [25, 25, 25, 25]

# Chi-square test
chi2, p_value = stats.chisquare(observed, expected)

print("Chi-square statistic:", chi2)
print("p-value:", p_value)

if p_value < 0.05:
    print("Distribution significantly different from expected")
else:
    print("No significant difference from expected distribution")

Chi-Square Test of Independence

Test if two categorical variables are independent.

Python

# Contingency table
# Rows: Gender (M/F), Columns: Preference (A/B/C)
observed = np.array([[30, 25, 20],
                     [20, 30, 25]])

# Chi-square test of independence
chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print("Chi-square statistic:", chi2)
print("p-value:", p_value)
print("Degrees of freedom:", dof)
print("\nExpected frequencies:\n", expected)

if p_value < 0.05:
    print("\nVariables are dependent (associated)")
else:
    print("\nVariables are independent (not associated)")

13. Bayesian Statistics

Update beliefs based on new evidence using Bayes' Theorem.

Bayes' Theorem

P(A|B) = P(B|A) × P(A) / P(B)

Python

# Example: Medical test
# P(Disease) = 0.01 (1% of population has disease)
# P(Positive|Disease) = 0.95 (95% sensitivity)
# P(Positive|No Disease) = 0.05 (5% false positive rate)

p_disease = 0.01
p_no_disease = 1 - p_disease
p_pos_given_disease = 0.95
p_pos_given_no_disease = 0.05

# P(Positive)
p_positive = (p_pos_given_disease * p_disease + 
              p_pos_given_no_disease * p_no_disease)

# P(Disease|Positive) using Bayes' Theorem
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive

print("P(Disease|Positive Test):", p_disease_given_pos)
print(f"Only {p_disease_given_pos*100:.1f}% chance of having disease!")

# Prior and Posterior
print("\nPrior probability:", p_disease)
print("Posterior probability:", p_disease_given_pos)

Bayesian Updating

Python

# Example: Coin flip bias estimation
from scipy.stats import beta

# Prior: Beta(1, 1) - uniform prior
a_prior, b_prior = 1, 1

# Observe 7 heads, 3 tails
heads, tails = 7, 3

# Posterior: Beta(a + heads, b + tails)
a_posterior = a_prior + heads
b_posterior = b_prior + tails

# Posterior mean (estimated probability of heads)
posterior_mean = a_posterior / (a_posterior + b_posterior)
print("Estimated P(Heads):", posterior_mean)

# 95% credible interval
lower = beta.ppf(0.025, a_posterior, b_posterior)
upper = beta.ppf(0.975, a_posterior, b_posterior)
print(f"95% Credible Interval: [{lower:.3f}, {upper:.3f}]")

14. Time Series Basics

Analyzing data points collected over time.

Components of Time Series

Trend: Long-term increase or decrease
Seasonality: Regular patterns repeating over time
Cyclicity: Irregular fluctuations
Noise: Random variation

Python

import pandas as pd
import numpy as np

# Create time series
dates = pd.date_range('2023-01-01', periods=100, freq='D')
trend = np.linspace(10, 50, 100)
seasonality = 10 * np.sin(np.linspace(0, 4*np.pi, 100))
noise = np.random.normal(0, 2, 100)
values = trend + seasonality + noise

ts = pd.Series(values, index=dates)

# Moving average (smooth data)
ma_7 = ts.rolling(window=7).mean()
print("7-day moving average:\n", ma_7.tail())

# Exponential smoothing
ema = ts.ewm(span=7).mean()
print("\nExponential moving average:\n", ema.tail())

# Calculate returns (percent change)
returns = ts.pct_change()
print("\nDaily returns:\n", returns.tail())

Stationarity Testing

Stationary series has constant mean and variance over time.

Python

from statsmodels.tsa.stattools import adfuller

# Augmented Dickey-Fuller test
result = adfuller(ts)

print("ADF Statistic:", result[0])
print("p-value:", result[1])
print("Critical Values:", result[4])

if result[1] < 0.05:
    print("\nSeries is stationary")
else:
    print("\nSeries is non-stationary")
    
# Make series stationary by differencing
ts_diff = ts.diff().dropna()
result_diff = adfuller(ts_diff)
print("\nAfter differencing p-value:", result_diff[1])

Autocorrelation

Python

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf

# Calculate autocorrelation
acf_values = acf(ts, nlags=20)
print("Autocorrelation:\n", acf_values)

# Note: Use plot_acf() and plot_pacf() for visualization
# (would require matplotlib)

15. Python for Statistics - Quick Reference

Essential Libraries

Python

# Core libraries
import numpy as np                    # Numerical computing
import pandas as pd                   # Data manipulation
import matplotlib.pyplot as plt       # Visualization
import seaborn as sns                 # Statistical visualization

# Statistical libraries
from scipy import stats              # Statistical functions
import statsmodels.api as sm         # Statistical models
from sklearn.linear_model import LinearRegression  # ML

# Install if needed:
# pip install numpy pandas scipy statsmodels scikit-learn matplotlib seaborn

Common Statistical Operations

Python

# Generate sample data
np.random.seed(42)
data = np.random.normal(100, 15, 1000)  # mean=100, std=15, n=1000

# Descriptive statistics
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Std Dev:", np.std(data))
print("Variance:", np.var(data))
print("Min:", np.min(data))
print("Max:", np.max(data))
print("25th percentile:", np.percentile(data, 25))
print("75th percentile:", np.percentile(data, 75))

# Pandas describe() - comprehensive summary
df = pd.DataFrame({'values': data})
print("\nPandas describe():")
print(df.describe())

# Skewness and Kurtosis
from scipy.stats import skew, kurtosis
print("\nSkewness:", skew(data))
print("Kurtosis:", kurtosis(data))

Statistical Tests Cheat Sheet

Python

from scipy import stats

# Normality test
statistic, p = stats.shapiro(data)  # Shapiro-Wilk test
print("Normal?", "Yes" if p > 0.05 else "No")

# t-tests
stats.ttest_1samp(sample, popmean)       # One-sample
stats.ttest_ind(group1, group2)          # Independent samples
stats.ttest_rel(before, after)           # Paired samples

# Non-parametric tests
stats.mannwhitneyu(group1, group2)       # Mann-Whitney U
stats.wilcoxon(before, after)            # Wilcoxon signed-rank
stats.kruskal(group1, group2, group3)    # Kruskal-Wallis

# Correlation
stats.pearsonr(x, y)                     # Pearson
stats.spearmanr(x, y)                    # Spearman
stats.kendalltau(x, y)                   # Kendall's tau

# Chi-square
stats.chisquare(observed, expected)      # Goodness of fit
stats.chi2_contingency(contingency_table) # Independence

# ANOVA
stats.f_oneway(group1, group2, group3)   # One-way ANOVA

Data Visualization for Statistics

Python

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")

# Histogram
plt.hist(data, bins=30, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution')
# plt.show()

# Box plot
plt.boxplot([group1, group2, group3])
plt.xticks([1, 2, 3], ['Group 1', 'Group 2', 'Group 3'])
# plt.show()

# Scatter plot with regression line
sns.regplot(x=x, y=y)
# plt.show()

# Correlation heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
# plt.show()

# Q-Q plot (check normality)
from scipy import stats
stats.probplot(data, dist="norm", plot=plt)
# plt.show()

Key Takeaways

Descriptive stats summarize data; inferential stats make predictions
Mean for symmetric data, median for skewed data or outliers
Standard deviation measures spread in same units as data
Normal distribution is bell-shaped; many stats assume normality
p-value < 0.05 typically means statistically significant
Correlation ≠ causation - always remember this!
Sample size matters - larger samples give more reliable results
Check assumptions before applying statistical tests
Visualize first - plots reveal patterns tests might miss
Context matters - statistical significance ≠ practical significance

Further Learning Resources

Books: "The Art of Statistics" by David Spiegelhalter, "Think Stats" by Allen Downey
Online: Khan Academy Statistics, StatQuest YouTube channel
Practice: Kaggle datasets, UCI Machine Learning Repository
Python Docs: SciPy stats documentation, Statsmodels documentation

📊 Statistics Fundamentals

1. Introduction to Statistics

Types of Statistics

Types of Data

2. Descriptive Statistics

Key Concepts

Frequency Distributions

3. Measures of Central Tendency

Mean (Average)

Median

Mode

When to Use Each Measure

4. Measures of Dispersion (Spread)

Range

Variance

Standard Deviation

Interquartile Range (IQR)

5. Probability Distributions

Normal Distribution (Gaussian)

Binomial Distribution

Poisson Distribution

Exponential Distribution

6. Probability Basics

Fundamental Concepts

Conditional Probability

Independence

Addition Rule

7. Sampling Methods

Population vs Sample

Sampling Techniques

Central Limit Theorem (CLT)

8. Hypothesis Testing

Key Concepts

One-Sample t-Test

Two-Sample t-Test

Paired t-Test

Z-Test

9. Correlation Analysis

Pearson Correlation

Spearman Correlation

Correlation vs Causation

10. Regression Analysis

Simple Linear Regression

Multiple Linear Regression

Assumptions of Linear Regression

11. Analysis of Variance (ANOVA)

One-Way ANOVA

Two-Way ANOVA

Post-hoc Tests

12. Chi-Square Tests

Chi-Square Goodness of Fit

Chi-Square Test of Independence

13. Bayesian Statistics

Bayes' Theorem

Bayesian Updating

14. Time Series Basics

Components of Time Series

Stationarity Testing

Autocorrelation

15. Python for Statistics - Quick Reference

Essential Libraries

Common Statistical Operations

Statistical Tests Cheat Sheet

Data Visualization for Statistics

Key Takeaways

Further Learning Resources

🎉 Statistics Mastered!