Machine Learning
Encyclopedia
Mathematical foundations · Model architectures · Training dynamics · Interview mastery · System design
Types of Machine Learning
Supervised Learning
Learn a mapping $f: X \to Y$ from labeled pairs $(x_i, y_i)$. Minimize empirical risk:
Unsupervised Learning
Discover structure in unlabeled data $\{x_i\}$. Learn the data distribution $p(x)$:
Reinforcement Learning
Agent learns policy $\pi$ by maximizing expected cumulative reward:
Semi-Supervised
Combines small labeled set $\mathcal{L}$ and large unlabeled set $\mathcal{U}$:
Self-Supervised
Create supervision from data itself using pretext tasks. Contrastive loss:
Transfer Learning
Adapt model trained on source $\mathcal{D}_S$ to target $\mathcal{D}_T$. Fine-tune with low LR:
Linear Models
Linear Regression
Model: $\hat{y} = \mathbf{w}^T\mathbf{x} + b$
Closed-form (Normal Equation):
Geometric interpretation: project $\mathbf{y}$ onto column space of $\mathbf{X}$.
Logistic Regression
Sigmoid squashes logit to probability:
Log-likelihood (cross-entropy):
Support Vector Machine (SVM)
Find maximum-margin hyperplane. Primal form:
Dual (kernel trick enabled):
RBF Kernel: $K(\mathbf{x},\mathbf{x}') = \exp\!\left(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2\right)$
Activation Functions
$\sigma'(x) = \sigma(x)(1-\sigma(x))$
$\tanh'(x) = 1 - \tanh^2(x)$
$f'(x) = \mathbb{1}[x > 0]$
$\Phi$: standard normal CDF
$\alpha$ typically $0.01$
Outputs sum to 1; for multi-class
Feed-Forward Neural Network
Forward Pass
Layer $l$ computation:
where $\phi$ is a non-linear activation. Stack $L$ layers:
Universal Approximation Theorem
Formally, $\forall \epsilon > 0$, $\exists$ network $f_\theta$ s.t. $\sup_{x \in K} |f(x) - f_\theta(x)| < \epsilon$
Parameter Count
For layers of sizes $[n_0, n_1, \ldots, n_L]$:
Recurrent Neural Network (RNN)
Core Equations
Hidden state update at time $t$:
Output:
BPTT (Backpropagation Through Time)
Long Short-Term Memory (LSTM)
Decides what to discard from cell state. Output in $[0,1]$.
Element-wise operations preserve gradients through time — the "highway" for gradients.
Why LSTM Solves Vanishing Gradient
The cell state update $\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$ creates an additive shortcut. Gradient flows through this path without repeated matrix multiplication:
Gated Recurrent Unit (GRU)
GRU Equations (Cho et al., 2014)
Reset gate — how much past to forget:
Update gate — interpolation coefficient:
Candidate hidden state:
Final hidden state:
LSTM vs GRU Comparison
| Property | LSTM | GRU |
|---|---|---|
| Gates | 3 (f, i, o) | 2 (r, z) |
| States | $\mathbf{h}_t, \mathbf{C}_t$ | $\mathbf{h}_t$ only |
| Parameters | $4(n_h^2 + n_h n_x)$ | $3(n_h^2 + n_h n_x)$ |
| Performance | Slightly better on long sequences | Faster, similar accuracy |
| Interpretability | Cell state explicit | Simpler structure |
Attention Mechanism
Scaled Dot-Product Attention
Given queries $\mathbf{Q}$, keys $\mathbf{K}$, values $\mathbf{V}$:
Scaling by $\sqrt{d_k}$ prevents dot products from growing large in high dimensions, which would push softmax into regions with tiny gradients.
Intuition
Each query "looks up" relevant keys. High similarity → high attention weight → more value contribution.
Multi-Head Attention
Each head attends to different representation subspaces — syntactic, semantic, positional relationships simultaneously. Total params: $4d_\text{model}^2$.
Transformer Architecture
Positional Encoding
Inject sequence order (no recurrence):
Feed-Forward Sub-Layer
Applied position-wise. $d_{ff}$ typically $4 \times d_\text{model}$.
Layer Normalization
Normalizes across feature dim. Enables stable deep training.
Complexity Comparison
| Model | Per-layer | Sequential |
|---|---|---|
| RNN | $O(nd^2)$ | $O(n)$ |
| LSTM | $O(nd^2)$ | $O(n)$ |
| Transformer | $O(n^2d)$ | $O(1)$ |
| Conv (k) | $O(knd^2)$ | $O(n/k)$ |
Embeddings
Word2Vec — Skip-Gram
Predict context from center word. Objective:
GloVe — Global Vectors
Factorize co-occurrence matrix $X$:
Weighting function $f(x) = (x/x_{\max})^\alpha$ if $x < x_{\max}$, else $1$. Combines local (Word2Vec) + global (co-occurrence) statistics.
Token Embeddings in Transformers
Input is a lookup table $\mathbf{E} \in \mathbb{R}^{|V| \times d}$:
Tied embeddings (input = output weight matrix) halve params and improve perplexity. Used in GPT, BERT.
Subword Tokenization (BPE)
Merge most frequent byte pairs iteratively. Vocab size $V$ balances coverage vs. sequence length:
Loss Functions
Mean Squared Error (MSE)
Penalizes large errors quadratically. Assumes Gaussian noise. Gradient: $\frac{2}{n}(\hat{y}-y)$.
📊 RegressionMAE / L1 Loss
Robust to outliers. Non-differentiable at 0 (use subgradient). Assumes Laplace noise.
📊 Robust RegressionHuber Loss
MSE for small errors, MAE for large. Differentiable everywhere. Best of both worlds.
📊 Robust RegressionCross-Entropy (Binary)
Maximizes log-likelihood of Bernoulli model. Gradient w.r.t. logit: $\hat{p} - y$ (clean!).
🔵 Binary ClassificationCategorical Cross-Entropy
KL divergence from true to predicted distribution: $\mathcal{L} = H(y, \hat{p}) = H(y) + D_{\text{KL}}(y\|\hat{p})$
🔵 Multi-classKL Divergence
Asymmetric: $D_{\text{KL}}(P\|Q) \neq D_{\text{KL}}(Q\|P)$. Used in VAEs, knowledge distillation, RL (PPO).
🧮 Distribution MatchingHinge Loss (SVM)
Only misclassified and margin-violating samples contribute. Correct and confident → 0 loss.
🔵 SVM / MarginFocal Loss
Down-weights easy examples via $(1-p_t)^\gamma$. Designed for class imbalance (RetinaNet). $\gamma=2$ is common.
🎯 Object DetectionContrastive / Triplet Loss
Anchor $a$, positive $p$, negative $n$. Margin $m$ ensures separation. Used in face recognition, embeddings.
🔗 Metric LearningELBO (VAE Loss)
Reconstruction term + KL regularization. Tight lower bound on log-likelihood $\log p(x)$.
🧮 Generative ModelsOptimizers
Gradient Descent Family
Stochastic Gradient Descent
Noisy but can escape local minima. Converges at $O(1/\sqrt{T})$ for convex.
Momentum
Accumulates velocity. Damps oscillations in shallow dimensions. $\beta=0.9$ standard.
Adaptive Gradient
Per-parameter learning rates. Rare features get larger updates. LR decays to 0 — problem!
Root Mean Square Propagation
Fixes AdaGrad's decaying LR with exponential moving average. $\rho=0.9$ typical.
Adam — Adaptive Moment Estimation
Default: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$, $\eta=10^{-3}$. Most widely used optimizer.
Adam + Weight Decay (Decoupled)
Decouples L2 regularization from gradient scaling. Better generalization than Adam+L2. Default for LLMs (GPT, BERT fine-tuning).
Nesterov + Adam
Look-ahead gradient (Nesterov) + adaptive rates (Adam). Often converges slightly faster.
EvoLved Sign Momentum (2023)
Uses sign of gradient. Memory-efficient (1 state vs Adam's 2). Competitive with AdamW on large models.
Learning Rate Schedules
Cosine Annealing
Linear Warmup
Transformer Schedule
Regularization Techniques
L2 Regularization (Weight Decay)
Gradient adds $\lambda\mathbf{w}$ — pulls weights toward 0. Equivalent to Gaussian prior $p(\mathbf{w}) = \mathcal{N}(0, 1/\lambda)$. Shrinks all weights proportionally.
L1 Regularization (Lasso)
Induces sparsity — some weights exactly 0. Feature selection. Equivalent to Laplace prior. Gradient: $\lambda\,\text{sign}(\mathbf{w})$.
Dropout
Scale at inference: $\hat{\mathbf{h}} = p\mathbf{h}$. Equivalent to training $2^n$ subnetworks. Acts as ensemble. Reduces co-adaptation.
Batch Normalization
Normalizes per-batch, per-feature. Enables higher LR, reduces sensitivity to initialization.
Data Augmentation
Implicitly increases training set. For images:
Mixup: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$, $\tilde{y} = \lambda y_i + (1-\lambda)y_j$
Early Stopping
Monitor validation loss. Stop when it stops decreasing for $k$ epochs (patience).
Equivalent to L2 regularization in linear models. Free — no additional compute cost.
Backpropagation
Chain Rule — The Core Idea
For composite $\mathcal{L} = f(g(\mathbf{w}))$:
For a layer $l$ with pre-activations $\mathbf{z}^{(l)}$:
Computational Graph Perspective
Forward pass: compute and cache all intermediate values.
Backward pass: traverse graph in reverse, accumulate gradients via chain rule.
Convolutional Neural Networks
Convolution Operation
Output size: $\lfloor\frac{W-F+2P}{S}\rfloor + 1$
where $W$=input, $F$=filter, $P$=padding, $S$=stride.
Parameters per filter: $F \times F \times C_{\text{in}}$. Total: $\times C_{\text{out}} + C_{\text{out}}$ (bias).
Receptive Field
Pooling & Depth
Max Pooling: $y = \max_{(i,j) \in R} x_{ij}$ — translation invariance
Avg Pooling: $y = \frac{1}{|R|}\sum_{(i,j)\in R} x_{ij}$
Global Avg Pooling: Collapses spatial dims → classification.
Key Architectures
| Model | Year | Key Innovation |
|---|---|---|
| AlexNet | 2012 | ReLU, Dropout, GPU |
| VGG | 2014 | Deep 3×3 filters |
| ResNet | 2015 | Skip connections |
| EfficientNet | 2019 | Compound scaling |
| ConvNeXt | 2022 | Modernized CNN |
ResNet Skip Connection
Gradient flows directly: $\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}\left(1 + \frac{\partial \mathcal{F}}{\partial \mathbf{x}}\right)$. The $+1$ prevents vanishing gradient, enabling 1000+ layer networks.
Generative Adversarial Networks
Minimax Game
Discriminator: maximize log-prob of real vs fake
Generator: minimize log-prob of being caught (or maximize $\log D(G(z))$ — non-saturating)
Variants
| Model | Key Idea |
|---|---|
| WGAN | Wasserstein distance, Lipschitz constraint |
| cGAN | Conditional on class label $y$ |
| CycleGAN | Unpaired image translation |
| StyleGAN | Style-based generator, disentanglement |
WGAN Loss
Variational Autoencoders
Encoder + Decoder
Encoder: $q_\phi(z|x)$ — posterior approximation
Decoder: $p_\theta(x|z)$ — likelihood
Prior: $p(z) = \mathcal{N}(0, I)$
ELBO Objective
KL term forces latent to match prior (smooth, continuous latent space).
Reparameterization Trick
Cannot backprop through sampling. Instead:
Moves randomness outside gradient path. Gradient flows through $\mu$, $\sigma$.
KL for Gaussian
Analytical! No Monte Carlo needed for this term.
Reinforcement Learning
Bellman Equation
Value function satisfies recursive self-consistency. Optimal: $Q^*(s,a) = r + \gamma\max_{a'}Q^*(s',a')$
Policy Gradient (REINFORCE)
$G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}$ (return). High variance — use baseline $b(s)$:
PPO (Proximal Policy Optimization)
$r_t(\theta) = \pi_\theta(a|s)/\pi_{\theta_\text{old}}(a|s)$. Clips policy update to prevent large steps. RLHF for LLMs uses PPO.
DQN Key Ideas
Experience Replay: $\mathcal{D} = \{(s,a,r,s')\}$ — breaks correlations
Target Network: $\theta^-$ updated every $C$ steps — stabilizes training
Interview Questions & Answers
ML System Design
Framework: ORCAS
Define the ML problem. What metric matters? Business KPI → proxy metric. E.g., "maximize watch time" → CTR + completion rate.
Latency (p99), throughput (QPS), accuracy threshold, freshness, training frequency, data constraints, budget.
Data sources, labeling strategy (human/weak supervision/self-supervised), class imbalance handling, feature store.
Model choice: start simple (logistic reg → tree → DNN). Feature engineering, embedding strategy, serving architecture.
Online vs batch inference, model compression (quantization, distillation, pruning), A/B testing, monitoring.
Recommendation System
Two-tower model:
User tower + Item tower trained with in-batch negatives. Serve via FAISS/ScaNN for ANN retrieval.
Search Ranking
Learning-to-Rank: Pointwise (regression), Pairwise (RankNet: $P_{ij} = \sigma(s_i - s_j)$), Listwise (LambdaRank).
Features: BM25, semantic similarity, CTR, freshness, authority.
Real-Time Fraud Detection
Constraints: <50ms latency, high precision (false positive = customer friction).
Architecture: Feature store → gradient boosted trees (fast inference) + async DNN for enrichment.
Challenge: Concept drift — retrain daily on recent data + monitor PSI.
Model Compression
Quantization: FP32 → INT8. $4\times$ smaller, $2-4\times$ faster. PTQ vs QAT.
Pruning: Remove weights with $|w| < \epsilon$. Structured (whole neurons) vs unstructured.
Distillation: Student learns soft targets from teacher:
LoRA: $\Delta W = BA$, where $B \in \mathbb{R}^{d\times r}$, $A \in \mathbb{R}^{r\times k}$, $r \ll \min(d,k)$.
Glossary of Key Terms
Quick-Reference Cheat Sheet
Complexity & Scaling Guide
Time & Space Complexity Comparison
| Architecture | Train Time / Step | Inference | Memory (params) | Sequence Ops |
|---|---|---|---|---|
| Linear / Logistic | $O(nd)$ | $O(d)$ | $O(d)$ | — |
| MLP (L layers) | $O(nd^2L)$ | $O(d^2L)$ | $O(d^2L)$ | — |
| CNN (k×k filter) | $O(nk^2d^2L)$ | $O(k^2d^2L)$ | $O(k^2d^2L)$ | $O(n/k)$ |
| RNN | $O(nd_h^2)$ | $O(d_h^2)$ | $O(d_h^2)$ | $O(n)$ |
| LSTM | $O(4nd_h^2)$ | $O(4d_h^2)$ | $O(4d_h(d_h+d_x))$ | $O(n)$ |
| Transformer | $O(n^2d)$ | $O(n^2d)$ | $O(4d^2)$ per layer | $O(1)$ |
| Linear Attn. | $O(nd^2)$ | $O(d^2)$ | $O(4d^2)$ per layer | $O(1)$ |
Scaling Laws (Chinchilla)
Optimal token count for a model of $N$ parameters:
Loss as a function of compute $C = 6ND$:
$\alpha_C \approx 0.057$. Both model size and data scale equally — compute-optimal training is the Chinchilla regime. GPT-3 (175B, 300B tokens) was severely undertrained.
GPU Memory Estimation
For a model with $N$ parameters, full training memory:
Breakdown (mixed precision + Adam):
e.g., 7B model ≈ 112 GB → needs 2× A100 80GB minimum. Activation checkpointing trades compute for memory: $O(\sqrt{L})$ vs $O(L)$.
Weight Initialization
Why Initialization Matters
Bad initialization → vanishing or exploding activations/gradients from layer 1. Consider $L$ layers with weights $W^{(l)} \in \mathbb{R}^{n \times n}$. Variance of activations:
This blows up if $n\,\text{Var}(w) > 1$ and vanishes if $< 1$. Initialization schemes choose $\text{Var}(w)$ to keep variance $\approx 1$ through all layers.
Xavier / Glorot (2010)
Designed for tanh / sigmoid activations. Balances forward and backward signal:
Or equivalently: $\text{Var}(W) = \dfrac{2}{n_{\text{in}}+n_{\text{out}}}$
Derivation: choose variance so that $\text{Var}(y_i) = \text{Var}(x_i)$ through both forward and backward pass simultaneously.
He / Kaiming (2015)
Designed for ReLU / Leaky ReLU. Accounts for ReLU zeroing half the neurons (halves effective fan-in):
For Leaky ReLU with slope $a$:
Default in PyTorch for Conv and Linear layers. Always prefer over Xavier when using ReLU variants.
Orthogonal Init
Initialize weight matrix as a random orthogonal matrix $W^TW = I$:
Preserves gradient norms exactly at initialization. Especially useful for RNNs (prevents vanishing/exploding in recurrent weights). Also used in deep linear networks research.
Practical Initialization Rules
| Activation | Use | Gain |
|---|---|---|
| Linear / tanh | Xavier Normal | $5/3$ |
| Sigmoid | Xavier Normal | $1$ |
| ReLU | He / Kaiming | $\sqrt{2}$ |
| Leaky ReLU | He Kaiming | $\sqrt{2/(1+a^2)}$ |
| SELU | LeCun Normal | $1$ |