// COMPREHENSIVE REFERENCE

Machine Learning
Encyclopedia

Mathematical foundations · Model architectures · Training dynamics · Interview mastery · System design

40+Topics
100+Equations
60+Interview Q&A

Types of Machine Learning

👁️

Supervised Learning

Learn a mapping $f: X \to Y$ from labeled pairs $(x_i, y_i)$. Minimize empirical risk:

$$\hat{f} = \arg\min_f \frac{1}{n}\sum_{i=1}^n \mathcal{L}(f(x_i), y_i)$$
Classification Regression Object Detection
🔍

Unsupervised Learning

Discover structure in unlabeled data $\{x_i\}$. Learn the data distribution $p(x)$:

$$\max_\theta \sum_{i=1}^n \log p_\theta(x_i)$$
Clustering Density Estimation Dimensionality Reduction
🎮

Reinforcement Learning

Agent learns policy $\pi$ by maximizing expected cumulative reward:

$$J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^T \gamma^t r_t\right]$$
Policy Gradient Q-Learning Actor-Critic
🤝

Semi-Supervised

Combines small labeled set $\mathcal{L}$ and large unlabeled set $\mathcal{U}$:

$$\mathcal{L}_{total} = \mathcal{L}_{sup} + \lambda \mathcal{L}_{unsup}$$
Self-Training Pseudo-Labels Consistency Reg.
🧠

Self-Supervised

Create supervision from data itself using pretext tasks. Contrastive loss:

$$\mathcal{L} = -\log\frac{e^{s(z_i,z_j^+)/\tau}}{\sum_k e^{s(z_i,z_k)/\tau}}$$
SimCLR BERT MLM MAE
🔄

Transfer Learning

Adapt model trained on source $\mathcal{D}_S$ to target $\mathcal{D}_T$. Fine-tune with low LR:

$$\theta_T = \theta_S - \eta \nabla_\theta \mathcal{L}_T(\theta_S)$$
Fine-tuning LoRA Domain Adaptation

Linear Models

Linear Regression

Model: $\hat{y} = \mathbf{w}^T\mathbf{x} + b$

Closed-form (Normal Equation):

$$\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Geometric interpretation: project $\mathbf{y}$ onto column space of $\mathbf{X}$.

# Gradient: ∂MSE/∂w = -2X^T(y - Xw)/n w = np.linalg.inv(X.T @ X) @ X.T @ y

Logistic Regression

Sigmoid squashes logit to probability:

$$\sigma(z) = \frac{1}{1+e^{-z}}, \quad \hat{p} = \sigma(\mathbf{w}^T\mathbf{x}+b)$$

Log-likelihood (cross-entropy):

$$\mathcal{L} = -\frac{1}{n}\sum_i[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]$$
💡 Decision boundary is linear in feature space: $\mathbf{w}^T\mathbf{x}+b=0$

Support Vector Machine (SVM)

Find maximum-margin hyperplane. Primal form:

$$\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1$$

Dual (kernel trick enabled):

$$\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j K(\mathbf{x}_i,\mathbf{x}_j)$$

RBF Kernel: $K(\mathbf{x},\mathbf{x}') = \exp\!\left(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2\right)$

Activation Functions

Sigmoid
$$\sigma(x)=\frac{1}{1+e^{-x}}$$

$\sigma'(x) = \sigma(x)(1-\sigma(x))$

✓ Probabilistic output ✗ Vanishing gradient
Tanh
$$\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$$

$\tanh'(x) = 1 - \tanh^2(x)$

✓ Zero-centered ✗ Still saturates
ReLU
$$\text{ReLU}(x)=\max(0,x)$$

$f'(x) = \mathbb{1}[x > 0]$

✓ No vanishing gradient ✗ Dying ReLU
GELU
$$\text{GELU}(x)=x\,\Phi(x)$$

$\Phi$: standard normal CDF

✓ Used in Transformers ✗ Computationally expensive
Leaky ReLU
$$f(x) = \begin{cases} x & x>0 \\ \alpha x & x\leq 0\end{cases}$$

$\alpha$ typically $0.01$

✓ Fixes dying ReLU ✗ Not always better
Softmax
$$\text{softmax}(z_i)=\frac{e^{z_i}}{\sum_j e^{z_j}}$$

Outputs sum to 1; for multi-class

✓ Probability distribution ✗ Not for hidden layers

Feed-Forward Neural Network

Forward Pass

Layer $l$ computation:

$$\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$
$$\mathbf{a}^{(l)} = \phi\!\left(\mathbf{z}^{(l)}\right)$$

where $\phi$ is a non-linear activation. Stack $L$ layers:

$$f(\mathbf{x}) = \mathbf{W}^{(L)}\phi(\mathbf{W}^{(L-1)}\cdots\phi(\mathbf{W}^{(1)}\mathbf{x}+\mathbf{b}^{(1)})\cdots)$$

Universal Approximation Theorem

A 1-hidden-layer network with enough neurons can approximate any continuous function on a compact set to arbitrary precision.

Formally, $\forall \epsilon > 0$, $\exists$ network $f_\theta$ s.t. $\sup_{x \in K} |f(x) - f_\theta(x)| < \epsilon$

Parameter Count

For layers of sizes $[n_0, n_1, \ldots, n_L]$:

$$\text{params} = \sum_{l=1}^L (n_{l-1} \cdot n_l + n_l)$$
layer_sizes = [784, 256, 128, 10] params = sum(l1*l2+l2 for l1,l2 in zip(layer_sizes, layer_sizes[1:])) # = 784*256+256 + 256*128+128 + 128*10+10 # = 234,506 parameters

Recurrent Neural Network (RNN)

Core Equations

Hidden state update at time $t$:

$$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$

Output:

$$\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y$$
⚠️ Suffers from vanishing/exploding gradients over long sequences. Gradient scales as $\prod_{t} \partial \mathbf{h}_t/\partial \mathbf{h}_{t-1}$.

BPTT (Backpropagation Through Time)

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \sum_t \frac{\partial \mathcal{L}_t}{\partial \mathbf{W}}$$
$$\frac{\partial \mathcal{L}_t}{\partial \mathbf{W}} = \sum_{k=1}^t \frac{\partial \mathcal{L}_t}{\partial \mathbf{y}_t}\frac{\partial \mathbf{y}_t}{\partial \mathbf{h}_t}\left(\prod_{j=k}^{t-1}\frac{\partial \mathbf{h}_{j+1}}{\partial \mathbf{h}_j}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}}$$

Long Short-Term Memory (LSTM)

🚪 Forget Gate
$$\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$$

Decides what to discard from cell state. Output in $[0,1]$.

➕ Input Gate
$$\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_i)$$
$$\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_C[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_C)$$
🔋 Cell State Update
$$\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$$

Element-wise operations preserve gradients through time — the "highway" for gradients.

📤 Output Gate
$$\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_o)$$
$$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)$$

Why LSTM Solves Vanishing Gradient

The cell state update $\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$ creates an additive shortcut. Gradient flows through this path without repeated matrix multiplication:

$$\frac{\partial \mathbf{C}_t}{\partial \mathbf{C}_{t-1}} = \mathbf{f}_t \quad \text{(element-wise, no matrix multiply)}$$

Gated Recurrent Unit (GRU)

GRU Equations (Cho et al., 2014)

Reset gate — how much past to forget:

$$\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t])$$

Update gate — interpolation coefficient:

$$\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])$$

Candidate hidden state:

$$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}[\mathbf{r}_t\odot\mathbf{h}_{t-1}, \mathbf{x}_t])$$

Final hidden state:

$$\mathbf{h}_t = (1-\mathbf{z}_t)\odot\mathbf{h}_{t-1} + \mathbf{z}_t\odot\tilde{\mathbf{h}}_t$$

LSTM vs GRU Comparison

PropertyLSTMGRU
Gates3 (f, i, o)2 (r, z)
States$\mathbf{h}_t, \mathbf{C}_t$$\mathbf{h}_t$ only
Parameters$4(n_h^2 + n_h n_x)$$3(n_h^2 + n_h n_x)$
PerformanceSlightly better on long sequencesFaster, similar accuracy
InterpretabilityCell state explicitSimpler structure
💡 GRU merges forget and input into a single update gate. If $\mathbf{z}_t \approx 0$, keep old state; if $\mathbf{z}_t \approx 1$, use new candidate.

Attention Mechanism

Scaled Dot-Product Attention

Given queries $\mathbf{Q}$, keys $\mathbf{K}$, values $\mathbf{V}$:

$$\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

Scaling by $\sqrt{d_k}$ prevents dot products from growing large in high dimensions, which would push softmax into regions with tiny gradients.

$$\text{score}(q,k) = \frac{q \cdot k}{\sqrt{d_k}}$$

Intuition

Each query "looks up" relevant keys. High similarity → high attention weight → more value contribution.

Multi-Head Attention

$$\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)$$
$$\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)\mathbf{W}^O$$

Each head attends to different representation subspaces — syntactic, semantic, positional relationships simultaneously. Total params: $4d_\text{model}^2$.

Transformer Architecture

Positional Encoding

Inject sequence order (no recurrence):

$$\text{PE}_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)$$
$$\text{PE}_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)$$

Feed-Forward Sub-Layer

$$\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}\mathbf{W}_1+\mathbf{b}_1)\mathbf{W}_2+\mathbf{b}_2$$

Applied position-wise. $d_{ff}$ typically $4 \times d_\text{model}$.

Layer Normalization

$$\text{LN}(\mathbf{x}) = \gamma \frac{\mathbf{x}-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta$$

Normalizes across feature dim. Enables stable deep training.

Complexity Comparison

ModelPer-layerSequential
RNN$O(nd^2)$$O(n)$
LSTM$O(nd^2)$$O(n)$
Transformer$O(n^2d)$$O(1)$
Conv (k)$O(knd^2)$$O(n/k)$

Embeddings

Word2Vec — Skip-Gram

Predict context from center word. Objective:

$$\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T\sum_{-c\leq j\leq c, j\neq 0}\log P(w_{t+j}|w_t)$$
$$P(w_O|w_I) = \frac{\exp(\mathbf{v}_{w_O}^{\prime T}\mathbf{v}_{w_I})}{\sum_{w=1}^W \exp(\mathbf{v}_w^{\prime T}\mathbf{v}_{w_I})}$$
✨ Famous: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$

GloVe — Global Vectors

Factorize co-occurrence matrix $X$:

$$\mathcal{L} = \sum_{i,j=1}^V f(X_{ij})(\mathbf{w}_i^T\tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2$$

Weighting function $f(x) = (x/x_{\max})^\alpha$ if $x < x_{\max}$, else $1$. Combines local (Word2Vec) + global (co-occurrence) statistics.

Token Embeddings in Transformers

Input is a lookup table $\mathbf{E} \in \mathbb{R}^{|V| \times d}$:

$$\mathbf{x}_t = \mathbf{E}[w_t] + \text{PE}(t)$$

Tied embeddings (input = output weight matrix) halve params and improve perplexity. Used in GPT, BERT.

Subword Tokenization (BPE)

Merge most frequent byte pairs iteratively. Vocab size $V$ balances coverage vs. sequence length:

$$\text{fertility} = \frac{\text{tokens}}{\text{words}} \approx 1.2\text{–}1.5 \text{ for English}$$
# "unhappiness" → ["un", "happy", "ness"] # Unknown words handled gracefully

Loss Functions

Mean Squared Error (MSE)

$$\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2$$

Penalizes large errors quadratically. Assumes Gaussian noise. Gradient: $\frac{2}{n}(\hat{y}-y)$.

📊 Regression

MAE / L1 Loss

$$\mathcal{L}_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^n|y_i - \hat{y}_i|$$

Robust to outliers. Non-differentiable at 0 (use subgradient). Assumes Laplace noise.

📊 Robust Regression

Huber Loss

$$\mathcal{L}_\delta = \begin{cases}\frac{1}{2}(y-\hat{y})^2 & |y-\hat{y}| \leq \delta \\ \delta|y-\hat{y}| - \frac{\delta^2}{2} & \text{otherwise}\end{cases}$$

MSE for small errors, MAE for large. Differentiable everywhere. Best of both worlds.

📊 Robust Regression

Cross-Entropy (Binary)

$$\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_i[y_i\log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]$$

Maximizes log-likelihood of Bernoulli model. Gradient w.r.t. logit: $\hat{p} - y$ (clean!).

🔵 Binary Classification

Categorical Cross-Entropy

$$\mathcal{L}_{\text{CE}} = -\sum_{c=1}^C y_c \log\hat{p}_c$$

KL divergence from true to predicted distribution: $\mathcal{L} = H(y, \hat{p}) = H(y) + D_{\text{KL}}(y\|\hat{p})$

🔵 Multi-class

KL Divergence

$$D_{\text{KL}}(P\|Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)}$$

Asymmetric: $D_{\text{KL}}(P\|Q) \neq D_{\text{KL}}(Q\|P)$. Used in VAEs, knowledge distillation, RL (PPO).

🧮 Distribution Matching

Hinge Loss (SVM)

$$\mathcal{L} = \frac{1}{n}\sum_i \max(0, 1 - y_i\hat{y}_i)$$

Only misclassified and margin-violating samples contribute. Correct and confident → 0 loss.

🔵 SVM / Margin

Focal Loss

$$\mathcal{L}_\text{FL} = -\alpha_t(1-p_t)^\gamma \log(p_t)$$

Down-weights easy examples via $(1-p_t)^\gamma$. Designed for class imbalance (RetinaNet). $\gamma=2$ is common.

🎯 Object Detection

Contrastive / Triplet Loss

$$\mathcal{L} = \max(0,\, d(a,p) - d(a,n) + m)$$

Anchor $a$, positive $p$, negative $n$. Margin $m$ ensures separation. Used in face recognition, embeddings.

🔗 Metric Learning

ELBO (VAE Loss)

$$\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x)\|p(z))$$

Reconstruction term + KL regularization. Tight lower bound on log-likelihood $\log p(x)$.

🧮 Generative Models

Optimizers

Gradient Descent Family

SGD

Stochastic Gradient Descent

$$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t; x^{(i)}, y^{(i)})$$

Noisy but can escape local minima. Converges at $O(1/\sqrt{T})$ for convex.

SGD+M

Momentum

$$\mathbf{v}_{t+1} = \beta \mathbf{v}_t + (1-\beta)\nabla_\theta\mathcal{L}$$
$$\theta_{t+1} = \theta_t - \eta \mathbf{v}_{t+1}$$

Accumulates velocity. Damps oscillations in shallow dimensions. $\beta=0.9$ standard.

AdaGrad

Adaptive Gradient

$$G_t = G_{t-1} + g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t+\epsilon}}g_t$$

Per-parameter learning rates. Rare features get larger updates. LR decays to 0 — problem!

RMSProp

Root Mean Square Propagation

$$E[g^2]_t = \rho E[g^2]_{t-1} + (1-\rho)g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t$$

Fixes AdaGrad's decaying LR with exponential moving average. $\rho=0.9$ typical.

Adam

Adam — Adaptive Moment Estimation

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \quad \text{(1st moment)}$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \quad \text{(2nd moment)}$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(bias correction)}$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t$$

Default: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$, $\eta=10^{-3}$. Most widely used optimizer.

AdamW

Adam + Weight Decay (Decoupled)

$$\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} + \lambda\theta_t\right)$$

Decouples L2 regularization from gradient scaling. Better generalization than Adam+L2. Default for LLMs (GPT, BERT fine-tuning).

Nadam

Nesterov + Adam

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\left(\beta_1\hat{m}_t + \frac{(1-\beta_1)g_t}{1-\beta_1^t}\right)$$

Look-ahead gradient (Nesterov) + adaptive rates (Adam). Often converges slightly faster.

Lion

EvoLved Sign Momentum (2023)

$$\theta_{t+1} = \theta_t - \eta(\text{sign}(\beta_1 m_{t-1} + (1-\beta_1)g_t) + \lambda\theta_t)$$

Uses sign of gradient. Memory-efficient (1 state vs Adam's 2). Competitive with AdamW on large models.

Learning Rate Schedules

Cosine Annealing

$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{t\pi}{T}\right)$$

Linear Warmup

$$\eta_t = \eta_{\max}\cdot\frac{t}{t_{\text{warmup}}} \quad (t \leq t_w)$$

Transformer Schedule

$$\eta = d_\text{model}^{-0.5}\cdot\min(t^{-0.5},\, t\cdot t_w^{-1.5})$$

Regularization Techniques

L2 Regularization (Weight Decay)

$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2}\|\mathbf{w}\|_2^2$$

Gradient adds $\lambda\mathbf{w}$ — pulls weights toward 0. Equivalent to Gaussian prior $p(\mathbf{w}) = \mathcal{N}(0, 1/\lambda)$. Shrinks all weights proportionally.

L1 Regularization (Lasso)

$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda\|\mathbf{w}\|_1$$

Induces sparsity — some weights exactly 0. Feature selection. Equivalent to Laplace prior. Gradient: $\lambda\,\text{sign}(\mathbf{w})$.

Dropout

$$\mathbf{h}_{\text{drop}} = \mathbf{h} \odot \mathbf{m}, \quad m_i \sim \text{Bernoulli}(p)$$

Scale at inference: $\hat{\mathbf{h}} = p\mathbf{h}$. Equivalent to training $2^n$ subnetworks. Acts as ensemble. Reduces co-adaptation.

⚠️ Don't use dropout before LayerNorm (Transformers use it after).

Batch Normalization

$$\mu_\mathcal{B} = \frac{1}{m}\sum_{i=1}^m x_i, \quad \sigma^2_\mathcal{B} = \frac{1}{m}\sum_{i=1}^m(x_i-\mu_\mathcal{B})^2$$
$$\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma^2_\mathcal{B}+\epsilon}}, \quad y_i = \gamma\hat{x}_i + \beta$$

Normalizes per-batch, per-feature. Enables higher LR, reduces sensitivity to initialization.

Data Augmentation

Implicitly increases training set. For images:

$$\tilde{x} = T(x), \quad T \sim \{$$crop, flip, rotate, color-jitter$\}$$

Mixup: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$, $\tilde{y} = \lambda y_i + (1-\lambda)y_j$

Early Stopping

Monitor validation loss. Stop when it stops decreasing for $k$ epochs (patience).

$$\text{stop if } \mathcal{L}_{\text{val}}(t) > \min_{t' < t} \mathcal{L}_{\text{val}}(t') \text{ for } k \text{ epochs}$$

Equivalent to L2 regularization in linear models. Free — no additional compute cost.

Backpropagation

Chain Rule — The Core Idea

For composite $\mathcal{L} = f(g(\mathbf{w}))$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial g} \cdot \frac{\partial g}{\partial \mathbf{w}}$$

For a layer $l$ with pre-activations $\mathbf{z}^{(l)}$:

$$\boldsymbol{\delta}^{(l)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} = \left(\mathbf{W}^{(l+1)T}\boldsymbol{\delta}^{(l+1)}\right) \odot \phi'(\mathbf{z}^{(l)})$$
$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)}\mathbf{a}^{(l-1)T}$$
$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$$

Computational Graph Perspective

Forward pass: compute and cache all intermediate values.

Backward pass: traverse graph in reverse, accumulate gradients via chain rule.

🔑 Key insight: share computation via dynamic programming. $O(n)$ ops for $n$-parameter gradient, same as one forward pass.
# PyTorch autograd loss.backward() # ← backprop optimizer.step() # ← update optimizer.zero_grad() # ← reset

Convolutional Neural Networks

Convolution Operation

$$(I * K)[i,j] = \sum_m\sum_n I[i+m, j+n]\cdot K[m,n]$$

Output size: $\lfloor\frac{W-F+2P}{S}\rfloor + 1$

where $W$=input, $F$=filter, $P$=padding, $S$=stride.

Parameters per filter: $F \times F \times C_{\text{in}}$. Total: $\times C_{\text{out}} + C_{\text{out}}$ (bias).

Receptive Field

$$r_l = r_{l-1} + (k_l - 1)\prod_{i=1}^{l-1}s_i$$

Pooling & Depth

Max Pooling: $y = \max_{(i,j) \in R} x_{ij}$ — translation invariance

Avg Pooling: $y = \frac{1}{|R|}\sum_{(i,j)\in R} x_{ij}$

Global Avg Pooling: Collapses spatial dims → classification.

Key Architectures

ModelYearKey Innovation
AlexNet2012ReLU, Dropout, GPU
VGG2014Deep 3×3 filters
ResNet2015Skip connections
EfficientNet2019Compound scaling
ConvNeXt2022Modernized CNN

ResNet Skip Connection

$$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$$

Gradient flows directly: $\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}\left(1 + \frac{\partial \mathcal{F}}{\partial \mathbf{x}}\right)$. The $+1$ prevents vanishing gradient, enabling 1000+ layer networks.

Generative Adversarial Networks

Minimax Game

$$\min_G \max_D \, V(D,G) = \mathbb{E}_{x\sim p_\text{data}}[\log D(x)] + \mathbb{E}_{z\sim p_z}[\log(1-D(G(z)))]$$

Discriminator: maximize log-prob of real vs fake

Generator: minimize log-prob of being caught (or maximize $\log D(G(z))$ — non-saturating)

🎯 Optimal: $D^*(x) = \frac{p_\text{data}(x)}{p_\text{data}(x)+p_g(x)}$. At equilibrium: $p_g = p_\text{data}$, $D^* = 1/2$ everywhere.

Variants

ModelKey Idea
WGANWasserstein distance, Lipschitz constraint
cGANConditional on class label $y$
CycleGANUnpaired image translation
StyleGANStyle-based generator, disentanglement

WGAN Loss

$$\mathcal{L} = \mathbb{E}[D(x_\text{real})] - \mathbb{E}[D(G(z))]$$

Variational Autoencoders

Encoder + Decoder

Encoder: $q_\phi(z|x)$ — posterior approximation

Decoder: $p_\theta(x|z)$ — likelihood

Prior: $p(z) = \mathcal{N}(0, I)$

ELBO Objective

$$\mathcal{L} = \underbrace{\mathbb{E}_{q_\phi}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_\text{KL}(q_\phi(z|x)\|p(z))}_{\text{regularization}}$$

KL term forces latent to match prior (smooth, continuous latent space).

Reparameterization Trick

Cannot backprop through sampling. Instead:

$$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

Moves randomness outside gradient path. Gradient flows through $\mu$, $\sigma$.

KL for Gaussian

$$D_\text{KL} = -\frac{1}{2}\sum_j\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$

Analytical! No Monte Carlo needed for this term.

Reinforcement Learning

Bellman Equation

$$V^\pi(s) = \mathbb{E}_\pi\left[r_t + \gamma V^\pi(s_{t+1}) \mid s_t=s\right]$$
$$Q^\pi(s,a) = r + \gamma\mathbb{E}_{s'}[V^\pi(s')]$$

Value function satisfies recursive self-consistency. Optimal: $Q^*(s,a) = r + \gamma\max_{a'}Q^*(s',a')$

Policy Gradient (REINFORCE)

$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[\nabla_\theta\log\pi_\theta(a|s) \cdot G_t\right]$$

$G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}$ (return). High variance — use baseline $b(s)$:

$$\nabla_\theta J = \mathbb{E}[\nabla_\theta\log\pi_\theta(a|s)(G_t - b(s))]$$

PPO (Proximal Policy Optimization)

$$\mathcal{L}^\text{CLIP} = \mathbb{E}\left[\min\!\left(r_t(\theta)\hat{A}_t,\, \text{clip}(r_t,1\!-\!\epsilon,1\!+\!\epsilon)\hat{A}_t\right)\right]$$

$r_t(\theta) = \pi_\theta(a|s)/\pi_{\theta_\text{old}}(a|s)$. Clips policy update to prevent large steps. RLHF for LLMs uses PPO.

DQN Key Ideas

$$\mathcal{L} = \mathbb{E}\left[(r + \gamma\max_{a'}Q(s',a';\theta^-) - Q(s,a;\theta))^2\right]$$

Experience Replay: $\mathcal{D} = \{(s,a,r,s')\}$ — breaks correlations

Target Network: $\theta^-$ updated every $C$ steps — stabilizes training

Interview Questions & Answers

ML System Design

Framework: ORCAS

O
Objective

Define the ML problem. What metric matters? Business KPI → proxy metric. E.g., "maximize watch time" → CTR + completion rate.

R
Requirements

Latency (p99), throughput (QPS), accuracy threshold, freshness, training frequency, data constraints, budget.

C
Collection

Data sources, labeling strategy (human/weak supervision/self-supervised), class imbalance handling, feature store.

A
Architecture

Model choice: start simple (logistic reg → tree → DNN). Feature engineering, embedding strategy, serving architecture.

S
Serving & Scaling

Online vs batch inference, model compression (quantization, distillation, pruning), A/B testing, monitoring.

Recommendation System

🔍 Retrieval (ANN)
⚡ Ranking (DNN)
📋 Re-ranking

Two-tower model:

$$\text{score}(u,i) = \mathbf{e}_u \cdot \mathbf{e}_i$$

User tower + Item tower trained with in-batch negatives. Serve via FAISS/ScaNN for ANN retrieval.

Search Ranking

$$\text{score} = f_\theta(\text{query}, \text{doc}, \text{context})$$

Learning-to-Rank: Pointwise (regression), Pairwise (RankNet: $P_{ij} = \sigma(s_i - s_j)$), Listwise (LambdaRank).

Features: BM25, semantic similarity, CTR, freshness, authority.

Real-Time Fraud Detection

Constraints: <50ms latency, high precision (false positive = customer friction).

Architecture: Feature store → gradient boosted trees (fast inference) + async DNN for enrichment.

Challenge: Concept drift — retrain daily on recent data + monitor PSI.

$$\text{PSI} = \sum\left(P_i - Q_i\right)\ln\frac{P_i}{Q_i}$$

Model Compression

Quantization: FP32 → INT8. $4\times$ smaller, $2-4\times$ faster. PTQ vs QAT.

Pruning: Remove weights with $|w| < \epsilon$. Structured (whole neurons) vs unstructured.

Distillation: Student learns soft targets from teacher:

$$\mathcal{L} = \alpha\mathcal{L}_\text{CE}(y, p_s) + (1-\alpha)\mathcal{L}_\text{KL}(p_t, p_s)$$

LoRA: $\Delta W = BA$, where $B \in \mathbb{R}^{d\times r}$, $A \in \mathbb{R}^{r\times k}$, $r \ll \min(d,k)$.

Glossary of Key Terms

Quick-Reference Cheat Sheet

Complexity & Scaling Guide

Time & Space Complexity Comparison

ArchitectureTrain Time / StepInference Memory (params)Sequence Ops
Linear / Logistic$O(nd)$$O(d)$$O(d)$
MLP (L layers)$O(nd^2L)$$O(d^2L)$$O(d^2L)$
CNN (k×k filter)$O(nk^2d^2L)$$O(k^2d^2L)$$O(k^2d^2L)$$O(n/k)$
RNN$O(nd_h^2)$$O(d_h^2)$$O(d_h^2)$$O(n)$
LSTM$O(4nd_h^2)$$O(4d_h^2)$$O(4d_h(d_h+d_x))$$O(n)$
Transformer$O(n^2d)$$O(n^2d)$$O(4d^2)$ per layer$O(1)$
Linear Attn.$O(nd^2)$$O(d^2)$$O(4d^2)$ per layer$O(1)$

Scaling Laws (Chinchilla)

Optimal token count for a model of $N$ parameters:

$$D_{\text{opt}} \approx 20 \cdot N$$

Loss as a function of compute $C = 6ND$:

$$L(C) = \left(\frac{C_{\min}}{C}\right)^{\alpha_C} + L_\infty$$

$\alpha_C \approx 0.057$. Both model size and data scale equally — compute-optimal training is the Chinchilla regime. GPT-3 (175B, 300B tokens) was severely undertrained.

GPU Memory Estimation

For a model with $N$ parameters, full training memory:

$$M_{\text{total}} \approx 16N \text{ bytes}$$

Breakdown (mixed precision + Adam):

$$M = \underbrace{2N}_{\text{fp16 params}} + \underbrace{2N}_{\text{gradients}} + \underbrace{12N}_{\text{Adam states (fp32)}}$$

e.g., 7B model ≈ 112 GB → needs 2× A100 80GB minimum. Activation checkpointing trades compute for memory: $O(\sqrt{L})$ vs $O(L)$.

params = 7e9 bytes_per_param = 16 # full training gpu_mem_gb = params * bytes_per_param / 1e9 # = 112 GB

Weight Initialization

Why Initialization Matters

Bad initialization → vanishing or exploding activations/gradients from layer 1. Consider $L$ layers with weights $W^{(l)} \in \mathbb{R}^{n \times n}$. Variance of activations:

$$\text{Var}(a^{(l)}) = \left(n \cdot \text{Var}(w)\right)^l \cdot \text{Var}(x)$$

This blows up if $n\,\text{Var}(w) > 1$ and vanishes if $< 1$. Initialization schemes choose $\text{Var}(w)$ to keep variance $\approx 1$ through all layers.

Xavier / Glorot (2010)

Designed for tanh / sigmoid activations. Balances forward and backward signal:

$$W \sim \mathcal{U}\!\left[-\frac{\sqrt{6}}{\sqrt{n_{\text{in}}+n_{\text{out}}}},\; \frac{\sqrt{6}}{\sqrt{n_{\text{in}}+n_{\text{out}}}}\right]$$

Or equivalently: $\text{Var}(W) = \dfrac{2}{n_{\text{in}}+n_{\text{out}}}$

Derivation: choose variance so that $\text{Var}(y_i) = \text{Var}(x_i)$ through both forward and backward pass simultaneously.

nn.init.xavier_uniform_(layer.weight) nn.init.xavier_normal_(layer.weight)

He / Kaiming (2015)

Designed for ReLU / Leaky ReLU. Accounts for ReLU zeroing half the neurons (halves effective fan-in):

$$W \sim \mathcal{N}\!\left(0,\; \frac{2}{n_{\text{in}}}\right)$$

For Leaky ReLU with slope $a$:

$$\text{Var}(W) = \frac{2}{(1+a^2)\,n_{\text{in}}}$$

Default in PyTorch for Conv and Linear layers. Always prefer over Xavier when using ReLU variants.

nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')

Orthogonal Init

Initialize weight matrix as a random orthogonal matrix $W^TW = I$:

$$W = UV^T, \quad \text{where } M = U\Sigma V^T \text{ (SVD of random matrix)}$$

Preserves gradient norms exactly at initialization. Especially useful for RNNs (prevents vanishing/exploding in recurrent weights). Also used in deep linear networks research.

nn.init.orthogonal_(layer.weight, gain=1.0)

Practical Initialization Rules

ActivationUseGain
Linear / tanhXavier Normal$5/3$
SigmoidXavier Normal$1$
ReLUHe / Kaiming$\sqrt{2}$
Leaky ReLUHe Kaiming$\sqrt{2/(1+a^2)}$
SELULeCun Normal$1$
💡 Biases: always init to 0. BatchNorm γ=1, β=0. Embedding: $\mathcal{N}(0, d^{-0.5})$. Final classifier: $\mathcal{N}(0, 0.01)$ for stable softmax at start.