ML Encyclopedia — Deep Learning & AI Reference

01

Types of Machine Learning

👁️

Supervised Learning

Learn a mapping $f: X \to Y$ from labeled pairs $(x_i, y_i)$. Minimize empirical risk:

\hat{f} = \arg\min_f \frac{1}{n}\sum_{i=1}^n \mathcal{L}(f(x_i), y_i)

🔍

Unsupervised Learning

Discover structure in unlabeled data $\{x_i\}$. Learn the data distribution $p(x)$:

\max_\theta \sum_{i=1}^n \log p_\theta(x_i)

🎮

Reinforcement Learning

Agent learns policy $\pi$ by maximizing expected cumulative reward:

J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^T \gamma^t r_t\right]

🤝

Semi-Supervised

Combines small labeled set $\mathcal{L}$ and large unlabeled set $\mathcal{U}$:

\mathcal{L}_{total} = \mathcal{L}_{sup} + \lambda \mathcal{L}_{unsup}

🧠

Self-Supervised

Create supervision from data itself using pretext tasks. Contrastive loss:

\mathcal{L} = -\log\frac{e^{s(z_i,z_j^+)/\tau}}{\sum_k e^{s(z_i,z_k)/\tau}}

🔄

Transfer Learning

Adapt model trained on source $\mathcal{D}_S$ to target $\mathcal{D}_T$. Fine-tune with low LR:

\theta_T = \theta_S - \eta \nabla_\theta \mathcal{L}_T(\theta_S)

02

Linear Models

Linear Regression

Model: $\hat{y} = \mathbf{w}^T\mathbf{x} + b$

Closed-form (Normal Equation):

\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Geometric interpretation: project $\mathbf{y}$ onto column space of $\mathbf{X}$.

# Gradient: ∂MSE/∂w = -2X^T(y - Xw)/n
w = np.linalg.inv(X.T @ X) @ X.T @ y

Logistic Regression

Sigmoid squashes logit to probability:

\sigma(z) = \frac{1}{1+e^{-z}}, \quad \hat{p} = \sigma(\mathbf{w}^T\mathbf{x}+b)

Log-likelihood (cross-entropy):

\mathcal{L} = -\frac{1}{n}\sum_i[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]

💡 Decision boundary is linear in feature space: $\mathbf{w}^T\mathbf{x}+b=0$

Support Vector Machine (SVM)

Find maximum-margin hyperplane. Primal form:

\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1

Dual (kernel trick enabled):

\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j K(\mathbf{x}_i,\mathbf{x}_j)

RBF Kernel: $K(\mathbf{x},\mathbf{x}') = \exp\!\left(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2\right)$

03

Activation Functions

Sigmoid

\sigma(x)=\frac{1}{1+e^{-x}}

$\sigma'(x) = \sigma(x)(1-\sigma(x))$

✓ Probabilistic output ✗ Vanishing gradient

Tanh

\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}

$\tanh'(x) = 1 - \tanh^2(x)$

✓ Zero-centered ✗ Still saturates

ReLU

\text{ReLU}(x)=\max(0,x)

$f'(x) = \mathbb{1}[x > 0]$

✓ No vanishing gradient ✗ Dying ReLU

GELU

\text{GELU}(x)=x\,\Phi(x)

$\Phi$: standard normal CDF

✓ Used in Transformers ✗ Computationally expensive

Leaky ReLU

f(x) = \begin{cases} x & x>0 \\ \alpha x & x\leq 0\end{cases}

$\alpha$ typically $0.01$

✓ Fixes dying ReLU ✗ Not always better

Softmax

\text{softmax}(z_i)=\frac{e^{z_i}}{\sum_j e^{z_j}}

Outputs sum to 1; for multi-class

✓ Probability distribution ✗ Not for hidden layers

04

Feed-Forward Neural Network

Forward Pass

Layer $l$ computation:

\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

\mathbf{a}^{(l)} = \phi\!\left(\mathbf{z}^{(l)}\right)

where $\phi$ is a non-linear activation. Stack $L$ layers:

f(\mathbf{x}) = \mathbf{W}^{(L)}\phi(\mathbf{W}^{(L-1)}\cdots\phi(\mathbf{W}^{(1)}\mathbf{x}+\mathbf{b}^{(1)})\cdots)

Universal Approximation Theorem

A 1-hidden-layer network with enough neurons can approximate any continuous function on a compact set to arbitrary precision.

Formally, $\forall \epsilon > 0$, $\exists$ network $f_\theta$ s.t. $\sup_{x \in K} |f(x) - f_\theta(x)| < \epsilon$

Parameter Count

For layers of sizes $[n_0, n_1, \ldots, n_L]$:

\text{params} = \sum_{l=1}^L (n_{l-1} \cdot n_l + n_l)

layer_sizes = [784, 256, 128, 10]
params = sum(l1*l2+l2 for l1,l2
  in zip(layer_sizes, layer_sizes[1:]))
# = 784*256+256 + 256*128+128 + 128*10+10
# = 234,506 parameters

05

Recurrent Neural Network (RNN)

Core Equations

Hidden state update at time $t$:

\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)

Output:

\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y

⚠️ Suffers from vanishing/exploding gradients over long sequences. Gradient scales as $\prod_{t} \partial \mathbf{h}_t/\partial \mathbf{h}_{t-1}$.

BPTT (Backpropagation Through Time)

\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \sum_t \frac{\partial \mathcal{L}_t}{\partial \mathbf{W}}

\frac{\partial \mathcal{L}_t}{\partial \mathbf{W}} = \sum_{k=1}^t \frac{\partial \mathcal{L}_t}{\partial \mathbf{y}_t}\frac{\partial \mathbf{y}_t}{\partial \mathbf{h}_t}\left(\prod_{j=k}^{t-1}\frac{\partial \mathbf{h}_{j+1}}{\partial \mathbf{h}_j}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}}

06

Long Short-Term Memory (LSTM)

🚪 Forget Gate

\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)

Decides what to discard from cell state. Output in $[0,1]$.

➕ Input Gate

\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_i)

\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_C[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_C)

🔋 Cell State Update

\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t

Element-wise operations preserve gradients through time — the "highway" for gradients.

📤 Output Gate

\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_o)

\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)

Why LSTM Solves Vanishing Gradient

The cell state update $\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$ creates an additive shortcut. Gradient flows through this path without repeated matrix multiplication:

\frac{\partial \mathbf{C}_t}{\partial \mathbf{C}_{t-1}} = \mathbf{f}_t \quad \text{(element-wise, no matrix multiply)}

07

Gated Recurrent Unit (GRU)

GRU Equations (Cho et al., 2014)

Reset gate — how much past to forget:

\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t])

Update gate — interpolation coefficient:

\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])

Candidate hidden state:

\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}[\mathbf{r}_t\odot\mathbf{h}_{t-1}, \mathbf{x}_t])

Final hidden state:

\mathbf{h}_t = (1-\mathbf{z}_t)\odot\mathbf{h}_{t-1} + \mathbf{z}_t\odot\tilde{\mathbf{h}}_t

LSTM vs GRU Comparison

Property	LSTM	GRU
Gates	3 (f, i, o)	2 (r, z)
States	$\mathbf{h}_t, \mathbf{C}_t$	$\mathbf{h}_t$ only
Parameters	$4(n_h^2 + n_h n_x)$	$3(n_h^2 + n_h n_x)$
Performance	Slightly better on long sequences	Faster, similar accuracy
Interpretability	Cell state explicit	Simpler structure

💡 GRU merges forget and input into a single update gate. If $\mathbf{z}_t \approx 0$, keep old state; if $\mathbf{z}_t \approx 1$, use new candidate.

08

Attention Mechanism

Scaled Dot-Product Attention

Given queries $\mathbf{Q}$, keys $\mathbf{K}$, values $\mathbf{V}$:

\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

Scaling by $\sqrt{d_k}$ prevents dot products from growing large in high dimensions, which would push softmax into regions with tiny gradients.

\text{score}(q,k) = \frac{q \cdot k}{\sqrt{d_k}}

Intuition

Each query "looks up" relevant keys. High similarity → high attention weight → more value contribution.

Multi-Head Attention

\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)

\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)\mathbf{W}^O

Each head attends to different representation subspaces — syntactic, semantic, positional relationships simultaneously. Total params: $4d_\text{model}^2$.

09

Transformer Architecture

Positional Encoding

Inject sequence order (no recurrence):

\text{PE}_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)

\text{PE}_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)

Feed-Forward Sub-Layer

\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}\mathbf{W}_1+\mathbf{b}_1)\mathbf{W}_2+\mathbf{b}_2

Applied position-wise. $d_{ff}$ typically $4 \times d_\text{model}$.

Layer Normalization

\text{LN}(\mathbf{x}) = \gamma \frac{\mathbf{x}-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta

Normalizes across feature dim. Enables stable deep training.

Complexity Comparison

Model	Per-layer	Sequential
RNN	$O(nd^2)$	$O(n)$
LSTM	$O(nd^2)$	$O(n)$
Transformer	$O(n^2d)$	$O(1)$
Conv (k)	$O(knd^2)$	$O(n/k)$

10

Embeddings

Word2Vec — Skip-Gram

Predict context from center word. Objective:

\mathcal{L} = -\frac{1}{T}\sum_{t=1}^T\sum_{-c\leq j\leq c, j\neq 0}\log P(w_{t+j}|w_t)

P(w_O|w_I) = \frac{\exp(\mathbf{v}_{w_O}^{\prime T}\mathbf{v}_{w_I})}{\sum_{w=1}^W \exp(\mathbf{v}_w^{\prime T}\mathbf{v}_{w_I})}

✨ Famous: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$

GloVe — Global Vectors

Factorize co-occurrence matrix $X$:

\mathcal{L} = \sum_{i,j=1}^V f(X_{ij})(\mathbf{w}_i^T\tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2

Weighting function $f(x) = (x/x_{\max})^\alpha$ if $x < x_{\max}$, else $1$. Combines local (Word2Vec) + global (co-occurrence) statistics.

Token Embeddings in Transformers

Input is a lookup table $\mathbf{E} \in \mathbb{R}^{|V| \times d}$:

\mathbf{x}_t = \mathbf{E}[w_t] + \text{PE}(t)

Tied embeddings (input = output weight matrix) halve params and improve perplexity. Used in GPT, BERT.

Subword Tokenization (BPE)

Merge most frequent byte pairs iteratively. Vocab size $V$ balances coverage vs. sequence length:

\text{fertility} = \frac{\text{tokens}}{\text{words}} \approx 1.2\text{–}1.5 \text{ for English}

# "unhappiness" → ["un", "happy", "ness"]
# Unknown words handled gracefully

11

Loss Functions

Mean Squared Error (MSE)

\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2

Penalizes large errors quadratically. Assumes Gaussian noise. Gradient: $\frac{2}{n}(\hat{y}-y)$.

📊 Regression

MAE / L1 Loss

\mathcal{L}_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^n|y_i - \hat{y}_i|

Robust to outliers. Non-differentiable at 0 (use subgradient). Assumes Laplace noise.

📊 Robust Regression

Huber Loss

\mathcal{L}_\delta = \begin{cases}\frac{1}{2}(y-\hat{y})^2 & |y-\hat{y}| \leq \delta \\ \delta|y-\hat{y}| - \frac{\delta^2}{2} & \text{otherwise}\end{cases}

MSE for small errors, MAE for large. Differentiable everywhere. Best of both worlds.

📊 Robust Regression

Cross-Entropy (Binary)

\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_i[y_i\log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]

Maximizes log-likelihood of Bernoulli model. Gradient w.r.t. logit: $\hat{p} - y$ (clean!).

🔵 Binary Classification

Categorical Cross-Entropy

\mathcal{L}_{\text{CE}} = -\sum_{c=1}^C y_c \log\hat{p}_c

KL divergence from true to predicted distribution: $\mathcal{L} = H(y, \hat{p}) = H(y) + D_{\text{KL}}(y\|\hat{p})$

🔵 Multi-class

KL Divergence

D_{\text{KL}}(P\|Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)}

Asymmetric: $D_{\text{KL}}(P\|Q) \neq D_{\text{KL}}(Q\|P)$. Used in VAEs, knowledge distillation, RL (PPO).

🧮 Distribution Matching

Hinge Loss (SVM)

\mathcal{L} = \frac{1}{n}\sum_i \max(0, 1 - y_i\hat{y}_i)

Only misclassified and margin-violating samples contribute. Correct and confident → 0 loss.

🔵 SVM / Margin

Focal Loss

\mathcal{L}_\text{FL} = -\alpha_t(1-p_t)^\gamma \log(p_t)

Down-weights easy examples via $(1-p_t)^\gamma$. Designed for class imbalance (RetinaNet). $\gamma=2$ is common.

🎯 Object Detection

Contrastive / Triplet Loss

\mathcal{L} = \max(0,\, d(a,p) - d(a,n) + m)

Anchor $a$, positive $p$, negative $n$. Margin $m$ ensures separation. Used in face recognition, embeddings.

🔗 Metric Learning

ELBO (VAE Loss)

\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x)\|p(z))

Reconstruction term + KL regularization. Tight lower bound on log-likelihood $\log p(x)$.

🧮 Generative Models

12

Optimizers

Gradient Descent Family

SGD

Stochastic Gradient Descent

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t; x^{(i)}, y^{(i)})

Noisy but can escape local minima. Converges at $O(1/\sqrt{T})$ for convex.

SGD+M

Momentum

\mathbf{v}_{t+1} = \beta \mathbf{v}_t + (1-\beta)\nabla_\theta\mathcal{L}

\theta_{t+1} = \theta_t - \eta \mathbf{v}_{t+1}

Accumulates velocity. Damps oscillations in shallow dimensions. $\beta=0.9$ standard.

AdaGrad

Adaptive Gradient

G_t = G_{t-1} + g_t^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t+\epsilon}}g_t

Per-parameter learning rates. Rare features get larger updates. LR decays to 0 — problem!

RMSProp

Root Mean Square Propagation

E[g^2]_t = \rho E[g^2]_{t-1} + (1-\rho)g_t^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t

Fixes AdaGrad's decaying LR with exponential moving average. $\rho=0.9$ typical.

Adam

Adam — Adaptive Moment Estimation

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \quad \text{(1st moment)}

v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \quad \text{(2nd moment)}

\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(bias correction)}

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t

Default: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$, $\eta=10^{-3}$. Most widely used optimizer.

AdamW

Adam + Weight Decay (Decoupled)

\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} + \lambda\theta_t\right)

Decouples L2 regularization from gradient scaling. Better generalization than Adam+L2. Default for LLMs (GPT, BERT fine-tuning).

Nadam

Nesterov + Adam

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\left(\beta_1\hat{m}_t + \frac{(1-\beta_1)g_t}{1-\beta_1^t}\right)

Look-ahead gradient (Nesterov) + adaptive rates (Adam). Often converges slightly faster.

Lion

EvoLved Sign Momentum (2023)

\theta_{t+1} = \theta_t - \eta(\text{sign}(\beta_1 m_{t-1} + (1-\beta_1)g_t) + \lambda\theta_t)

Uses sign of gradient. Memory-efficient (1 state vs Adam's 2). Competitive with AdamW on large models.

Learning Rate Schedules

Cosine Annealing

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{t\pi}{T}\right)

Linear Warmup

\eta_t = \eta_{\max}\cdot\frac{t}{t_{\text{warmup}}} \quad (t \leq t_w)

Transformer Schedule

\eta = d_\text{model}^{-0.5}\cdot\min(t^{-0.5},\, t\cdot t_w^{-1.5})

13

Regularization Techniques

L2 Regularization (Weight Decay)

\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2}\|\mathbf{w}\|_2^2

Gradient adds $\lambda\mathbf{w}$ — pulls weights toward 0. Equivalent to Gaussian prior $p(\mathbf{w}) = \mathcal{N}(0, 1/\lambda)$. Shrinks all weights proportionally.

L1 Regularization (Lasso)

\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda\|\mathbf{w}\|_1

Induces sparsity — some weights exactly 0. Feature selection. Equivalent to Laplace prior. Gradient: $\lambda\,\text{sign}(\mathbf{w})$.

Dropout

\mathbf{h}_{\text{drop}} = \mathbf{h} \odot \mathbf{m}, \quad m_i \sim \text{Bernoulli}(p)

Scale at inference: $\hat{\mathbf{h}} = p\mathbf{h}$. Equivalent to training $2^n$ subnetworks. Acts as ensemble. Reduces co-adaptation.

⚠️ Don't use dropout before LayerNorm (Transformers use it after).

Batch Normalization

\mu_\mathcal{B} = \frac{1}{m}\sum_{i=1}^m x_i, \quad \sigma^2_\mathcal{B} = \frac{1}{m}\sum_{i=1}^m(x_i-\mu_\mathcal{B})^2

\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma^2_\mathcal{B}+\epsilon}}, \quad y_i = \gamma\hat{x}_i + \beta

Normalizes per-batch, per-feature. Enables higher LR, reduces sensitivity to initialization.

Data Augmentation

Implicitly increases training set. For images:

\tilde{x} = T(x), \quad T \sim \{$$crop, flip, rotate, color-jitter$\}

Mixup: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$, $\tilde{y} = \lambda y_i + (1-\lambda)y_j$

Early Stopping

Monitor validation loss. Stop when it stops decreasing for $k$ epochs (patience).

\text{stop if } \mathcal{L}_{\text{val}}(t) > \min_{t' < t} \mathcal{L}_{\text{val}}(t') \text{ for } k \text{ epochs}

Equivalent to L2 regularization in linear models. Free — no additional compute cost.

14

Backpropagation

Chain Rule — The Core Idea

For composite $\mathcal{L} = f(g(\mathbf{w}))$:

\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial g} \cdot \frac{\partial g}{\partial \mathbf{w}}

For a layer $l$ with pre-activations $\mathbf{z}^{(l)}$:

\boldsymbol{\delta}^{(l)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} = \left(\mathbf{W}^{(l+1)T}\boldsymbol{\delta}^{(l+1)}\right) \odot \phi'(\mathbf{z}^{(l)})

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)}\mathbf{a}^{(l-1)T}

\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}

Computational Graph Perspective

Forward pass: compute and cache all intermediate values.

Backward pass: traverse graph in reverse, accumulate gradients via chain rule.

🔑 Key insight: share computation via dynamic programming. $O(n)$ ops for $n$-parameter gradient, same as one forward pass.

# PyTorch autograd
loss.backward()  # ← backprop
optimizer.step() # ← update
optimizer.zero_grad() # ← reset

15

Convolutional Neural Networks

Convolution Operation

(I * K)[i,j] = \sum_m\sum_n I[i+m, j+n]\cdot K[m,n]

Output size: $\lfloor\frac{W-F+2P}{S}\rfloor + 1$

where $W$=input, $F$=filter, $P$=padding, $S$=stride.

Parameters per filter: $F \times F \times C_{\text{in}}$. Total: $\times C_{\text{out}} + C_{\text{out}}$ (bias).

Receptive Field

r_l = r_{l-1} + (k_l - 1)\prod_{i=1}^{l-1}s_i

Pooling & Depth

Max Pooling: $y = \max_{(i,j) \in R} x_{ij}$ — translation invariance

Avg Pooling: $y = \frac{1}{|R|}\sum_{(i,j)\in R} x_{ij}$

Global Avg Pooling: Collapses spatial dims → classification.

Key Architectures

Model	Year	Key Innovation
AlexNet	2012	ReLU, Dropout, GPU
VGG	2014	Deep 3×3 filters
ResNet	2015	Skip connections
EfficientNet	2019	Compound scaling
ConvNeXt	2022	Modernized CNN

ResNet Skip Connection

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

Gradient flows directly: $\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}\left(1 + \frac{\partial \mathcal{F}}{\partial \mathbf{x}}\right)$. The $+1$ prevents vanishing gradient, enabling 1000+ layer networks.

16

Generative Adversarial Networks

Minimax Game

\min_G \max_D \, V(D,G) = \mathbb{E}_{x\sim p_\text{data}}[\log D(x)] + \mathbb{E}_{z\sim p_z}[\log(1-D(G(z)))]

Discriminator: maximize log-prob of real vs fake

Generator: minimize log-prob of being caught (or maximize $\log D(G(z))$ — non-saturating)

🎯 Optimal: $D^*(x) = \frac{p_\text{data}(x)}{p_\text{data}(x)+p_g(x)}$. At equilibrium: $p_g = p_\text{data}$, $D^* = 1/2$ everywhere.

Variants

Model	Key Idea
WGAN	Wasserstein distance, Lipschitz constraint
cGAN	Conditional on class label $y$
CycleGAN	Unpaired image translation
StyleGAN	Style-based generator, disentanglement

WGAN Loss

\mathcal{L} = \mathbb{E}[D(x_\text{real})] - \mathbb{E}[D(G(z))]

17

Variational Autoencoders

Encoder + Decoder

Encoder: $q_\phi(z|x)$ — posterior approximation

Decoder: $p_\theta(x|z)$ — likelihood

Prior: $p(z) = \mathcal{N}(0, I)$

ELBO Objective

\mathcal{L} = \underbrace{\mathbb{E}_{q_\phi}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_\text{KL}(q_\phi(z|x)\|p(z))}_{\text{regularization}}

KL term forces latent to match prior (smooth, continuous latent space).

Reparameterization Trick

Cannot backprop through sampling. Instead:

z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Moves randomness outside gradient path. Gradient flows through $\mu$, $\sigma$.

KL for Gaussian

D_\text{KL} = -\frac{1}{2}\sum_j\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)

Analytical! No Monte Carlo needed for this term.

18

Reinforcement Learning

Bellman Equation

V^\pi(s) = \mathbb{E}_\pi\left[r_t + \gamma V^\pi(s_{t+1}) \mid s_t=s\right]

Q^\pi(s,a) = r + \gamma\mathbb{E}_{s'}[V^\pi(s')]

Value function satisfies recursive self-consistency. Optimal: $Q^*(s,a) = r + \gamma\max_{a'}Q^*(s',a')$

Policy Gradient (REINFORCE)

\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[\nabla_\theta\log\pi_\theta(a|s) \cdot G_t\right]

$G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}$ (return). High variance — use baseline $b(s)$:

\nabla_\theta J = \mathbb{E}[\nabla_\theta\log\pi_\theta(a|s)(G_t - b(s))]

PPO (Proximal Policy Optimization)

\mathcal{L}^\text{CLIP} = \mathbb{E}\left[\min\!\left(r_t(\theta)\hat{A}_t,\, \text{clip}(r_t,1\!-\!\epsilon,1\!+\!\epsilon)\hat{A}_t\right)\right]

$r_t(\theta) = \pi_\theta(a|s)/\pi_{\theta_\text{old}}(a|s)$. Clips policy update to prevent large steps. RLHF for LLMs uses PPO.

DQN Key Ideas

\mathcal{L} = \mathbb{E}\left[(r + \gamma\max_{a'}Q(s',a';\theta^-) - Q(s,a;\theta))^2\right]

Experience Replay: $\mathcal{D} = \{(s,a,r,s')\}$ — breaks correlations

Target Network: $\theta^-$ updated every $C$ steps — stabilizes training

19

Interview Questions & Answers

20

ML System Design

Framework: ORCAS

O

Objective

Define the ML problem. What metric matters? Business KPI → proxy metric. E.g., "maximize watch time" → CTR + completion rate.

R

Requirements

Latency (p99), throughput (QPS), accuracy threshold, freshness, training frequency, data constraints, budget.

C

Collection

Data sources, labeling strategy (human/weak supervision/self-supervised), class imbalance handling, feature store.

A

Architecture

Model choice: start simple (logistic reg → tree → DNN). Feature engineering, embedding strategy, serving architecture.

S

Serving & Scaling

Online vs batch inference, model compression (quantization, distillation, pruning), A/B testing, monitoring.

Recommendation System

🔍 Retrieval (ANN)

→

⚡ Ranking (DNN)

→

📋 Re-ranking

Two-tower model:

\text{score}(u,i) = \mathbf{e}_u \cdot \mathbf{e}_i

User tower + Item tower trained with in-batch negatives. Serve via FAISS/ScaNN for ANN retrieval.

Search Ranking

\text{score} = f_\theta(\text{query}, \text{doc}, \text{context})

Learning-to-Rank: Pointwise (regression), Pairwise (RankNet: $P_{ij} = \sigma(s_i - s_j)$), Listwise (LambdaRank).

Features: BM25, semantic similarity, CTR, freshness, authority.

Real-Time Fraud Detection

Constraints: <50ms latency, high precision (false positive = customer friction).

Architecture: Feature store → gradient boosted trees (fast inference) + async DNN for enrichment.

Challenge: Concept drift — retrain daily on recent data + monitor PSI.

\text{PSI} = \sum\left(P_i - Q_i\right)\ln\frac{P_i}{Q_i}

Model Compression

Quantization: FP32 → INT8. $4\times$ smaller, $2-4\times$ faster. PTQ vs QAT.

Pruning: Remove weights with $|w| < \epsilon$. Structured (whole neurons) vs unstructured.

Distillation: Student learns soft targets from teacher:

\mathcal{L} = \alpha\mathcal{L}_\text{CE}(y, p_s) + (1-\alpha)\mathcal{L}_\text{KL}(p_t, p_s)

LoRA: $\Delta W = BA$, where $B \in \mathbb{R}^{d\times r}$, $A \in \mathbb{R}^{r\times k}$, $r \ll \min(d,k)$.

21

Glossary of Key Terms

22

Quick-Reference Cheat Sheet

23

Complexity & Scaling Guide

Time & Space Complexity Comparison

Architecture	Train Time / Step	Inference	Memory (params)	Sequence Ops
Linear / Logistic	$O(nd)$	$O(d)$	$O(d)$	—
MLP (L layers)	$O(nd^2L)$	$O(d^2L)$	$O(d^2L)$	—
CNN (k×k filter)	$O(nk^2d^2L)$	$O(k^2d^2L)$	$O(k^2d^2L)$	$O(n/k)$
RNN	$O(nd_h^2)$	$O(d_h^2)$	$O(d_h^2)$	$O(n)$
LSTM	$O(4nd_h^2)$	$O(4d_h^2)$	$O(4d_h(d_h+d_x))$	$O(n)$
Transformer	$O(n^2d)$	$O(n^2d)$	$O(4d^2)$ per layer	$O(1)$
Linear Attn.	$O(nd^2)$	$O(d^2)$	$O(4d^2)$ per layer	$O(1)$

Scaling Laws (Chinchilla)

Optimal token count for a model of $N$ parameters:

D_{\text{opt}} \approx 20 \cdot N

Loss as a function of compute $C = 6ND$:

L(C) = \left(\frac{C_{\min}}{C}\right)^{\alpha_C} + L_\infty

$\alpha_C \approx 0.057$. Both model size and data scale equally — compute-optimal training is the Chinchilla regime. GPT-3 (175B, 300B tokens) was severely undertrained.

GPU Memory Estimation

For a model with $N$ parameters, full training memory:

M_{\text{total}} \approx 16N \text{ bytes}

Breakdown (mixed precision + Adam):

M = \underbrace{2N}_{\text{fp16 params}} + \underbrace{2N}_{\text{gradients}} + \underbrace{12N}_{\text{Adam states (fp32)}}

e.g., 7B model ≈ 112 GB → needs 2× A100 80GB minimum. Activation checkpointing trades compute for memory: $O(\sqrt{L})$ vs $O(L)$.

params = 7e9
bytes_per_param = 16  # full training
gpu_mem_gb = params * bytes_per_param / 1e9
# = 112 GB

24

Weight Initialization

Why Initialization Matters

Bad initialization → vanishing or exploding activations/gradients from layer 1. Consider $L$ layers with weights $W^{(l)} \in \mathbb{R}^{n \times n}$. Variance of activations:

\text{Var}(a^{(l)}) = \left(n \cdot \text{Var}(w)\right)^l \cdot \text{Var}(x)

This blows up if $n\,\text{Var}(w) > 1$ and vanishes if $< 1$. Initialization schemes choose $\text{Var}(w)$ to keep variance $\approx 1$ through all layers.

Xavier / Glorot (2010)

Designed for tanh / sigmoid activations. Balances forward and backward signal:

W \sim \mathcal{U}\!\left[-\frac{\sqrt{6}}{\sqrt{n_{\text{in}}+n_{\text{out}}}},\; \frac{\sqrt{6}}{\sqrt{n_{\text{in}}+n_{\text{out}}}}\right]

Or equivalently: $\text{Var}(W) = \dfrac{2}{n_{\text{in}}+n_{\text{out}}}$

Derivation: choose variance so that $\text{Var}(y_i) = \text{Var}(x_i)$ through both forward and backward pass simultaneously.

nn.init.xavier_uniform_(layer.weight)
nn.init.xavier_normal_(layer.weight)

He / Kaiming (2015)

Designed for ReLU / Leaky ReLU. Accounts for ReLU zeroing half the neurons (halves effective fan-in):

W \sim \mathcal{N}\!\left(0,\; \frac{2}{n_{\text{in}}}\right)

For Leaky ReLU with slope $a$:

\text{Var}(W) = \frac{2}{(1+a^2)\,n_{\text{in}}}

Default in PyTorch for Conv and Linear layers. Always prefer over Xavier when using ReLU variants.

nn.init.kaiming_normal_(layer.weight,
  mode='fan_in',
  nonlinearity='relu')

Orthogonal Init

Initialize weight matrix as a random orthogonal matrix $W^TW = I$:

W = UV^T, \quad \text{where } M = U\Sigma V^T \text{ (SVD of random matrix)}

Preserves gradient norms exactly at initialization. Especially useful for RNNs (prevents vanishing/exploding in recurrent weights). Also used in deep linear networks research.

nn.init.orthogonal_(layer.weight,
  gain=1.0)

Practical Initialization Rules

Activation	Use	Gain
Linear / tanh	Xavier Normal	$5/3$
Sigmoid	Xavier Normal	$1$
ReLU	He / Kaiming	$\sqrt{2}$
Leaky ReLU	He Kaiming	$\sqrt{2/(1+a^2)}$
SELU	LeCun Normal	$1$

💡 Biases: always init to 0. BatchNorm γ=1, β=0. Embedding: $\mathcal{N}(0, d^{-0.5})$. Final classifier: $\mathcal{N}(0, 0.01)$ for stable softmax at start.

Machine LearningEncyclopedia

Types of Machine Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Semi-Supervised

Self-Supervised

Transfer Learning

Linear Models

Linear Regression

Logistic Regression

Support Vector Machine (SVM)

Activation Functions

Feed-Forward Neural Network

Forward Pass

Universal Approximation Theorem

Parameter Count

Recurrent Neural Network (RNN)

Core Equations

BPTT (Backpropagation Through Time)

Long Short-Term Memory (LSTM)

Why LSTM Solves Vanishing Gradient

Gated Recurrent Unit (GRU)

GRU Equations (Cho et al., 2014)

LSTM vs GRU Comparison

Attention Mechanism

Scaled Dot-Product Attention

Intuition

Multi-Head Attention

Transformer Architecture

Positional Encoding

Feed-Forward Sub-Layer

Layer Normalization

Complexity Comparison

Embeddings

Word2Vec — Skip-Gram

GloVe — Global Vectors

Token Embeddings in Transformers

Subword Tokenization (BPE)

Loss Functions

Mean Squared Error (MSE)

MAE / L1 Loss

Huber Loss

Cross-Entropy (Binary)

Categorical Cross-Entropy

KL Divergence

Hinge Loss (SVM)

Focal Loss

Contrastive / Triplet Loss

ELBO (VAE Loss)

Optimizers

Gradient Descent Family

Stochastic Gradient Descent

Momentum

Adaptive Gradient

Root Mean Square Propagation

Adam — Adaptive Moment Estimation

Adam + Weight Decay (Decoupled)

Nesterov + Adam

EvoLved Sign Momentum (2023)

Learning Rate Schedules

Cosine Annealing

Linear Warmup

Transformer Schedule

Regularization Techniques

L2 Regularization (Weight Decay)

L1 Regularization (Lasso)

Dropout

Batch Normalization

Data Augmentation

Early Stopping

Backpropagation

Chain Rule — The Core Idea

Computational Graph Perspective

Convolutional Neural Networks

Convolution Operation

Receptive Field

Pooling & Depth

Key Architectures

ResNet Skip Connection

Machine Learning
Encyclopedia