Some definitions that you will find out there.
For a supervised learning problem:
We aim to find \(f(X)\) such that:
\[y \approx f(X; \theta)\]
This means we want our modelβs predictions \(f(X; \theta)\) to be as close as possible to the true values \(y\).
\[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2\]
Use case: Regression tasks. Measures how far the predicted values ( \(\hat{y_i}\) ) are from the actual values ( \(y_i\) ) by averaging the square of the differences. The larger the difference, the higher the penalty, as squaring the difference emphasizes bigger errors.
\[H(p,q) = -\sum_{x}p(x)\log q(x)\]
Use case: Classification tasks. Compares the true probability distribution ( \(p(x)\) ) (often 0 or 1 for classification) with the predicted probability ( \(q(x)\) ). If the predicted probability is far from the true label, it gives a higher penalty. Logarithms are used to give more emphasis to confident but wrong predictions.
import numpy as np
# Actual and predicted values
y = np.array([3.5, 2.1, 4.0, 5.5, 6.1, 7.3, 3.9, 4.4, 5.0, 6.7])
y_hat = np.array([3.8, 2.0, 4.2, 5.0, 6.0, 7.5, 3.5, 4.2, 5.1, 6.8])
# Calculate MSE
n = len(y)
mse = (1/n) * np.sum((y - y_hat) ** 2)
\[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2\]
import numpy as np
# True labels (one-hot encoded) for 3 classes
p = np.array([
[1, 0, 0], # Class 1
[0, 1, 0], # Class 2
[0, 0, 1], # Class 3
[1, 0, 0], # Class 1
[0, 1, 0], # Class 2
])
# Predicted probabilities for each class (sum of each row should be 1)
q = np.array([
[0.8, 0.1, 0.1], # Predicted probabilities for Class 1, 2, 3
[0.2, 0.7, 0.1], # Predicted probabilities for Class 1, 2, 3
[0.1, 0.2, 0.7], # Predicted probabilities for Class 1, 2, 3
[0.6, 0.3, 0.1], # Predicted probabilities for Class 1, 2, 3
[0.2, 0.6, 0.2], # Predicted probabilities for Class 1, 2, 3
])
# Clip values to avoid log(0)
q = np.clip(q, 1e-12, 1 - 1e-12)
# Calculate cross-entropy for multiclass classification
cross_entropy = -np.sum(p * np.log(q))
\[H(p,q) = -\sum_{x}p(x)\log q(x)\]
Bias: Error from overly simplistic models (underfitting)
Variance: Error from models too sensitive to training data (overfitting)
Total Error = BiasΒ² + Variance + Irreducible Error
Goal: Find the optimal balance between bias and variance for the lowest possible total error.
This version adds more explanation to bias and variance, and highlights the purpose of the tradeoff.
For a given point \(x\), the expected prediction error is:
\[E[(y - \hat{f}(x))^2] = \text{Var}(\hat{f}(x)) + [\text{Bias}(\hat{f}(x))]^2 + \text{Var}(\epsilon)\]
Where: - \(\text{Var}(\hat{f}(x))\) is the variance - \([\text{Bias}(\hat{f}(x))]^2\) is the squared bias - \(\text{Var}(\epsilon)\) is the irreducible error
Hereβs how you could structure the slide:
\[\lambda \sum_{p=1}^{P} [ (1 - \alpha) |\beta_p| + \alpha |\beta_p|^2]\]
Linear Regression \[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon\]
Logistic Regression (for binary classification) \[P(y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + ... + \beta_nx_n)}}\]
Elastic Net
Transforms categorical features into a numerical format that models can understand.
Example: Grid Search with cross-validation