Classification Methods

## MST0052 — Lecture 5

### Classification Methods

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1–3 | Foundations |
| **4–7** | **Core methods — you are here** |
| 8–13 | Going further |
| 14–16 | Wrapping up |

---

## Today's plan

- From regression to classification — what changes?
- **Logistic regression**
- **k-nearest neighbours**
- **Naive Bayes**
- Evaluation: confusion matrix, precision, recall, F1

---

## Regression vs classification

- **Regression:** predict a continuous number (house price, temperature)
- **Classification:** predict a **category** (churn/no churn, spam/not spam, species)

Same pipeline pattern from L3. Different model, different metrics.

---

## Binary vs multi-class

- **Binary:** two classes (positive/negative, 0/1, yes/no) — most common in practice
- **Multi-class:** three or more classes (species, product category, digit)

Today's methods handle both. We focus on binary for clarity.

---

## What does a classifier actually output?

- A **predicted class label** (0 or 1)
- Often also a **predicted probability** — how confident is the model?

The probability is more useful than the label:

> "This customer has a 73% chance of churning" is more informative than "this customer will churn."

---

## The baseline: majority class predictor

- Always predict the most common class
- If 70% of customers don't churn, this baseline gets **70% accuracy** — for free
- **Every classifier must beat this**

---

## From linear regression to logistic regression

Linear regression predicts a number. For classification, we need a probability in [0, 1].

**Logistic regression** maps a linear score through the sigmoid function:

$$\Pr(y = 1 \mid x) = \frac{1}{1 + e^{-(\beta\_0 + x^\top\beta)}}$$

Same weighted sum as linear regression, squeezed through a curve.

---

## The sigmoid function

![Sigmoid curve mapping linear scores to probabilities](/figures/sigmoid.svg)

Score 0 → probability 0.5. Large positive → near 1.0. Large negative → near 0.0. Default threshold 0.5 sets the class label.

---

## Interpreting the coefficients

- Coefficients work on the **log-odds** scale
- $\beta\_j > 0$: increasing $x\_j$ increases the probability of class 1
- $\beta\_j < 0$: increasing $x\_j$ decreases the probability
- The magnitude tells you how strongly the feature pushes the prediction

---

## Logistic regression in scikit-learn

```python
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
y_prob = pipe.predict_proba(X_test)[:, 1]
```

- `predict` → class labels
- `predict_proba` → probabilities

---

## Regularisation in logistic regression

- scikit-learn's `LogisticRegression` is **regularised by default** (L2, `C=1.0`)
- `C` is the inverse of $\lambda$: smaller `C` = stronger penalty
- Same idea as ridge (L4): prevents overfitting with many features
- Can switch to L1 (`penalty='l1'`) for feature selection

---

## When to use logistic regression

- Strong baseline for **any** binary classification problem
- Transparent, fast, probabilistic
- Works best when the relationship is roughly linear in log-odds
- **Always try it first** — even if you plan to use something fancier

---

## k-nearest neighbours: let neighbours vote

For a new point:

1. Find the $k$ closest training observations
2. Let them **vote** — the majority class wins

- No training phase — all work happens at prediction time
- Simple and intuitive — no hidden machinery

---

## k is the tuning parameter

- **Small k** (e.g., 1): very flexible, follows noise, high variance
- **Large k** (e.g., 50): very smooth, may miss local patterns, high bias
- Choose $k$ by **cross-validation** — same `GridSearchCV` pattern from L4

k-NN is a bias-variance lesson in one parameter. (More on this in L6.)

---

## k-NN requires scaling

k-NN uses **distances** (usually Euclidean) to find neighbours.

If features are on different scales, the large-scale feature **dominates**.

**Always standardise before k-NN.**

---

## k-NN in scikit-learn

```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())
param_grid = {'kneighborsclassifier__n_neighbors': [3, 5, 11, 21]}
search = GridSearchCV(pipe, param_grid, cv=5, scoring='f1')
search.fit(X_train, y_train)

print(f"Best k: {search.best_params_}")
```

---

## When to use k-NN

- Good for small datasets with clear cluster structure
- No assumptions about the shape of the decision boundary
- Struggles with **high-dimensional data** (curse of dimensionality)
- Slow at prediction time for large datasets

Best role: a non-parametric sanity check alongside logistic regression.

---

## Naive Bayes: Bayes' rule with independence

$$P(c \mid x) \propto P(c) \prod\_{j=1}^{p} P(x\_j \mid c)$$

- Assumes each feature contributes **independently** to the class probability
- Almost never literally true — but often works surprisingly well
- "Naive" refers to the independence assumption

---

## Variants of naive Bayes

| Variant | Assumes | Good for |
|---------|---------|----------|
| **GaussianNB** | Features are normally distributed | Continuous tabular data |
| **MultinomialNB** | Count data | Text (word counts, TF-IDF) |
| **BernoulliNB** | Binary features | Binary text features |

In this course, `GaussianNB` is the default for tabular data.

---

## Naive Bayes in scikit-learn

```python
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
```

- **No scaling needed** (each feature modelled independently)
- **No hyperparameters** to tune (for GaussianNB)
- Very fast — fits and predicts in milliseconds

---

## When to use naive Bayes

- Strong for **text classification** (spam, sentiment, topic)
- Good as a fast, cheap baseline for any classification problem
- Probabilities are often **poorly calibrated** (too confident)
- Works surprisingly well when the independence assumption is violated

---

## Decision boundaries

![Decision boundaries](/figures/decision-boundary.svg)

Same data, three different ways to carve up the space.

---

## What the boundaries tell you

- **Logistic regression:** linear boundary — simple, can miss curves
- **k-NN:** adapts locally — flexible, can overfit
- **Naive Bayes:** smooth curved boundaries from feature distributions

The method that wins **depends on your data.** That's why you compare.

---

## Why accuracy is not enough

If 95% of patients don't have the disease, a model that always predicts "healthy" gets **95% accuracy.**

But it **misses every sick patient.** That's useless.

Accuracy is misleading whenever classes are **imbalanced.**

---

## The confusion matrix

|  | Predicted positive | Predicted negative |
|--|---|---|
| **Actually positive** | True positive (TP) | False negative (FN) |
| **Actually negative** | False positive (FP) | True negative (TN) |

Every evaluation metric is a function of these four numbers.

---

## Precision and recall

- **Precision:** of predicted positives, how many are correct?

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$

- **Recall (sensitivity):** of actual positives, how many did we find?

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$

Raising the threshold increases precision, decreases recall.

---

## F1 score

Harmonic mean of precision and recall:

$$F\_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

- A single number that balances both concerns
- For imbalanced classes, F1 is almost always more informative than accuracy
- **Default classification metric in this course** unless the problem clearly favours precision or recall

---

## ROC curve and AUC

- **ROC curve:** true positive rate vs false positive rate at every threshold
- **AUC** (area under the curve): 1.0 = perfect, 0.5 = random
- Measures the model's ability to **rank** positives above negatives
- Good for comparing models when you haven't chosen a threshold yet

---

## classification\_report in scikit-learn

```python
from sklearn.metrics import classification_report, confusion_matrix

y_pred = pipe.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

- Gives precision, recall, F1 **per class** plus macro/weighted averages
- Always print the confusion matrix alongside it

---

## Choosing the right metric

| Situation | Recommended metric |
|-----------|--------------------|
| Balanced classes | Accuracy or F1 |
| Imbalanced classes | F1, precision-recall, or AUC |
| Cost asymmetry (e.g., missing fraud is worse) | Favour recall |
| Ranking matters (e.g., prioritise likely churners) | AUC |

**State your choice and justify it in the report.**

---

## Worked example: breast cancer dataset

- Built into scikit-learn: `load_breast_cancer()`
- **Target:** malignant (1) vs benign (0)
- 30 numeric features (cell nucleus measurements)
- 569 observations. Slightly imbalanced (~63% benign).

---

## Build three pipelines

```python
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

models = {
    'Logistic': make_pipeline(StandardScaler(),
                              LogisticRegression(max_iter=5000)),
    'k-NN':     make_pipeline(StandardScaler(),
                              KNeighborsClassifier(n_neighbors=5)),
    'NaiveBayes': GaussianNB(),
}
```

Same data, same splits, same metric — **fair comparison.**

---

## Compare with cross-validation

```python
from sklearn.model_selection import cross_val_score

for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train,
                             cv=5, scoring='f1')
    print(f"{name}: F1 = {scores.mean():.3f} ± {scores.std():.3f}")
```

Expected: all three perform well (~0.95+ F1). Logistic regression slightly ahead.

---

## Evaluate the best on the test set

```python
best = models['Logistic']
best.fit(X_train, y_train)
y_pred = best.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

Walk through the confusion matrix: **how many malignant cases did we miss?**

---

## Lessons from the comparison

- All three classifiers are competitive on clean, well-separated data
- **Logistic regression** is the most stable and interpretable
- **k-NN** is sensitive to $k$ and feature scaling
- **Naive Bayes** is the fastest but probabilities are poorly calibrated

The method that wins **depends on your data.** That's why you compare.

---

## Summary

- **Logistic regression:** transparent, probabilistic, regularised by default. Always try first.
- **k-NN:** simple, non-parametric, needs scaling and tuned $k$
- **Naive Bayes:** fast, probabilistic, good for text
- Evaluate with **confusion matrix, precision, recall, F1** — not just accuracy
- Compare models under the **same resampling rule**

---

## Before Lecture 6

- Try all three classifiers on **your own dataset** (if it's classification)
- If your project is regression, try the L4 pipeline instead
- Read ahead: Lecture 6 is the **bias-variance tradeoff** — the most important concept in the course

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## Class imbalance strategies

- `class_weight='balanced'` in logistic regression — adjusts loss by class frequency
- **Oversampling** (SMOTE): generate synthetic minority examples
- **Undersampling:** randomly drop majority examples
- **Threshold tuning:** adjust the decision threshold instead of resampling

Start with `class_weight='balanced'` — it's one parameter change.

## The precision-recall curve

- An alternative to the ROC curve for **heavily imbalanced** datasets
- Plots precision vs recall at every threshold
- More informative than ROC when the positive class is rare

```python
from sklearn.metrics import PrecisionRecallDisplay
PrecisionRecallDisplay.from_estimator(pipe, X_test, y_test)
```

## Multi-class classification

- scikit-learn handles multi-class **automatically** for most classifiers
- Under the hood: **one-vs-rest** (OvR) or **one-vs-one** (OvO)
- `classification_report` shows metrics **per class**
- Use `scoring='f1_macro'` or `scoring='f1_weighted'` in cross-validation

## Probability calibration

Some classifiers output probabilities that don't match observed frequencies:

- **Naive Bayes** — independence assumption pushes probabilities toward 0 or 1
- **Tree-based models** — discrete leaves produce coarse probabilities
- **SVM** — `decision_function` is a margin, not a probability

If you care about the **probability**, not just the label, calibrate:

```python
from sklearn.calibration import CalibratedClassifierCV

calibrated = CalibratedClassifierCV(GaussianNB(),
                                    method='isotonic', cv=5)
calibrated.fit(X_train, y_train)
y_prob = calibrated.predict_proba(X_test)[:, 1]
```

Check with a **reliability diagram** (`CalibrationDisplay`): does a predicted 0.7 actually correspond to ~70% positives?

## The decision threshold

The default threshold is 0.5 — but it's not always the right choice.

```python
y_prob = pipe.predict_proba(X_test)[:, 1]
threshold = 0.3  # lower threshold = more positives
y_pred_custom = (y_prob >= threshold).astype(int)
```

Choose the threshold based on the **cost of errors** in your problem.

---

## What's next

**Lecture 6:** Bias-variance tradeoff

- Why training error is not enough
- The U-shaped test error curve
- How model complexity relates to generalisation