MST0052
## MST0052 — Lecture 5 ### Classification Methods Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1–3 | Foundations | | **4–7** | **Core methods — you are here** | | 8–13 | Going further | | 14–16 | Wrapping up | --- ## Today's plan - From regression to classification — what changes? - **Logistic regression** - **k-nearest neighbours** - **Naive Bayes** - Evaluation: confusion matrix, precision, recall, F1 --- ## Regression vs classification - **Regression:** predict a continuous number (house price, temperature) - **Classification:** predict a **category** (churn/no churn, spam/not spam, species) Same pipeline pattern from L3. Different model, different metrics. --- ## Binary vs multi-class - **Binary:** two classes (positive/negative, 0/1, yes/no) — most common in practice - **Multi-class:** three or more classes (species, product category, digit) Today's methods handle both. We focus on binary for clarity. --- ## What does a classifier actually output? - A **predicted class label** (0 or 1) - Often also a **predicted probability** — how confident is the model? The probability is more useful than the label: > "This customer has a 73% chance of churning" is more informative than "this customer will churn." --- ## The baseline: majority class predictor - Always predict the most common class - If 70% of customers don't churn, this baseline gets **70% accuracy** — for free - **Every classifier must beat this** --- ## From linear regression to logistic regression Linear regression predicts a number. For classification, we need a probability in [0, 1]. **Logistic regression** maps a linear score through the sigmoid function: $$\Pr(y = 1 \mid x) = \frac{1}{1 + e^{-(\beta\_0 + x^\top\beta)}}$$ Same weighted sum as linear regression, squeezed through a curve. --- ## The sigmoid function  Score 0 → probability 0.5. Large positive → near 1.0. Large negative → near 0.0. Default threshold 0.5 sets the class label. --- ## Interpreting the coefficients - Coefficients work on the **log-odds** scale - $\beta\_j > 0$: increasing $x\_j$ increases the probability of class 1 - $\beta\_j < 0$: increasing $x\_j$ decreases the probability - The magnitude tells you how strongly the feature pushes the prediction --- ## Logistic regression in scikit-learn ```python from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler pipe = make_pipeline(StandardScaler(), LogisticRegression()) pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) y_prob = pipe.predict_proba(X_test)[:, 1] ``` - `predict` → class labels - `predict_proba` → probabilities --- ## Regularisation in logistic regression - scikit-learn's `LogisticRegression` is **regularised by default** (L2, `C=1.0`) - `C` is the inverse of $\lambda$: smaller `C` = stronger penalty - Same idea as ridge (L4): prevents overfitting with many features - Can switch to L1 (`penalty='l1'`) for feature selection --- ## When to use logistic regression - Strong baseline for **any** binary classification problem - Transparent, fast, probabilistic - Works best when the relationship is roughly linear in log-odds - **Always try it first** — even if you plan to use something fancier --- ## k-nearest neighbours: let neighbours vote For a new point: 1. Find the $k$ closest training observations 2. Let them **vote** — the majority class wins - No training phase — all work happens at prediction time - Simple and intuitive — no hidden machinery --- ## k is the tuning parameter - **Small k** (e.g., 1): very flexible, follows noise, high variance - **Large k** (e.g., 50): very smooth, may miss local patterns, high bias - Choose $k$ by **cross-validation** — same `GridSearchCV` pattern from L4 k-NN is a bias-variance lesson in one parameter. (More on this in L6.) --- ## k-NN requires scaling k-NN uses **distances** (usually Euclidean) to find neighbours. If features are on different scales, the large-scale feature **dominates**. **Always standardise before k-NN.** --- ## k-NN in scikit-learn ```python from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import GridSearchCV pipe = make_pipeline(StandardScaler(), KNeighborsClassifier()) param_grid = {'kneighborsclassifier__n_neighbors': [3, 5, 11, 21]} search = GridSearchCV(pipe, param_grid, cv=5, scoring='f1') search.fit(X_train, y_train) print(f"Best k: {search.best_params_}") ``` --- ## When to use k-NN - Good for small datasets with clear cluster structure - No assumptions about the shape of the decision boundary - Struggles with **high-dimensional data** (curse of dimensionality) - Slow at prediction time for large datasets Best role: a non-parametric sanity check alongside logistic regression. --- ## Naive Bayes: Bayes' rule with independence $$P(c \mid x) \propto P(c) \prod\_{j=1}^{p} P(x\_j \mid c)$$ - Assumes each feature contributes **independently** to the class probability - Almost never literally true — but often works surprisingly well - "Naive" refers to the independence assumption --- ## Variants of naive Bayes | Variant | Assumes | Good for | |---------|---------|----------| | **GaussianNB** | Features are normally distributed | Continuous tabular data | | **MultinomialNB** | Count data | Text (word counts, TF-IDF) | | **BernoulliNB** | Binary features | Binary text features | In this course, `GaussianNB` is the default for tabular data. --- ## Naive Bayes in scikit-learn ```python from sklearn.naive_bayes import GaussianNB model = GaussianNB() model.fit(X_train, y_train) y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] ``` - **No scaling needed** (each feature modelled independently) - **No hyperparameters** to tune (for GaussianNB) - Very fast — fits and predicts in milliseconds --- ## When to use naive Bayes - Strong for **text classification** (spam, sentiment, topic) - Good as a fast, cheap baseline for any classification problem - Probabilities are often **poorly calibrated** (too confident) - Works surprisingly well when the independence assumption is violated --- ## Decision boundaries  Same data, three different ways to carve up the space. --- ## What the boundaries tell you - **Logistic regression:** linear boundary — simple, can miss curves - **k-NN:** adapts locally — flexible, can overfit - **Naive Bayes:** smooth curved boundaries from feature distributions The method that wins **depends on your data.** That's why you compare. --- ## Why accuracy is not enough If 95% of patients don't have the disease, a model that always predicts "healthy" gets **95% accuracy.** But it **misses every sick patient.** That's useless. Accuracy is misleading whenever classes are **imbalanced.** --- ## The confusion matrix | | Predicted positive | Predicted negative | |--|---|---| | **Actually positive** | True positive (TP) | False negative (FN) | | **Actually negative** | False positive (FP) | True negative (TN) | Every evaluation metric is a function of these four numbers. --- ## Precision and recall - **Precision:** of predicted positives, how many are correct? $$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$ - **Recall (sensitivity):** of actual positives, how many did we find? $$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$ Raising the threshold increases precision, decreases recall. --- ## F1 score Harmonic mean of precision and recall: $$F\_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$ - A single number that balances both concerns - For imbalanced classes, F1 is almost always more informative than accuracy - **Default classification metric in this course** unless the problem clearly favours precision or recall --- ## ROC curve and AUC - **ROC curve:** true positive rate vs false positive rate at every threshold - **AUC** (area under the curve): 1.0 = perfect, 0.5 = random - Measures the model's ability to **rank** positives above negatives - Good for comparing models when you haven't chosen a threshold yet --- ## classification\_report in scikit-learn ```python from sklearn.metrics import classification_report, confusion_matrix y_pred = pipe.predict(X_test) print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) ``` - Gives precision, recall, F1 **per class** plus macro/weighted averages - Always print the confusion matrix alongside it --- ## Choosing the right metric | Situation | Recommended metric | |-----------|--------------------| | Balanced classes | Accuracy or F1 | | Imbalanced classes | F1, precision-recall, or AUC | | Cost asymmetry (e.g., missing fraud is worse) | Favour recall | | Ranking matters (e.g., prioritise likely churners) | AUC | **State your choice and justify it in the report.** --- ## Worked example: breast cancer dataset - Built into scikit-learn: `load_breast_cancer()` - **Target:** malignant (1) vs benign (0) - 30 numeric features (cell nucleus measurements) - 569 observations. Slightly imbalanced (~63% benign). --- ## Build three pipelines ```python from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB models = { 'Logistic': make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)), 'k-NN': make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5)), 'NaiveBayes': GaussianNB(), } ``` Same data, same splits, same metric — **fair comparison.** --- ## Compare with cross-validation ```python from sklearn.model_selection import cross_val_score for name, model in models.items(): scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1') print(f"{name}: F1 = {scores.mean():.3f} ± {scores.std():.3f}") ``` Expected: all three perform well (~0.95+ F1). Logistic regression slightly ahead. --- ## Evaluate the best on the test set ```python best = models['Logistic'] best.fit(X_train, y_train) y_pred = best.predict(X_test) print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) ``` Walk through the confusion matrix: **how many malignant cases did we miss?** --- ## Lessons from the comparison - All three classifiers are competitive on clean, well-separated data - **Logistic regression** is the most stable and interpretable - **k-NN** is sensitive to $k$ and feature scaling - **Naive Bayes** is the fastest but probabilities are poorly calibrated The method that wins **depends on your data.** That's why you compare. --- ## Summary - **Logistic regression:** transparent, probabilistic, regularised by default. Always try first. - **k-NN:** simple, non-parametric, needs scaling and tuned $k$ - **Naive Bayes:** fast, probabilistic, good for text - Evaluate with **confusion matrix, precision, recall, F1** — not just accuracy - Compare models under the **same resampling rule** --- ## Before Lecture 6 - Try all three classifiers on **your own dataset** (if it's classification) - If your project is regression, try the L4 pipeline instead - Read ahead: Lecture 6 is the **bias-variance tradeoff** — the most important concept in the course --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## Class imbalance strategies - `class_weight='balanced'` in logistic regression — adjusts loss by class frequency - **Oversampling** (SMOTE): generate synthetic minority examples - **Undersampling:** randomly drop majority examples - **Threshold tuning:** adjust the decision threshold instead of resampling Start with `class_weight='balanced'` — it's one parameter change. -- ## The precision-recall curve - An alternative to the ROC curve for **heavily imbalanced** datasets - Plots precision vs recall at every threshold - More informative than ROC when the positive class is rare ```python from sklearn.metrics import PrecisionRecallDisplay PrecisionRecallDisplay.from_estimator(pipe, X_test, y_test) ``` -- ## Multi-class classification - scikit-learn handles multi-class **automatically** for most classifiers - Under the hood: **one-vs-rest** (OvR) or **one-vs-one** (OvO) - `classification_report` shows metrics **per class** - Use `scoring='f1_macro'` or `scoring='f1_weighted'` in cross-validation -- ## Probability calibration Some classifiers output probabilities that don't match observed frequencies: - **Naive Bayes** — independence assumption pushes probabilities toward 0 or 1 - **Tree-based models** — discrete leaves produce coarse probabilities - **SVM** — `decision_function` is a margin, not a probability If you care about the **probability**, not just the label, calibrate: ```python from sklearn.calibration import CalibratedClassifierCV calibrated = CalibratedClassifierCV(GaussianNB(), method='isotonic', cv=5) calibrated.fit(X_train, y_train) y_prob = calibrated.predict_proba(X_test)[:, 1] ``` Check with a **reliability diagram** (`CalibrationDisplay`): does a predicted 0.7 actually correspond to ~70% positives? -- ## The decision threshold The default threshold is 0.5 — but it's not always the right choice. ```python y_prob = pipe.predict_proba(X_test)[:, 1] threshold = 0.3 # lower threshold = more positives y_pred_custom = (y_prob >= threshold).astype(int) ``` Choose the threshold based on the **cost of errors** in your problem. --- ## What's next **Lecture 6:** Bias-variance tradeoff - Why training error is not enough - The U-shaped test error curve - How model complexity relates to generalisation