Ensemble Methods

<p class="kicker">MST0052 · Lecture 10 · Fall 2026</p>

## Ensemble Methods

### Vegard H. Larsen — BI Norwegian Business School

---

## Where we are

L9 found structure without labels. Today: back to a **target** — with the course's first **nonlinear** model family.

![Semester timeline with four phases](../figures/semester-timeline.svg)

---

A single deep tree **memorises**. Three hundred of them, averaged, **generalise**.

Lecture 6's tradeoff, cashed out in practice.

---

## Today's plan

<p class="kicker">90 minutes · one model family · three workflows</p>

- The **decision tree** — one rule at a time
- Why one tree is the textbook **high-variance** learner
- **Bagging** and **random forests** — averaging done right
- **Worked example:** tree → bagging → forest, with real numbers

---

## One tree

---

## A tree is a sequence of questions

![A depth-2 decision tree fitted on the wine dataset](../figures/decision-tree-example.svg)

Follow the answers down; predict the **leaf** you land in. No equations, no scaling.

---

## How splits are chosen

At each node: the (feature, threshold) pair that makes the children **purest**.

$$\text{Gini}(S) = 1 - \sum\_c p\_c^2$$

Growth is **greedy** — best split now, no look-ahead.

<p class="muted">Regression: same algorithm, within-node variance; leaves predict the mean.</p>

---

## Why a tree overfits

<div class="cols">
<div class="callout">
<h3>On the training data</h3>
<p>Keeps splitting until every leaf is pure. Error: zero.</p>
</div>
<div class="callout warn">
<h3>On a resample</h3>
<p>A slightly different sample grows a wildly different tree.</p>
</div>
</div>

The textbook **high-variance** learner from L6.

---

## Taming one tree

**Depth**   `max_depth` `ccp_alpha`

**Node size**   `min_samples_split` `min_samples_leaf`

</div>

All trade variance for bias — L6's dial in tree clothing.

Today's better answer: don't tame the tree — **average** it.

---

## Bagging

---

## The averaging idea

<div class="cols">
<div class="callout">
<h3>Bias</h3>
<p>Averaging inherits it — unchanged.</p>
</div>
<div class="callout">
<h3>Variance</h3>
<p>Averaging slashes it.</p>
</div>
</div>

We'd need many **independent** training sets. We have **one**.

---

Average B independent trees: bias **unchanged** — variance **divided by B**.

Bootstrap trees aren't independent. The direction still holds.

---

## The bootstrap

<div class="cols">
<div class="stat stat--blue">
<span class="stat-value">63%</span>
<span class="stat-label">of unique rows land in<br>each bootstrap sample</span>
</div>
<div class="stat">
<span class="stat-value">37%</span>
<span class="stat-label">left out —<br>out-of-bag (OOB)</span>
</div>
</div>

Sample $n$ rows **with replacement** — fake many datasets from the one you have.

---

## Bagging in one slide

<p class="kicker">The recipe · Bootstrap AGGregating</p>

1. Draw $B$ bootstrap samples
2. Fit an **unpruned** tree to each — low bias
3. **Aggregate** — average (regression), majority vote (classification)

---

## Out-of-bag — free validation

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=300, oob_score=True,
                            random_state=42)
rf.fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.3f}")
```

Every row has ~110 trees that **never saw it** — average their votes: a generalisation estimate with **no refitting**.

---

## Variance reduction, visible

![Decision boundary of a single deep tree vs a bagged ensemble](../figures/bagging-variance-reduction.svg)

Same data, same model family — one boundary **chases noise**, the other is **calm**.

---

## Random forests

---

## Bagging's weakness

<div class="cols">
<div class="callout">
<h3>Different rows</h3>
<p>Each tree gets its own bootstrap sample.</p>
</div>
<div class="callout warn">
<h3>Same splits</h3>
<p>The strongest features still top every tree.</p>
</div>
</div>

Correlated trees make correlated errors — and correlated errors **don't average away**.

---

## The recipe — one extra coin flip

At every split: draw $m$ of the $p$ features at random — best split chosen **only among those $m$**.

<div class="cols">
<div class="callout">
<h3>Classification</h3>
<p>m = √p</p>
</div>
<div class="callout">
<h3>Regression</h3>
<p>m = p / 3</p>
</div>
</div>

Random feature subsets are the **only** difference from bagging.

---

## Why it works

<div class="cols">
<div class="callout">
<h3>Each tree</h3>
<p>Slightly worse — fewer features to pick from.</p>
</div>
<div class="callout">
<h3>The ensemble</h3>
<p>Meaningfully better — decorrelated errors average away.</p>
</div>
</div>

---

## In scikit-learn

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_features='sqrt',
    n_jobs=-1,
    random_state=42,
)
rf.fit(X_train, y_train)
```

<p class="muted">No scaler — trees don't care about units. The shortest pipeline in the course.</p>

---

## Key hyperparameters

<p class="kicker">Reference — screenshot this</p>

| Parameter | Effect | Typical range |
|-----------|--------|---------------|
| `n_estimators` | more trees → lower variance, diminishing past a few hundred | 100–1000 |
| `max_features` | lower → more decorrelation, higher per-tree bias | `'sqrt'`, `'log2'`, fractions |
| `max_depth` | caps individual tree complexity | `None` or 5–30 |
| `min_samples_leaf` | floor on leaf size — larger = more regularisation | 1–20 |

<p class="muted">Tune with <code>GridSearchCV</code> — L7's machinery, unchanged.</p>

---

## Reading a forest

---

## Impurity importance

```python
import pandas as pd

importances = pd.Series(rf.feature_importances_,
                        index=feature_names)
print(importances.sort_values(ascending=False).head(10))
```

Which features **reduced impurity** during training — a free first pass.

<p class="muted">Biased toward high-cardinality features — and toward whichever correlated twin got picked first.</p>

---

## Permutation importance

```python
from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf, X_test, y_test,
    n_repeats=20, random_state=0, scoring='f1'
)
```

Shuffle one column, refit nothing — how much does the **score** drop?

Measures what matters at **prediction time**, not what was used in training.

---

## What a forest won't give you

<ul class="checklist">
<li class="no">A <strong>coefficient table</strong> to sign-and-interpret</li>
<li class="no"><strong>Smooth</strong> functional forms — boundaries are staircases</li>
<li class="no"><strong>Extrapolation</strong> — predictions are flat outside the training range</li>
<li class="no"><strong>Calibrated probabilities</strong> — <code>predict_proba</code> is the fraction of trees voting yes</li>
</ul>

---

## Pitfall 1 — OOB as the headline

<p class="kicker">Pitfall 1 · OOB is a sanity check</p>

<div class="cols">
<div class="callout">
<h3>OOB</h3>
<p>Free — report it alongside CV.</p>
</div>
<div class="callout warn">
<h3>The test set</h3>
<p>Still touched exactly once (L7).</p>
</div>
</div>

---

## Pitfall 2 — Cranking n_estimators

<p class="kicker">Pitfall 2 · More trees never overfit — they just cost time</p>

![Out-of-bag error flattens as the number of trees grows](../figures/oob-error-vs-trees.svg)

Flat past **~100** — at 5,000 trees you bought compute, not accuracy. Default: **200–500**.

---

## Pitfall 3 — Importance splits between twins

<p class="kicker">Pitfall 3 · Correlated features split the credit</p>

<div class="cols">
<div class="callout">
<h3>Two clones of one signal</h3>
<p>Each looks half as important.</p>
</div>
<div class="callout warn">
<h3>The model is fine</h3>
<p>The story the plot tells is wrong.</p>
</div>
</div>

Defence: permutation importance **plus** domain knowledge.

---

## Your project — the ensemble recipe

<p class="kicker">The pattern · memorise this</p>

1. **Baseline** — logistic / ridge from L4–L5
2. **Forest** with defaults — `n_estimators=300`, `max_features='sqrt'`
3. **CV** under the same protocol (L7)
4. **Compare** — same metric, gaps vs fold std
5. **Interpret** — permutation importance

A forest that doesn't beat the baseline is a **finding**, not a failure.

---

## Worked example — tree vs bagging vs forest

---

## The protocol

<p class="kicker">load_breast_cancer() · 80/20 stratified · random_state=42 · 5-fold CV · f1</p>

<div class="cols-3">
<div class="stat stat--blue">
<span class="stat-value">569</span>
<span class="stat-label">rows</span>
</div>
<div class="stat stat--green">
<span class="stat-value">30</span>
<span class="stat-label">numeric features</span>
</div>
<div class="stat">
<span class="stat-value">2</span>
<span class="stat-label">classes · ~63% benign</span>
</div>
</div>

Same dataset as L5 — you already know what good looks like. Today: **does averaging show up in the numbers?**

---

## Workflow 1 — a single tree

<p class="kicker">Workflow 1 · One unpruned tree</p>

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

tree = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(tree, X_train, y_train, cv=cv, scoring='f1')
print(f"Tree CV f1: {scores.mean():.3f} ± {scores.std():.3f}")
```

CV f1 = **0.932 ± 0.014** — the baseline to beat.

---

## Workflow 2 — bagging

<p class="kicker">Workflow 2 · 300 bootstrap copies of the same tree</p>

```python
from sklearn.ensemble import BaggingClassifier

bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=300, n_jobs=-1, random_state=42,
)
scores = cross_val_score(bag, X_train, y_train, cv=cv, scoring='f1')
```

CV f1 = **0.966 ± 0.017** — up **0.034**, about **2.4 fold-stds**. Averaging did real work.

---

## Workflow 3 — random forest

<p class="kicker">Workflow 3 · Bagging + random feature subsets</p>

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300, max_features='sqrt',
    n_jobs=-1, random_state=42,
)
scores = cross_val_score(rf, X_train, y_train, cv=cv, scoring='f1')
```

CV f1 = **0.972 ± 0.014** — up **0.006** over bagging. **Inside one std: a tie.**

---

## Read the numbers

<p class="kicker">Test f1 · CV mean ± std underneath</p>

<div class="cols-3">
<div class="stat stat--red">
<span class="stat-value">0.929</span>
<span class="stat-label">single tree<br>CV 0.932 ± 0.014</span>
</div>
<div class="stat stat--blue">
<span class="stat-value">0.958</span>
<span class="stat-label">bagging<br>CV 0.966 ± 0.017</span>
</div>
<div class="stat stat--green">
<span class="stat-value">0.958</span>
<span class="stat-label">random forest<br>CV 0.972 ± 0.014</span>
</div>
</div>

Tree → bagging is **real**. Bagging → forest is a **tie** — on this data.

---

## Which one would you ship?

<ul class="checklist">
<li>Bagging and forest <strong>tie</strong> → ship the <strong>forest</strong> — same cost, the field's standard, decorrelation is free insurance</li>
<li class="no">The single tree — a <strong>2+ std</strong> gap is a real loss</li>
<li class="no">Tuning further against the test set — it was touched <strong>once</strong></li>
</ul>

---

## Pitfall 3, live

<p class="kicker">Interpretation · two importances, two stories</p>

<div class="cols">
<div class="stat stat--yellow">
<span class="stat-value">0.137</span>
<span class="stat-label">worst perimeter<br>impurity importance</span>
</div>
<div class="stat stat--yellow">
<span class="stat-value">0.137</span>
<span class="stat-label">worst area — the same<br>measurement, credit split</span>
</div>
</div>

Permutation importance: near-zero for **29 of 30** features — shuffled columns get **covered by their twins**.

---

## Wrap-up

---

Ensembles don't make trees smarter — they make their mistakes **cancel**.

Bias stays. Variance averages away. That is the whole trick.

---

## Three words for this lecture

<div class="cols-3">
<div class="stat stat--word stat--blue">
<span class="stat-value">Bootstrap</span>
<span class="stat-label">fake many datasets<br>from the one you have</span>
</div>
<div class="stat stat--word stat--green">
<span class="stat-value">Average</span>
<span class="stat-label">variance down,<br>bias unchanged</span>
</div>
<div class="stat stat--word">
<span class="stat-value">Decorrelate</span>
<span class="stat-label">random features make<br>the averaging work</span>
</div>
</div>

---

## Before Lecture 11

- **Run** today's tree → bagging → forest comparison on your machine
- Project: add a forest as a **second-family comparison** under your L7 protocol
- Read ahead: **L11 is SVMs** — a very different way to draw a boundary

</div>

<p class="muted">L12 returns to ensembles: boosting swaps averaging for <strong>sequential correction</strong>.</p>

---

## Questions

Backup slides below — press down ↓

## Extra trees

<p class="kicker">Even more randomness · ExtraTreesClassifier</p>

```python
from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier(n_estimators=300, n_jobs=-1, random_state=42)
```

- Split **thresholds** picked at random, not optimised
- Faster than a forest, sometimes competitive, less stable

## Calibrating forest probabilities

<p class="kicker">Probabilities · CalibratedClassifierCV</p>

```python
from sklearn.calibration import CalibratedClassifierCV

calibrated = CalibratedClassifierCV(rf, method='isotonic', cv=5)
calibrated.fit(X_train, y_train)
y_prob = calibrated.predict_proba(X_test)[:, 1]
```

`predict_proba` is a **vote fraction** — rarely well-calibrated on its own.

## Class imbalance

<p class="kicker">Imbalance · class_weight='balanced_subsample'</p>

```python
rf = RandomForestClassifier(
    n_estimators=300,
    class_weight='balanced_subsample',
    random_state=42,
)
```

Reweights **per bootstrap** — each tree sees a roughly balanced sample.

## Partial dependence

<p class="kicker">Interpretation · one feature's effect</p>

```python
from sklearn.inspection import PartialDependenceDisplay

PartialDependenceDisplay.from_estimator(rf, X_train,
                                        features=['mean radius'])
```

Vary one feature, average the prediction over everything else → a **curve**.

## Why not one massive tree?

<p class="kicker">Design question · depth vs breadth</p>

- A deep tree is **unstable** — one row changes the top split, the whole structure shifts
- Many cheap, parallel, individually unstable trees **average out** that instability

Exactly the conditions where averaging pays.

## Multi-output regression

<p class="kicker">Regression · many targets, one forest</p>

```python
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=300, random_state=42)
rf.fit(X_train, Y_train)   # Y_train shape: (n, k)
```

One forest, $k$ correlated targets — splits score impurity across **all** of them.

---

## What's next

**Lecture 11:** Support vector machines

- Maximum-margin classification
- The kernel trick
- When SVMs vs forests