Foundations and Preprocessing Pipelines

## MST0052 — Lecture 3

### Foundations and Preprocessing Pipelines

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| **1–3** | **Foundations — you are here** |
| 4–7 | Core methods |
| 8–13 | Going further |
| 14–16 | Wrapping up |

---

## Today's plan

- Why preprocessing is **part of the model**
- The split-first rule and data leakage
- scikit-learn `Pipeline` and `ColumnTransformer`
- Worked example: Titanic survival prediction

---

## Project check-in

Quick show of hands:

- **Who has a candidate dataset?**
- **Who has a rough problem statement?**

If you don't have one yet — come talk to me after class or revisit the dataset pointers from L1.

---

## Preprocessing is part of the model

Raw data is rarely ready for modelling:

- Missing values, categorical strings, inconsistent scales
- These all change how a model behaves

Preprocessing choices are **modelling assumptions**, not cosmetic cleanup.

---

## What can go wrong with raw data

| Problem | Example |
|---------|---------|
| Missing values | `age` is NaN for 20% of rows |
| Categorical strings | `sex = "male"` — model expects numbers |
| Different scales | Age in years vs income in thousands |
| Outliers | A single fare of 500 when the median is 15 |
| High cardinality | A `city` column with 500 unique values |

---

## Common preprocessing tasks

| Task | What it solves | Example |
|------|---------------|---------|
| **Imputation** | Missing values | Replace NaN with median |
| **Encoding** | Categorical variables | One-hot, ordinal |
| **Scaling** | Different units | StandardScaler, MinMaxScaler |
| **Outlier handling** | Extreme values | Clip, Winsorise, flag |
| **Feature engineering** | Raw features not informative enough | Log-transform, interactions |

There is no universal checklist — justify each step for your dataset and model.

---

## Why model choice affects preprocessing

| Model | Preprocessing needs |
|-------|-------------------|
| **k-NN** | Sensitive to scale (distances drive prediction) |
| **Logistic regression** | Needs numeric inputs, benefits from scaling |
| **Tree-based models** | Less sensitive to scaling, still need encoding |
| **PCA / k-means** | Dominated by unit differences without scaling |

There is no one-size-fits-all recipe.

---

## The pipeline as a contract

A pipeline says:

> "These are the exact transformations applied to any new data before prediction."

Without a pipeline, you're doing manual steps that won't be reproduced at prediction time.

**Reproducibility is a project requirement.**

---

## The most important rule: split first, then fit

**Before** you fit anything that learns from data:

- Imputers (estimate replacement values)
- Encoders (learn category sets)
- Scalers (estimate means and standard deviations)
- PCA (learns variance directions)

If you fit on the full dataset before splitting, you **leak** test information into training.

---

## What is data leakage?

![Data leakage: wrong vs right](../figures/data-leakage.svg)

---

## Leakage example: scaling before splitting

**Wrong — leakage:**

```python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)       # sees ALL data
X_train, X_test = train_test_split(X_scaled, ...)
```

**Right — no leakage:**

```python
X_train, X_test = train_test_split(X, ...)
pipe = Pipeline([('scaler', StandardScaler()), ('model', ...)])
pipe.fit(X_train, y_train)               # sees only training
```

---

## Leakage example: imputation before splitting

```python
# Wrong: imputer learns median from full dataset
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)
X_train, X_test = train_test_split(X_imputed, ...)
```

The median used to fill missing values was partly computed **from the test set**.

Fix: put the imputer inside the pipeline.

---

## The fix: everything inside the pipeline

1. `train_test_split` → get `X_train`, `X_test`
2. Build a `Pipeline` that includes **all** preprocessing + the model
3. `pipe.fit(X_train, y_train)` — only training data seen
4. `pipe.predict(X_test)` — preprocessing automatically applied correctly

The pipeline **guarantees** that `fit` only sees training data.

---

## Pipeline basics

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
```

Steps run in order. The last step is the model. Everything before it is preprocessing.

---

## What `fit` and `transform` mean

| Method | What it does | When |
|--------|-------------|------|
| `fit` | Learn parameters from data (mean, std, categories) | Training |
| `transform` | Apply learned parameters to data | Training and prediction |
| `fit_transform` | Do both at once | Training (convenience) |

At prediction time, only `transform` is called — using parameters learned during `fit`.

**This is the mechanism that prevents leakage.**

---

## The problem with mixed-type data

Real datasets have both numeric and categorical columns:

- You can't pass a string column through `StandardScaler`
- You can't one-hot encode a numeric column

**Solution:** `ColumnTransformer` — apply different transformations to different columns.

---

## ColumnTransformer

![ColumnTransformer flow](../figures/column-transformer-flow.svg)

---

## ColumnTransformer in code

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_cols = ['age', 'fare']
cat_cols = ['pclass', 'sex', 'embarked']

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
    ]), num_cols),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore')),
    ]), cat_cols),
])
```

---

## Full pipeline with ColumnTransformer

```python
pipe = Pipeline([
    ('prep', preprocessor),
    ('model', LogisticRegression(max_iter=200))
])

pipe.fit(X_train, y_train)
print(f"Test accuracy: {pipe.score(X_test, y_test):.3f}")
```

One object that handles **everything**: imputation, encoding, scaling, modelling.

Reproducible. Portable. No manual steps.

---

## Shorthand: make_pipeline and make_column_transformer

```python
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

preprocess = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='median'),
                   StandardScaler()), num_cols),
    (make_pipeline(SimpleImputer(strategy='most_frequent'),
                   OneHotEncoder(handle_unknown='ignore')), cat_cols),
)

pipe = make_pipeline(preprocess, LogisticRegression(max_iter=200))
```

Same result, less typing. Use whichever you prefer.

---

## Preprocessing components reference

| Component | What it does | When to use |
|-----------|-------------|-------------|
| `SimpleImputer` | Fill missing values | Always, if NaNs present |
| `StandardScaler` | Zero mean, unit variance | Linear models, k-NN, SVM, PCA |
| `MinMaxScaler` | Scale to [0, 1] | When you need bounded features |
| `OneHotEncoder` | Dummy variables | Low/medium cardinality categories |
| `OrdinalEncoder` | Integer encoding | Ordered categories, tree models |
| `FunctionTransformer` | Custom transform (e.g., log) | Feature engineering |

---

## Worked example: Titanic

- Classic dataset: predict survival (binary classification)
- Features: passenger class, sex, age, fare, embarkation port
- Mixed types, missing values, categorical strings

A good preprocessing exercise.

---

## Step 1: Load and inspect

```python
import pandas as pd
import seaborn as sns

df = sns.load_dataset('titanic')
print(df.info())
print(df.isnull().sum())
```

![Titanic missing data pattern](../figures/titanic-missing.svg)

**Decision:** keep `pclass`, `sex`, `age`, `fare`, `embarked`. Drop `deck` (77% missing).

---

## Step 2: Split first

```python
from sklearn.model_selection import train_test_split

features = ['pclass', 'sex', 'age', 'fare', 'embarked']
X = df[features]
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
```

Split **before** any preprocessing. `stratify=y` keeps survival ratio balanced.

---

## Step 3: Build the pipeline

```python
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
    ]), ['age', 'fare']),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore')),
    ]), ['pclass', 'sex', 'embarked']),
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('model', LogisticRegression(max_iter=200))
])
```

---

## Step 4: Fit, predict, evaluate

```python
from sklearn.metrics import classification_report

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
```

Is ~79% accuracy good? **Compared to what?**

A majority-class predictor (always predict "did not survive") gets ~62%. So 79% is a real improvement.

---

## Step 5: Cross-validate

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X_train, y_train,
                         cv=5, scoring='accuracy')
print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
```

- Cross-validate on the **training set**
- The pipeline ensures no leakage across folds
- Compare CV accuracy to test accuracy — are they close?

---

## Beyond the basics: feature engineering

Preprocessing gets data into a form the model can consume. Feature engineering creates features that make the model's job **easier**.

- Log-transforming a skewed feature (income, price)
- Creating interaction terms (age × class)
- Extracting components from dates (day of week, month)
- Binning continuous variables (age groups)

---

## Feature engineering in a pipeline

```python
from sklearn.preprocessing import FunctionTransformer
import numpy as np

log_transform = FunctionTransformer(np.log1p, validate=True)

preprocess = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='median'),
                   log_transform,
                   StandardScaler()), ['fare']),
    # ... other columns as before
)
```

Custom transforms fit into the same pipeline pattern. `log1p` = log(1 + x), safe for zeros.

---

## When to engineer features

- **Start simple** — raw features + standard preprocessing
- Add engineered features when you understand the data and the model's weaknesses
- Every engineered feature is a **claim about the data** — be ready to defend it

---

## Documenting your preprocessing

In your project report, describe:

- Which variables were **kept, dropped, or grouped** — and why
- How **missing values** were handled — and why that strategy
- How **categorical variables** were encoded
- Whether **scaling** was applied and which models required it
- How preprocessing was **tied to the pipeline**

---

## Common mistakes in project reports

- *"I standardised all features"* — but you're using a random forest
- *"I dropped all rows with missing values"* — that was 30% of your data
- *"I one-hot encoded city"* — city has 500 unique values

The fix: **justify** each choice in terms of the data and the model.

---

## Summary

- Preprocessing is a **modelling decision**, not a separate step
- **Split first**, then fit preprocessing
- Use `Pipeline` and `ColumnTransformer` for reproducibility
- Different models need different preprocessing
- Always report and justify your choices

---

## Before Lecture 4

- Run today's Titanic pipeline on **your own machine**
- Start applying the pipeline pattern to **your own dataset**
- Read ahead: next lecture is **linear models** (OLS, ridge, lasso)

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## Ordinal encoding vs one-hot encoding

| Approach | When to use |
|----------|-------------|
| **One-hot** | Nominal categories (no order): colour, city, sex |
| **Ordinal** | Ordered categories: low/medium/high, education level |

**Watch out:** one-hot encoding a column with 500 categories creates 500 new columns. For high cardinality, consider target encoding or grouping rare categories.

## Handling outliers

- **Clipping:** cap values at a percentile (e.g., 1st and 99th)
- **Winsorising:** replace extremes with the boundary value
- **IQR filtering:** flag values beyond 1.5 × IQR

When to leave outliers alone: when they're real and informative (e.g., a genuinely expensive apartment).

## Missing data patterns

| Pattern | Meaning | Implication |
|---------|---------|-------------|
| **MCAR** | Missing completely at random | Safe to impute or drop |
| **MAR** | Missingness depends on observed data | Imputation can work, be careful |
| **MNAR** | Missingness depends on the missing value itself | Imputation is biased |

In practice, most project datasets are MAR or MCAR. If you suspect MNAR, discuss it in your report.

## Second worked example: Ames Housing

A regression analog of the Titanic pipeline — same pattern, different target.

- **Target:** house sale price (continuous)
- **Features:** numeric (square footage, year built), ordinal (overall quality 1–10), nominal (neighborhood, roof style)

```python
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import Ridge

preprocessor = ColumnTransformer([
    ('num', make_pipeline(SimpleImputer(strategy='median'),
                          StandardScaler()), num_cols),
    ('ord', OrdinalEncoder(), ord_cols),
    ('cat', make_pipeline(SimpleImputer(strategy='most_frequent'),
                          OneHotEncoder(handle_unknown='ignore')),
     cat_cols),
])

pipe = make_pipeline(preprocessor, Ridge())
```

The pipeline pattern from Titanic generalises directly: only the model and the column groupings change.

## Inspecting pipeline internals

```python
# Access named steps
pipe.named_steps['prep']

# Transform training data without predicting
X_transformed = pipe[:-1].transform(X_train)
print(X_transformed.shape)

# Get feature names after encoding
pipe.named_steps['prep'].get_feature_names_out()
```

Useful for debugging: check what the model actually sees after preprocessing.

---

## What's next

**Lecture 4:** Linear models for prediction

- Ordinary least squares as your first baseline
- Ridge and lasso regularisation
- When to penalise