MST0052
## MST0052 — Lecture 3 ### Foundations and Preprocessing Pipelines Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | **1–3** | **Foundations — you are here** | | 4–7 | Core methods | | 8–13 | Going further | | 14–16 | Wrapping up | --- ## Today's plan - Why preprocessing is **part of the model** - The split-first rule and data leakage - scikit-learn `Pipeline` and `ColumnTransformer` - Worked example: Titanic survival prediction --- ## Project check-in Quick show of hands: - **Who has a candidate dataset?** - **Who has a rough problem statement?** If you don't have one yet — come talk to me after class or revisit the dataset pointers from L1. --- ## Preprocessing is part of the model Raw data is rarely ready for modelling: - Missing values, categorical strings, inconsistent scales - These all change how a model behaves Preprocessing choices are **modelling assumptions**, not cosmetic cleanup. --- ## What can go wrong with raw data | Problem | Example | |---------|---------| | Missing values | `age` is NaN for 20% of rows | | Categorical strings | `sex = "male"` — model expects numbers | | Different scales | Age in years vs income in thousands | | Outliers | A single fare of 500 when the median is 15 | | High cardinality | A `city` column with 500 unique values | --- ## Common preprocessing tasks | Task | What it solves | Example | |------|---------------|---------| | **Imputation** | Missing values | Replace NaN with median | | **Encoding** | Categorical variables | One-hot, ordinal | | **Scaling** | Different units | StandardScaler, MinMaxScaler | | **Outlier handling** | Extreme values | Clip, Winsorise, flag | | **Feature engineering** | Raw features not informative enough | Log-transform, interactions | There is no universal checklist — justify each step for your dataset and model. --- ## Why model choice affects preprocessing | Model | Preprocessing needs | |-------|-------------------| | **k-NN** | Sensitive to scale (distances drive prediction) | | **Logistic regression** | Needs numeric inputs, benefits from scaling | | **Tree-based models** | Less sensitive to scaling, still need encoding | | **PCA / k-means** | Dominated by unit differences without scaling | There is no one-size-fits-all recipe. --- ## The pipeline as a contract A pipeline says: > "These are the exact transformations applied to any new data before prediction." Without a pipeline, you're doing manual steps that won't be reproduced at prediction time. **Reproducibility is a project requirement.** --- ## The most important rule: split first, then fit **Before** you fit anything that learns from data: - Imputers (estimate replacement values) - Encoders (learn category sets) - Scalers (estimate means and standard deviations) - PCA (learns variance directions) If you fit on the full dataset before splitting, you **leak** test information into training. --- ## What is data leakage?  --- ## Leakage example: scaling before splitting **Wrong — leakage:** ```python scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # sees ALL data X_train, X_test = train_test_split(X_scaled, ...) ``` **Right — no leakage:** ```python X_train, X_test = train_test_split(X, ...) pipe = Pipeline([('scaler', StandardScaler()), ('model', ...)]) pipe.fit(X_train, y_train) # sees only training ``` --- ## Leakage example: imputation before splitting ```python # Wrong: imputer learns median from full dataset imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X) X_train, X_test = train_test_split(X_imputed, ...) ``` The median used to fill missing values was partly computed **from the test set**. Fix: put the imputer inside the pipeline. --- ## The fix: everything inside the pipeline 1. `train_test_split` → get `X_train`, `X_test` 2. Build a `Pipeline` that includes **all** preprocessing + the model 3. `pipe.fit(X_train, y_train)` — only training data seen 4. `pipe.predict(X_test)` — preprocessing automatically applied correctly The pipeline **guarantees** that `fit` only sees training data. --- ## Pipeline basics ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression pipe = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('model', LogisticRegression()) ]) pipe.fit(X_train, y_train) pipe.score(X_test, y_test) ``` Steps run in order. The last step is the model. Everything before it is preprocessing. --- ## What `fit` and `transform` mean | Method | What it does | When | |--------|-------------|------| | `fit` | Learn parameters from data (mean, std, categories) | Training | | `transform` | Apply learned parameters to data | Training and prediction | | `fit_transform` | Do both at once | Training (convenience) | At prediction time, only `transform` is called — using parameters learned during `fit`. **This is the mechanism that prevents leakage.** --- ## The problem with mixed-type data Real datasets have both numeric and categorical columns: - You can't pass a string column through `StandardScaler` - You can't one-hot encode a numeric column **Solution:** `ColumnTransformer` — apply different transformations to different columns. --- ## ColumnTransformer  --- ## ColumnTransformer in code ```python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder num_cols = ['age', 'fare'] cat_cols = ['pclass', 'sex', 'embarked'] preprocessor = ColumnTransformer([ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ]), num_cols), ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')), ]), cat_cols), ]) ``` --- ## Full pipeline with ColumnTransformer ```python pipe = Pipeline([ ('prep', preprocessor), ('model', LogisticRegression(max_iter=200)) ]) pipe.fit(X_train, y_train) print(f"Test accuracy: {pipe.score(X_test, y_test):.3f}") ``` One object that handles **everything**: imputation, encoding, scaling, modelling. Reproducible. Portable. No manual steps. --- ## Shorthand: make_pipeline and make_column_transformer ```python from sklearn.pipeline import make_pipeline from sklearn.compose import make_column_transformer preprocess = make_column_transformer( (make_pipeline(SimpleImputer(strategy='median'), StandardScaler()), num_cols), (make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(handle_unknown='ignore')), cat_cols), ) pipe = make_pipeline(preprocess, LogisticRegression(max_iter=200)) ``` Same result, less typing. Use whichever you prefer. --- ## Preprocessing components reference | Component | What it does | When to use | |-----------|-------------|-------------| | `SimpleImputer` | Fill missing values | Always, if NaNs present | | `StandardScaler` | Zero mean, unit variance | Linear models, k-NN, SVM, PCA | | `MinMaxScaler` | Scale to [0, 1] | When you need bounded features | | `OneHotEncoder` | Dummy variables | Low/medium cardinality categories | | `OrdinalEncoder` | Integer encoding | Ordered categories, tree models | | `FunctionTransformer` | Custom transform (e.g., log) | Feature engineering | --- ## Worked example: Titanic - Classic dataset: predict survival (binary classification) - Features: passenger class, sex, age, fare, embarkation port - Mixed types, missing values, categorical strings A good preprocessing exercise. --- ## Step 1: Load and inspect ```python import pandas as pd import seaborn as sns df = sns.load_dataset('titanic') print(df.info()) print(df.isnull().sum()) ```  **Decision:** keep `pclass`, `sex`, `age`, `fare`, `embarked`. Drop `deck` (77% missing). --- ## Step 2: Split first ```python from sklearn.model_selection import train_test_split features = ['pclass', 'sex', 'age', 'fare', 'embarked'] X = df[features] y = df['survived'] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) ``` Split **before** any preprocessing. `stratify=y` keeps survival ratio balanced. --- ## Step 3: Build the pipeline ```python preprocessor = ColumnTransformer([ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ]), ['age', 'fare']), ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')), ]), ['pclass', 'sex', 'embarked']), ]) pipe = Pipeline([ ('prep', preprocessor), ('model', LogisticRegression(max_iter=200)) ]) ``` --- ## Step 4: Fit, predict, evaluate ```python from sklearn.metrics import classification_report pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) print(classification_report(y_test, y_pred)) ``` Is ~79% accuracy good? **Compared to what?** A majority-class predictor (always predict "did not survive") gets ~62%. So 79% is a real improvement. --- ## Step 5: Cross-validate ```python from sklearn.model_selection import cross_val_score scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy') print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}") ``` - Cross-validate on the **training set** - The pipeline ensures no leakage across folds - Compare CV accuracy to test accuracy — are they close? --- ## Beyond the basics: feature engineering Preprocessing gets data into a form the model can consume. Feature engineering creates features that make the model's job **easier**. - Log-transforming a skewed feature (income, price) - Creating interaction terms (age × class) - Extracting components from dates (day of week, month) - Binning continuous variables (age groups) --- ## Feature engineering in a pipeline ```python from sklearn.preprocessing import FunctionTransformer import numpy as np log_transform = FunctionTransformer(np.log1p, validate=True) preprocess = make_column_transformer( (make_pipeline(SimpleImputer(strategy='median'), log_transform, StandardScaler()), ['fare']), # ... other columns as before ) ``` Custom transforms fit into the same pipeline pattern. `log1p` = log(1 + x), safe for zeros. --- ## When to engineer features - **Start simple** — raw features + standard preprocessing - Add engineered features when you understand the data and the model's weaknesses - Every engineered feature is a **claim about the data** — be ready to defend it --- ## Documenting your preprocessing In your project report, describe: - Which variables were **kept, dropped, or grouped** — and why - How **missing values** were handled — and why that strategy - How **categorical variables** were encoded - Whether **scaling** was applied and which models required it - How preprocessing was **tied to the pipeline** --- ## Common mistakes in project reports - *"I standardised all features"* — but you're using a random forest - *"I dropped all rows with missing values"* — that was 30% of your data - *"I one-hot encoded city"* — city has 500 unique values The fix: **justify** each choice in terms of the data and the model. --- ## Summary - Preprocessing is a **modelling decision**, not a separate step - **Split first**, then fit preprocessing - Use `Pipeline` and `ColumnTransformer` for reproducibility - Different models need different preprocessing - Always report and justify your choices --- ## Before Lecture 4 - Run today's Titanic pipeline on **your own machine** - Start applying the pipeline pattern to **your own dataset** - Read ahead: next lecture is **linear models** (OLS, ridge, lasso) --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## Ordinal encoding vs one-hot encoding | Approach | When to use | |----------|-------------| | **One-hot** | Nominal categories (no order): colour, city, sex | | **Ordinal** | Ordered categories: low/medium/high, education level | **Watch out:** one-hot encoding a column with 500 categories creates 500 new columns. For high cardinality, consider target encoding or grouping rare categories. -- ## Handling outliers - **Clipping:** cap values at a percentile (e.g., 1st and 99th) - **Winsorising:** replace extremes with the boundary value - **IQR filtering:** flag values beyond 1.5 × IQR When to leave outliers alone: when they're real and informative (e.g., a genuinely expensive apartment). -- ## Missing data patterns | Pattern | Meaning | Implication | |---------|---------|-------------| | **MCAR** | Missing completely at random | Safe to impute or drop | | **MAR** | Missingness depends on observed data | Imputation can work, be careful | | **MNAR** | Missingness depends on the missing value itself | Imputation is biased | In practice, most project datasets are MAR or MCAR. If you suspect MNAR, discuss it in your report. -- ## Second worked example: Ames Housing A regression analog of the Titanic pipeline — same pattern, different target. - **Target:** house sale price (continuous) - **Features:** numeric (square footage, year built), ordinal (overall quality 1–10), nominal (neighborhood, roof style) ```python from sklearn.preprocessing import OrdinalEncoder from sklearn.linear_model import Ridge preprocessor = ColumnTransformer([ ('num', make_pipeline(SimpleImputer(strategy='median'), StandardScaler()), num_cols), ('ord', OrdinalEncoder(), ord_cols), ('cat', make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(handle_unknown='ignore')), cat_cols), ]) pipe = make_pipeline(preprocessor, Ridge()) ``` The pipeline pattern from Titanic generalises directly: only the model and the column groupings change. -- ## Inspecting pipeline internals ```python # Access named steps pipe.named_steps['prep'] # Transform training data without predicting X_transformed = pipe[:-1].transform(X_train) print(X_transformed.shape) # Get feature names after encoding pipe.named_steps['prep'].get_feature_names_out() ``` Useful for debugging: check what the model actually sees after preprocessing. --- ## What's next **Lecture 4:** Linear models for prediction - Ordinary least squares as your first baseline - Ridge and lasso regularisation - When to penalise