This module introduces foundational techniques in data preprocessing and unsupervised learning using the Titanic dataset. Learners practice cleaning real-world data, preparing features for modeling, and applying K-Means clustering to uncover patterns.
Dataset: Kaggle Titanic Dataset
Objective:
- Preprocess and explore tabular data.
- Apply K-Means clustering.
- Interpret cluster characteristics and patterns.
Students load the Titanic dataset and perform standard cleaning steps:
- Inspect data structure and basic statistics
- Handle missing values (Age, Embarked, Fare)
- Drop high-cardinality or unused columns (Name, Ticket, Cabin)
- Encode categorical variables (Sex, Embarked)
- Normalize numerical features (Age, Fare) using MinMaxScaler
This prepares the dataset for clustering by ensuring all features are numeric, scaled, and free of missing values.
Learners:
- Select relevant features
- Apply K-Means with 3 clusters
- Visualize clusters (Age vs. Fare)
- Compute mean feature values per cluster to understand typical passenger profiles
- Explore feature distributions using box plots segmented by cluster
These steps introduce how unsupervised learning groups similar passengers based on socioeconomic indicators, demographics, and travel details.
Students analyze:
- How clusters differ in class, age, fare, family size, and encoded attributes
- Whether clusters align with survival patterns
- How additional features or alternative techniques (e.g., hierarchical clustering, PCA) might improve insights
Example themes include socioeconomic patterns, demographic groupings, and relationships between cluster membership and survival outcomes.
- Python 3.x
- pandas
- matplotlib
- seaborn
- scikit-learn
Column descriptions and dataset details:
https://www.kaggle.com/competitions/titanic/data