Skip to content

One-hour coding activity developed for the NSF I-GUIDE Convergence Curriculum for Geospatial Data Science module on Data Mining.

Notifications You must be signed in to change notification settings

vavramusser/ccdatamining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

README

NSF I-GUIDE Convergence Curriculum – Geospatial Data Science

Module: Intro to Data Preprocessing and Clustering (Titanic Dataset)

This module introduces foundational techniques in data preprocessing and unsupervised learning using the Titanic dataset. Learners practice cleaning real-world data, preparing features for modeling, and applying K-Means clustering to uncover patterns.

Overview

Dataset: Kaggle Titanic Dataset
Objective:

  1. Preprocess and explore tabular data.
  2. Apply K-Means clustering.
  3. Interpret cluster characteristics and patterns.

Tasks

Task 1 — Data Preprocessing

Students load the Titanic dataset and perform standard cleaning steps:

  • Inspect data structure and basic statistics
  • Handle missing values (Age, Embarked, Fare)
  • Drop high-cardinality or unused columns (Name, Ticket, Cabin)
  • Encode categorical variables (Sex, Embarked)
  • Normalize numerical features (Age, Fare) using MinMaxScaler

This prepares the dataset for clustering by ensuring all features are numeric, scaled, and free of missing values.

Task 2 — Clustering

Learners:

  • Select relevant features
  • Apply K-Means with 3 clusters
  • Visualize clusters (Age vs. Fare)
  • Compute mean feature values per cluster to understand typical passenger profiles
  • Explore feature distributions using box plots segmented by cluster

These steps introduce how unsupervised learning groups similar passengers based on socioeconomic indicators, demographics, and travel details.

Task 3 — Interpretation

Students analyze:

  • How clusters differ in class, age, fare, family size, and encoded attributes
  • Whether clusters align with survival patterns
  • How additional features or alternative techniques (e.g., hierarchical clustering, PCA) might improve insights

Example themes include socioeconomic patterns, demographic groupings, and relationships between cluster membership and survival outcomes.

Requirements

  • Python 3.x
  • pandas
  • matplotlib
  • seaborn
  • scikit-learn

Dataset Reference

Column descriptions and dataset details:
https://www.kaggle.com/competitions/titanic/data

About

One-hour coding activity developed for the NSF I-GUIDE Convergence Curriculum for Geospatial Data Science module on Data Mining.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published