diff --git a/README.md b/README.md index c3138ed..8aaeca6 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,291 @@ # Python-Data-Science-Onboarding -Coming Soon + +Welcome to the WMU DSC/Developer Club!
+This repository is designed to help new members get familiar with the tools and workflows commonly used in our data science projects. +
+
+ + + +## πŸš€ Who is this for? + +This tutorial assumes you already have *basic Python knowledge*, including: + +- Using numpy and pandas for data handling +- Knowing what a .ipynb Jupyter Notebook file is +- Using scikit-learn to build simple machine learning models + +
+❓Don't know Python yet? No problem!❓ +
+ +> **Start with the resources below before continuing:**
+>   [W3Schools Python Tutorial](https://www.w3schools.com/python/) +>   [Google's Python Class](https://developers.google.com/edu/python) +>   [Python for Beginners (YouTube)](https://www.youtube.com/watch?v=K5KVEU3aaeQ&t=56s) +
+ + +
+ ❓Python Installation Guide For Beginners❓ +
+ +> ### To follow along with the notebooks in this repository, you need Python installed on your machine. +> ### πŸŽ₯ How to Install Python +>    [For macOS](https://www.youtube.com/watch?v=nhv82tvFfkM) +>    [For Windows](https://www.youtube.com/watch?v=YagM_FuPLQU)

+> πŸ“Œ *Important*: During installation, make sure to check: +> *β€œAdd Python to PATH”* + +### Verify Your Installation + +After installing, open a terminal (or Command Prompt on Windows), and run: + +```bash +python --version +pip --version +``` +
+
+
+ + + +## πŸ“¦ Recommended Libraries + +In Python, you install packages by running: +```bash +pip install +``` + +Before you dive into the notebooks, make sure you have the core data-science libraries installed. You can install them all at once via pip: + +```bash +pip install \ + numpy \ + pandas \ + matplotlib \ + seaborn \ + scikit-learn \ + notebook +``` +
+
+ + + +## πŸ“˜ Core Topics + +
+ πŸ”₯Understanding Jupyter Notebooks (.ipynb)πŸ”₯ +What are text vs code cells, how to run them, and best practices for documenting your analysis. +# πŸ“ Jupyter Notebook Quickstart Guide + +This guide will introduce you to Jupyter Notebookβ€”from β€œwhat it is” to how to install and use it locally or in the cloudβ€”then walk you through basic operations, hands-on examples, Markdown usage, and sharing. + +--- + +## πŸ” What Is Jupyter Notebook? + +Jupyter Notebook is an interactive computing environment where you can combine live code, equations, visualizations, and narrative text in a single document (`.ipynb`). It’s widely used for data analysis, teaching, and rapid prototyping. + +- **Key Features** + - Interactive code execution + - Rich text via Markdown (headings, lists, LaTeX) + - Inline data visualizations + - Easy sharing and reproducibility + +--- + +## βš™οΈ Installation & Access + +### 1. Install Locally + +You’ll need Python installed first. Then: + +```bash +# Install Jupyter Notebook via pip +pip install notebook +``` +Or, if you use Conda: +```bash +conda install -c conda-forge notebook +``` +After installation, launch the notebook server: +```bash +jupyter notebook +``` +Your default browser will open at http://localhost:8888, showing the notebook dashboard. + +### 2. Use JupyterLab (Optional) +For a more full-featured interface: + +```bash +pip install jupyterlab +jupyter lab +``` +### 3. Cloud / Web Options +Google Colab + +1. Go to colab.research.google.com +2. Sign in with your Google account +3. Open or upload any .ipynb file +
+ + + +
+ πŸ”₯Data Handling with NumPy & PandasπŸ”₯ + Learn how to load, clean, and manipulate data using NumPy arrays and Pandas DataFrames. + +## πŸ” Library Overview + +Before we dive in, here's a quick intro to the two core libraries we’ll use: + +### NumPy +- **The fundamental package for numerical computing in Python.** +- **Key features:** + - **Arrays:** Homogeneous, N-dimensional arrays (faster and more memory-efficient than Python lists) + - **Vectorized ops:** Element-wise arithmetic without explicit loops + - **Linear algebra & random:** Built-in support for matrix operations and pseudo-random number generation + +### Pandas +- **A powerful data analysis and manipulation library built on top of NumPy.** +- **Key features:** + - **DataFrame:** 2D tabular data structure with labeled axes (rows & columns) + - **IO tools:** Read/write CSV, Excel, SQL, JSON, and more + - **Series:** 1D labeled array, great for time series and single-column tables + - **Grouping & aggregation:** Split-apply-combine workflows for summarizing data + + + +### 1. What +> **What you will learn in this section.** +> By the end of this notebook, you will be able to: +> - Create and manipulate NumPy arrays of different shapes and dtypes +> - Perform element-wise arithmetic and universal functions +> - Index, slice, and reshape arrays for efficient computation + +--- + +### 2. Why +> **Why this topic matters.** +> NumPy arrays are the foundation of nearly all scientific computing in Python. +> They provide: +> - **Speed:** Vectorized operations run much faster than Python loops +> - **Memory efficiency:** Compact storage of homogeneous data +> - **Interoperability:** A common data structure for libraries like Pandas, SciPy, and scikit-learn + +--- + +### 3. How +> **How to do it.** +> Follow these step-by-step examples: + +```python +import numpy as np + +# 1) Create arrays +a = np.array([1, 2, 3, 4]) +b = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] +c = np.zeros((2, 3), dtype=int) # 2Γ—3 array of zeros + +# 2) Element-wise arithmetic +sum_ab = a + b[:4] # adds element by element +prod_ab = a * b[:4] # multiplies element by element + +# 3) Universal functions +sqrt_b = np.sqrt(b) # square root of each element +exp_a = np.exp(a) # eᡃ for each element + +# 4) Indexing & slicing +row = b[2:5] # slice subarray +c[0, :] = row # assign a row + +# 5) Reshape & combine +d = np.linspace(0, 1, 6).reshape(2, 3) +stacked = np.vstack([c, d]) # vertical stack of two 2Γ—3 arrays + + +``` +
+ + + +
+ πŸ”₯Basic Machine Learning with scikit-learnπŸ”₯ +Build your first regression and classification models, split data, and evaluate performance. + +## πŸ” Library Overview +scikit-learn is one of the most widely used ML libraries in Python. +It provides simple APIs for preprocessing, training models, and evaluating performance. + +### ✨ Key Features +- Large collection of supervised & unsupervised algorithms +- Easy dataset splitting, scaling, and pipelines +- Built-in metrics for evaluation +- Works seamlessly with NumPy & pandas + +--- + +### 1. What +> **What you will learn in this section.** +> By the end of this notebook, you will be able to: +> - Split data into train/test sets +> - Train a simple regression model +> - Train a classification model +> - Evaluate predictions using accuracy and error metrics + +--- + +### 2. Why +> **Why this topic matters.** +> - Machine Learning is the core of many data science projects. +> - scikit-learn offers a consistent interface to try many models quickly. +> - Understanding the ML workflow (split β†’ train β†’ predict β†’ evaluate) is essential. + +--- + +### 3. How +> **How to do it.** +> Follow these hands-on examples: + +```python +from sklearn.datasets import load_iris, make_regression +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LinearRegression, LogisticRegression +from sklearn.metrics import mean_squared_error, accuracy_score +import numpy as np + +# --- Regression Example --- +# Generate synthetic data +X_reg, y_reg = make_regression(n_samples=100, n_features=1, noise=10, random_state=42) + +# Train/test split +X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42) + +# Fit linear regression +reg = LinearRegression() +reg.fit(X_train, y_train) + +# Predict and evaluate +y_pred = reg.predict(X_test) +print("MSE (Regression):", mean_squared_error(y_test, y_pred)) + + +# --- Classification Example --- +iris = load_iris() +X_clf, y_clf = iris.data, iris.target + +X_train, X_test, y_train, y_test = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42) + +clf = LogisticRegression(max_iter=200) +clf.fit(X_train, y_train) + +y_pred = clf.predict(X_test) +print("Accuracy (Classification):", accuracy_score(y_test, y_pred)) + +``` +
+ + diff --git a/checkpoints/01_numpy_basics.ipynb b/checkpoints/01_numpy_basics.ipynb new file mode 100644 index 0000000..d237f78 --- /dev/null +++ b/checkpoints/01_numpy_basics.ipynb @@ -0,0 +1,137 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# βœ… Checkpoint 01 β€” NumPy Basics\n\n", + "**Goal**\n", + "- Create/reshape arrays, vectorized ops, boolean masking\n\n", + "**Rules**\n", + "- Fill only where marked as `# TODO`\n", + "- Do not change test cells (πŸ”’)\n", + "- Run all cells before submitting\n\n", + "**References**\n", + "- NumPy docs: https://numpy.org/doc/\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# πŸ”§ Setup\n", + "import numpy as np\n", + "import pandas as pd\n", + "from utils.grader import check_array, check_value\n\n", + "np.random.seed(42)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q1) Create a 3x3 array with values 0..8 (row-major)\n", + "# TODO: assign to variable 'A'\n", + "A = ... # TODO\n\n", + "# πŸ”’ Test\n", + "check_array(A, shape=(3,3), dtype=np.integer)\n", + "check_value(A.sum(), 36)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q2) From A, create a boolean mask selecting even numbers\n", + "# TODO: assign to variable 'mask_even'\n", + "mask_even = ... # TODO\n\n", + "# πŸ”’ Test\n", + "check_array(mask_even, shape=(3,3), dtype=bool)\n", + "check_value(int(mask_even.sum()), 5) # number of evens in 0..8\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q3) Reshape, stack, and compute row-wise means β†’ 'means'\n", + "# TODO: assign to variable 'means' (1D array length 3)\n", + "B = ... # TODO\n", + "means = ... # TODO\n\n", + "# πŸ”’ Test\n", + "check_array(means, shape=(3,))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q4) Broadcasting: A (3x3) and v (1x3) β†’ 'C'\n", + "v = np.array([10, 0, -10])\n", + "C = ... # TODO\n\n", + "# πŸ”’ Test\n", + "check_array(C, shape=(3,3), dtype=np.integer)\n", + "check_value(int(C[0,0] + C[2,2]), (A[0,0]+10) + (A[2,2]-10))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q5) Fancy indexing / boolean masking\n", + "# Extract odd numbers β‰₯ 3 from A β†’ 'odd_ge3'\n", + "odd_ge3 = ... # TODO\n\n", + "# πŸ”’ Test\n", + "check_array(\n", + " odd_ge3,\n", + " shape=(np.count_nonzero((A>=3)&(A%2==1)),),\n", + " dtype=np.integer,\n", + " allow_int_any=True\n", + ")\n", + "check_value(int(odd_ge3.min()), 3)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### βœ… Submit\n", + "- All tests above passed\n", + "- Save notebook and commit to your repo\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.8", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/checkpoints/02_pandas_basics.ipynb b/checkpoints/02_pandas_basics.ipynb new file mode 100644 index 0000000..839870c --- /dev/null +++ b/checkpoints/02_pandas_basics.ipynb @@ -0,0 +1,167 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# βœ… Checkpoint 02 β€” pandas Basics\n", + "\n", + "**Goal**\n", + "- Load/create DataFrames, filter & sort, add computed columns, groupby/aggregate, and merge.\n", + "\n", + "**Rules**\n", + "- Fill only where marked as `# TODO`\n", + "- Do not change test cells (πŸ”’)\n", + "- Run all cells before submitting\n", + "\n", + "**References**\n", + "- pandas docs: https://pandas.pydata.org/docs/\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# πŸ”§ Setup\n", + "import numpy as np\n", + "import pandas as pd\n", + "from utils.grader import (\n", + " check_array, check_value, check_dataframe_columns,\n", + " check_series_index_values, check_len\n", + ")\n", + "np.random.seed(42)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Small in-memory data we'll use throughout\n", + "data = {\n", + " 'city': ['Ann Arbor','Kalamazoo','Detroit','Grand Rapids','Lansing'],\n", + " 'temp_f': [68, 77, 59, 90, 82],\n", + " 'rain': [False, True, False, False, True],\n", + " 'date': pd.to_datetime(['2025-08-20','2025-08-20','2025-08-20','2025-08-20','2025-08-20'])\n", + "}\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q1) Create a DataFrame 'df' from the dict 'data' with columns in order: city, temp_f, rain, date\n", + "# TODO: assign to variable 'df'\n", + "df = ... # TODO\n", + "\n", + "# πŸ”’ Test\n", + "check_dataframe_columns(df, ['city','temp_f','rain','date'])\n", + "check_value(df.iloc[0]['city'], 'Ann Arbor')\n", + "check_len(df, 5)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q2) Filter rows where rain == False, sort by temp_f descending, reset index β†’ 'df_dry'\n", + "# TODO: assign to variable 'df_dry'\n", + "df_dry = ... # TODO\n", + "\n", + "# πŸ”’ Test\n", + "check_len(df_dry, 3)\n", + "check_value(df_dry.iloc[0]['temp_f'], 90)\n", + "check_dataframe_columns(df_dry, ['city','temp_f','rain','date'])\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q3) Add a Celsius column: temp_c = round((temp_f - 32) * 5/9, 1)\n", + "# TODO: create 'temp_c' column on df\n", + "...\n", + "\n", + "# πŸ”’ Test\n", + "check_value(float(df.loc[df['city']=='Grand Rapids','temp_c'].iloc[0]), round((90-32)*5/9,1))\n", + "check_dataframe_columns(df, ['city','temp_f','rain','date','temp_c'])\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q4) Group by 'rain' and compute mean temp_c β†’ 'avg_temp_by_rain' (Series indexed by rain boolean)\n", + "# TODO: assign to variable 'avg_temp_by_rain'\n", + "avg_temp_by_rain = ... # TODO\n", + "\n", + "# πŸ”’ Test (values checked approximately)\n", + "check_series_index_values(avg_temp_by_rain, {False, True})\n", + "mean_false = avg_temp_by_rain.loc[False]\n", + "mean_true = avg_temp_by_rain.loc[True]\n", + "check_value(round(float(mean_false),1), round(((68-32)*5/9 + (59-32)*5/9 + (90-32)*5/9)/3, 1))\n", + "check_value(round(float(mean_true),1), round(((77-32)*5/9 + (82-32)*5/9)/2, 1))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Q5) Merge: create a DataFrame 'city_region' with columns city and region, then left-merge onto df β†’ 'df_merged'\n", + "city_region = pd.DataFrame({\n", + " 'city': ['Ann Arbor','Kalamazoo','Detroit','Grand Rapids','Lansing'],\n", + " 'region': ['SE','SW','SE','W','C']\n", + "})\n", + "# TODO: left-merge on 'city' to produce df_merged\n", + "df_merged = ... # TODO\n", + "\n", + "# πŸ”’ Test\n", + "check_dataframe_columns(df_merged, ['city','temp_f','rain','date','temp_c','region'])\n", + "check_value(set(df_merged['region']), {'SE','SW','W','C'})\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### βœ… Submit\n", + "- All tests above passed\n", + "- Save notebook and commit to your repo\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/checkpoints/03_matplotlib_seaborn.ipynb b/checkpoints/03_matplotlib_seaborn.ipynb new file mode 100644 index 0000000..d87d586 --- /dev/null +++ b/checkpoints/03_matplotlib_seaborn.ipynb @@ -0,0 +1,237 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# βœ… Checkpoint 03 β€” Matplotlib & Seaborn\n", + "\n", + "**Goal**\n", + "- Create basic plots with Matplotlib & Seaborn: scatter, histogram, boxplot, and aggregated barplot.\n", + "- Set titles/labels properly and export figures as files.\n", + "\n", + "**Rules**\n", + "- Fill only where marked as `# TODO`\n", + "- Do not change test cells (πŸ”’)\n", + "- Run all cells in order before submitting\n", + "\n", + "**References**\n", + "- Matplotlib docs: https://matplotlib.org/stable/\n", + "- Seaborn docs: https://seaborn.pydata.org/\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# πŸ”§ Setup\n", + "import os\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from utils.grader import (\n", + " check_value, check_len, check_file_exists,\n", + " check_axes_instance, check_xlabel, check_ylabel, check_title_contains,\n", + " check_num_lines, check_num_collections, check_num_patches\n", + ")\n", + "np.random.seed(42)\n", + "\n", + "# ensure output dir\nn", + "os.makedirs('outputs', exist_ok=True)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# small synthetic dataset (deterministic)\n", + "n = 120\n", + "days = np.random.choice(['Thur','Fri','Sat','Sun'], size=n, p=[0.25,0.2,0.3,0.25])\n", + "sex = np.random.choice(['Male','Female'], size=n)\n", + "smoker = np.random.choice(['Yes','No'], size=n, p=[0.3,0.7])\n", + "total_bill = np.round(np.random.normal(loc=24, scale=8, size=n).clip(5, 80), 2)\n", + "tip = np.round((total_bill * np.random.uniform(0.08, 0.22, size=n)), 2)\n", + "\n", + "df = pd.DataFrame({\n", + " 'day': days,\n", + " 'sex': sex,\n", + " 'smoker': smoker,\n", + " 'total_bill': total_bill,\n", + " 'tip': tip\n", + "})\n", + "df.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q1) Matplotlib Scatter\n", + "Create a scatter plot of `total_bill` (x) vs `tip` (y) using **Matplotlib**.\n", + "- Put the **x label**: `Total Bill ($)`\n", + "- Put the **y label**: `Tip ($)`\n", + "- Title should contain the word **\"Scatter\"**\n", + "- Save the fig object in a variable named **`fig1`**, axes in **`ax1`**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: create fig1, ax1, draw scatter, set labels and title\n", + "fig1, ax1 = ... # TODO\n", + "\n", + "# πŸ”’ Test\n", + "check_axes_instance(ax1)\n", + "check_xlabel(ax1, 'Total Bill ($)')\n", + "check_ylabel(ax1, 'Tip ($)')\n", + "check_title_contains(ax1, 'Scatter')\n", + "check_num_collections(ax1, 1) # one scatter collection\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q2) Seaborn Boxplot\n", + "Using **Seaborn**, create a **boxplot** of `tip` by `day` (x=`day`, y=`tip`).\n", + "- Store the Axes in a variable named **`ax2`**\n", + "- x label must be `Day`, y label must be `Tip ($)`\n", + "- Title should contain the word **\"Box\"**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: create ax2 using seaborn.boxplot\n", + "ax2 = ... # TODO\n", + "...\n", + "\n", + "# πŸ”’ Test\n", + "check_axes_instance(ax2)\n", + "check_xlabel(ax2, 'Day')\n", + "check_ylabel(ax2, 'Tip ($)')\n", + "check_title_contains(ax2, 'Box')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q3) Matplotlib Histogram\n", + "Create a **histogram** of `total_bill` with **10 bins** using Matplotlib.\n", + "- Save fig as **`fig3`**, axes as **`ax3`**\n", + "- Title should contain **\"Histogram\"**\n", + "- x label `Total Bill ($)`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: histogram with 10 bins\n", + "fig3, ax3 = ... # TODO\n", + "...\n", + "\n", + "# πŸ”’ Test\n", + "check_axes_instance(ax3)\n", + "check_title_contains(ax3, 'Histogram')\n", + "check_xlabel(ax3, 'Total Bill ($)')\n", + "check_num_patches(ax3, 10)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q4) Seaborn Aggregated Barplot\n", + "Add a computed column `tip_pct = tip / total_bill * 100`. Then plot the **mean tip % by day** using Seaborn (barplot).\n", + "- Store the Axes in **`ax4`**\n", + "- There should be one bar per unique day in `df['day']`\n", + "- y label should contain the `%` sign\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: add tip_pct column and make barplot of mean tip_pct by day\n", + "...\n", + "ax4 = ... # TODO\n", + "...\n", + "\n", + "# πŸ”’ Test\n", + "check_axes_instance(ax4)\n", + "unique_days = sorted(df['day'].unique().tolist())\n", + "check_len(ax4.patches, len(unique_days))\n", + "check_ylabel(ax4, '%') # contains percent sign\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q5) Save Figure to File\n", + "Save the Q1 scatter figure to `outputs/fig_scatter.png` using `fig1.savefig(...)`.\n", + "- The path must be exactly `outputs/fig_scatter.png`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: save fig1 to outputs/fig_scatter.png\n", + "...\n", + "\n", + "# πŸ”’ Test\n", + "check_file_exists('outputs/fig_scatter.png')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### βœ… Submit\n", + "- All tests above passed\n", + "- Save notebook and commit to your repo\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/checkpoints/04_plotly_intro.ipynb b/checkpoints/04_plotly_intro.ipynb new file mode 100644 index 0000000..6912439 --- /dev/null +++ b/checkpoints/04_plotly_intro.ipynb @@ -0,0 +1,237 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# βœ… Checkpoint 04 β€” Plotly Intro\n", + "\n", + "**Goal**\n", + "- Build interactive charts with Plotly (scatter, histogram, bar) using both Express and Graph Objects.\n", + "- Set titles/axis labels, count traces, and export figures to HTML.\n", + "\n", + "**Rules**\n", + "- Fill only where marked as `# TODO`.\n", + "- Do not change test cells (πŸ”’).\n", + "- Run all cells in order before submitting.\n", + "\n", + "**References**\n", + "- Plotly docs: https://plotly.com/python/\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# πŸ”§ Setup\n", + "import os\n", + "import numpy as np\n", + "import pandas as pd\n", + "import plotly.express as px\n", + "import plotly.graph_objects as go\n", + "from utils.grader import (\n", + " check_file_exists,\n", + " check_figure, check_trace_count,\n", + " check_axis_title, check_layout_title_contains,\n", + " check_bar_count, check_trace_modes\n", + ")\n", + "np.random.seed(42)\n", + "os.makedirs('outputs', exist_ok=True)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Small deterministic dataset (similar to 'tips')\n", + "n = 120\n", + "days = np.random.choice(['Thur','Fri','Sat','Sun'], size=n, p=[0.25,0.2,0.3,0.25])\n", + "sex = np.random.choice(['Male','Female'], size=n)\n", + "smoker = np.random.choice(['Yes','No'], size=n, p=[0.3,0.7])\n", + "total_bill = np.round(np.random.normal(loc=24, scale=8, size=n).clip(5, 80), 2)\n", + "tip = np.round((total_bill * np.random.uniform(0.08, 0.22, size=n)), 2)\n", + "df = pd.DataFrame({\n", + " 'day': days,\n", + " 'sex': sex,\n", + " 'smoker': smoker,\n", + " 'total_bill': total_bill,\n", + " 'tip': tip\n", + "})\n", + "df.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q1) Plotly Express β€” Scatter\n", + "Create a scatter plot of `total_bill` (x) vs `tip` (y) using **Plotly Express**.\n", + "- Color by `day` (optional but encouraged).\n", + "- Title should contain **\"Scatter\"**.\n", + "- x-axis title: `Total Bill ($)`; y-axis title: `Tip ($)`.\n", + "- Store the figure in **`fig1`**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: create fig1 with px.scatter\n", + "fig1 = ... # TODO\n", + "# Example (for reference):\n", + "# fig1 = px.scatter(df, x='total_bill', y='tip', color='day', title='Scatter: Tip vs Total Bill')\n", + "# fig1.update_layout(xaxis_title='Total Bill ($)', yaxis_title='Tip ($)')\n", + "\n", + "# πŸ”’ Test\n", + "check_figure(fig1)\n", + "check_trace_count(fig1, expected_min=1) # at least 1 trace (color may create >1)\n", + "check_layout_title_contains(fig1, 'Scatter')\n", + "check_axis_title(fig1, axis='x', expected='Total Bill ($)')\n", + "check_axis_title(fig1, axis='y', expected='Tip ($)')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q2) Plotly Express β€” Histogram\n", + "Create a histogram of `total_bill` with **10 bins**.\n", + "- Title should contain **\"Histogram\"**.\n", + "- x-axis title: `Total Bill ($)`.\n", + "- Store the figure in **`fig2`**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: create fig2 with px.histogram and nbins=10\n", + "fig2 = ... # TODO\n", + "# Example:\n", + "# fig2 = px.histogram(df, x='total_bill', nbins=10, title='Histogram: Total Bill')\n", + "# fig2.update_layout(xaxis_title='Total Bill ($)')\n", + "\n", + "# πŸ”’ Test\n", + "check_figure(fig2)\n", + "check_trace_count(fig2, expected_min=1)\n", + "check_layout_title_contains(fig2, 'Histogram')\n", + "check_axis_title(fig2, axis='x', expected='Total Bill ($)')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q3) Plotly Express β€” Bar (mean tip%)\n", + "Add a computed column `tip_pct = tip / total_bill * 100`. Then plot the **mean tip % by day** as a bar chart.\n", + "- One bar per unique `day`.\n", + "- y-axis title should contain `%`.\n", + "- Store the figure in **`fig3`**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: compute tip_pct and create fig3\n", + "...\n", + "fig3 = ... # TODO\n", + "\n", + "# πŸ”’ Test\n", + "check_figure(fig3)\n", + "unique_days = sorted(df['day'].unique().tolist())\n", + "check_bar_count(fig3, expected=len(unique_days))\n", + "check_axis_title(fig3, axis='y', expected='%') # contains percent sign\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q4) Graph Objects β€” Line (running mean of tip)\n", + "Using **plotly.graph_objects**, build a line chart of the running mean of `tip` over row index.\n", + "- Use `go.Figure` with a single `go.Scatter` trace in `'lines'` mode.\n", + "- Title should contain **\"Running Mean\"**.\n", + "- Store the figure in **`fig4`**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: create running mean and fig4 with go.Figure\n", + "...\n", + "fig4 = ... # TODO\n", + "\n", + "# πŸ”’ Test\n", + "check_figure(fig4)\n", + "check_trace_count(fig4, expected_min=1, expected_max=1)\n", + "check_trace_modes(fig4, must_include='lines')\n", + "check_layout_title_contains(fig4, 'Running Mean')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Q5) Export to HTML\n", + "Save the Q1 scatter figure to **`outputs/fig_scatter.html`** using `fig1.write_html(...)`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: export fig1 to outputs/fig_scatter.html\n", + "...\n", + "\n", + "# πŸ”’ Test\n", + "check_file_exists('outputs/fig_scatter.html')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### βœ… Submit\n", + "- All tests above passed\n", + "- Save notebook and commit to your repo\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/checkpoints/utils/grader.py b/checkpoints/utils/grader.py new file mode 100644 index 0000000..4ac8796 --- /dev/null +++ b/checkpoints/utils/grader.py @@ -0,0 +1,168 @@ +# utils/grader.py +import os +import numpy as np +import pandas as pd +import matplotlib +import matplotlib.pyplot as plt + +def _fail(msg): + raise AssertionError(msg) + +# Generic / NumPy / pandas +def check_array(arr, shape=None, dtype=None, allow_int_any=False): + if not isinstance(arr, np.ndarray): + _fail(f"❌ Expected numpy.ndarray, got {type(arr)}") + if shape is not None and arr.shape != shape: + _fail(f"❌ Wrong shape: expected {shape}, got {arr.shape}") + if dtype is not None: + if allow_int_any and np.issubdtype(arr.dtype, np.integer): + pass + elif not np.issubdtype(arr.dtype, dtype): + _fail(f"❌ Wrong dtype: expected {dtype}, got {arr.dtype}") + print("βœ… Array check passed.") + +def check_value(val, expected, tol=1e-8): + if isinstance(val, (float, np.floating)) or isinstance(expected, (float, np.floating)): + if abs(float(val) - float(expected)) > tol: + _fail(f"❌ Wrong value: expected {expected}, got {val}") + else: + if val != expected: + _fail(f"❌ Wrong value: expected {expected}, got {val}") + print("βœ… Value check passed.") + +def check_dataframe_columns(df, expected_cols): + if not isinstance(df, pd.DataFrame): + _fail(f"❌ Expected pandas.DataFrame, got {type(df)}") + missing = [c for c in expected_cols if c not in df.columns] + if missing: + _fail(f"❌ Missing columns: {missing}") + print("βœ… DataFrame columns check passed.") + +def check_series_index_values(s, expected_index_set): + if not isinstance(s, pd.Series): + _fail(f"❌ Expected pandas.Series, got {type(s)}") + if set(list(s.index)) != set(list(expected_index_set)): + _fail(f"❌ Unexpected index: got {list(s.index)}, expected set {list(expected_index_set)}") + print("βœ… Series index check passed.") + +def check_len(obj, expected_len): + try: + n = len(obj) + except Exception as e: + _fail(f"❌ Object has no len(): {e}") + if n != expected_len: + _fail(f"❌ Wrong length: expected {expected_len}, got {n}") + print("βœ… Length check passed.") + +def check_file_exists(path): + if not os.path.exists(path): + _fail(f"❌ File not found: {path}") + print("βœ… File exists.") + +# Matplotlib / Seaborn helpers +def check_axes_instance(ax): + if not hasattr(ax, "get_xlabel") or not hasattr(ax, "get_ylabel"): + _fail(f"❌ Expected a Matplotlib Axes-like object, got {type(ax)}") + print("βœ… Axes instance check passed.") + +def check_xlabel(ax, expected): + label = ax.get_xlabel() + if label != expected and expected not in label: + _fail(f"❌ X label mismatch. Got '{label}', expected '{expected}' (or containing it).") + print("βœ… X label ok.") + +def check_ylabel(ax, expected): + label = ax.get_ylabel() + if label != expected and expected not in label: + _fail(f"❌ Y label mismatch. Got '{label}', expected '{expected}' (or containing it).") + print("βœ… Y label ok.") + +def check_title_contains(ax, keyword): + title = ax.get_title() + if keyword not in title: + _fail(f"❌ Title does not contain '{keyword}'. Got '{title}'") + print("βœ… Title contains keyword.") + +def check_num_lines(ax, expected_n): + n = len(ax.lines) + if n != expected_n: + _fail(f"❌ Expected {expected_n} line(s), got {n}") + print("βœ… Number of lines ok.") + +def check_num_collections(ax, expected_n): + n = len(ax.collections) + if n != expected_n: + _fail(f"❌ Expected {expected_n} collection(s), got {n}") + print("βœ… Number of collections ok.") + +def check_num_patches(ax, expected_n): + n = len(ax.patches) + if n != expected_n: + _fail(f"❌ Expected {expected_n} patch(es), got {n}") + print("βœ… Number of patches ok.") + +# Plotly helpers +def check_figure(fig): + try: + import plotly.graph_objects as go + except Exception as e: + _fail(f"❌ Plotly not installed: {e}") + if not isinstance(fig, go.Figure): + _fail(f"❌ Expected plotly.graph_objects.Figure, got {type(fig)}") + print("βœ… Figure instance ok.") + +def check_trace_count(fig, expected_min=None, expected_max=None): + n = len(fig.data) + if expected_min is not None and n < expected_min: + _fail(f"❌ Too few traces: got {n}, expected >= {expected_min}") + if expected_max is not None and n > expected_max: + _fail(f"❌ Too many traces: got {n}, expected <= {expected_max}") + print("βœ… Trace count ok.") + +def _get_axis(fig, axis): + if axis == 'x': + return fig.layout.xaxis + elif axis == 'y': + return fig.layout.yaxis + else: + _fail("❌ axis must be 'x' or 'y'") + +def check_axis_title(fig, axis='x', expected=None): + ax = _get_axis(fig, axis) + title = getattr(ax.title, "text", "") if ax.title else "" + if expected is None: + _fail("❌ expected title text is None") + if expected != title and (expected not in title): + _fail(f"❌ {axis}-axis title mismatch. Got '{title}', expected '{expected}' (or containing it).") + print(f"βœ… {axis.upper()} axis title ok.") + +def check_layout_title_contains(fig, keyword): + title = getattr(fig.layout.title, "text", "") if fig.layout.title else "" + if keyword not in title: + _fail(f"❌ Layout title does not contain '{keyword}'. Got '{title}'") + print("βœ… Layout title contains keyword.") + +def check_bar_count(fig, expected): + if len(fig.data) == 0: + _fail("❌ No traces in figure.") + trace = fig.data[0] + xs = getattr(trace, "x", None) + if xs is None: + _fail("❌ Bar trace has no x values.") + n = len(xs) + if n != expected: + _fail(f"❌ Expected {expected} bars, got {n}") + print("βœ… Bar count ok.") + +def check_trace_modes(fig, must_include='lines'): + if len(fig.data) == 0: + _fail("❌ No traces in figure.") + modes = [] + for t in fig.data: + mode = getattr(t, "mode", None) + if mode: + modes.append(mode) + joined = ",".join(modes) + if must_include not in joined: + _fail(f"❌ Required mode '{must_include}' not found in traces. Got modes: {modes}") + print("βœ… Trace mode ok.") diff --git a/libraries.md b/libraries.md new file mode 100644 index 0000000..6744b24 --- /dev/null +++ b/libraries.md @@ -0,0 +1,51 @@ +# πŸ“š Top 26 Python Libraries for Data Science + + + +## Staple Python Libraries for Data Science +1. **NumPy** – Core numerical computing library in Python, offering fast operations on multi-dimensional arrays and matrices, essential for scientific computing and linear algebra. +2. **pandas** – Powerful data analysis/manipulation tool providing DataFrame structures, easy I/O with multiple file formats, and advanced indexing, grouping, and time series functionality. +3. **Matplotlib** – Fundamental plotting library for creating static, interactive, and animated visualizations with full customization. +4. **Seaborn** – High-level statistical visualization library built on Matplotlib, offering attractive and informative default styles for complex plots. +5. **Plotly** – Interactive graphing library for web-based visualizations, supporting 3D charts and dashboards via Dash. +6. **scikit-learn** – Comprehensive machine learning library for classification, regression, clustering, and preprocessing, with a consistent API. + + +
+ +## Machine Learning Python Libraries +7. **LightGBM** – Gradient boosting framework optimized for speed, memory efficiency, and accuracy, supporting large-scale and GPU-based learning. +8. **XGBoost** – Widely used gradient boosting library known for performance in Kaggle competitions, supporting distributed training and multiple platforms. +9. **CatBoost** – High-performance gradient boosting library with strong categorical feature handling and excellent CPU/GPU support. +10. **Statsmodels** – Statistical modeling library for regression, hypothesis testing, and time series analysis, with an R-like interface. +11. **RAPIDS cuDF/cuML** – NVIDIA GPU-accelerated libraries for DataFrame manipulation and machine learning with pandas- and scikit-learn-like APIs. +12. **Optuna** – Hyperparameter optimization framework with efficient algorithms, pruning, and visualization tools. + +
+ + +## Automated Machine Learning Python Libraries +13. **PyCaret** – Low-code machine learning library automating the end-to-end ML workflow for rapid experimentation. +14. **H2O** – Scalable ML platform for big data, supporting distributed computing and AutoML. +15. **TPOT** – AutoML tool using genetic programming to optimize ML pipelines automatically. +16. **auto-sklearn** – Automated model selection and hyperparameter tuning built on scikit-learn with Bayesian optimization. +17. **FLAML** – Lightweight AutoML library focused on finding accurate models quickly with minimal computational cost. + +
+ + +## Deep Learning Python Libraries +18. **TensorFlow** – Google’s open-source ML framework for scalable deep learning, offering APIs for building, training, and deploying models. +19. **PyTorch** – Facebook’s deep learning framework known for dynamic computation graphs, ease of use, and strong research-to-production transition. +20. **FastAI** – High-level deep learning library on PyTorch with concise APIs for state-of-the-art results. +21. **Keras** – User-friendly deep learning API integrated with TensorFlow, designed for quick prototyping and experimentation. +22. **PyTorch Lightning** – Lightweight wrapper for PyTorch that organizes code for reproducibility and scalability. + + +
+ +## Natural Language Processing Python Libraries +23. **NLTK** – Comprehensive NLP toolkit for tokenization, parsing, and linguistic processing with access to corpora like WordNet. +24. **spaCy** – Industrial-strength NLP library for large-scale text processing, supporting deep learning integration and 60+ languages. +25. **Gensim** – Topic modeling and vector space modeling library optimized for large corpora and memory efficiency. +26. **Hugging Face Transformers** – Library for state-of-the-art transformer-based models for text, vision, audio, and multimodal tasks, supporting PyTorch, TensorFlow, and JAX.