This project demonstrates Big Data processing using PySpark on the BigMart Sales dataset. The analysis includes data preprocessing, aggregation, feature engineering, and a machine learning model to predict sales.
- Python
- PySpark
- Spark MLlib
- Jupyter Notebook
BigMart Sales Dataset including:
- Item Identifier
- Item Type
- Item MRP
- Outlet Type
- Location Tier
- Item Outlet Sales
- Data Cleaning
- Missing Value Handling
- Aggregation & GroupBy Operations
- Sales Trend Analysis
- Linear Regression Model
- Supermarket Type outlets generate higher revenue
- Tier 3 cities show strong sales trends
- Item MRP significantly impacts sales
- PySpark efficiently processes large-scale data
This project demonstrates scalable data processing using distributed computing with PySpark, suitable for large datasets in real-world business environments.