Skip to content

Big Data Analysis using PySpark on BigMart Dataset with ML model and business insights.

License

Notifications You must be signed in to change notification settings

pavithralanalytics/Big-Data-Analysis-PySpark

Repository files navigation

Big Data Analysis Using PySpark

πŸ“Œ Project Overview

This project demonstrates Big Data processing using PySpark on the BigMart Sales dataset. The analysis includes data preprocessing, aggregation, feature engineering, and a machine learning model to predict sales.

πŸ›  Tools Used

  • Python
  • PySpark
  • Spark MLlib
  • Jupyter Notebook

πŸ“Š Dataset

BigMart Sales Dataset including:

  • Item Identifier
  • Item Type
  • Item MRP
  • Outlet Type
  • Location Tier
  • Item Outlet Sales

πŸ” Analysis Performed

  • Data Cleaning
  • Missing Value Handling
  • Aggregation & GroupBy Operations
  • Sales Trend Analysis
  • Linear Regression Model

πŸ“ˆ Key Insights

  • Supermarket Type outlets generate higher revenue
  • Tier 3 cities show strong sales trends
  • Item MRP significantly impacts sales
  • PySpark efficiently processes large-scale data

🎯 Conclusion

This project demonstrates scalable data processing using distributed computing with PySpark, suitable for large datasets in real-world business environments.

Releases

No releases published

Packages

No packages published