This project performs Exploratory Data Analysis (EDA) on an Amazon product dataset.
The dataset contains product details, prices, discounts, ratings, reviews, and user information.
The goal of this analysis is to:
- Understand the structure and quality of the dataset.
- Identify trends in pricing, discounting, and ratings.
- Explore customer review patterns.
- Detect potential issues like missing values, duplicates, or imbalances.
The dataset includes the following key columns:
| Column | Description |
|---|---|
product_id |
Unique identifier for each product |
product_name |
Name/description of the product |
category |
Product category (e.g., Electronics, Accessories) |
discounted_price |
Selling price after discount |
actual_price |
Original price before discount |
discount_percentage |
Percentage discount offered |
rating |
Customer rating (out of 5) |
rating_count |
Number of ratings |
about_product |
Short description/features |
user_id |
Unique ID of reviewer |
user_name |
Name of reviewer |
review_id |
Unique ID of review |
review_title |
Title of review |
review_content |
Full review text |
img_link |
Product image link |
product_link |
Product page link |
- Used
.info()to check data types, null values, and dataset size. - Found that most columns are complete, with very few missing values.
.describe()applied to both numeric and categorical columns.- Found mean ≈ median in prices → data is fairly symmetric.
- Ratings cluster around 4.1, showing positive bias.
- Computed correlation matrix for numeric features.
- Observed strong negative correlation between
discount_percentageanddiscounted_price. - Weak/no correlation between
ratingand price → ratings are not price-driven.
- Bar Chart: Average rating per category.
- Boxplot: Discount % distribution across categories.
- Scatterplot: Discounted price vs rating.
- Word Cloud: Most frequent terms in reviews.
- Heatmap: Correlations between numeric features.
- Found duplicate product IDs (same product reviewed multiple times).
- Prices and discounts stored as strings (
₹,%) → cleaned and converted to numeric.
- Many products receive 4★ or higher → customer reviews skew positive.
- Discounts are widely offered (~50% most frequent).
- Certain categories dominate the dataset (e.g., Electronics & Accessories).
- Some reviews and users appear multiple times → dataset contains duplicate/overlapping entries.
- Python 3
- Pandas → data cleaning & manipulation
- NumPy → numerical operations
- Matplotlib / Seaborn → data visualization
- WordCloud → review text analysis
- Clone the repository:
git clone https://github.com/HarshitWaldia/Exploratory-Data-Analysis.git cd Exploratory-Data-Analysis - Install required libraries:
pip install -r requirements.txt
3.Open the Jupyter Notebook:
jupyter notebook Amazon_EDA.ipynb
- Run the cells step by step to reproduce the analysis.
-
Build a recommendation system using ratings & categories.
-
Perform sentiment analysis on review text.
-
Use ML models to predict product ratings based on price & discount.
Harshit Waldia