Python | Jupyter Notebook | Regression
Our objective is to predict hourly bike rentals to ensure supply can meet future demand.
In addition, we will investigate the following business problems:
- Is there a day of the week when bikes are rented more than others?
- Is there an hour when bikes are rented most?
- Is there a season when bikes are rented more than others?
- Is there a temperature when most bikes are rented?
Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, pandas_profiling, math
Supervised learning approach: Regression
IDE Jupyter Notebook
Data: The 'Seoul Bike Sharing' dataset was provided by UCI Archives.
The dataset was comprised of 1 year worth of bike rentals containing the following attributes:
- Date (Dec 2017 – Nov 2018)
- Hour of Day
- Number of bikes rented hourly (0 – 3,556)
- Hourly Weather Conditions (temp, humidity, rain, snow, wind, etc.)
- Seasons (Spring, Summer, Fall, Winter)
- Holiday (Holiday/Non-Holiday)
- Functional Day (Closed/Open)
- Frame the business problem
- Obtain the data
- Preprocessing
- Exploratory Data Analytics (EDA)
- Perform modeling
- Communicate Results
I started by transforming the categorical variables into numeric variables.
Then I created train and tests sets with a test size of 25%.
For the first model, I included all data points as features. We tried three algorithms (Random Forest, Linear Regression, Support Vector Machine) and determined Random Forest performed the best.
Random Forest Outcomes:
R Squared: 0.929
RMSE: 168.621
From our first model, we discovered temperature and hour of the day were the most important features. With this insight I discretized the temperature data and built a new model with this new data point. I tried the same three algorithms (Random Forest, Linear Regression, Support Vector Machine) and determined Random Forest performed the best.
Random Forest Outcomes:
R Squared: 0.925
RMSE: 173.585