- Repository:
challenge-regression - Type of Challenge:
Consolidation - Duration:
5 days - Deadline:
06/05/2025 17:00 - Solo Challenge
- Be able to preprocess data for machine learning.
- Be able to apply a regression in a real context.
- Be able to understand some of machine learning.
The real estate company "ImmoEliza" asks you to create a machine learning model to predict prices on Belgium's real estate sales.
You have collected your data, you have cleaned and analyzed it a first time! So it's time to do some machine learning with it!
Preprocess the data to be used with machine learning.
- You have to handle NANs.
- You have to handle categorical data.
- You have to select features.
Now that the dataset is ready, you have to format it for machine learning:
- Divide your dataset for training and testing. (
X_train, y_train, X_test, y_test)
The dataset is ready. Now let's select a model.
Look at which models make the most sense according to your data.
Apply your model on your data:
- Train your model (on the train dataset)
- Check for predictions (on single lines or the test dataset)
- Once this works, look into
sklearn'sPipelineobject to make things clean and reusable
Let's evaluate your model. The metric we are interested in is the MAE (Mean Absolute Error). Make sure you understand it well. Try to answer those questions:
- How could you improve this result?
- Which part of the process has the most impact on the results?
- Are there other metrics which would make more sense to evaluate your model.
You may go back a couple of steps if you want to try other types of approaches.
I know some of you will get to a viable model really quickly and will get bored to go back and forth between filtering out outliers and selecting features. The truth is when playing with ML, you only truly understand it when you do it yourself. Here is what you can do:
- Watch what most ML models do to make a prediction
- Select one which you find elegant
- Implement it from scratch using at maximum
numpy
Note that some are easier to implement than others.
Present your results in front of the group.
- You have to make a nice presentation with a professional design.
- You have 5 minutes to present (without Q&A). You can't use more time, you can't use less time.
- You CAN'T show code or jupyter notebook during the presentation.
- Each function or class has to be typed
- Each function or class has to contain a docstring
- Your code should be commented when necessary.
- Your code should be cleaned of any unused code.
- Pimp up the README file:
- Description
- Installation
- Usage
- (Visuals)
- (Contributors)
- (Timeline)
- (Personal situation)
- Present your results in front of the group in 5mins max.
- Create the repository
- Study the request (What & Why ?)
- Identify technical challenges (How ?)
| Criteria | Indicator | Yes/No |
|---|---|---|
| 1. Is complete | Know how to answer all the above questions. | [ ] |
pandas and matplotlib/seaborn are used. |
[ ] | |
| All the above steps were followed. | [ ] | |
| A nice README is available. | [ ] | |
| Your model is able to predict something. | [ ] | |
| 2. Is good | You used typing and docstring. | [ ] |
| Your code is formatted (PEP8 compliant). | [ ] | |
| No unused file/code is present. | [ ] |
“The lottery is a tax on people who don't understand the statistics.” - Anonymous
