Prediction of March Madness 2018 Men's Tournament
The detailed project report can be viewed here.
- Install Apache Spark and Apache Hadoop
- Download and install R v3.4.3
- Download and install RStudio v1.1.383
- In case you have never used the libraries I used in my project, open the R console in RStudio and type the following lines:
install.packages("caret")install.packages("SparkR")install.packages("dplyr")install.packages("magrittr")install.packages("tidyr")install.packages("ggplot2")
- Open
FullTeamData.rmdin RStudio - Go to line 11 (it will contain:
currentYear <- 2016) - Set
currentYearto the value of2010(it should look like:currentYear <- 2010) - Knit the file, this will create a
FullTeamData.htmlfile in the current directory that shows the results and aFullTeamData2010.csvfile in the Data folder - Repeat steps 8 and 9 for values
2011,2012,2013,2014,2015,2016,2017, and2018 - Open
Testing.rmdin RStudio - Knit the file (this will take about 20 minutes to complete), this will create a
Testing.htmlfile that shows the results - Open
Submission.rmdin RStudio - Knit the file, this will create a
Submission.htmlfile that shows the results andsubmission_v1.csv,submission_v1_forBracket.csv,submission_v2.csv, andsubmission_v2_forBracket.csvin the current directory submission_v1.csvandsubmission_v2.csvwere the files submitted to the Kaggle competition,submission_v1_forBracket.csvandsubmission_v2_forBracket.csvwere used to create brackets for the NCAA March Madness Bracket Challenge- To view the results of the three R scripts (
FullTeamData.rmd,Testing.rmd, andSubmission.rmd), open the respective html files that were created in any browser
FullTeamData.rmd creates 9 datasets (FullTeamData2010.csv, ..., FullTeamData2018.csv) that combines all of the data from the 52 datasets given by Kaggle into one dataset for each year. Testing.rmd tests six different Logistic Regression models on every year's data, which comes from the datasets created in FullTeamData.rmd. Submission.Rmd uses the two best Logistic Regression models on the 2018 data to create output files used in the two competitions (Kaggle and NCAA Bracket Challenge).