Evaluation of Math LLM

I've previously prompt-engineered a Mistral Large 7B model to answer mathematical questions. Now, I want to evaluate the mathematical part of that model.

I've used mathematics_dataset as a test dataset. It is not the best choice because it is open source, and it is a big possibility that Mistral has used that data set for training their models. But it'll do for now and work to show my chain of thought for this purpose. If I actually want to test my Math LLM, one option would be to create my own data set.

So far, I've tried

Different smaller open-source LLMs are used to compare the answer that Math LLM gives to the correct answer in the data set.
- The models were too unreliable, and there was about a 50/50 chance that they answered correctly.
- I prompt engineered the models but did not fine-tune the models.
I concluded that it is better to do an exact match of the final answer of the Math LLMs answer.
- The formatting is difficult because it needs to match precisely the answer in the mathematics_dataset
- It only compares the final answer. This means that the reasoning for the answer to mathematical LLM can be wrong, even if the answer is correct.

Next step:

Continue with the exact match
Try Math Verify

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
mathematics_dataset		mathematics_dataset
.gitignore		.gitignore
clean_qa_data copy.json		clean_qa_data copy.json
comparison_results.json		comparison_results.json
math_model_eval.ipynb		math_model_eval.ipynb
model_answers.json		model_answers.json
questions.json		questions.json
questions_and_answers.json		questions_and_answers.json
raw_data.txt		raw_data.txt
readme.md		readme.md
sample_clean_qa_data.json		sample_clean_qa_data.json
sample_questions.json		sample_questions.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluation of Math LLM

About

Uh oh!

Releases

Packages

Languages

minettekaum/llm_eval

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Math LLM

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages