Identification/disambiguation of mathematical definitions by semantic similarity

This repository contains the data, code sources, and experimental results of our two publications:

A preliminary study on : Towards Disambiguation of Mathematical Terms based on Semantic Representations (V1).
A full paper on: MathD2: Towards Disambiguation of Mathematical Terms (V2).

Problem formulation

Given two definitions, the system shall tell whether these definitions are equivalent/ different

Research questions

How well can contextualized word embeddings help the disambiguation of mathematical terms? (V1 & V2)
Which architecture/ pretraining strategy helps to capture the similarity of mathematical statements? (V1 & V2)
How well do models trained in the preceding learning paradigm of pre-train + fine-tune compare with state-of-the-art Instruct Large Language Models? (V2)

Step 0. Making Proofwiki disambiguation pages as a ground truth

00.ExtractProofWiki.ipynb collects and parses disambiguitaion pages in ProofWiki.

We store disambiguation page titles, ambiguous terms, definitions in the LaTeX source, definition page titles, the categories of each definition, and definitions in plain text in parsed_disambiguation_list_without===.csv.

01.Proofwiki_vs_ArXiv_Def.ipynb shows the overlap of ambiguous terms in Proofwiki and in arXiv papers. TODO: extract more definitions from arXiv papers.

Step 1. Syntactic analysis

Unsupervised: How different are these proofwiki definitions?

10.ExtractHypernyms.ipynb shows that the Word Class Lattice classifier can extract very few hypernyms of mathematical definienda. Most WCL-identifiable definitions match the" is a " pattern.

Step 2. Linking different definitions to different entities

20.SentencePairClassifier.ipynb shows our study about how pre-trained language models can help to differentiate mathematical definitions (Approach 1 & 2 + Evaluation).

Approach 1. Supervised NSP-like classifier

Inspired by GLADIS we build a supervised NSP-like sentence pair classifier to link definitions to their page titles in Proofwiki. Every pair of a definition and a title (term,domain) with the matching ambiguous term in proofwiki constitutes an input to the Next Sentence Prediction (NSP) task. The language model produces a score for each candidate, and we select the one with the highest score as the final predicted output.

Approach 2. Prediction based on cosine similarity

Motivated to make a faster solution, we also explore the sentence embeddings of the definitions and titles. We calculate sentence embeddings for the definition and each candidate title with the matching ambiguous term in ProofWiki, and we select the title with the highest cosine similarity to the embedding of the definition as the final predicted output.

Approach 3. Zero-shot LLM (7B models)

23. LLM test.ipynb

24. LLM train.ipynb

25. LLM tests Llama.ipynb

Evaluation:

V1:

Train set: df_flattened_train_disam_list.csv ~ 275 ambiguous terms and 1436 (definition, title) pairs

Test set: df_flattened_test_disam_list.csv ~ 68 ambiguous terms and 433 (definition, title) pairs

To make this task difficult, we split the Train: Test sets based on the ambiguous terms.

We evaluate both approaches on the training set and the test set.

Results are in data/res-V1/, the notebook, and NSP_logs.txt.

We evaluate both approaches on the training set and the test set.

Further question: Does a better NSP model make more similar embeddings for (definition, title) pairs? No.

V2:

Test over (1) unseen ambiguous terms and (2) new candidate titles for seen ambiguous terms.

Added LLM-based approaches.

Pulled newly added items in Proofwiki.

Train set: df_flattened_train_disam_list.csv ~ 297 ambiguous terms and 1158 (definition, title) pairs

Test set (1): df_flattened_test_new_term_disam_list.csv ~ 68 ambiguous terms and 364 (definition, title) pairs

Test set (2) : df_flattened_test_new_candi_disam_list.csv ~ 68 ambiguous terms and 462 (definition, title) pairs

Preliminary results : data/res

V2.1 :

Adding 5-fold cross-validation.

Data: data/SP_CLS-5fold

Results (presented in the full paper): data/res-5fold

V3 : Future works

TODO: Evaluate the generalizability of the different approaches to definitions extracted from papers. If the results are also cool, make a demo.

Citation

If you find our work useful and would like to cite it, please use the following BibTeX entry:

V1:

@inproceedings{mathd2v1,
  TITLE = {{Towards Disambiguation of Mathematical Terms based on Semantic Representations}},
  AUTHOR = {Jiang, Shufan and Tan, Mary Ann and Sack, Harald},
  URL = {https://hal.science/hal-05028993},
  BOOKTITLE = {{SCOLIA 2025 - First International Workshop on Scholarly Information Access}},
  ADDRESS = {Lucca, Italy},
  YEAR = {2025},
  MONTH = Apr,
}

V2:

@inproceedings{mathd2v2,
    title = "{M}ath{D}2: Towards Disambiguation of Mathematical Terms",
    author = "Jiang, Shufan and Tan, Mary Ann  and Sack, Harald",
    booktitle = "Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.sdp-1.3/",
    doi = "10.18653/v1/2025.sdp-1.3",
    pages = "17--30",
    ISBN = "979-8-89176-265-7",
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
10_ExtractHypernyms		10_ExtractHypernyms
data		data
fig		fig
proofwikiorg_w-20240709-wikidump		proofwikiorg_w-20240709-wikidump
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
00.ExtractProofWiki.ipynb		00.ExtractProofWiki.ipynb
01.Proofwiki_vs_ArXiv_Def.ipynb		01.Proofwiki_vs_ArXiv_Def.ipynb
10.ExtractHypernyms.ipynb		10.ExtractHypernyms.ipynb
20.SentencePairClassifier-bis.ipynb		20.SentencePairClassifier-bis.ipynb
20.SentencePairClassifier.ipynb		20.SentencePairClassifier.ipynb
20.SentencePairClassifier_5fold.ipynb		20.SentencePairClassifier_5fold.ipynb
21.ResultsAnalysis.ipynb		21.ResultsAnalysis.ipynb
22.TODO_Finetuning SentenceTransformer.ipynb		22.TODO_Finetuning SentenceTransformer.ipynb
23. LLM tests.ipynb		23. LLM tests.ipynb
24. LLM tests Mistralv3.ipynb		24. LLM tests Mistralv3.ipynb
25. LLM tests Llama.ipynb		25. LLM tests Llama.ipynb
ExtractProofWikiDisambiguation.log		ExtractProofWikiDisambiguation.log
ExtractProofWikiDisambiguation_new.log		ExtractProofWikiDisambiguation_new.log
MathD2-Jiang-Tan-SDP2025.pdf		MathD2-Jiang-Tan-SDP2025.pdf
NSP_logs.txt		NSP_logs.txt
README.md		README.md
SentencePairCLS-5fold.log		SentencePairCLS-5fold.log
SentencePairCLS-bis.log		SentencePairCLS-bis.log
SentencePairCLS.log		SentencePairCLS.log
XZVT-private.sty		XZVT-private.sty
arXiv_terms_including_proofwiki_terms.csv		arXiv_terms_including_proofwiki_terms.csv
comp_tokenized_len.csv		comp_tokenized_len.csv
de-macro.py		de-macro.py
duplicated_term_def.csv		duplicated_term_def.csv
duplicated_term_def_for_opencsv.csv		duplicated_term_def_for_opencsv.csv
mentioned_proofwiki_terms_value_counts.csv		mentioned_proofwiki_terms_value_counts.csv
parsed_disambiguation_list.csv		parsed_disambiguation_list.csv
parsed_disambiguation_list.json		parsed_disambiguation_list.json
parsed_disambiguation_list_no_upper_no_3equals.csv		parsed_disambiguation_list_no_upper_no_3equals.csv
parsed_disambiguation_list_without===.csv		parsed_disambiguation_list_without===.csv
parsed_disambiguation_list_without===_with_macros.json		parsed_disambiguation_list_without===_with_macros.json
proofwikiorg_w-20240709-current_restored_tag.xml		proofwikiorg_w-20240709-current_restored_tag.xml
proofwikiorg_w-20250205-wikidump.zip		proofwikiorg_w-20250205-wikidump.zip
tex-file-1-clean.csv		tex-file-1-clean.csv
tex-file-1.csv-clean.tex		tex-file-1.csv-clean.tex
tex-file-1.csv.tex		tex-file-1.csv.tex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Identification/disambiguation of mathematical definitions by semantic similarity

Problem formulation

Research questions

Step 0. Making Proofwiki disambiguation pages as a ground truth

Step 1. Syntactic analysis

Step 2. Linking different definitions to different entities

Approach 1. Supervised NSP-like classifier

Approach 2. Prediction based on cosine similarity

Approach 3. Zero-shot LLM (7B models)

Evaluation:

V1:

V2:

V2.1 :

V3 : Future works

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

sufianj/MathTermDisambiguation

Folders and files

Latest commit

History

Repository files navigation

Identification/disambiguation of mathematical definitions by semantic similarity

Problem formulation

Research questions

Step 0. Making Proofwiki disambiguation pages as a ground truth

Step 1. Syntactic analysis

Step 2. Linking different definitions to different entities

Approach 1. Supervised NSP-like classifier

Approach 2. Prediction based on cosine similarity

Approach 3. Zero-shot LLM (7B models)

Evaluation:

V1:

V2:

V2.1 :

V3 : Future works

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages