-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hey guys,
there was one issue I ran into when using your tool for Spotlight: In Wikipedia, only the first occurrence of a surface form within an article is linked. However for training, you want to have all occurrences within each single article (subsequent occurrences of a SF are assumed to link to the same page as the first occurrence). In pignlproc, these are artificially introduced into the training. Your tool is missing these so far and hence you're missing a lot of tokenCounts and the sfCounts are incorrect.
I fixed this in our fork, but I also introduced quite a few Spotlight-related changes. I can send a separate pull request if you want. It makes the extraction a bit slower, but there is plenty of room for improvement if you want to make it faster/more memory-friendly.
Here's the relevant commit.
Jo