Skip to content

Missing repeating article links #2

@jodaiber

Description

@jodaiber

Hey guys,

there was one issue I ran into when using your tool for Spotlight: In Wikipedia, only the first occurrence of a surface form within an article is linked. However for training, you want to have all occurrences within each single article (subsequent occurrences of a SF are assumed to link to the same page as the first occurrence). In pignlproc, these are artificially introduced into the training. Your tool is missing these so far and hence you're missing a lot of tokenCounts and the sfCounts are incorrect.

I fixed this in our fork, but I also introduced quite a few Spotlight-related changes. I can send a separate pull request if you want. It makes the extraction a bit slower, but there is plenty of room for improvement if you want to make it faster/more memory-friendly.

Here's the relevant commit.

Jo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions