You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
there was one issue I ran into when using your tool for Spotlight: In Wikipedia, only the first occurrence of a surface form within an article is linked. However for training, you want to have all occurrences within each single article (subsequent occurrences of a SF are assumed to link to the same page as the first occurrence). In pignlproc, these are artificially introduced into the training. Your tool is missing these so far and hence you're missing a lot of tokenCounts and the sfCounts are incorrect.
I fixed this in our fork, but I also introduced quite a few Spotlight-related changes. I can send a separate pull request if you want. It makes the extraction a bit slower, but there is plenty of room for improvement if you want to make it faster/more memory-friendly.
Hey guys,
there was one issue I ran into when using your tool for Spotlight: In Wikipedia, only the first occurrence of a surface form within an article is linked. However for training, you want to have all occurrences within each single article (subsequent occurrences of a SF are assumed to link to the same page as the first occurrence). In pignlproc, these are artificially introduced into the training. Your tool is missing these so far and hence you're missing a lot of tokenCounts and the sfCounts are incorrect.
I fixed this in our fork, but I also introduced quite a few Spotlight-related changes. I can send a separate pull request if you want. It makes the extraction a bit slower, but there is plenty of room for improvement if you want to make it faster/more memory-friendly.
Here's the relevant commit.
Jo
The text was updated successfully, but these errors were encountered: