Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing repeating article links #2

Open
jodaiber opened this issue Jan 20, 2016 · 2 comments
Open

Missing repeating article links #2

jodaiber opened this issue Jan 20, 2016 · 2 comments

Comments

@jodaiber
Copy link

Hey guys,

there was one issue I ran into when using your tool for Spotlight: In Wikipedia, only the first occurrence of a surface form within an article is linked. However for training, you want to have all occurrences within each single article (subsequent occurrences of a SF are assumed to link to the same page as the first occurrence). In pignlproc, these are artificially introduced into the training. Your tool is missing these so far and hence you're missing a lot of tokenCounts and the sfCounts are incorrect.

I fixed this in our fork, but I also introduced quite a few Spotlight-related changes. I can send a separate pull request if you want. It makes the extraction a bit slower, but there is plenty of room for improvement if you want to make it faster/more memory-friendly.

Here's the relevant commit.

Jo

@samhumeau
Copy link
Contributor

It looks good.

I will try it later today to see if it is fast enough, and then merge if you want to.

@jodaiber
Copy link
Author

Cool! Keep in mind though that I also adapted/butchered Launcher.java to make it more easy to use with our index_db.sh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants