-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues on English Wikipedia #1
Comments
Hi tgalery, sorry to answer only now. I notice that the first and second phases (over 3) have taken quite a bit of time (1600s + 2852s). Out of curiosity (I don't think it matters), what kind of machine is it?How many cores? Is it an AWS/GCE/Azure instance (those are usually weaker than real cores)? Are you working on an SSD? I tested it today with the latest wikipedia dump currently available on https://dumps.wikimedia.org/enwiki/latest/ (dating from 3rd of december), and it worked fine with 14GB.
Here is my output, for the latest wikipedia dump, obtained by following exactly the instructions of https://github.com/diffbot/wikistatsextractor. Have you made anything different? start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream.. /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file Parsed 16100000 pagesdone in 621446ms 3685939 elements in the interesting sf start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream.. /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file Parsed 16100000 pagesdone in 801467ms start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream.. /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file Parsed 16100000 pagesdone in 478843ms storage took 16080 start parsing the dump: data/tmp/tmp_paragraphes data/tmp/tmp_paragraphes is not gzipped, trying bzip input stream.. data/tmp/tmp_paragraphes is not bzip commpressed, trying decompressed file Parsed 20100000 pagesdone in 61597ms building voc 61884 start parsing the dump: data/tmp/tmp_referencess data/tmp/tmp_referencess is not gzipped, trying bzip input stream.. data/tmp/tmp_referencess is not bzip commpressed, trying decompressed file Parsed 1600000 pagesdone in 63598ms last step 125483 all done in 2128 seconds [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 35:29 min [INFO] Finished at: 2016-01-13T06:29:49-08:00 [INFO] Final Memory: 61M/10547M [INFO] ------------------------------------------------------------------------ |
Hi @Cerbal I ran it on a i3 laptop without an ssd (but was using a hybrid HD with 8 gigs of ssd cache). Is SSD a must ? |
Truth to be told I have never tested it on a non-ssd device. I made this software to exploit the speed of SSD, and this way avoid having to deploy a hadoop cluster to extract info from something as small as wikipedia (60GB isn't that much). It shouldn't matter, though. It should just be slower. But that definitely explain why the 2 first phases were slow. I still don't get why it get stuck though, but the 3rd pass is the most intense in terms of RAM. How much RAM do you have on that machine? |
I have 16 gigs and ran it with 14 gigs reserved with maven, but I might run this on a proper cloud provider with an SSD and report back. |
Ok, so I tried again using a 6 core server with 32 GB of ram. I'm running it using screen, but I made sure to allocate enough ram to the process. I still get stuck
|
To add to what @tgalery said, I ran it with -Xmx28G and SSD and the same thing happens. It produced pairCounts, sfAndTotalCounts and uriCounts but gets stuck on the token counts, I presume? |
@tgalery Building the voc is a step of the third pass (the third pass loads 3 files) I have used the same version that you had during my test. That is very strange, I am still in need to be able to reproduce your problem, the execution works well for me both on my laptop and on a server. Has one of you used an instance from AWS/GCE/Azure so that I can use his exact configuration? Also, can you tell me if the problem appear on a smaller dump (say the French one which is about 5 times smaller, or the 10M first lines of the English wiki dump)? This way it would be easier to debug. |
Hey @Cerbal, I just tried with French. It ran through but tokenCounts is empty. Here's the stderr:
Is there a way to increase the logging level? Jo |
Ok, in the last run, it didn't find the French Lucene analyzer. I fixed that and re-ran everything, but still it does not produce any tokens.
and stderr:
|
What java version are you using? It's odd, sampling shows it's spending most of the time in There was another issue for French with this line: this prefix only works for English. I fixed that in my branch, but the problem still remains. |
@Cerbal do you know if there are specific config for os level stuff, like mas number of open files in |
So, here's the resolution to the problem. For some reason my java decides to not write out the queue for a long time (pretty much until the end). I am guessing this is an optimisation because I gave it plenty of Xmx. For whatever reason it is doing that, it means that the workers get stuck here. I am not sure if the After removing |
Thanks Jo Daiber for your investigation, I think it led me to the solution. Bad code, I had used a ThreadPool whose number of threads was indexed on On less than 8 cores what was happening is that the thread responsible for the output writing was executed after the others (while of course it should happen at the same time). It should work now, can one of you check on their machine such that I can close the issue? |
@jodaiber Your solution was putting the whole output in memory, and writing it down afterward. |
I d be able to check it soon enough :-) On Wed, 20 Jan 2016 19:10 Cerbal [email protected] wrote:
|
tested this tool for the first time with a small fraction of a german wikipedia dump reinhard@linux-rmha:~/entity_linking/dbpedia_spotlight/wikistatsextractor> ls -lR data data: insgesamt 24 drwxr-xr-x 2 reinhard users 4096 1. Mär 11:09 output -rw-r--r-- 1 reinhard users 777 1. Mär 10:57 stopwords.de.list -rw-r--r-- 1 reinhard users 6047 1. Mär 10:40 stopwords.en.list -rw-r--r-- 1 reinhard users 2855 1. Mär 10:40 stopwords.fr.list drwxr-xr-x 2 reinhard users 4096 1. Mär 11:09 tmp data/output: insgesamt 16 -rw-r--r-- 1 reinhard users 2818 1. Mär 11:09 pairCounts_de -rw-r--r-- 1 reinhard users 1718 1. Mär 11:09 sfAndTotalCounts_de -rw-r--r-- 1 reinhard users 18 1. Mär 10:40 timestamp -rw-r--r-- 1 reinhard users 0 1. Mär 11:09 tokenCounts_de -rw-r--r-- 1 reinhard users 1777 1. Mär 11:09 uriCounts_de data/tmp: insgesamt 12 -rw-r--r-- 1 reinhard users 18 1. Mär 10:40 timestamp -rw-r--r-- 1 reinhard users 0 1. Mär 11:09 tmp_paragraphes -rw-r--r-- 1 reinhard users 406 1. Mär 11:09 tmp_redirections_de -rw-r--r-- 1 reinhard users 0 1. Mär 11:09 tmp_referencess -rw-r--r-- 1 reinhard users 765 1. Mär 11:09 tmp_surface_form_counts_de |
I have may be fixed the issue with no tokenCounts. See #3 |
Hi @rschwab68, these issues are fixed in this fork: https://github.com/jodaiber/wikistatsextractor Best, |
Ah ok, thanks @jodaiber for your effort. String uri = split[0].split("/", 5)[4]; best regards, |
fixed bug in spotting known surface forms
Hi there and thanks for coming up with the tool. I tried to extract on a decompressed dump of the english wikipedia and the process got stuck on the third parse of the dump for a few hours.
Here is the output
I specified 14GB of ram. Anything that I might be doing wrong ?
The text was updated successfully, but these errors were encountered: