Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues on English Wikipedia #1

Open
tgalery opened this issue Jan 5, 2016 · 19 comments
Open

Issues on English Wikipedia #1

tgalery opened this issue Jan 5, 2016 · 19 comments
Assignees

Comments

@tgalery
Copy link

tgalery commented Jan 5, 2016

Hi there and thanks for coming up with the tool. I tried to extract on a decompressed dump of the english wikipedia and the process got stuck on the third parse of the dump for a few hours.

Here is the output

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 15200000 pagesdone in 1601421ms
too many redirections: KWV,,KWV  Koöperatieve Wijnbouwers Vereniging van Zuid-Afrika Bpkt,,KWV,,13
3512031 elements in the interesting sf
start parsing the dump: /home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 15200000 pagesdone in 2852276ms
start parsing the dump: /home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 100000 pages

I specified 14GB of ram. Anything that I might be doing wrong ?

@samhumeau
Copy link
Contributor

Hi tgalery, sorry to answer only now.

I notice that the first and second phases (over 3) have taken quite a bit of time (1600s + 2852s). Out of curiosity (I don't think it matters), what kind of machine is it?How many cores? Is it an AWS/GCE/Azure instance (those are usually weaker than real cores)? Are you working on an SSD?

I tested it today with the latest wikipedia dump currently available on https://dumps.wikimedia.org/enwiki/latest/ (dating from 3rd of december), and it worked fine with 14GB.

  • Since 14GB should be at the edge of what is needed, have you tried with 15GB or more of RAM?
  • Have you checked that your system has 14GB of available RAM? (If you constrain your system to use the swap, then things can get stuck)

Here is my output, for the latest wikipedia dump, obtained by following exactly the instructions of https://github.com/diffbot/wikistatsextractor. Have you made anything different?

start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream..
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 621446ms
3685939 elements in the interesting sf
start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream..
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 801467ms
start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream..
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 478843ms
storage took 16080
start parsing the dump: data/tmp/tmp_paragraphes
data/tmp/tmp_paragraphes is not gzipped, trying bzip input stream..
data/tmp/tmp_paragraphes is not bzip commpressed, trying decompressed file
Parsed 20100000 pagesdone in 61597ms
building voc 61884
start parsing the dump: data/tmp/tmp_referencess
data/tmp/tmp_referencess is not gzipped, trying bzip input stream..
data/tmp/tmp_referencess is not bzip commpressed, trying decompressed file
Parsed 1600000 pagesdone in 63598ms
last step 125483
all done in 2128 seconds
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 35:29 min
[INFO] Finished at: 2016-01-13T06:29:49-08:00
[INFO] Final Memory: 61M/10547M
[INFO] ------------------------------------------------------------------------

@tgalery
Copy link
Author

tgalery commented Jan 13, 2016

Hi @Cerbal I ran it on a i3 laptop without an ssd (but was using a hybrid HD with 8 gigs of ssd cache). Is SSD a must ?

@samhumeau
Copy link
Contributor

Truth to be told I have never tested it on a non-ssd device. I made this software to exploit the speed of SSD, and this way avoid having to deploy a hadoop cluster to extract info from something as small as wikipedia (60GB isn't that much). It shouldn't matter, though. It should just be slower.

But that definitely explain why the 2 first phases were slow. I still don't get why it get stuck though, but the 3rd pass is the most intense in terms of RAM. How much RAM do you have on that machine?

@tgalery
Copy link
Author

tgalery commented Jan 13, 2016

I have 16 gigs and ran it with 14 gigs reserved with maven, but I might run this on a proper cloud provider with an SSD and report back.

@tgalery
Copy link
Author

tgalery commented Jan 15, 2016

Ok, so I tried again using a 6 core server with 32 GB of ram. I'm running it using screen, but I made sure to allocate enough ram to the process. I still get stuck

start parsing the dump: /data/ssd/wikipedia/enwiki-20151201-pages-articles.xml

/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 711739ms
3685987 elements in the interesting sf
start parsing the dump: /data/ssd/wikipedia/enwiki-20151201-pages-articles.xml

/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 988640ms
start parsing the dump: /data/ssd/wikipedia/enwiki-20151201-pages-articles.xml

/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not bzip commpressed, trying decompressed file

htop indicates that I have 18G reserved for the process, but memory consumption doesn't surpass 9gb. Comparing my output to yours, it seems that you have an extra line after the second phase building voc 61884. I wonder whether we are using different versions of the tool.
My git log says the last commit is:

commit bed428a4e94e6707cd070e639a71730fd99b9830
Author: Dru Wynings <[email protected]>
Date:   Mon Nov 16 18:12:40 2015 -0800

    Update README.md

@jodaiber
Copy link

To add to what @tgalery said, I ran it with -Xmx28G and SSD and the same thing happens. It produced pairCounts, sfAndTotalCounts and uriCounts but gets stuck on the token counts, I presume?

@samhumeau
Copy link
Contributor

@tgalery Building the voc is a step of the third pass (the third pass loads 3 files)
@jodaiber Yeah it seems that both of you get stuck at the very beginning of the token count computation.

I have used the same version that you had during my test. That is very strange, I am still in need to be able to reproduce your problem, the execution works well for me both on my laptop and on a server.

Has one of you used an instance from AWS/GCE/Azure so that I can use his exact configuration?

Also, can you tell me if the problem appear on a smaller dump (say the French one which is about 5 times smaller, or the 10M first lines of the English wiki dump)? This way it would be easier to debug.

@jodaiber
Copy link

Hey @Cerbal,

I just tried with French. It ran through but tokenCounts is empty. Here's the stderr:

/data/ssd/wikipedia/frwiki-20160111-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/frwiki-20160111-pages-articles.xml is not bzip commpressed, trying decompressed file
^MParsed 100000 pages^MParsed 200000 pagestoo many redirections: fusil/pistolet mitrailleur automatique
^MParsed 300000 pages^MParsed 400000 pages^MParsed 500000 pages^MParsed 600000 pages^MParsed 700000 pages^MParsed 800000 pages^MParsed 900000 pages^MParsed 1000000 pages^MParsed 1100000 pages^MParsed 1200000 pages^MParsed 1300000 pages^MParsed 1400000 pages^MParsed 1500000 pages^MParsed 1600000 pages^MParsed 1700000 pages^MParsed 1800000 pages^MParsed 1900000 pages^MParsed 2000000 pages^MParsed 2100000 pages^MParsed 2200000 pages^MParsed 2300000 pages^MParsed 2400000 pages^MParsed 2500000 pages^MParsed 2600000 pages^MParsed 2700000 pages^MParsed 2800000 pages^MParsed 2900000 pages^MParsed 3000000 pages^MParsed 3100000 pages^MParsed 3200000 pages^MParsed 3300000 pages^MParsed 3400000 pages^MParsed 3500000 pages^MParsed 3600000 pages^MParsed 3700000 pages^MParsed 3800000 pagesdone in 172357ms
storage took 13
start parsing the dump: data/tmp/tmp_paragraphes

data/tmp/tmp_paragraphes is not gzipped, trying bzip input stream..
data/tmp/tmp_paragraphes is not bzip commpressed, trying decompressed file
done in 6ms
building voc 1463
start parsing the dump: data/tmp/tmp_referencess

data/tmp/tmp_referencess is not gzipped, trying bzip input stream..
data/tmp/tmp_referencess is not bzip commpressed, trying decompressed file
done in 104ms
last step 1568
all done in 775 seconds
$ head data/tmp/tmp_paragraphes 
$
$ head data/output/tokenCounts_fr 
$ 
$ head data/tmp/tmp_referencess 
$ 

Is there a way to increase the logging level?

Jo

@jodaiber
Copy link

Ok, in the last run, it didn't find the French Lucene analyzer. I fixed that and re-ran everything, but still it does not produce any tokens.

$ wc -l data/*/*
   1636498 data/output/pairCounts_fr
   2557492 data/output/sfAndTotalCounts_fr
         0 data/output/tokenCounts_fr
   1081815 data/output/uriCounts_fr
         0 data/tmp/tmp_paragraphes
   1431829 data/tmp/tmp_redirections_fr
         0 data/tmp/tmp_referencess
   1401076 data/tmp/tmp_surface_form_counts_fr

and stderr:

/data/ssd/wikipedia/frwiki-20160111-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/frwiki-20160111-pages-articles.xml is not bzip commpressed, trying decompressed file
^MParsed 100000 pages^MParsed 200000 pagestoo many redirections: fusil/pistolet mitrailleur automatique
^MParsed 300000 pages^MParsed 400000 pages^MParsed 500000 pages^MParsed 600000 pages^MParsed 700000 pages^MParsed 800000 pages^MParsed 900000 pages^MParsed 1000000 pages^MParsed 1100000 pages^MParsed 1200000 pages^MParsed 1300000 pages^MParsed 1400000 pages^MParsed 1500000 pages^MParsed 1600000 pages^MParsed 1700000 pages^MParsed 1800000 pages^MParsed 1900000 pages^MParsed 2000000 pages^MParsed 2100000 pages^MParsed 2200000 pages^MParsed 2300000 pages^MParsed 2400000 pages^MParsed 2500000 pages^MParsed 2600000 pages^MParsed 2700000 pages^MParsed 2800000 pages^MParsed 2900000 pages^MParsed 3000000 pages^MParsed 3100000 pages^MParsed 3200000 pages^MParsed 3300000 pages^MParsed 3400000 pages^MParsed 3500000 pages^MParsed 3600000 pages^MParsed 3700000 pages^MParsed 3800000 pagesdone in 170523ms
storage took 10
start parsing the dump: data/tmp/tmp_paragraphes

data/tmp/tmp_paragraphes is not gzipped, trying bzip input stream..
data/tmp/tmp_paragraphes is not bzip commpressed, trying decompressed file
done in 7ms
building voc 1457
start parsing the dump: data/tmp/tmp_referencess

data/tmp/tmp_referencess is not gzipped, trying bzip input stream..
data/tmp/tmp_referencess is not bzip commpressed, trying decompressed file
done in 104ms
last step 1562
all done in 631 seconds

@jodaiber
Copy link

What java version are you using? It's odd, sampling shows it's spending most of the time in DumpParser$Woker.writeInOutput() but nothing is written (the file data/tmp/tmp_paragraphes isn't even created).

There was another issue for French with this line: this prefix only works for English. I fixed that in my branch, but the problem still remains.

@tgalery
Copy link
Author

tgalery commented Jan 18, 2016

@Cerbal do you know if there are specific config for os level stuff, like mas number of open files in sysctl or things like that ?

@jodaiber
Copy link

So, here's the resolution to the problem.

For some reason my java decides to not write out the queue for a long time (pretty much until the end). I am guessing this is an optimisation because I gave it plenty of Xmx. For whatever reason it is doing that, it means that the workers get stuck here. I am not sure if the element_written_in_output--; after this section is a mistake or not, but it basically means the sleep() will be repeated with the next call to writeInOutput!

After removing element_written_in_output--;, everything works fine!

@samhumeau
Copy link
Contributor

Thanks Jo Daiber for your investigation, I think it led me to the solution.
I had never tested this code on a machine with less than 8 core/threads. (I guess it is a first world problem).

Bad code, I had used a ThreadPool whose number of threads was indexed on On less than 8 cores what was happening is that the thread responsible for the output writing was executed after the others (while of course it should happen at the same time).

It should work now, can one of you check on their machine such that I can close the issue?

@samhumeau
Copy link
Contributor

@jodaiber Your solution was putting the whole output in memory, and writing it down afterward.

@tgalery
Copy link
Author

tgalery commented Jan 20, 2016

I d be able to check it soon enough :-)

On Wed, 20 Jan 2016 19:10 Cerbal [email protected] wrote:

@jodaiber https://github.com/jodaiber Your solution was putting the
whole output in memory, and writing it down afterward.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@rschwab68
Copy link

tested this tool for the first time with a small fraction of a german wikipedia dump
I have the same results like jo daiber, no tokenCounts

reinhard@linux-rmha:~/entity_linking/dbpedia_spotlight/wikistatsextractor> ls -lR data
data:
insgesamt 24
drwxr-xr-x 2 reinhard users 4096  1. Mär 11:09 output
-rw-r--r-- 1 reinhard users  777  1. Mär 10:57 stopwords.de.list
-rw-r--r-- 1 reinhard users 6047  1. Mär 10:40 stopwords.en.list
-rw-r--r-- 1 reinhard users 2855  1. Mär 10:40 stopwords.fr.list
drwxr-xr-x 2 reinhard users 4096  1. Mär 11:09 tmp
data/output:
insgesamt 16
-rw-r--r-- 1 reinhard users 2818  1. Mär 11:09 pairCounts_de
-rw-r--r-- 1 reinhard users 1718  1. Mär 11:09 sfAndTotalCounts_de
-rw-r--r-- 1 reinhard users   18  1. Mär 10:40 timestamp
-rw-r--r-- 1 reinhard users    0  1. Mär 11:09 tokenCounts_de
-rw-r--r-- 1 reinhard users 1777  1. Mär 11:09 uriCounts_de
data/tmp:
insgesamt 12
-rw-r--r-- 1 reinhard users  18  1. Mär 10:40 timestamp
-rw-r--r-- 1 reinhard users   0  1. Mär 11:09 tmp_paragraphes
-rw-r--r-- 1 reinhard users 406  1. Mär 11:09 tmp_redirections_de
-rw-r--r-- 1 reinhard users   0  1. Mär 11:09 tmp_referencess
-rw-r--r-- 1 reinhard users 765  1. Mär 11:09 tmp_surface_form_counts_de

@rschwab68
Copy link

I have may be fixed the issue with no tokenCounts. See #3

@jodaiber
Copy link

jodaiber commented Mar 1, 2016

Hi @rschwab68,

these issues are fixed in this fork: https://github.com/jodaiber/wikistatsextractor
An example execution is here and the data produced by this version can be found here.

Best,
Jo

@rschwab68
Copy link

Ah ok, thanks @jodaiber for your effort.
I see that you have avoided a string prefix by using

String uri = split[0].split("/", 5)[4];

best regards,
Reinhard

akms17 pushed a commit to akms17/wikistatsextractor that referenced this issue Jun 2, 2016
fixed bug in spotting known surface forms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants