Issues on English Wikipedia #1

tgalery · 2016-01-05T13:07:46Z

Hi there and thanks for coming up with the tool. I tried to extract on a decompressed dump of the english wikipedia and the process got stuck on the third parse of the dump for a few hours.

Here is the output

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 15200000 pagesdone in 1601421ms
too many redirections: KWV,,KWV  Koöperatieve Wijnbouwers Vereniging van Zuid-Afrika Bpkt,,KWV,,13
3512031 elements in the interesting sf
start parsing the dump: /home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 15200000 pagesdone in 2852276ms
start parsing the dump: /home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml

/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not gzipped, trying bzip input stream..
/home/intruder/datasets/wikipedia/en/enwiki-latest-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 100000 pages

I specified 14GB of ram. Anything that I might be doing wrong ?

The text was updated successfully, but these errors were encountered:

samhumeau · 2016-01-13T14:35:24Z

Hi tgalery, sorry to answer only now.

I notice that the first and second phases (over 3) have taken quite a bit of time (1600s + 2852s). Out of curiosity (I don't think it matters), what kind of machine is it?How many cores? Is it an AWS/GCE/Azure instance (those are usually weaker than real cores)? Are you working on an SSD?

I tested it today with the latest wikipedia dump currently available on https://dumps.wikimedia.org/enwiki/latest/ (dating from 3rd of december), and it worked fine with 14GB.

Since 14GB should be at the edge of what is needed, have you tried with 15GB or more of RAM?
Have you checked that your system has 14GB of available RAM? (If you constrain your system to use the swap, then things can get stuck)

Here is my output, for the latest wikipedia dump, obtained by following exactly the instructions of https://github.com/diffbot/wikistatsextractor. Have you made anything different?

start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream..
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 621446ms
3685939 elements in the interesting sf
start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream..
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 801467ms
start parsing the dump: /mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not gzipped, trying bzip input stream..
/mnt/hd3/Thoth/data/bleeding_edge/en/wikidump/dump is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 478843ms
storage took 16080
start parsing the dump: data/tmp/tmp_paragraphes
data/tmp/tmp_paragraphes is not gzipped, trying bzip input stream..
data/tmp/tmp_paragraphes is not bzip commpressed, trying decompressed file
Parsed 20100000 pagesdone in 61597ms
building voc 61884
start parsing the dump: data/tmp/tmp_referencess
data/tmp/tmp_referencess is not gzipped, trying bzip input stream..
data/tmp/tmp_referencess is not bzip commpressed, trying decompressed file
Parsed 1600000 pagesdone in 63598ms
last step 125483
all done in 2128 seconds
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 35:29 min
[INFO] Finished at: 2016-01-13T06:29:49-08:00
[INFO] Final Memory: 61M/10547M
[INFO] ------------------------------------------------------------------------

tgalery · 2016-01-13T14:43:34Z

Hi @Cerbal I ran it on a i3 laptop without an ssd (but was using a hybrid HD with 8 gigs of ssd cache). Is SSD a must ?

samhumeau · 2016-01-13T14:50:25Z

Truth to be told I have never tested it on a non-ssd device. I made this software to exploit the speed of SSD, and this way avoid having to deploy a hadoop cluster to extract info from something as small as wikipedia (60GB isn't that much). It shouldn't matter, though. It should just be slower.

But that definitely explain why the 2 first phases were slow. I still don't get why it get stuck though, but the 3rd pass is the most intense in terms of RAM. How much RAM do you have on that machine?

tgalery · 2016-01-13T14:54:55Z

I have 16 gigs and ran it with 14 gigs reserved with maven, but I might run this on a proper cloud provider with an SSD and report back.

tgalery · 2016-01-15T16:01:33Z

Ok, so I tried again using a 6 core server with 32 GB of ram. I'm running it using screen, but I made sure to allocate enough ram to the process. I still get stuck

start parsing the dump: /data/ssd/wikipedia/enwiki-20151201-pages-articles.xml

/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 711739ms
3685987 elements in the interesting sf
start parsing the dump: /data/ssd/wikipedia/enwiki-20151201-pages-articles.xml

/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not bzip commpressed, trying decompressed file
Parsed 16100000 pagesdone in 988640ms
start parsing the dump: /data/ssd/wikipedia/enwiki-20151201-pages-articles.xml

/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/enwiki-20151201-pages-articles.xml is not bzip commpressed, trying decompressed file

htop indicates that I have 18G reserved for the process, but memory consumption doesn't surpass 9gb. Comparing my output to yours, it seems that you have an extra line after the second phase building voc 61884. I wonder whether we are using different versions of the tool.
My git log says the last commit is:

commit bed428a4e94e6707cd070e639a71730fd99b9830
Author: Dru Wynings <[email protected]>
Date:   Mon Nov 16 18:12:40 2015 -0800

    Update README.md

jodaiber · 2016-01-16T12:39:45Z

To add to what @tgalery said, I ran it with -Xmx28G and SSD and the same thing happens. It produced pairCounts, sfAndTotalCounts and uriCounts but gets stuck on the token counts, I presume?

samhumeau · 2016-01-16T15:35:21Z

@tgalery Building the voc is a step of the third pass (the third pass loads 3 files)
@jodaiber Yeah it seems that both of you get stuck at the very beginning of the token count computation.

I have used the same version that you had during my test. That is very strange, I am still in need to be able to reproduce your problem, the execution works well for me both on my laptop and on a server.

Has one of you used an instance from AWS/GCE/Azure so that I can use his exact configuration?

Also, can you tell me if the problem appear on a smaller dump (say the French one which is about 5 times smaller, or the 10M first lines of the English wiki dump)? This way it would be easier to debug.

jodaiber · 2016-01-16T17:14:29Z

Hey @Cerbal,

I just tried with French. It ran through but tokenCounts is empty. Here's the stderr:

/data/ssd/wikipedia/frwiki-20160111-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/frwiki-20160111-pages-articles.xml is not bzip commpressed, trying decompressed file
^MParsed 100000 pages^MParsed 200000 pagestoo many redirections: fusil/pistolet mitrailleur automatique
^MParsed 300000 pages^MParsed 400000 pages^MParsed 500000 pages^MParsed 600000 pages^MParsed 700000 pages^MParsed 800000 pages^MParsed 900000 pages^MParsed 1000000 pages^MParsed 1100000 pages^MParsed 1200000 pages^MParsed 1300000 pages^MParsed 1400000 pages^MParsed 1500000 pages^MParsed 1600000 pages^MParsed 1700000 pages^MParsed 1800000 pages^MParsed 1900000 pages^MParsed 2000000 pages^MParsed 2100000 pages^MParsed 2200000 pages^MParsed 2300000 pages^MParsed 2400000 pages^MParsed 2500000 pages^MParsed 2600000 pages^MParsed 2700000 pages^MParsed 2800000 pages^MParsed 2900000 pages^MParsed 3000000 pages^MParsed 3100000 pages^MParsed 3200000 pages^MParsed 3300000 pages^MParsed 3400000 pages^MParsed 3500000 pages^MParsed 3600000 pages^MParsed 3700000 pages^MParsed 3800000 pagesdone in 172357ms
storage took 13
start parsing the dump: data/tmp/tmp_paragraphes

data/tmp/tmp_paragraphes is not gzipped, trying bzip input stream..
data/tmp/tmp_paragraphes is not bzip commpressed, trying decompressed file
done in 6ms
building voc 1463
start parsing the dump: data/tmp/tmp_referencess

data/tmp/tmp_referencess is not gzipped, trying bzip input stream..
data/tmp/tmp_referencess is not bzip commpressed, trying decompressed file
done in 104ms
last step 1568
all done in 775 seconds

$ head data/tmp/tmp_paragraphes 
$
$ head data/output/tokenCounts_fr 
$ 
$ head data/tmp/tmp_referencess 
$

Is there a way to increase the logging level?

Jo

jodaiber · 2016-01-16T17:39:03Z

Ok, in the last run, it didn't find the French Lucene analyzer. I fixed that and re-ran everything, but still it does not produce any tokens.

$ wc -l data/*/*
   1636498 data/output/pairCounts_fr
   2557492 data/output/sfAndTotalCounts_fr
         0 data/output/tokenCounts_fr
   1081815 data/output/uriCounts_fr
         0 data/tmp/tmp_paragraphes
   1431829 data/tmp/tmp_redirections_fr
         0 data/tmp/tmp_referencess
   1401076 data/tmp/tmp_surface_form_counts_fr

and stderr:

/data/ssd/wikipedia/frwiki-20160111-pages-articles.xml is not gzipped, trying bzip input stream..
/data/ssd/wikipedia/frwiki-20160111-pages-articles.xml is not bzip commpressed, trying decompressed file
^MParsed 100000 pages^MParsed 200000 pagestoo many redirections: fusil/pistolet mitrailleur automatique
^MParsed 300000 pages^MParsed 400000 pages^MParsed 500000 pages^MParsed 600000 pages^MParsed 700000 pages^MParsed 800000 pages^MParsed 900000 pages^MParsed 1000000 pages^MParsed 1100000 pages^MParsed 1200000 pages^MParsed 1300000 pages^MParsed 1400000 pages^MParsed 1500000 pages^MParsed 1600000 pages^MParsed 1700000 pages^MParsed 1800000 pages^MParsed 1900000 pages^MParsed 2000000 pages^MParsed 2100000 pages^MParsed 2200000 pages^MParsed 2300000 pages^MParsed 2400000 pages^MParsed 2500000 pages^MParsed 2600000 pages^MParsed 2700000 pages^MParsed 2800000 pages^MParsed 2900000 pages^MParsed 3000000 pages^MParsed 3100000 pages^MParsed 3200000 pages^MParsed 3300000 pages^MParsed 3400000 pages^MParsed 3500000 pages^MParsed 3600000 pages^MParsed 3700000 pages^MParsed 3800000 pagesdone in 170523ms
storage took 10
start parsing the dump: data/tmp/tmp_paragraphes

data/tmp/tmp_paragraphes is not gzipped, trying bzip input stream..
data/tmp/tmp_paragraphes is not bzip commpressed, trying decompressed file
done in 7ms
building voc 1457
start parsing the dump: data/tmp/tmp_referencess

data/tmp/tmp_referencess is not gzipped, trying bzip input stream..
data/tmp/tmp_referencess is not bzip commpressed, trying decompressed file
done in 104ms
last step 1562
all done in 631 seconds

jodaiber · 2016-01-18T12:51:37Z

What java version are you using? It's odd, sampling shows it's spending most of the time in DumpParser$Woker.writeInOutput() but nothing is written (the file data/tmp/tmp_paragraphes isn't even created).

There was another issue for French with this line: this prefix only works for English. I fixed that in my branch, but the problem still remains.

tgalery · 2016-01-18T17:38:05Z

@Cerbal do you know if there are specific config for os level stuff, like mas number of open files in sysctl or things like that ?

jodaiber · 2016-01-18T20:05:43Z

So, here's the resolution to the problem.

For some reason my java decides to not write out the queue for a long time (pretty much until the end). I am guessing this is an optimisation because I gave it plenty of Xmx. For whatever reason it is doing that, it means that the workers get stuck here. I am not sure if the element_written_in_output--; after this section is a mistake or not, but it basically means the sleep() will be repeated with the next call to writeInOutput!

After removing element_written_in_output--;, everything works fine!

samhumeau · 2016-01-20T19:08:48Z

Thanks Jo Daiber for your investigation, I think it led me to the solution.
I had never tested this code on a machine with less than 8 core/threads. (I guess it is a first world problem).

Bad code, I had used a ThreadPool whose number of threads was indexed on On less than 8 cores what was happening is that the thread responsible for the output writing was executed after the others (while of course it should happen at the same time).

It should work now, can one of you check on their machine such that I can close the issue?

samhumeau · 2016-01-20T19:10:14Z

@jodaiber Your solution was putting the whole output in memory, and writing it down afterward.

tgalery · 2016-01-20T20:29:13Z

I d be able to check it soon enough :-)

On Wed, 20 Jan 2016 19:10 Cerbal [email protected] wrote:

@jodaiber https://github.com/jodaiber Your solution was putting the
whole output in memory, and writing it down afterward.

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

rschwab68 · 2016-03-01T10:17:21Z

tested this tool for the first time with a small fraction of a german wikipedia dump
I have the same results like jo daiber, no tokenCounts

reinhard@linux-rmha:~/entity_linking/dbpedia_spotlight/wikistatsextractor> ls -lR data
data:
insgesamt 24
drwxr-xr-x 2 reinhard users 4096  1. Mär 11:09 output
-rw-r--r-- 1 reinhard users  777  1. Mär 10:57 stopwords.de.list
-rw-r--r-- 1 reinhard users 6047  1. Mär 10:40 stopwords.en.list
-rw-r--r-- 1 reinhard users 2855  1. Mär 10:40 stopwords.fr.list
drwxr-xr-x 2 reinhard users 4096  1. Mär 11:09 tmp
data/output:
insgesamt 16
-rw-r--r-- 1 reinhard users 2818  1. Mär 11:09 pairCounts_de
-rw-r--r-- 1 reinhard users 1718  1. Mär 11:09 sfAndTotalCounts_de
-rw-r--r-- 1 reinhard users   18  1. Mär 10:40 timestamp
-rw-r--r-- 1 reinhard users    0  1. Mär 11:09 tokenCounts_de
-rw-r--r-- 1 reinhard users 1777  1. Mär 11:09 uriCounts_de
data/tmp:
insgesamt 12
-rw-r--r-- 1 reinhard users  18  1. Mär 10:40 timestamp
-rw-r--r-- 1 reinhard users   0  1. Mär 11:09 tmp_paragraphes
-rw-r--r-- 1 reinhard users 406  1. Mär 11:09 tmp_redirections_de
-rw-r--r-- 1 reinhard users   0  1. Mär 11:09 tmp_referencess
-rw-r--r-- 1 reinhard users 765  1. Mär 11:09 tmp_surface_form_counts_de

rschwab68 · 2016-03-01T11:01:50Z

I have may be fixed the issue with no tokenCounts. See #3

jodaiber · 2016-03-01T11:02:05Z

Hi @rschwab68,

these issues are fixed in this fork: https://github.com/jodaiber/wikistatsextractor
An example execution is here and the data produced by this version can be found here.

Best,
Jo

rschwab68 · 2016-03-01T11:05:53Z

Ah ok, thanks @jodaiber for your effort.
I see that you have avoided a string prefix by using

String uri = split[0].split("/", 5)[4];

best regards,
Reinhard

fixed bug in spotting known surface forms

druwynings assigned samhumeau Jan 19, 2016

druwynings closed this as completed Jan 19, 2016

druwynings reopened this Jan 19, 2016

akms17 pushed a commit to akms17/wikistatsextractor that referenced this issue Jun 2, 2016

Merge pull request diffbot#1 from rschwab68/master

9ed6ab8

fixed bug in spotting known surface forms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues on English Wikipedia #1

Issues on English Wikipedia #1

tgalery commented Jan 5, 2016

samhumeau commented Jan 13, 2016

tgalery commented Jan 13, 2016

samhumeau commented Jan 13, 2016

tgalery commented Jan 13, 2016

tgalery commented Jan 15, 2016

jodaiber commented Jan 16, 2016

samhumeau commented Jan 16, 2016

jodaiber commented Jan 16, 2016

jodaiber commented Jan 16, 2016

jodaiber commented Jan 18, 2016

tgalery commented Jan 18, 2016

jodaiber commented Jan 18, 2016

samhumeau commented Jan 20, 2016

samhumeau commented Jan 20, 2016

tgalery commented Jan 20, 2016

rschwab68 commented Mar 1, 2016

rschwab68 commented Mar 1, 2016

jodaiber commented Mar 1, 2016

rschwab68 commented Mar 1, 2016

Issues on English Wikipedia #1

Issues on English Wikipedia #1

Comments

tgalery commented Jan 5, 2016

samhumeau commented Jan 13, 2016

tgalery commented Jan 13, 2016

samhumeau commented Jan 13, 2016

tgalery commented Jan 13, 2016

tgalery commented Jan 15, 2016

jodaiber commented Jan 16, 2016

samhumeau commented Jan 16, 2016

jodaiber commented Jan 16, 2016

jodaiber commented Jan 16, 2016

jodaiber commented Jan 18, 2016

tgalery commented Jan 18, 2016

jodaiber commented Jan 18, 2016

samhumeau commented Jan 20, 2016

samhumeau commented Jan 20, 2016

tgalery commented Jan 20, 2016

rschwab68 commented Mar 1, 2016

rschwab68 commented Mar 1, 2016

jodaiber commented Mar 1, 2016

rschwab68 commented Mar 1, 2016