Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parseIt is not threadsafe (was: Probabilities printed out of order for multi-threaded LM parser) #16

Open
syllog1sm opened this issue Nov 19, 2013 · 14 comments
Labels

Comments

@syllog1sm
Copy link

in first-stage/PARSE/parseIt.C , the probabilities are written out directly, instead of being pushed to the print stack. This means that they appear out of order in multi-threaded LM parsing.

@dmcc
Copy link
Member

dmcc commented Nov 19, 2013

Thanks for the report! That's certainly an issue, but I'm afraid parseIt may have much larger issues.

At least in parsing mode, I don't think that parseIt is actually thread safe -- I've gotten different outputs (or crashes) when using multithreaded vs. single threaded so I don't recommend it. This is a long time overdue, so I'm finally updating the README and help menu to discourage people from using multiple threads with parseIt for now (b050590)

Of course, if anyone wants to fix this printing issue or the larger multithreading problems, I wouldn't object :)

@KantanLabs
Copy link

Dear dmcc,

I don't know if there is any progress on the bllip parser since 2013, but currently I am using it for a large project -- where a lot of sentences are needed to be processed in parallel -- and I was wondering if the multi-threading issue is resolved.

Have there been some solutions to allow the bllip-parser to run multi-threaded and being thread-safe?

Thanks,
Dimitar.

@dmcc
Copy link
Member

dmcc commented May 23, 2017

Unfortunately, I'm not aware of any progress on BLLIP in this area. In the past, I've run multiple instances of the parser as a workaround. I still do not recommend using the -t option.

@KantanLabs
Copy link

Hello dmcc,

Thank you for the quick response. That's pity. I have built a server/wrapper around the parser and it works great on a single core. However, it is rather slow when comparing to , e.g., Stanford which runs multiple threads. On a single thread the BLLIP is actually faster and give a better accuracy.

I am willing to take a look at the code. Do you have any suggestions what needs to be fixed?

Cheers,
Dimitar.

@dmcc
Copy link
Member

dmcc commented May 24, 2017

Hi Dimitar,

That would be great if you're willing to take a look at it. It's been a while so my memory is pretty fuzzy on this, but I believe the issue is a race condition in the first stage parser which we never really narrowed down. If you repeatedly parse a small set of long enough sentences with -t > 1, you should be able to see some cases where you get different parses for the same sentence. Most of the code uses arrays to separate structures used in different threads (https://github.com/BLLIP/bllip-parser/search?utf8=%E2%9C%93&q=MAXNUMTHREADS&type=), but my guess is that there's some global structure that should have got this treatment, but didn't.

Good luck!
David

@KantanLabs
Copy link

Hi David,

So, I did some look into the multithreading today. Something is definitely fishy. So what I did is just run some tests and some diagnostics. Here is a summary. I will continue digging and will keep posting.

  • Upgraded to gcc 7 and compiled with -fsanitize=thread -fPIE -pie -g -> no problems
    after running parseIt the tool hanged and couldn't be terminated (except with a kill -s 9)
  • Clean compile and run parseIt with -t64, -t1, -t4, -t2, -t24 for different datasets
    -t2 is faster than -t1; the rest are slower.
  • Ran valgrind with helgrind.
    For t2:
    ==23834== ERROR SUMMARY: 51313 errors from 23 contexts (suppressed: 919 from 32)
    For single thread
    ==23867== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)

Cheers,
Dimitar.

@dmcc
Copy link
Member

dmcc commented May 26, 2017

Thanks for the update -- please keep them coming! Yeah, that sounds (vaguely) familiar. I think I ran helgrind and it found issues, but I forget how much deeper I dug than that.

@KantanLabs
Copy link

KantanLabs commented May 26, 2017

OK, So, two things.

First, in the parseIt/mainLoop there is a print loop:

	while(!printStack.empty())
	{
		sleep(SLEEPTIME);
		workOnPrintStack(&printStack);
	}

which is outside of the reading/parsing loop. This loop and the sleep delay the printing process, even if the parse is finished. To be honest, I don't get why is this loop there since there is a print workOnPrintStack(&printStack); call inside the main loop (see line 289 in parseIt.C).

Second, helgrind detects a racing condition somewhere in MeChart::bestParse. It complains about complains about functions from fnSubFns,C which make access to the FullHist*. I assume that the issue arrises there - reading and writing something to FullHist in an uncontrolled way.

I haven't looked more in details today. If I have time next week I will dig more into it.

Cheers,
Dimitar.

@KantanLabs
Copy link

Hi David,

I did some look around and to be honest didn't figure out what exactly causes helgrade to complain. I modified some things in the parse method.

First, switched the global sentence count from int to unsigned long long int. If I am not wrong, for more than 32,767 sentences there would be errors.

Second, the printing function I took out of the main loop - this way they should work on parallel (added one more semaphore). If you want, I will send the code for review.

But I wanted to ask - how do I set the option for multithreading (the -t option) with the python wrapper?

Thanks,
Dimitar.

@dmcc
Copy link
Member

dmcc commented Jun 18, 2017 via email

@KantanLabs
Copy link

Hello David,

Sorry for my late reply - being busy with lots of stuff.

Once I have a bit of free time I will submit the change for the long long int.

Regarding the race condition, it is possible that the history (FullHist) is not thread safe... but not yet verified.

I have, however, two questions:

  1. If I submit more code changes, will you review it? (maybe that sounds like a silly question, but to be honest I haven't worked on a github project like this before.)
  2. The whole idea for fixing the multithreading arose from the fact that I have a python rest-server which I want to connect to the parser and have it running as fast and as distributed as possible. This, however, will not be solved with having the parser multithreaded, because the python wrapper by itself parses the sentences one-by-one (sending and waiting for a response). So, having a python rest server will not help. So, my second question - can parseIt.C be extended with REST capabilities? to listen to a port, receive a request and then send the parse back (in, e.g., json format). I have been looking into pistache for such a solution, but still need to work on it. If you have any idea/suggestion/comment on this, that would be great.

Thanks,
Kind regards,
Dimitar.

@dmcc
Copy link
Member

dmcc commented Jun 21, 2017 via email

@KantanLabs
Copy link

Hi David,

After long search through the BLLIP parser code I am sort of giving up. I reached to a point where I think (one) problem comes from the "edgeSubFns". There is this array of functions that is accessed by each thread. I tried to work on this, but couldn't make it work.

Anyway, I tried a lot of other things (following gdb and valgrind's info) but coudln't make it work properly for multiple threads.

I would like to share my last version and maybe you or somebody can take over. Maybe I am on the right track and just missing something small (or maybe I am shoooting tooo far away).

Cheers,
Dimitar.

@dmcc
Copy link
Member

dmcc commented Aug 5, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants