Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry "org.hbase.async.RemoteException: Call queue is full on" RPCs #135

Open
manolama opened this issue Apr 25, 2016 · 12 comments
Open

Retry "org.hbase.async.RemoteException: Call queue is full on" RPCs #135

manolama opened this issue Apr 25, 2016 · 12 comments
Labels

Comments

@manolama
Copy link
Member

HBase 1.x and later return an exception when the call queue is full. The native client will retry these calls as if it's was a recoverable exception. AsyncHBase should do the same.

@vitaliyf
Copy link

vitaliyf commented May 9, 2016

Hi,

Do you know what the plan is for fixing this?

@devaudio
Copy link

is there a workaround or anything?

@vitaliyf
Copy link

@manolama, can you glance at this PR - does it match what your plan for this issue was? These changes don't solve the issue for us yet, though it is catching these exceptions so presumably we need further handling to retry them.

@manolama
Copy link
Member Author

manolama commented Sep 7, 2016

@vitaliyf @stlava That's a good start and should help but it does invalidate the region cache which is something we likely don't want. If you want to issue a PR for this part we can get it in. Thanks.

@manolama manolama added the bug label Sep 17, 2016
@dsimmie
Copy link

dsimmie commented Sep 23, 2016

I am getting this error using Cloudera CDH5.7.2 which comes with HBase 1.2. I am using v1.7.0 of the asynchbase client library in Scala.

I have been able to work around on it for some Get requests by increasing hbase.regionserver.handler.count and limiting request to a certain amount before collecting results. However I have some large Get requests that hit memory limits with so many concurrent threads. At least that is what I think they are doing they are timing out and I'm not sure why. I had increased hbase.regionserver.handler.count to 128 which worked fine until I started working with larger Get requests.

My typical use case is querying for 500k-1m random Gets (out of a total of 25m rows) stored on 9 region servers, hosting 100 pre-split regions. 1m row keys roughly equate to 8GB of data.

Are there any other workarounds or advice for dealing with the call queue full issue?

@devaudio
Copy link

All I keep doing is resizing/splitting regions until they are smaller and smaller. I have 40 region servers and currently 258 regions

@vitaliyf
Copy link

Our workaround (on same CDH5.7.x) was to set tsd.core.meta.enable_realtime_ts = false.

@dsimmie
Copy link

dsimmie commented Sep 26, 2016

@vitaliyf from my reading the setting tsd.core.meta.enable_realtime_ts seems to be related to OpenTSDB, see entry on metadata here. I'm not using OpenTSDB. I cannot see why it is used for the async HBase client asynchbase. Is it used in asynchbase and if so where do I set it? I have added that entry to my config file and nothing has changed.

@mikhail-antonov
Copy link

I seems to have lost the track here, what happens here? Apache HBase 1.3 was released this January, I think this issue seems to be resolved?..

The proper behavior should be to not bail out but retry on that kind of exception, but avoid clearing location cache since it's likely temporary overload and not a permanent failure.

@manolama
Copy link
Member Author

manolama commented Feb 5, 2018

@mikhail-antonov This can still happen in 1.3, it simply has to do with a region server being unable to handle the request load. We can add code to AsyncHBase that would buffer and retry requests with a delay but that only makes sense for buffered writes. For reads, it makes more sense to fail the RPC and let the application figure out what to do, I think.

@dsimmie You're correct, that has no affect on AsyncHBase.

@stannie42
Copy link

Thanks @manolama for the update. How does the native client behave with GetRequests ? Doesn't it retry like for PutRequest ? Is there someone working on this bug ? We are using asynchbase outside openTSDB and are highly affected by this.

@manolama
Copy link
Member Author

manolama commented Apr 4, 2018

@stannie42 Not too sure yet regarding the native client but we just upgraded internally to 1.3 and faced the issue when the HBase config changed and merged the read and write queues. We're separating them again and if that solves it I'd suggest you try it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants