-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When a scanner lease expires, it will retry request same regionserver endless. RS too busy! #198
Comments
In addition, i use asynchbase1.8.2 and hbase2.1.0. |
Yeah let's increment the count and change it to NonRecoverableException. Could you issue a PR please? |
|
This should be a severe bug in opentsdb as I use the latest version. I will downgrade the opentsdb for now. |
The fix for this one will be a bit more involved if we want it to be recoverable/retry-able. |
@manolama We are also hitting this issue. Basically in the Hbase RS we start getting this
printed a lot until the RegionServer dies out of memory because of these requests. If I understand correctly this is because OpenTSDB keeps retrying the same requests. Sadly this causes our whole OpenTSDB/Hbase cluster to die because it eventually happens to the RegionServer hosting .META region. Can you please advice us how to downgrade and to what version? We use Hbase 2.0.0. |
One more comment is that I think having this merge request merged is better then the current state. It is better to fail the user request then to do an Inf loop of retries with no chance of success and take down the whole cluster. |
We've also encountered this on Hbase 2.1.2. We've tried running with this patch and while it is certainly an improvement, the tsdb servers no longer kill our entire hbase cluster, the tsdb servers do stop processing requests and have to be restarted manually. I've also tried downgrading to asynchbase 1.8.1, but that does not work with Hbase 2. |
From reading the code I think the
Is a good idea anyway since we don't ever want inf retries. @manolama Can you give us some guidance how to build opentsdb with a custom patched version of asynchbase. |
|
@manolama We are using Hbase 0.98 , and saw the problem of UnknownScannerException running tsdb2.4 and asynchbase 1.8.2. We then switched back to tsdb2.2 and asynchbase 1.7.1 , and the problem went away. Since we still needed some features of tsdb2.4, we tried running tsdb2.4 with older asynchbase 1.7.1 , however we still see some UnknownScannerException |
Just wanted to provide some feedback on this one: we were having major issues for a while. After long investigations and trying a few other things (including adding health-checks and auto-restarts by moving to K8 -- it helped as a work-around, but we were still getting outages), we ended up applying the patch in PR 202 and that basically solved our issues. Our OpenTSDBs are now much healthier. |
请问一下 这个源码的类在哪里呢 |
I to seem to be noticing this bug in Opentsdb 2.4, AsyncHbase 1.8.2. I have tried unsuccessfully to compile AsyncHBase with this change. I believe there are some operations in the code that are incompatible with my version of JDK(11), and I am not able to downgrade. Can someone please provide the jar file with this modification? Thank you |
seams related to this change 061ec3 , |
Is this project dead? why is this problem still not fixed? |
Issuse description:
When RS has a hot region, tsdb's scanner lease may expire. Once many scanners are expire, then on the regionserver side, we will see too many handler are handling scan request and will throw "UnknownScannerException" with "missing scanner" logs like this:
In further, the scanner with the same scanner_id will retry send rpc to RS always. RS will be more busy to handle these endless "missing scanner".
This is debug logs on tsdb side:
Moreover, this bug can be occur stable, when code "Thread.sleep(61000)" is add into Scanner class's nextRow() function .
This is my bug fix:
Only let rpc's attampt plus 1 before invoke sendRpc(). when this rpc retry times > hbase.client.retries.number, it will leave.
Another think, we change the UnknownScannerException to NonRecoverableException is OK?
The text was updated successfully, but these errors were encountered: