When a scanner lease expires, it will retry request same regionserver endless. RS too busy! #198

xuming01 · 2018-11-02T09:54:36Z

Issuse description:
When RS has a hot region, tsdb's scanner lease may expire. Once many scanners are expire, then on the regionserver side, we will see too many handler are handling scan request and will throw "UnknownScannerException" with "missing scanner" logs like this:

2018-11-02 16:46:40,580 WARN  [RpcServer.default.RWQ.Fifo.scan.handler=380,queue=38,port=60020] regionserver.RSRpcServices: Client tried to access missing scanner 5816065332938628527

In further, the scanner with the same scanner_id will retry send rpc to RS always. RS will be more busy to handle these endless "missing scanner".

This is debug logs on tsdb side:

17:27:10.561 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - ------------------>> ENTERING DECODE >>------------------
17:27:20.561 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - rpcid=1335, response size=1126 bytes, 0 readable bytes left, rpc=CloseScannerRequest(scanner_id=0x00B6D09B00455DAF, attempt=0)
17:27:30.980 DEBUG [AsyncHBase Timer HBaseClient #1] [RegionClient.encode] - [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] Sending RPC #1336, payload=BigEndianHeapChannelBuffer(ridx=11, widx=42, cap=42) [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 11, 8, -72, 10, 26, 4, 83, 99, 97, 110, 32, 1, 14, 24, -81, -69, -107, -126, -80, -109, -76, -37, 80, 32, 0, 40, 1]
17:27:30.982 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] WRITTEN_AMOUNT: 31
17:27:30.983 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=1126, cap=1126)
17:27:30.987 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - ------------------>> ENTERING DECODE >>------------------
17:27:40.988 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.decode] - rpcid=1336, response size=1126 bytes, 0 readable bytes left, rpc=CloseScannerRequest(scanner_id=0x00B6D09B00455DAF, attempt=0)
17:27:51.400 DEBUG [AsyncHBase Timer HBaseClient #1] [RegionClient.encode] - [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] Sending RPC #1337, payload=BigEndianHeapChannelBuffer(ridx=11, widx=42, cap=42) [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 11, 8, -71, 10, 26, 4, 83, 99, 97, 110, 32, 1, 14, 24, -81, -69, -107, -126, -80, -109, -76, -37, 80, 32, 0, 40, 1]
17:27:51.401 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] WRITTEN_AMOUNT: 31
17:27:51.402 DEBUG [AsyncHBase I/O Worker #2] [RegionClient.handleUpstream] - handleUpstream [id: 0xfbba7e26, /xxx:33732 => /xxx:60020] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=1126, cap=1126)

Moreover, this bug can be occur stable, when code "Thread.sleep(61000)" is add into Scanner class's nextRow() function .

This is my bug fix:
Only let rpc's attampt plus 1 before invoke sendRpc(). when this rpc retry times > hbase.client.retries.number, it will leave.

diff --git a/src/RegionClient.java b/src/RegionClient.java
index ad83aa1..59c0d8e 100644
--- a/src/RegionClient.java
+++ b/src/RegionClient.java
@@ -1547,6 +1547,7 @@ final class RegionClient extends ReplayingDecoder<VoidEnum> {
       final class RetryTimer implements TimerTask {
         public void run(final Timeout timeout) {
           if (isAlive()) {
+            rpc.attempt++;
             sendRpc(rpc);
           } else {
             if (rpc instanceof MultiAction) {

Another think, we change the UnknownScannerException to NonRecoverableException is OK?

The text was updated successfully, but these errors were encountered:

xuming01 · 2018-11-02T09:59:46Z

In addition, i use asynchbase1.8.2 and hbase2.1.0.

manolama · 2018-11-30T01:56:38Z

Yeah let's increment the count and change it to NonRecoverableException. Could you issue a PR please?

xuming01 · 2018-12-14T06:50:41Z

Yeah let's increment the count and change it to NonRecoverableException. Could you issue a PR please?
OK,I will do.

seanlook · 2019-01-15T09:19:42Z

This should be a severe bug in opentsdb as I use the latest version. I will downgrade the opentsdb for now.

manolama · 2019-01-27T21:11:49Z

The fix for this one will be a bit more involved if we want it to be recoverable/retry-able.
@seanlook you can also just downgrade the HBase client for OpenTSDB without downgrading the entire package if you like.

NikolaBorisov · 2019-02-15T12:17:45Z

@manolama We are also hitting this issue. Basically in the Hbase RS we start getting this

2019-02-14 22:33:29,653 WARN org.apache.hadoop.hbase.regionserver.RSRpcServices: Client tried to access missing scanner 0

printed a lot until the RegionServer dies out of memory because of these requests. If I understand correctly this is because OpenTSDB keeps retrying the same requests. Sadly this causes our whole OpenTSDB/Hbase cluster to die because it eventually happens to the RegionServer hosting .META region. Can you please advice us how to downgrade and to what version? We use Hbase 2.0.0.

NikolaBorisov · 2019-02-15T12:21:08Z

One more comment is that I think having this merge request merged is better then the current state. It is better to fail the user request then to do an Inf loop of retries with no chance of success and take down the whole cluster.

clinta · 2019-02-15T13:13:45Z

We've also encountered this on Hbase 2.1.2. We've tried running with this patch and while it is certainly an improvement, the tsdb servers no longer kill our entire hbase cluster, the tsdb servers do stop processing requests and have to be restarted manually.

I've also tried downgrading to asynchbase 1.8.1, but that does not work with Hbase 2.

NikolaBorisov · 2019-02-15T22:11:57Z

From reading the code I think the

rpc.attempt++;

Is a good idea anyway since we don't ever want inf retries. @manolama Can you give us some guidance how to build opentsdb with a custom patched version of asynchbase.

openaphid · 2019-02-28T13:06:17Z

clone the code of opentsdb
place a custom patched asynchbase to a web server
create a file with the md5 signature of asynchbase..jar and put it under third_party/hbase/
modify third_party/hbase/include.mk, update ASYNCHBASE_VERSION and ASYNCHBASE_BASE_URL
run ./build.sh

dilip-devaraj · 2019-03-04T23:47:20Z

@manolama
Are there any issues with the 2 changes in this PR, since it has still not been merged ?

We are using Hbase 0.98 , and saw the problem of UnknownScannerException running tsdb2.4 and asynchbase 1.8.2. We then switched back to tsdb2.2 and asynchbase 1.7.1 , and the problem went away. Since we still needed some features of tsdb2.4, we tried running tsdb2.4 with older asynchbase 1.7.1 , however we still see some UnknownScannerException
Is it safe to use tsdb2.4, with asynchbase 1.8.2 and above custom patch ?

tgwk · 2019-10-17T08:45:08Z

Just wanted to provide some feedback on this one: we were having major issues for a while. After long investigations and trying a few other things (including adding health-checks and auto-restarts by moving to K8 -- it helped as a work-around, but we were still getting outages), we ended up applying the patch in PR 202 and that basically solved our issues. Our OpenTSDBs are now much healthier.
(we run HBase 1.1.2 at the moment)

1256040466zy · 2019-10-21T07:43:35Z

通过阅读代码，我认为
rpc.attempt++;
无论如何，这是一个好主意，因为我们永远不希望重试。@manolama您能否给我们一些指导，说明如何使用自定义修补版本的asynchbase构建opentsdb。

请问一下这个源码的类在哪里呢

joshnorell · 2020-05-02T21:05:58Z

I to seem to be noticing this bug in Opentsdb 2.4, AsyncHbase 1.8.2. I have tried unsuccessfully to compile AsyncHBase with this change. I believe there are some operations in the code that are incompatible with my version of JDK(11), and I am not able to downgrade. Can someone please provide the jar file with this modification? Thank you

iamgd67 · 2022-04-02T07:31:42Z

seams related to this change 061ec3 ,
prior to this, will not retry UnknownScannerException， so 1.8.1 should be good.

RuralHunter · 2022-04-22T07:05:46Z

Is this project dead? why is this problem still not fixed?

manolama added the bug label Nov 30, 2018

xuming01 mentioned this issue Dec 27, 2018

change UnknownScannerException to NonRecoverableException #202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When a scanner lease expires, it will retry request same regionserver endless. RS too busy! #198

When a scanner lease expires, it will retry request same regionserver endless. RS too busy! #198

xuming01 commented Nov 2, 2018 •

edited

Loading

xuming01 commented Nov 2, 2018

manolama commented Nov 30, 2018

xuming01 commented Dec 14, 2018

seanlook commented Jan 15, 2019

manolama commented Jan 27, 2019

NikolaBorisov commented Feb 15, 2019

NikolaBorisov commented Feb 15, 2019

clinta commented Feb 15, 2019

NikolaBorisov commented Feb 15, 2019

openaphid commented Feb 28, 2019

dilip-devaraj commented Mar 4, 2019 •

edited

Loading

tgwk commented Oct 17, 2019

1256040466zy commented Oct 21, 2019

joshnorell commented May 2, 2020

iamgd67 commented Apr 2, 2022

RuralHunter commented Apr 22, 2022

When a scanner lease expires, it will retry request same regionserver endless. RS too busy! #198

When a scanner lease expires, it will retry request same regionserver endless. RS too busy! #198

Comments

xuming01 commented Nov 2, 2018 • edited Loading

xuming01 commented Nov 2, 2018

manolama commented Nov 30, 2018

xuming01 commented Dec 14, 2018

seanlook commented Jan 15, 2019

manolama commented Jan 27, 2019

NikolaBorisov commented Feb 15, 2019

NikolaBorisov commented Feb 15, 2019

clinta commented Feb 15, 2019

NikolaBorisov commented Feb 15, 2019

openaphid commented Feb 28, 2019

dilip-devaraj commented Mar 4, 2019 • edited Loading

tgwk commented Oct 17, 2019

1256040466zy commented Oct 21, 2019

joshnorell commented May 2, 2020

iamgd67 commented Apr 2, 2022

RuralHunter commented Apr 22, 2022

xuming01 commented Nov 2, 2018 •

edited

Loading

dilip-devaraj commented Mar 4, 2019 •

edited

Loading