org.hbase.async.RemoteException with CDH 5.7 #783

devaudio · 2016-04-22T21:39:43Z

Similar to the last comment in #772 , since upgrading to CDH 5.7, I am getting this constantly in my logs:

org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ? org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ? Caused by: org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ?

I ran it with DEBUG, and most of the time, especially for writes, i see it selecting the correct region servers:
id: 0x231245fe, /10.10.7.101:33714 => /10.10.7.36:60020] Sending RPC #26

but i can't seem to figure out why it's trying to go to 0.0.0.0 ..... and, this just cropped up when I upgraded to CDH 5.7 :/

The text was updated successfully, but these errors were encountered:

manolama · 2016-04-23T04:02:07Z

Hmm, Hbase has an RPC queue that it throttles on to prevent having tons of calls backing up and OOMing or GCing the region server. Take a look at the queue size setting and it maybe that you need to bump it up a bit or drop the write rate from your TSDs. (we've faced a similar issue around sending region servers into GC with too many requests). I'm not exactly sure what the setting is, we'll have to look it up.

As for the 0.0.0.0, that's just the region sever saying it's listening on all IP addresses on port 60020 and it's busy. (some folks run multiple region servers on the same host over different ports).

devaudio · 2016-04-23T14:11:16Z

But strange, i changed no configs, and it worked fine before the update. I did update to CDH 5.7, and with this and Solr issues i am having.... i am not liking it

kev009 · 2016-04-25T07:17:35Z

If your heap size is decent and you have a modern number of CPU cores, try setting the handler count higher

<property> <name>hbase.regionserver.handler.count</name> <value>64</value> </property>

devaudio · 2016-04-25T13:22:39Z

it's 30 now. but i wonder (and it seems to be a CDH 5.7 issue, not an opentsdb issue, so this can prolly be closed) why it cropped up when upgrading. sigh

johann8384 · 2016-04-25T16:04:35Z

Does increasing that value improve the problem? We don't mind helping you
solve it, you probably wont be the last OpenTSDB user to upgrade CDH. :)

kev009 · 2016-04-26T17:48:02Z

Not sure what CDH 5.7 looks inside like but there was a pretty bad bug in HBase 1.2.0 in the Procedure system that caused a ton of CPU spin and garbage creation HBASE-15422.

devaudio · 2016-04-26T18:08:52Z

That's listed in their changelog as applied to CDH 5.7

devaudio · 2016-04-27T13:09:24Z

So i can confirm, upping the regionserver handler count resolved the issue. Opentsdb is again quick to respond and such. As it was fine before updating to CDH 5.7, gonna mark it down as Hbase 1.2/CDH 5.7 issue, but this minimizes it

devaudio · 2016-05-16T13:12:55Z

Hmm... so i have hbase.regionserver.handler.count = 150 in my 40 node cluster... and still am seeing two issues:

Bad Request on /api/query: Call queue is full on /0.0.0.0:60020, too many items queued ?
(this is still popping up)

and also /api/query/last is not returning last.... Although this all appeared to start with CD 5.7, i am wondering if it could be my opentsdb.conf that is setup wrong for what i want to do as well
Any thoughts on call queue full/last?

tsd.core.auto_create_metrics=true
tsd.core.auto_create_tagks=true
tsd.core.auto_create_tagvs=true
tsd.core.meta.enable_realtime_uid=true
tsd.core.meta.enable_tsuid_tracking=true
tsd.core.meta.enable_tsuid_incrementing=true
tsd.core.meta.enable_realtime_ts=true
tsd.core.plugin_path=/usr/share/opentsdb/plugins
tsd.core.preload_uid_cache=true
tsd.core.preload_uid_cache.max_entries=300000
tsd.core.socket.timeout=0
tsd.core.storage_exception_handler.enable=false
tsd.core.tree.enable_processing=false
tsd.core.uid.random_metrics=true
tsd.http.cachedir=/tmp/opentsdb
tsd.http.query.allow_delete=true
tsd.http.request.cors_domains=*
tsd.http.request.enable_chunked=true
#tsd.http.request.max_chunk=16384
# changed to 8ish MB
tsd.http.request.max_chunk=8291456
tsd.http.show_stack_trace=true
tsd.http.staticroot=/usr/share/opentsdb/static/
tsd.mode=rw
tsd.network.async_io=false
tsd.network.bind=0.0.0.0
tsd.network.keep_alive=true
tsd.network.keepalive=true
tsd.network.port=4242
tsd.network.reuse_address=true
tsd.network.reuseaddress=true
tsd.network.backlog=30768
tsd.network.tcp_no_delay=true
tsd.network.worker_threads=24
tsd.no_diediedie=false
tsd.query.allow_simultaneous_duplicates=true
tsd.query.filter.expansion_limit=4096
tsd.query.skip_unresolved_tagvs=false
tsd.query.timeout=60000
tsd.rtpublisher.enable=false
tsd.rtpublisher.plugin=
tsd.search.enable=false
tsd.search.plugin=
tsd.stats.canonical=false
tsd.storage.compaction.flush_interval=10
tsd.storage.compaction.flush_speed=2
tsd.storage.compaction.max_concurrent_flushes=10000
tsd.storage.compaction.min_flush_threshold=100
tsd.storage.enable_appends=true
tsd.storage.enable_compaction=true
tsd.storage.fix_duplicate=true
tsd.storage.fix_duplicates=false
tsd.storage.flush_interval=1000
tsd.storage.hbase.data_table=tsdb
tsd.storage.hbase.meta_table=tsdb-meta
tsd.storage.hbase.prefetch_meta=true
tsd.storage.hbase.tree_table=tsdb-tree
tsd.storage.hbase.uid_table=tsdb-uid
tsd.storage.hbase.zk_basedir=/hbase
tsd.storage.repair_appends=true
tsd.storage.salt.buckets=36
tsd.storage.salt.width=1
tsd.storage.uid.width.metric=3
tsd.storage.uid.width.tagk=3
tsd.storage.uid.width.tagv=4

vitaliyf · 2016-05-16T13:14:08Z

There is OpenTSDB/asynchbase#135 on this issue.

devaudio · 2016-05-16T14:59:34Z

the call queue full thing doesn't explain why last isn't working tho, either

devaudio · 2016-05-16T14:59:44Z

get empty array on everything

vitaliyf · 2016-05-18T16:35:35Z

We increased handler.count to 200 and have been watching stats for queue usage over time. There are occasional spikes that lead to this exception, and I'm fairly confident the spikes come from requests to tsdb-meta table. We only have OpenTSDB on this cluster and tsdb is spread across all RS evenly, but tsdb-meta only has 4 regions and there is exact correlation between spikes in queues on RS that have tsdb-meta regions on them.

Any thoughts on what would cause spikes in tsdb-meta RPC volumes?

We do have tsd.core.meta.enable_realtime_ts=true, but all other .meta. settings are left as default.

vitaliyf · 2016-05-20T22:07:26Z

So yea, turning off tsd.core.meta.enable_realtime_ts lowered our ipc.numActiveHandler usage from 100-200 spikes down to 0-1, and this problem is gone.

devaudio · 2016-06-10T22:37:32Z

SO, also, region splitting tsdb-meta was a big win

vitaliyf · 2016-06-15T15:09:54Z

@devaudio mind if you ask how many tsdb-meta regions you ended up with? I suspect this was necessary, so we went from 1 to 2 to 4 but that only helped marginally.

devaudio · 2016-06-15T15:38:55Z

I have 174 regions for tsdb-meta, with ~3,800 metrics, ~155 billion rows

devaudio closed this as completed Apr 27, 2016

devaudio reopened this May 16, 2016

johann8384 added the question label May 19, 2016

devaudio closed this as completed Jun 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

org.hbase.async.RemoteException with CDH 5.7 #783

org.hbase.async.RemoteException with CDH 5.7 #783

devaudio commented Apr 22, 2016

manolama commented Apr 23, 2016

devaudio commented Apr 23, 2016

kev009 commented Apr 25, 2016

devaudio commented Apr 25, 2016

johann8384 commented Apr 25, 2016 •

edited

Loading

kev009 commented Apr 26, 2016

devaudio commented Apr 26, 2016

devaudio commented Apr 27, 2016

devaudio commented May 16, 2016 •

edited

Loading

vitaliyf commented May 16, 2016

devaudio commented May 16, 2016

devaudio commented May 16, 2016

vitaliyf commented May 18, 2016

vitaliyf commented May 20, 2016

devaudio commented Jun 10, 2016

vitaliyf commented Jun 15, 2016

devaudio commented Jun 15, 2016 •

edited

Loading

org.hbase.async.RemoteException with CDH 5.7 #783

org.hbase.async.RemoteException with CDH 5.7 #783

Comments

devaudio commented Apr 22, 2016

manolama commented Apr 23, 2016

devaudio commented Apr 23, 2016

kev009 commented Apr 25, 2016

devaudio commented Apr 25, 2016

johann8384 commented Apr 25, 2016 • edited Loading

kev009 commented Apr 26, 2016

devaudio commented Apr 26, 2016

devaudio commented Apr 27, 2016

devaudio commented May 16, 2016 • edited Loading

vitaliyf commented May 16, 2016

devaudio commented May 16, 2016

devaudio commented May 16, 2016

vitaliyf commented May 18, 2016

vitaliyf commented May 20, 2016

devaudio commented Jun 10, 2016

vitaliyf commented Jun 15, 2016

devaudio commented Jun 15, 2016 • edited Loading

johann8384 commented Apr 25, 2016 •

edited

Loading

devaudio commented May 16, 2016 •

edited

Loading

devaudio commented Jun 15, 2016 •

edited

Loading