Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

org.hbase.async.RemoteException with CDH 5.7 #783

Closed
devaudio opened this issue Apr 22, 2016 · 17 comments
Closed

org.hbase.async.RemoteException with CDH 5.7 #783

devaudio opened this issue Apr 22, 2016 · 17 comments
Labels

Comments

@devaudio
Copy link

Similar to the last comment in #772 , since upgrading to CDH 5.7, I am getting this constantly in my logs:

org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ? org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ? Caused by: org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ?

I ran it with DEBUG, and most of the time, especially for writes, i see it selecting the correct region servers:
id: 0x231245fe, /10.10.7.101:33714 => /10.10.7.36:60020] Sending RPC #26

but i can't seem to figure out why it's trying to go to 0.0.0.0 ..... and, this just cropped up when I upgraded to CDH 5.7 :/

@manolama
Copy link
Member

Hmm, Hbase has an RPC queue that it throttles on to prevent having tons of calls backing up and OOMing or GCing the region server. Take a look at the queue size setting and it maybe that you need to bump it up a bit or drop the write rate from your TSDs. (we've faced a similar issue around sending region servers into GC with too many requests). I'm not exactly sure what the setting is, we'll have to look it up.

As for the 0.0.0.0, that's just the region sever saying it's listening on all IP addresses on port 60020 and it's busy. (some folks run multiple region servers on the same host over different ports).

@devaudio
Copy link
Author

But strange, i changed no configs, and it worked fine before the update. I did update to CDH 5.7, and with this and Solr issues i am having.... i am not liking it

@kev009
Copy link
Contributor

kev009 commented Apr 25, 2016

If your heap size is decent and you have a modern number of CPU cores, try setting the handler count higher

<property> <name>hbase.regionserver.handler.count</name> <value>64</value> </property>

@devaudio
Copy link
Author

it's 30 now. but i wonder (and it seems to be a CDH 5.7 issue, not an opentsdb issue, so this can prolly be closed) why it cropped up when upgrading. sigh

@johann8384
Copy link
Member

johann8384 commented Apr 25, 2016

Does increasing that value improve the problem? We don't mind helping you
solve it, you probably wont be the last OpenTSDB user to upgrade CDH. :)

@kev009
Copy link
Contributor

kev009 commented Apr 26, 2016

Not sure what CDH 5.7 looks inside like but there was a pretty bad bug in HBase 1.2.0 in the Procedure system that caused a ton of CPU spin and garbage creation HBASE-15422.

@devaudio
Copy link
Author

That's listed in their changelog as applied to CDH 5.7

@devaudio
Copy link
Author

So i can confirm, upping the regionserver handler count resolved the issue. Opentsdb is again quick to respond and such. As it was fine before updating to CDH 5.7, gonna mark it down as Hbase 1.2/CDH 5.7 issue, but this minimizes it

@devaudio devaudio reopened this May 16, 2016
@devaudio
Copy link
Author

devaudio commented May 16, 2016

Hmm... so i have hbase.regionserver.handler.count = 150 in my 40 node cluster... and still am seeing two issues:

Bad Request on /api/query: Call queue is full on /0.0.0.0:60020, too many items queued ?
(this is still popping up)

and also /api/query/last is not returning last.... Although this all appeared to start with CD 5.7, i am wondering if it could be my opentsdb.conf that is setup wrong for what i want to do as well
Any thoughts on call queue full/last?

tsd.core.auto_create_metrics=true
tsd.core.auto_create_tagks=true
tsd.core.auto_create_tagvs=true
tsd.core.meta.enable_realtime_uid=true
tsd.core.meta.enable_tsuid_tracking=true
tsd.core.meta.enable_tsuid_incrementing=true
tsd.core.meta.enable_realtime_ts=true
tsd.core.plugin_path=/usr/share/opentsdb/plugins
tsd.core.preload_uid_cache=true
tsd.core.preload_uid_cache.max_entries=300000
tsd.core.socket.timeout=0
tsd.core.storage_exception_handler.enable=false
tsd.core.tree.enable_processing=false
tsd.core.uid.random_metrics=true
tsd.http.cachedir=/tmp/opentsdb
tsd.http.query.allow_delete=true
tsd.http.request.cors_domains=*
tsd.http.request.enable_chunked=true
#tsd.http.request.max_chunk=16384
# changed to 8ish MB
tsd.http.request.max_chunk=8291456
tsd.http.show_stack_trace=true
tsd.http.staticroot=/usr/share/opentsdb/static/
tsd.mode=rw
tsd.network.async_io=false
tsd.network.bind=0.0.0.0
tsd.network.keep_alive=true
tsd.network.keepalive=true
tsd.network.port=4242
tsd.network.reuse_address=true
tsd.network.reuseaddress=true
tsd.network.backlog=30768
tsd.network.tcp_no_delay=true
tsd.network.worker_threads=24
tsd.no_diediedie=false
tsd.query.allow_simultaneous_duplicates=true
tsd.query.filter.expansion_limit=4096
tsd.query.skip_unresolved_tagvs=false
tsd.query.timeout=60000
tsd.rtpublisher.enable=false
tsd.rtpublisher.plugin=
tsd.search.enable=false
tsd.search.plugin=
tsd.stats.canonical=false
tsd.storage.compaction.flush_interval=10
tsd.storage.compaction.flush_speed=2
tsd.storage.compaction.max_concurrent_flushes=10000
tsd.storage.compaction.min_flush_threshold=100
tsd.storage.enable_appends=true
tsd.storage.enable_compaction=true
tsd.storage.fix_duplicate=true
tsd.storage.fix_duplicates=false
tsd.storage.flush_interval=1000
tsd.storage.hbase.data_table=tsdb
tsd.storage.hbase.meta_table=tsdb-meta
tsd.storage.hbase.prefetch_meta=true
tsd.storage.hbase.tree_table=tsdb-tree
tsd.storage.hbase.uid_table=tsdb-uid
tsd.storage.hbase.zk_basedir=/hbase
tsd.storage.repair_appends=true
tsd.storage.salt.buckets=36
tsd.storage.salt.width=1
tsd.storage.uid.width.metric=3
tsd.storage.uid.width.tagk=3
tsd.storage.uid.width.tagv=4

@vitaliyf
Copy link
Contributor

There is OpenTSDB/asynchbase#135 on this issue.

@devaudio
Copy link
Author

the call queue full thing doesn't explain why last isn't working tho, either

@devaudio
Copy link
Author

get empty array on everything

@vitaliyf
Copy link
Contributor

We increased handler.count to 200 and have been watching stats for queue usage over time. There are occasional spikes that lead to this exception, and I'm fairly confident the spikes come from requests to tsdb-meta table. We only have OpenTSDB on this cluster and tsdb is spread across all RS evenly, but tsdb-meta only has 4 regions and there is exact correlation between spikes in queues on RS that have tsdb-meta regions on them.

Any thoughts on what would cause spikes in tsdb-meta RPC volumes?

We do have tsd.core.meta.enable_realtime_ts=true, but all other .meta. settings are left as default.

@vitaliyf
Copy link
Contributor

So yea, turning off tsd.core.meta.enable_realtime_ts lowered our ipc.numActiveHandler usage from 100-200 spikes down to 0-1, and this problem is gone.

@devaudio
Copy link
Author

SO, also, region splitting tsdb-meta was a big win

@vitaliyf
Copy link
Contributor

@devaudio mind if you ask how many tsdb-meta regions you ended up with? I suspect this was necessary, so we went from 1 to 2 to 4 but that only helped marginally.

@devaudio
Copy link
Author

devaudio commented Jun 15, 2016

I have 174 regions for tsdb-meta, with ~3,800 metrics, ~155 billion rows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants