You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a compactor request a compaction job from the manager the following things happen.
The request runs in the general client thrift thread pool (note this thread pool will slowly automatically increase)
The request reads tablet metadata and then writes a conditional mutation to add the the compaction to tablet metadata.
When there are lots of compactors and there is a problem writing to the metadata table these threads will grow in an unlimited manner. Saw this problem occur and it was probably caused by not having the fixes for #5155 and #5168, however there are many other potential causes that could cause threads to get stuck or be slow.
Expected behavior
The number of threads concurrently executing compaction reservation is somehow constrained. This must be done w/o blocking other manager functionality. A simple way to achieve this goal would be to add a semaphore around this functionality, however this would cause general thrift threads that execute all manager functionality to block which could cause other problems but maybe that is ok since the manager thread pool always grows.
One possible way to achieve this goal would be to use #5018 and in the async code for executing compaction reservation runs in a limited thread pool. #5018 was created for performance reason, but it can also easily satisfy this goal of protecting manager memory.
Another possible way to achieve this goal would be run another thrift server w/ its own port for getting compaction jobs and limit the thread pool size for this.
The text was updated successfully, but these errors were encountered:
The Jetty solution (or another async solution like gRPC) would be interesting to try because you'd get the performance benefits as well. As you pointed out, we'd need to make sure to limit things by passing in a thread pool when there is a job available here and that should prevent trying to reserve too many jobs at once.
Another thing that would help solve this problem would be #4978 because if those changes were made then the compaction coordinator would no longer be handling the job reservations as that responsibility would be moved to the compactor itself
Describe the bug
When a compactor request a compaction job from the manager the following things happen.
When there are lots of compactors and there is a problem writing to the metadata table these threads will grow in an unlimited manner. Saw this problem occur and it was probably caused by not having the fixes for #5155 and #5168, however there are many other potential causes that could cause threads to get stuck or be slow.
Expected behavior
The number of threads concurrently executing compaction reservation is somehow constrained. This must be done w/o blocking other manager functionality. A simple way to achieve this goal would be to add a semaphore around this functionality, however this would cause general thrift threads that execute all manager functionality to block which could cause other problems but maybe that is ok since the manager thread pool always grows.
One possible way to achieve this goal would be to use #5018 and in the async code for executing compaction reservation runs in a limited thread pool. #5018 was created for performance reason, but it can also easily satisfy this goal of protecting manager memory.
Another possible way to achieve this goal would be run another thrift server w/ its own port for getting compaction jobs and limit the thread pool size for this.
The text was updated successfully, but these errors were encountered: