SAMZA-2582: Add a metric to track container failure tracking metric for Samza #1417

Sanil15 · 2020-08-14T18:18:03Z

Changes: Added a metric to -failure-count to track failure count of a single container

API Changes: None

Tests: Tested the change with a yarn job deploy

Upgrade Instructions: None

Usage Instructions: None

cameronlee314 · 2020-08-17T18:59:23Z

Can you please update the PR description to include what issue/symptom you are fixing with this? It's unclear why you need a container failure metric for each container specifically instead of an aggregate container failure metric.

f3flight · 2020-08-18T01:47:55Z

@cameronlee314 this is needed to be able to track individual container health issues and make informed ops decisions based on that data, this is useful for both containers with host affinity to detect unstable hosts, as well as containers w/o affinity - to detect issues caused by partitioning (i.e. when specific traffic goes to certain containers and causes instability from time to time).

rmatharu-zz · 2020-08-19T17:48:16Z

docs/learn/documentation/versioned/operations/monitoring.md

@@ -369,6 +369,7 @@ All \<system\>, \<stream\>, \<partition\>, \<store-name\>, \<topic\>, are popula
 | | expired-preferred-host-requests | Number of expired resource-requests-for -preferred-host received by the cluster manager. |
 | | expired-any-host-requests | Number of expired resource-requests-for -any-host received by the cluster manager. |
 | | host-affinity-match-pct | Percentage of non-expired preferred host requests. This measures the % of resource-requests for which host-affinity provided the preferred host. |
+| | \<containerId\>-failure-count | Number of times a container identified by containerId has failed |


I believe we decided to use "processorId" for 0,1,2..

that lingo is used internally in code as the naming conventions for javadocs, this is public-facing metrics page where we do not need to have context between processorId and containerId

rmatharu-zz · 2020-08-19T18:08:25Z

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java

@@ -472,6 +479,9 @@ void onResourceCompletedWithUnknownStatus(SamzaResourceStatus resourceStatus, St
    LOG.info("Container ID: {} for Processor ID: {} failed with exit code: {}.", containerId, processorId, exitStatus);
    Instant now = Instant.now();
    state.failedContainers.incrementAndGet();
+    if (state.perProcessorFailureCount.get(processorId) != null) {
+      state.perProcessorFailureCount.get(processorId).incrementAndGet();
+    }


else {
Log.error("Unknown/orphan container") ??
}

This method is the helper to and is invoked from onResourceCompleted(...) which does the check for processorId to be legit, remeber that we also get redundant notifications so we cannot declare a container orphan / unknown, we need more testing to deem callback senarios as orphans and that work is beyond the scope of this change

rmatharu-zz · 2020-08-19T18:10:36Z

samza-core/src/main/java/org/apache/samza/clustermanager/SamzaApplicationState.java

+
+  /**
+   *  Map of the Samza processor ID to the count of failed attempts
+   *  Modified by AMRMCallbackThread
+   */
+  public final ConcurrentMap<String, AtomicInteger> perProcessorFailureCount = new ConcurrentHashMap<>(0);
+


If this information is only useful for metric-emission, does it need to be stored in "state" ?

That is correct, we can directly wire metrics registry in ContainerManager, ContainerAllocator and instantiate new guage and counters in the code but all metrics related to AM are under this ContainerProcessManagerMetrics class which holds MetricsRegistry and SamzaApplicationState, so once does not need to wire MetricsRegistry individually to each AM class ContainerManager, ContainerAllocator. This is the justification for maintained this state variable to wire metrics, I feel this approach is cleaner

Check SamzaApplicationState most of the state there is just used for metric emissions

rmatharu-zz · 2020-08-19T18:11:16Z

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java

@@ -236,6 +237,12 @@ public void start() {
    Map<String, String> processorToHostMapping = state.jobModelManager.jobModel().getAllContainerLocality();
    containerAllocator.requestResources(processorToHostMapping);

+    // Initialize the per processor failure count to be 0
+    processorToHostMapping.keySet().forEach(processorId -> {


See comment below on how/why this information isnt really in "state"

replied there

rmatharu-zz

took a pass, likely requires some simplification

Per container failure tracking metric for Samza

09a030e

Add teardown to the tests to prevent memory leaks

4189ae8

rmatharu-zz reviewed Aug 19, 2020

View reviewed changes

Sanil15 added 3 commits September 8, 2020 17:17

Address Rays comments

f6d6214

Rebasing with Master

1fe3c09

Fix the shutdown sequence for container process manager test

5ebf385

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAMZA-2582: Add a metric to track container failure tracking metric for Samza #1417

SAMZA-2582: Add a metric to track container failure tracking metric for Samza #1417

Sanil15 commented Aug 14, 2020

cameronlee314 commented Aug 17, 2020

f3flight commented Aug 18, 2020

rmatharu-zz Aug 19, 2020

Sanil15 Aug 21, 2020

rmatharu-zz Aug 19, 2020

Sanil15 Aug 21, 2020

rmatharu-zz Aug 19, 2020

Sanil15 Aug 21, 2020

Sanil15 Aug 21, 2020

rmatharu-zz Aug 19, 2020

Sanil15 Aug 21, 2020

rmatharu-zz left a comment

SAMZA-2582: Add a metric to track container failure tracking metric for Samza #1417

Are you sure you want to change the base?

SAMZA-2582: Add a metric to track container failure tracking metric for Samza #1417

Conversation

Sanil15 commented Aug 14, 2020

cameronlee314 commented Aug 17, 2020

f3flight commented Aug 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmatharu-zz left a comment

Choose a reason for hiding this comment