Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline overview shows wrong times and wrong color #362

Closed
galindro opened this issue Mar 12, 2024 · 29 comments
Closed

Pipeline overview shows wrong times and wrong color #362

galindro opened this issue Mar 12, 2024 · 29 comments
Labels
bug Something isn't working

Comments

@galindro
Copy link

Jenkins and plugins versions report

Environment
Jenkins version: 2.440.1
pipeline-graph-view-plugin version: 230.vcd408a_1308dc

What Operating System are you using (both controller, and any agents involved in the problem)?

RHEL8

Reproduction steps

Execute any pipeline with multiple steps within multiple parallel stages.

Expected Results

The counter of previous steps shouldn't increment and the undone steps should be blue, not green.

Actual Results

This weird behavior:

output

Anything else?

No response

Are you interested in contributing a fix?

No response

@stuartrowe
Copy link
Contributor

@timbrown5 @timja I just recently fixed in-progress steps showing blue instead of green (#345). I'm now seeing this too after taking the latest version. I'm assuming #285 may have changed the status being returned for nodes in this situation? I'll try to debug this later tonight.

@stuartrowe
Copy link
Contributor

@timbrown5 NodeRelationship.java#L124 is the cause of running step nodes being reported as complete. Removing this check and letting StatusAndTiming.computeChunkStatus2 handle all cases (NodeRelationship.java#L132) solves the issue for a simple repro pipeline:

stage("sleep") {
    sleep(60)
}

Unfortunately with that change alone there are a few test regressions that I need to dig into.

@timbrown5
Copy link
Contributor

Ahh, that was added to avoid cases where finished steps that were part of an unfinished block were reported as running (i.e. there was no BlockEndNode for the after parameter).
If `stepAtomNodes should be excluded from this check, you might be able to simple do:

if (this.start != this.end && this.after == null && !isRunning(run)) {

(as for steps, relations have the start and end to the same StepAtomNode).
or:

if (this.end instanceof BlockEndNode && this.after == null && !isRunning(run)) {

should make this check only trigger for blocks.

My thinking (maybe wrong) is that if a Block has an end node, it should be finished (by definition?). The StatusAndTimingApi checks for an after node, which we can't assign until the parent block has finished. This causes problems for stages blocks with long running siblings, as it will be reported running until those siblings have finished.

@timbrown5
Copy link
Contributor

timbrown5 commented Mar 13, 2024

For the incorrect timing, I think it's similar to the issue I found with the getting the status -if there is no after it is assumed to still be running.
https://github.com/jenkinsci/pipeline-graph-analysis-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/pipelinegraphanalysis/StatusAndTiming.java#L376
This could be a bug in how TreeScanner finds after nodes. Where step B follows step A in a block, step A's after should be step B - maybe this isn't the case.

I did consider assigning the after stage to be the same as end in cases wee we cannot find after, but I think this could end with the step total time being set to 0 (as total time = endTime-startTIme - where endTime is likely the after start time and after==end==start).
https://github.com/jenkinsci/pipeline-graph-analysis-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/pipelinegraphanalysis/StatusAndTiming.java#L394

@timbrown5
Copy link
Contributor

Also, do you have a simple Pipeline to reproduce it?

@stuartrowe
Copy link
Contributor

Also, do you have a simple Pipeline to reproduce it?

stage("sleep") {
    sleep(60)
}

The sleep step will appear to be completed even though it's still running.

@galindro
Copy link
Author

galindro commented Mar 13, 2024

Please, test with parallel stages as well. I have identified another weird behavior that perhaps can be solved by a fix for this ticket:

The stage build:export:lib got duplicated and one has marked as skipped. However, it should show only one stage called build:export:lib per flow without skipped status.

image

@galindro
Copy link
Author

galindro commented Mar 13, 2024

Right after I wrote the previous message, it a 3rd build:export:lib stage was created and marked as skipped in one flow and as completed on the other. It's totally random and nonsense:

image

@galindro
Copy link
Author

BTW, this isn't a nested parallel stages. Just a common parallel stages.

@timja
Copy link
Member

timja commented Mar 13, 2024

Might be fixed by jenkinsci/pipeline-model-definition-plugin#700

@galindro
Copy link
Author

but it doesn't fix the wrong color and timing, correct?

@timja
Copy link
Member

timja commented Mar 13, 2024

No thats just around skipped stages

@stuartrowe
Copy link
Contributor

@timbrown5 both suggested solutions had the same issue with the incorrect completed status of stages in a parallel block.
Screen Recording 2024-03-13 at 11 20 43 PM

Stage Sleep B changes to complete as soon as Sleep A actually completes. The sleep step within Sleep B will still run for another 15s.

Here's the pipeline I used:


stage("Build") {
    parallel([
        "A": { 
            stage("Sleep A") {
                sleep(15)
            }
            stage("Build A") {
                node {
                    sh("echo building A")
                }
            }
        },
        "B": {
            stage("Sleep B") {
                sleep(30)
            }
            stage("Build B") {
                node {
                    sh("echo building B")
                }
            }
            stage("Test B") {
                node {
                    sh("echo testing B")
                }
            }
        }
    ])
}

@stuartrowe
Copy link
Contributor

Relying on StatusAndTiming.computeChunkStatus2 only introduces a another set of issues:

Screen Recording 2024-03-13 at 11 34 49 PM

Stage Sleep B now completes (incorrectly) while Sleep A is still shown as in progress (incorrectly).

@timbrown5
Copy link
Contributor

Relying on StatusAndTiming.computeChunkStatus2 only introduces a another set of issues:
The StatusAndTiming API make a lot of unhelpful assumptions (in our case at least) and isn't well documented - the help docs literally say to read the code to figure out how to use it 😩. We have been advised not to use it (BlueOcean often prefers to generate it's own status/timing before replying on it):
https://issues.jenkins.io/browse/JENKINS-72771?focusedId=445879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-445879

The check you mention is a hedge - if we didn't find an after node, but know the node/block isn't running, then calculate the status ourselves. We might be able to remove it if we are able to ensure we always get an after node when a step/block is finished (should be possible if the graph is well behaved in this instace - and it's often not).
I think, as you say, the error comes from:

if (this.after == null && !isRunning(run)) {
    return new NodeRunStatus(getFinishNodeStatus());
}

This should only fire when the node is not running, but obviously this step is running.

@timbrown5
Copy link
Contributor

@stuartrowe your issue should be fixed by #364 (if I've understood correctly). If you are able to check, that would be great :).

@galindro are you able to share a pipeline which reproduces the timing bug?
I think we are missing a condition here, but it would be good to check:
https://github.com/jenkinsci/pipeline-graph-view-plugin/blob/main/src/main/java/io/jenkins/plugins/pipelinegraphview/treescanner/NodeRelationshipFinder.java#L91
(if we hit a step block inside a set of steps (e.g. a retry), then we might need to set the after node of the step before the block to the start of the step block).

@galindro
Copy link
Author

galindro commented Mar 14, 2024

My pipeline is quite complex. It is based on a library that dynamically creates stages and then executes them in parallel. I can't share it due to IP (intelectual property).

However, I am sure you can simulate it with just some parallel stages like the one @stuartrowe used because mine is very similar. Put some more sleeps on first stages: e.g.: a time greater than 2 minutes on the first and second and a time greater than 30 minutes on the last.

The point here is that it seems to be something very random. The behaviors aren't the same. For every execution they flip.

@stuartrowe
Copy link
Contributor

stuartrowe commented Mar 14, 2024

@timbrown5 no, unfortunately the stage node is still being marked complete prematurely. Here's the same job running after checking out your branch. Note that the sleep 30 step node is still in progress, but its parent stage Sleep B is marked complete as soon as the sleep 15 step from its sibling stage Sleep A completes.

Screen Recording 2024-03-14 at 12 13 57 PM

@galindro
Copy link
Author

This sounds to me a typical issue with memory sharing... Perhaps the same object is being used twice?

@stuartrowe
Copy link
Contributor

stuartrowe commented Mar 15, 2024

@timbrown5 I tested against your latest commit (225b856) and the stage / step statuses are all correct now in my sample pipeline 🎉

The only minor issue I see left is that the parallel branch A seems to inherit the timing info from branch B when B finally completes. Branch A should take roughly 15s because it contains a sleep 15 step, while Branch B should take roughly 30s because it contains a sleep 30 step. The timing info of all the stages within both branches is correct.

I appreciate all of your hard work on these issues! The console view is amazing now 👍

image

@timbrown5
Copy link
Contributor

timbrown5 commented Mar 15, 2024

Hi @stuartrowe,
Glad it fixes most of the stats/timing issues you were seeing, and thanks for testing.

I think the parallel timing behaviour is due to the StatusAndTiming API:
We call it here:
https://github.com/jenkinsci/pipeline-graph-view-plugin/blob/main/src/main/java/io/jenkins/plugins/pipelinegraphview/treescanner/ParallelBlockRelationship.java#L132C46-L132C74
It calls computeChunkTiming for each start/end block pairing but sets after to the parallel block end:
https://github.com/jenkinsci/pipeline-graph-analysis-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/pipelinegraphanalysis/StatusAndTiming.java#L463
If we wanted different behaviour (which I think we do) then we would probably want to copy that function and pass the after that we recorded instead - which #364 will hopefully set reliably.

I'm won't be able to look at this for 10 days as I am going away, but happy for someone else to make changes (or to look when I get back).

@galindro WRT the duplicate steps issue, could you raise another issue please? It doesn't seem related to the Stating/Timing issues and TBH I'm not even sure where to start with it - maybe you can give copy of the debug logs to help narrow it down?
(If you set up a logger for io.jenkins.plugins.pipelinegraphview.treescanner.* and level All that might get us further - assuming the Jenkins log listener supports wildcards 🤔 ).

==Update==
Actually, the above suggestion might not work as I'm not sure what the after branch would be if it's not the parallel end node. Does Blue Ocean behave differently?

@timja
Copy link
Member

timja commented Mar 15, 2024

(If you set up a logger for io.jenkins.plugins.pipelinegraphview.treescanner.* and level All that might get us further - assuming the Jenkins log listener supports wildcards 🤔 ).

Just do io.jenkins.plugins.pipelinegraphview.treescanner as the log recorder

@galindro
Copy link
Author

Yes sir!

@timbrown5
Copy link
Contributor

@galindro I have created #367 for your duplicate step issue. If we move the conversation to that but then we might be able to close this one.

@galindro
Copy link
Author

Just sent the logs there @timbrown5 / @timja

@timja
Copy link
Member

timja commented Mar 18, 2024

Closing as split into other issues

@timja timja closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2024
@galindro
Copy link
Author

#367 - Duplicate stages

And what about the timing issue? Was it fixed by #364 ?

@timja
Copy link
Member

timja commented Mar 18, 2024

Yes

@timbrown5
Copy link
Contributor

Or tracked as #368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants