Large output in shell-script breaks pipelines in chaining-macro (since 0.9.4) #135

sroidl · 2016-10-05T18:14:01Z

While trying to update to a recent release, we encountered an issue in a pipeline that uses the 'stepsupport/always-chaining' macro:

If a step outputs in a bulk, this regulary breaks the pipeline.
The visible effects are: The output stops in the middle of the step. The step itself never finishes (seems to be in a deadlock). The UI is still responding, but no new builds can be triggered, making the pipeline unusable until a restart.
After various tests, it seems that having a lot of output from a shell script at the same time breaks the pipeline.

The bug can be reproduced in (probably) all versions since 0.9.4. More specifically, the commit cee9694 seems to introduce the problem.

Steps to reproduce:

checkout lambdacd in version 0.9.4 (6b09355)
Apply the attached patch to Version 0.9.4 (also works in newer commits, but needs a short merge).
The patch adds two shell scripts to the dev-resources directory and changes the example pipeline to have chaining step that executes those scripts. The first script calls the second (to resemble our real life problem) and the second script produces a large chunk of output.
./go setup
./go serve
Open UI and trigger the build.
The step _chaining-long-output_ should run. After a while it should start printing lines but stop in the middle of its task (the goal is to print 3000 lines).

Disclaimer: Unfortunately the bug is not 100% reproducible. In some cases, especially if the machine running the pipeline is started fresh, the step will not stop. After some restarts of the pipeline it should break and end up in the describe behavior. This might indicate problems with memory or number of threads, etc.

The text was updated successfully, but these errors were encountered:

sroidl · 2016-10-05T18:16:46Z

A workaround seems to be avoiding the always-chaining macro in your pipeline.

flosell · 2016-10-06T06:40:10Z

Thanks for the very detailed bug report! I'll try to reproduce this the next chance I get and get back to you as soon as possible. I refactored this part of the codebase a lot recently so there is a chance something went wrong there.

flosell · 2016-10-10T09:06:02Z

Just to give an update:

I reproduced the issue and started investigating but haven't found the root cause yet. My current guess is an issue with very large and frequest step results in general. The commit you outlined (cee9694) most likely just increased the likelihood of the problem surfacing because it makes step results available per chaining-step as well as overall, therefore essentially duplicating the data.

I'll keep you posted...

flosell · 2016-10-16T13:29:09Z

Ok, finally figured out what the issue was and fixed it: The size of the step results doesn't matter, it's the frequent updates that are being sent that fill up the event-bus. Normally, this isn't an issue because the updates get read by event-bus subscribers and live continues.

Unfortunately, this doesn't work when step-result inheritance comes into play (which is what happens under the covers of the chaining-macro to merge the outputs of the different steps in a chain): Then, the parent listens on the event-bus for updates from the child and sends its own update whenever the child updates. When the child sends lots of updates, the event-bus is saturated so the parent blocks on trying to send updates, therefore also no longer consumes updates, therefore the whole thing deadlocks.

It's now fixed by adding a sliding buffer in the inheritance: If the event-bus is saturated, we can compress events and just send the most up to date version of the inherited state once the event-bus is unblocked again.

Also, in debugging this issue I noticed every update also flushes to disk which is probably a bit too much. I'll address this in #137

…fault persistence mechanism (#135, #6)

…erformance and resolve potential deadlocks #135, #140)

flosell added the bug label Oct 6, 2016

flosell added a commit that referenced this issue Oct 16, 2016

patch from #135 that reproduces the issue

6afa77d

flosell closed this as completed in 5fa1273 Oct 16, 2016

flosell added a commit that referenced this issue Oct 16, 2016

Implement PipelineStructureConsumer and PipelineStructureSource in de…

8f92b1b

…fault persistence mechanism (#135, #6)

flosell mentioned this issue Oct 22, 2016

Reduce number of events generated by step updates #140

Closed

flosell added a commit that referenced this issue Oct 23, 2016

Reduce amount of step-update events to reduce resource-consumption, p…

dbf66f3

…erformance and resolve potential deadlocks #135, #140)

flosell added a commit that referenced this issue Oct 23, 2016

patch from #135 that reproduces the issue

9333cf5

flosell mentioned this issue Dec 6, 2016

Lots of step-results can still cause deadlocks #144

Closed

flosell added a commit that referenced this issue Feb 5, 2017

reproduce #135

b9cad00

flosell added a commit that referenced this issue Feb 5, 2017

debugging #135

68a103e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large output in shell-script breaks pipelines in chaining-macro (since 0.9.4) #135

Large output in shell-script breaks pipelines in chaining-macro (since 0.9.4) #135

sroidl commented Oct 5, 2016

sroidl commented Oct 5, 2016

flosell commented Oct 6, 2016 •

edited

Loading

flosell commented Oct 10, 2016

flosell commented Oct 16, 2016 •

edited

Loading

Large output in shell-script breaks pipelines in chaining-macro (since 0.9.4) #135

Large output in shell-script breaks pipelines in chaining-macro (since 0.9.4) #135

Comments

sroidl commented Oct 5, 2016

sroidl commented Oct 5, 2016

flosell commented Oct 6, 2016 • edited Loading

flosell commented Oct 10, 2016

flosell commented Oct 16, 2016 • edited Loading

flosell commented Oct 6, 2016 •

edited

Loading

flosell commented Oct 16, 2016 •

edited

Loading