Skip to content
This repository has been archived by the owner on Jun 15, 2024. It is now read-only.

Large output in shell-script breaks pipelines in chaining-macro (since 0.9.4) #135

Closed
sroidl opened this issue Oct 5, 2016 · 4 comments
Closed
Labels

Comments

@sroidl
Copy link

sroidl commented Oct 5, 2016

While trying to update to a recent release, we encountered an issue in a pipeline that uses the 'stepsupport/always-chaining' macro:

If a step outputs in a bulk, this regulary breaks the pipeline.
The visible effects are: The output stops in the middle of the step. The step itself never finishes (seems to be in a deadlock). The UI is still responding, but no new builds can be triggered, making the pipeline unusable until a restart.
After various tests, it seems that having a lot of output from a shell script at the same time breaks the pipeline.

The bug can be reproduced in (probably) all versions since 0.9.4. More specifically, the commit cee9694 seems to introduce the problem.

Steps to reproduce:

  1. checkout lambdacd in version 0.9.4 (6b09355)
  2. Apply the attached patch to Version 0.9.4 (also works in newer commits, but needs a short merge).
    The patch adds two shell scripts to the dev-resources directory and changes the example pipeline to have chaining step that executes those scripts. The first script calls the second (to resemble our real life problem) and the second script produces a large chunk of output.
  3. ./go setup
  4. ./go serve
  5. Open UI and trigger the build.
  6. The step _chaining-long-output_ should run. After a while it should start printing lines but stop in the middle of its task (the goal is to print 3000 lines).

Disclaimer: Unfortunately the bug is not 100% reproducible. In some cases, especially if the machine running the pipeline is started fresh, the step will not stop. After some restarts of the pipeline it should break and end up in the describe behavior. This might indicate problems with memory or number of threads, etc.

@sroidl
Copy link
Author

sroidl commented Oct 5, 2016

A workaround seems to be avoiding the always-chaining macro in your pipeline.

@flosell flosell added the bug label Oct 6, 2016
@flosell
Copy link
Owner

flosell commented Oct 6, 2016

Thanks for the very detailed bug report! I'll try to reproduce this the next chance I get and get back to you as soon as possible. I refactored this part of the codebase a lot recently so there is a chance something went wrong there.

@flosell
Copy link
Owner

flosell commented Oct 10, 2016

Just to give an update:

I reproduced the issue and started investigating but haven't found the root cause yet. My current guess is an issue with very large and frequest step results in general. The commit you outlined (cee9694) most likely just increased the likelihood of the problem surfacing because it makes step results available per chaining-step as well as overall, therefore essentially duplicating the data.

I'll keep you posted...

@flosell
Copy link
Owner

flosell commented Oct 16, 2016

Ok, finally figured out what the issue was and fixed it: The size of the step results doesn't matter, it's the frequent updates that are being sent that fill up the event-bus. Normally, this isn't an issue because the updates get read by event-bus subscribers and live continues.

Unfortunately, this doesn't work when step-result inheritance comes into play (which is what happens under the covers of the chaining-macro to merge the outputs of the different steps in a chain): Then, the parent listens on the event-bus for updates from the child and sends its own update whenever the child updates. When the child sends lots of updates, the event-bus is saturated so the parent blocks on trying to send updates, therefore also no longer consumes updates, therefore the whole thing deadlocks.

It's now fixed by adding a sliding buffer in the inheritance: If the event-bus is saturated, we can compress events and just send the most up to date version of the inherited state once the event-bus is unblocked again.

Also, in debugging this issue I noticed every update also flushes to disk which is probably a bit too much. I'll address this in #137

flosell added a commit that referenced this issue Oct 16, 2016
flosell added a commit that referenced this issue Oct 23, 2016
flosell added a commit that referenced this issue Oct 23, 2016
flosell added a commit that referenced this issue Feb 5, 2017
flosell added a commit that referenced this issue Feb 5, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants