-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for structuring consumer definitions and isolating independent components #156
Comments
This is a functioning partial implementation of the idea. https://github.com/getsentry/sentry/blob/replays-use-run-task-strategy/src/sentry/replays/consumers/recording/factory.py return Pipeline(
steps=[
# Catch and log any exceptions that occur during processing.
Partial(LogExceptionStep, message="Invalid recording specified."),
# Deserialize the msgpack payload.
Apply(deserialize),
# Initialize a sentry transaction.
Apply(init_context),
# Cache chunk messages.
Apply(cache_chunks),
# Remove chunk messages from pipeline. They should never be committed.
Filter(filter_chunks),
# Run the capstone messages in a thread-pool.
Partial(
RunTaskInThreads,
processing_function=store,
concurrency=16,
max_pending_futures=32,
),
],
# Batch capstone messages and commit when called.
next_pipeline=CommitOffsets(commit),
) And this is the supporting library code. def Apply(
function: Callable[[Message[TPayload]], TReplaced]
) -> Callable[[ProcessingStrategy[TReplaced]], TransformStep[TPayload]]:
return lambda next_step: TransformStep(function=function, next_step=next_step)
def Filter(
function: Callable[[Message[TPayload]], bool]
) -> Callable[[ProcessingStrategy[TPayload]], FilterStep[TPayload]]:
return lambda next_step: FilterStep(function=function, next_step=next_step)
def Partial(
strategy: Callable[[ProcessingStrategy[TReplaced]], ProcessingStrategy[TPayload]],
**kwargs: Any,
) -> Callable[[ProcessingStrategy[TReplaced]], ProcessingStrategy[TPayload]]:
return lambda next_step: strategy(next_step=next_step, **kwargs)
def Pipeline(
steps: List[Callable[[ProcessingStrategy[TPayload]], ProcessingStrategy[TReplaced]]],
next_pipeline: Optional[ProcessingStrategy[TPayload]] = None,
) -> ProcessingStrategy[TPayload]:
if not steps:
raise ValueError("Pipeline misconfigured. Missing required step functions.")
return functools.reduce(
lambda prev_step, step_fn: step_fn(prev_step),
sequence=reversed(steps),
initial=next_pipeline,
) I'm currying the Having done this I think we can simplify the ask in the original issue significantly. This system can be supported without library code if we allow the next_step property to be defined after initialization. Then we could initialize these classes in the |
i'm curious how this could be made type-safe so that you can have strategies transform T1 into T2, then ensure that the next step actually accepts T2 as payload. that step potentially returns a T3. in your examples, every payload passed to each step in API like this would probably work: return Pipeline()
.and_then(lambda next_step: FilterStep(next_step, function=filter_chunks))
.and_then(lambda next_step: RunTask(next_step, function=preprocess_chunks))
.and_then(lambda next_step: RunTask(next_step, function=process_chunks))
.finish(CommitOffsets(commit)) you want to somehow ensure that the first I am not even sure if type inference works properly even if you use the builder pattern. the closures/lambdas are probably necessary because when forwarding |
also @cmanallen i see that the file in your OP no longer exists. have you stopped pursuing this style altogether, and for what reason? |
@untitaker I adopted the existing formula and eventually the consumer was so simplified that the pipeline concept was not needed. |
@untitaker I have stopped pursuing this. In general the less code, frameworks, boilerplate, etc. to learn the better. I'd much rather duplicate simple work than have to learn highly abstracted library code. Having lots of abstractions between the developer and Kafka is just going to slow people down. |
Arroyo Ideas
Intro
I’m thinking of this as more of a discussion point and not a definitive answer to the question: “how should we structure pipelines in our consumer”. If you like it – great, wants changes – great, or think its useless – that’s fine too! I’m happy with the current Arroyo.
Goal
Users of the library should rarely define strategies. Not because its bad but because its unnecessary. We should be able to expose 99% of the range of behavior as primitives. The user can then mix and match primitives to achieve a result.
Another goal is to remove all thinking about Kafka. Allow the maintainers of arroyo primitives (in sentry or arroyo) to define sane interactions with the consumer that can be blindly reused by junior developers.
Problems with the current implementation
“Problem” is used loosely here. The current implementation is really good. This refactor changes how steps are composed and normalizes some naming to “XYZ Step” or “XYZ Pipeline”. I think doing so gives better reuse and clearer semantics for new users trying to implement their consumers.
Naming can be updated. These are just examples. Any naming outside of the “Step” and “Pipeline” suffixes should be ignored. The definition language is the most important aspect.
Thoughts About the Data Model
The current implementation uses the decorator pattern. Behavior encloses other behavior. Whereas a pipeline is generally thought of as "my output is your input". Behavior is not enclosed, components are distinct.
I will illustrate this. Consider a pipeline of three functions A, B, and C. The way a developer would mentally conceive of this pipeline is illustrated below:
However, Arroyo defines pipelines that generate this call stack:
This difference can be described in code. You can see the inversion of the function ordering. This is required to satisfy the decorator pattern. A component must be defined after and execute before its dependency.
This is not a problem per se. It could cause confusion in certain circumstances but I think with the right interface much of that confusion could be limited without changes to the underlying data model.
Desired implementation
Consider the following pipeline. A consumer receives a payload containing type and value fields. We want to multiply the message’s value field by 2 if it is of the multiply type. Otherwise we drop the message.
The steps are defined as a list and not a linked-list. The steps function identically to the current steps and implement the
submit
,poll
workflow.FilterStep
removes items from processing.ApplyStep
applies functions to a message and returns the output. These are provided by Arroyo. Thelambda
s are user-defined. Notice the steps do not enclose one another. They are totally independent and can be composed in any order.However, this is not a complete pipeline. How are the steps called? For this we need another layer.
The
SynchronousPipeline
would be provided by Arroyo in this hypothetical. It works how you would think. It calls submit and poll on all its steps. At this layer we can do things like catch exceptions. We can manage backpressure.Almost done but not quite. We need to commit the message.
Pipelines implement a linked list pattern for accessing next steps. They function almost identically to the current Arroyo implementation. Pipelines decorate their next steps. Pipeline steps (the sub-steps defined as a list) do not decorate one another and are independent components.
Why isn’t
CommitStep
in thesteps
list? Well it can be but it makes sense to call it as a linkednext_step
. This will become more clear in the next example.next_step
could also be a nested pipeline.Real world example (Replays)
Replays is like attachments. Chunked messages come in. A capstone comes in. The capstone commits. If the capstone fails the chunks can be retried because they never commit. If a message raises an exception, it should log but NOT commit.
With the
CommitStep
defined as a next step it exists outside of the try catch pipeline and is unimpacted by its behavior. In other words, theCommitStep
is free to raise.Compare to the current arroyo implementation:
Its less clear how things are composed. How Steps interact with one another. By refactoring the steps to be more isolated, we can clearly understand how all of the components interact with one another. The proposed implementation also has the benefit of presenting the pipeline steps in order.
Internal of SynchronousPipeline
Random Notes
Step
types do not have anext_step
property (they do but its always None). Only pipelines can move to a next step.The text was updated successfully, but these errors were encountered: