Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of data collection #2056

Open
vvmruder opened this issue Sep 23, 2024 · 1 comment
Open

Improve performance of data collection #2056

vvmruder opened this issue Sep 23, 2024 · 1 comment
Assignees
Labels
discussion usergroup work This lable is used to mark issues which will be done by usergroup members

Comments

@vvmruder
Copy link
Collaborator

vvmruder commented Sep 23, 2024

Intro

Coming back to #1544 we can see that there are many redundant steps which are executed. This is/was mainly introduced to have a configurable server which can be altered at runtime. Means the data integrator can change content on the database without restarting the server. It turned out, that this usecase is a rare one. Probably it is not used at all. Instead, new data is provided by a regular deploy which means the underlying data and the server is completely re setup.

Initialising things once

One of the most interesting points which was not touched in the recent refactorings and performance improvements is the initialisation of the processor. It is initialized everytime an ÖREB related endpoint is called. If we agree on the statement made in the intro, it would be one of the most efficient performance catches if we refactor pyramid_oereb to initialize the processor only once at boot time. This would cut down all initilisation process which is done in this method.

=> this has to be discussed, as it is a organizational decision to make pyramid_oereb recognizing the configuration and datasorurces only at boot time and not on every request.

Parallelisation

I see on potential place where we could hook in for proper take advantage of parallel processing:
https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/readers/extract.py#L51-L104

Here all iterative querying to the sources is bundled and here we could take action. However we should discuss which technique we want to use and if this should be configurable.

asyncio

Since we build up onto recent python versions in this project, we are able to use asyncio in combination with SQLAlchemy. This is probably the best solution in terms of future proof setup. However it comes with some down sides. Asyncio is not 100% available in all python stack and libs we may depend on. So a major task would be to research where we might be blocked to use that.
The most up side of this solution is its scalability and the resource saving solution we would have.

multiprocessing / threading

A well known way of implementing iterative parallel tasks. We easily could implement that. The main disatvantage here is the forking. Threads in one solution or processes in the other, may introduce much more load onto the metal server in the end. So we should discuss how we can avoid bruteforcing wether the database or our servers in the end. In my opinion we could avoid that with some additional configuration where one can set the number of threads or processes to be allowed.

SQLAlchemy Session Management

A thing we also need to research, is the way we currently implement our session sharing:
https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/adapter.py#L12-L73

It is some home made way to improve things for long time running servers to not collect too many open DB sessions. Currently Iam not aware of the influence that would have in a parallel context. Not for Threading NOR Processing NOR Asyncio.

@vvmruder vvmruder added usergroup work This lable is used to mark issues which will be done by usergroup members discussion labels Sep 23, 2024
@voisardf
Copy link
Collaborator

@vvmruder Thanks for the work, we will study and discuss the point in the PSC
@michmuel @svamaa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion usergroup work This lable is used to mark issues which will be done by usergroup members
Projects
None yet
Development

No branches or pull requests

2 participants