Transient `500` for `/things` endpoint #11

relu91 · 2022-08-01T10:49:13Z

I received a 500 when accessing the /things endpoint. Here is the Zion log:

Error: select * from "thing_description" - Connection terminated unexpectedly
    at Connection.<anonymous> (/home/node/app/node_modules/pg/lib/client.js:132:73)
    at Object.onceWrapper (node:events:641:28)
    at Connection.emit (node:events:527:28)
    at Socket.<anonymous> (/home/node/app/node_modules/pg/lib/connection.js:57:12)
    at Socket.emit (node:events:527:28)
    at TCP.<anonymous> (node:net:709:12)
Connection Error: Connection ended unexpectedly

The error disappeared after a second try.

The text was updated successfully, but these errors were encountered:

relu91 · 2022-08-01T12:34:10Z

We concluded it to be related to knex/knex#3523. No real solution, yet just a small workaround: setting the min pool size of knex to 0.

hyperloris · 2022-08-01T12:45:18Z

Reopen to keep track of new solutions.

hyperloris · 2022-08-16T19:15:58Z

I ran some tests and unfortunately, it seems that the workaround is not working. After 20-30 minutes of inactivity, the connection with the database is terminated and the next call returns a 500.

We need to investigate further and find a solution even if it is temporary.

relu91 · 2022-09-05T09:45:20Z

@ivanzy We decided to increase the priority of this issue because it is damaging the user experience. When you finish #18 I'd ask you to focus on this.

ivanzy · 2022-09-05T09:49:51Z

Great! Can we have a short technical call so I can understand better the issue? It would be great if you could reproduce this errror so I can have a full picture knowledge of the problem (which I do not, so far haha)

ivanzy · 2022-09-12T09:35:50Z

Hey guys, so I tried to reproduce the error. First, I tried it manually, but it did not work (I never got 500). Then, I create some scripts to automate the process. Initially, the script creates a TD, and after a random amount of time (between 15 and 30 min), it performs a GET to retrieve the created TD. Again I never got a 500 error. I improve the script to do other operations, such as PUT and POST, and to choose it randomly. Yet, again, I never got the 500 error.

Zion was deployed as a docker container in all the attempts, using the repo docker-compose.

The logs of some of the trials are here, and you guys can try the script by yourself. I wrote a very simple README.

Am I missing something?
Does a specific request trigger the error? Or is it already solved?

relu91 · 2022-09-12T10:18:22Z

Am I missing something?
Does a specific request trigger the error? Or is it already solved?

I think I found the problem..... I am afraid that @hyperloris commit is newer than the released version. Therefore, I would probably bet that the issue was indeed resolved 1st of August and in the rush of solving other DESMO-LD issues we didn't notice this discrepancy.

The good news is that now we have a benchmark and we can investigate alternative solutions than 8a13a30.

My next todo-item is to set up a CD as I initially planned to do.

ivanzy · 2022-09-12T10:26:55Z

I guess that what happened. I put some effort in this, since the tradicional testing methods did not worked for this error. As you mentioned, at least we have now a realiable way to identify problems of such kind.

I suggest to close this issue without merging the branch 11-transient-500-for-things-endpoint to the main.

relu91 · 2022-09-12T12:56:26Z

I suggest to close this issue

see

Reopen to keep track of new solutions.

As I stated above I think there is still value in https://github.com/vaimee/zion/tree/11-transient-500-for-things-endpoint . It should provide a benchmark to test new solutions. The only thing that is left to understand, is if the benchmark you created is able to spot the problem if you revert commits back to https://github.com/vaimee/zion/releases/tag/v1.0.0-alpha.0

hyperloris · 2022-09-13T08:30:42Z

The TDDs deployed for the project demo have the minimum pool size set to 0 (we are not using the v1.0.0-alpha.0). Unfortunately, after some downtime, the problem reappears.
May I suggest increasing the waiting time in the tests? Perhaps 15-30 min is too little.

ivanzy · 2022-09-13T09:49:18Z

Taking into account @relu91 and @hyperloris's comments, I will do the following:

Make more tests with the current version considering higher downtimes (30-60 min?)
Test the same scripts in the older version before the solution was added in 8a13a30.

It would be helpful if you could share more details on how you triggered the error. Like:

Where and how was Zion deployed (docker or no docker)?
Which exact request triggered the error?
Which was the server uptime when the error occurred?
Is there a rough estimate of the inactivity period?

relu91 · 2022-09-13T10:44:45Z

Where and how was Zion deployed (docker or no docker)?

Docker, behind a proxy (Traeffic - I don't think it is the problem but just to be crystal clear about the setup).

Which exact request triggered the error?

A GET HTTP operation to /things endpoint.

Which was the server uptime when the error occurred?

Sorry, I don't know, but I believe the server has never gone down.

Is there a rough estimate of the inactivity period?

Tested today at 10:20 a.m. everything was fine. Tested again now at 12:43 a.m. and got 500. So for sure, we know that the error happens after at least about 2 hours of inactivity (I know it is a lot 🤣 ). But I think @hyperloris is right 30 min might be a reasonable minum.

ivanzy · 2022-09-13T14:32:20Z

Guys,

What you told me is very strange because I cannot replicate this error!
I tried to replicate it with various inactivity times, from 15min to 2h, and in none, the error manifested (I also tried manually, in case my script is not working properly).

Do you have any clue why that is happening in my setup? Maybe performing the request in the same machine as the docker containers are executing has some impact (but I cannot imagine why).
Another potential difference is the docker-compose used to deploy in production, are the environmental variables utilized identical?
Is the DB version the same as well?

Do you have a complete log of the production deployed Zion to understand better when those errors occurred? Also, it would be helpful to calculate the exact time between requests.

I am currently executing test pre 8a13a30 commit. I will update here when I get the results.

hyperloris · 2022-09-14T07:16:53Z

Do you have any clue why that is happening in my setup? Maybe performing the request in the same machine as the docker containers are executing has some impact (but I cannot imagine why).

Unfortunately, I have no additional clues, but I can add more details about the deployment we are using for the demo. We have 4 TDDs, each with its own Postgres database, everything deployed inside a Docker Swarm cluster within the same Docker stack.

Another potential difference is the docker-compose used to deploy in production, are the environmental variables utilized identical?

Yes, they are the same.

Is the DB version the same as well?

postgres:14.3-alpine

Do you have a complete log of the production deployed Zion to understand better when those errors occurred? Also, it would be helpful to calculate the exact time between requests.

I cannot retrieve it right now, but I will try to get it for you asap.

relu91 · 2022-09-19T09:00:59Z

After additional investigations, we have found that there are dependencies with our production environment. However, it is still not clear the implications of the 8a13a30 and whether it fixed the error or just downgraded the performance of the system.

relu91 · 2023-02-17T10:43:46Z

Currently, we can bypass the VIP/IPVS issues using the workaround described at point 3 in moby/moby#37466 (comment). Note, even if, in practice, it resolves the 500 transient errors bypassing VIP/IPVS does not work very well for scalability. Therefore, we are considering alternative solutions in replacement of knex.

hyperloris closed this as completed in 8a13a30 Aug 1, 2022

hyperloris reopened this Aug 1, 2022

hyperloris added type: bug Something isn't working priority: high (2) High-priority issue that should be resolved as soon as possible scope: persistance type: 3rd-party labels Aug 16, 2022

relu91 assigned ivanzy Sep 5, 2022

relu91 added priority: critical (1) Critical priority issue that must be resolved immediately and removed priority: high (2) High-priority issue that should be resolved as soon as possible labels Sep 5, 2022

relu91 mentioned this issue Sep 12, 2022

Bug with prefixes vaimee/desmo-dapp#15

Open

relu91 added type: potential issue and removed type: bug Something isn't working labels Sep 19, 2022

relu91 added a commit that referenced this issue Feb 17, 2023

docs(readme): workaround comment for issue #11

0e9ba71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient `500` for `/things` endpoint #11

Transient `500` for `/things` endpoint #11

relu91 commented Aug 1, 2022

relu91 commented Aug 1, 2022

hyperloris commented Aug 1, 2022

hyperloris commented Aug 16, 2022

relu91 commented Sep 5, 2022

ivanzy commented Sep 5, 2022

ivanzy commented Sep 12, 2022

relu91 commented Sep 12, 2022

ivanzy commented Sep 12, 2022

relu91 commented Sep 12, 2022

hyperloris commented Sep 13, 2022

ivanzy commented Sep 13, 2022

relu91 commented Sep 13, 2022 •

edited

Loading

ivanzy commented Sep 13, 2022 •

edited

Loading

hyperloris commented Sep 14, 2022

relu91 commented Sep 19, 2022

relu91 commented Feb 17, 2023

Transient 500 for /things endpoint #11

Transient 500 for /things endpoint #11

Comments

relu91 commented Aug 1, 2022

relu91 commented Aug 1, 2022

hyperloris commented Aug 1, 2022

hyperloris commented Aug 16, 2022

relu91 commented Sep 5, 2022

ivanzy commented Sep 5, 2022

ivanzy commented Sep 12, 2022

relu91 commented Sep 12, 2022

ivanzy commented Sep 12, 2022

relu91 commented Sep 12, 2022

hyperloris commented Sep 13, 2022

ivanzy commented Sep 13, 2022

relu91 commented Sep 13, 2022 • edited Loading

ivanzy commented Sep 13, 2022 • edited Loading

hyperloris commented Sep 14, 2022

relu91 commented Sep 19, 2022

relu91 commented Feb 17, 2023

Transient `500` for `/things` endpoint #11

Transient `500` for `/things` endpoint #11

relu91 commented Sep 13, 2022 •

edited

Loading

ivanzy commented Sep 13, 2022 •

edited

Loading