Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient 500 for /things endpoint #11

Open
relu91 opened this issue Aug 1, 2022 · 16 comments
Open

Transient 500 for /things endpoint #11

relu91 opened this issue Aug 1, 2022 · 16 comments
Assignees
Labels

Comments

@relu91
Copy link
Contributor

relu91 commented Aug 1, 2022

I received a 500 when accessing the /things endpoint. Here is the Zion log:

Error: select * from "thing_description" - Connection terminated unexpectedly
    at Connection.<anonymous> (/home/node/app/node_modules/pg/lib/client.js:132:73)
    at Object.onceWrapper (node:events:641:28)
    at Connection.emit (node:events:527:28)
    at Socket.<anonymous> (/home/node/app/node_modules/pg/lib/connection.js:57:12)
    at Socket.emit (node:events:527:28)
    at TCP.<anonymous> (node:net:709:12)
Connection Error: Connection ended unexpectedly

The error disappeared after a second try.

@relu91
Copy link
Contributor Author

relu91 commented Aug 1, 2022

We concluded it to be related to knex/knex#3523. No real solution, yet just a small workaround: setting the min pool size of knex to 0.

@hyperloris
Copy link
Contributor

Reopen to keep track of new solutions.

@hyperloris hyperloris reopened this Aug 1, 2022
@hyperloris hyperloris added type: bug Something isn't working priority: high (2) High-priority issue that should be resolved as soon as possible scope: persistance type: 3rd-party labels Aug 16, 2022
@hyperloris
Copy link
Contributor

I ran some tests and unfortunately, it seems that the workaround is not working. After 20-30 minutes of inactivity, the connection with the database is terminated and the next call returns a 500.

We need to investigate further and find a solution even if it is temporary.

@relu91 relu91 added priority: critical (1) Critical priority issue that must be resolved immediately and removed priority: high (2) High-priority issue that should be resolved as soon as possible labels Sep 5, 2022
@relu91
Copy link
Contributor Author

relu91 commented Sep 5, 2022

@ivanzy We decided to increase the priority of this issue because it is damaging the user experience. When you finish #18 I'd ask you to focus on this.

@ivanzy
Copy link
Contributor

ivanzy commented Sep 5, 2022

Great! Can we have a short technical call so I can understand better the issue? It would be great if you could reproduce this errror so I can have a full picture knowledge of the problem (which I do not, so far haha)

@ivanzy
Copy link
Contributor

ivanzy commented Sep 12, 2022

Hey guys, so I tried to reproduce the error. First, I tried it manually, but it did not work (I never got 500). Then, I create some scripts to automate the process. Initially, the script creates a TD, and after a random amount of time (between 15 and 30 min), it performs a GET to retrieve the created TD. Again I never got a 500 error. I improve the script to do other operations, such as PUT and POST, and to choose it randomly. Yet, again, I never got the 500 error.

Zion was deployed as a docker container in all the attempts, using the repo docker-compose.

The logs of some of the trials are here, and you guys can try the script by yourself. I wrote a very simple README.

Am I missing something?
Does a specific request trigger the error? Or is it already solved?

@relu91
Copy link
Contributor Author

relu91 commented Sep 12, 2022

Am I missing something?
Does a specific request trigger the error? Or is it already solved?

I think I found the problem..... I am afraid that @hyperloris commit is newer than the released version. Therefore, I would probably bet that the issue was indeed resolved 1st of August and in the rush of solving other DESMO-LD issues we didn't notice this discrepancy.

The good news is that now we have a benchmark and we can investigate alternative solutions than 8a13a30.

My next todo-item is to set up a CD as I initially planned to do.

@ivanzy
Copy link
Contributor

ivanzy commented Sep 12, 2022

I guess that what happened. I put some effort in this, since the tradicional testing methods did not worked for this error. As you mentioned, at least we have now a realiable way to identify problems of such kind.

I suggest to close this issue without merging the branch 11-transient-500-for-things-endpoint to the main.

@relu91
Copy link
Contributor Author

relu91 commented Sep 12, 2022

I suggest to close this issue

see

Reopen to keep track of new solutions.

As I stated above I think there is still value in https://github.com/vaimee/zion/tree/11-transient-500-for-things-endpoint . It should provide a benchmark to test new solutions. The only thing that is left to understand, is if the benchmark you created is able to spot the problem if you revert commits back to https://github.com/vaimee/zion/releases/tag/v1.0.0-alpha.0

@hyperloris
Copy link
Contributor

The TDDs deployed for the project demo have the minimum pool size set to 0 (we are not using the v1.0.0-alpha.0). Unfortunately, after some downtime, the problem reappears.
May I suggest increasing the waiting time in the tests? Perhaps 15-30 min is too little.

@ivanzy
Copy link
Contributor

ivanzy commented Sep 13, 2022

Taking into account @relu91 and @hyperloris's comments, I will do the following:

  • Make more tests with the current version considering higher downtimes (30-60 min?)
  • Test the same scripts in the older version before the solution was added in 8a13a30.

It would be helpful if you could share more details on how you triggered the error. Like:

  • Where and how was Zion deployed (docker or no docker)?
  • Which exact request triggered the error?
  • Which was the server uptime when the error occurred?
  • Is there a rough estimate of the inactivity period?

@relu91
Copy link
Contributor Author

relu91 commented Sep 13, 2022

Where and how was Zion deployed (docker or no docker)?

Docker, behind a proxy (Traeffic - I don't think it is the problem but just to be crystal clear about the setup).

Which exact request triggered the error?

A GET HTTP operation to /things endpoint.

Which was the server uptime when the error occurred?

Sorry, I don't know, but I believe the server has never gone down.

Is there a rough estimate of the inactivity period?

Tested today at 10:20 a.m. everything was fine. Tested again now at 12:43 a.m. and got 500. So for sure, we know that the error happens after at least about 2 hours of inactivity (I know it is a lot 🤣 ). But I think @hyperloris is right 30 min might be a reasonable minum.

@ivanzy
Copy link
Contributor

ivanzy commented Sep 13, 2022

Guys,

What you told me is very strange because I cannot replicate this error!
I tried to replicate it with various inactivity times, from 15min to 2h, and in none, the error manifested (I also tried manually, in case my script is not working properly).

  • Do you have any clue why that is happening in my setup? Maybe performing the request in the same machine as the docker containers are executing has some impact (but I cannot imagine why).
  • Another potential difference is the docker-compose used to deploy in production, are the environmental variables utilized identical?
  • Is the DB version the same as well?

Do you have a complete log of the production deployed Zion to understand better when those errors occurred? Also, it would be helpful to calculate the exact time between requests.

I am currently executing test pre 8a13a30 commit. I will update here when I get the results.

@hyperloris
Copy link
Contributor

  • Do you have any clue why that is happening in my setup? Maybe performing the request in the same machine as the docker containers are executing has some impact (but I cannot imagine why).

Unfortunately, I have no additional clues, but I can add more details about the deployment we are using for the demo. We have 4 TDDs, each with its own Postgres database, everything deployed inside a Docker Swarm cluster within the same Docker stack.

  • Another potential difference is the docker-compose used to deploy in production, are the environmental variables utilized identical?

Yes, they are the same.

  • Is the DB version the same as well?

postgres:14.3-alpine

Do you have a complete log of the production deployed Zion to understand better when those errors occurred? Also, it would be helpful to calculate the exact time between requests.

I cannot retrieve it right now, but I will try to get it for you asap.

@relu91
Copy link
Contributor Author

relu91 commented Sep 19, 2022

After additional investigations, we have found that there are dependencies with our production environment. However, it is still not clear the implications of the 8a13a30 and whether it fixed the error or just downgraded the performance of the system.

@relu91 relu91 added type: potential issue and removed type: bug Something isn't working labels Sep 19, 2022
@relu91
Copy link
Contributor Author

relu91 commented Feb 17, 2023

Currently, we can bypass the VIP/IPVS issues using the workaround described at point 3 in moby/moby#37466 (comment). Note, even if, in practice, it resolves the 500 transient errors bypassing VIP/IPVS does not work very well for scalability. Therefore, we are considering alternative solutions in replacement of knex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants