-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient 500
for /things
endpoint
#11
Comments
We concluded it to be related to knex/knex#3523. No real solution, yet just a small workaround: setting the min pool size of |
Reopen to keep track of new solutions. |
I ran some tests and unfortunately, it seems that the workaround is not working. After 20-30 minutes of inactivity, the connection with the database is terminated and the next call returns a 500. We need to investigate further and find a solution even if it is temporary. |
Great! Can we have a short technical call so I can understand better the issue? It would be great if you could reproduce this errror so I can have a full picture knowledge of the problem (which I do not, so far haha) |
Hey guys, so I tried to reproduce the error. First, I tried it manually, but it did not work (I never got 500). Then, I create some scripts to automate the process. Initially, the script creates a TD, and after a random amount of time (between 15 and 30 min), it performs a GET to retrieve the created TD. Again I never got a 500 error. I improve the script to do other operations, such as PUT and POST, and to choose it randomly. Yet, again, I never got the 500 error. Zion was deployed as a docker container in all the attempts, using the repo docker-compose. The logs of some of the trials are here, and you guys can try the script by yourself. I wrote a very simple README. Am I missing something? |
I think I found the problem..... I am afraid that @hyperloris commit is newer than the released version. Therefore, I would probably bet that the issue was indeed resolved 1st of August and in the rush of solving other DESMO-LD issues we didn't notice this discrepancy. The good news is that now we have a benchmark and we can investigate alternative solutions than 8a13a30. My next todo-item is to set up a CD as I initially planned to do. |
I guess that what happened. I put some effort in this, since the tradicional testing methods did not worked for this error. As you mentioned, at least we have now a realiable way to identify problems of such kind. I suggest to close this issue without merging the branch 11-transient-500-for-things-endpoint to the main. |
see
As I stated above I think there is still value in https://github.com/vaimee/zion/tree/11-transient-500-for-things-endpoint . It should provide a benchmark to test new solutions. The only thing that is left to understand, is if the benchmark you created is able to spot the problem if you revert commits back to https://github.com/vaimee/zion/releases/tag/v1.0.0-alpha.0 |
The TDDs deployed for the project demo have the minimum pool size set to 0 (we are not using the v1.0.0-alpha.0). Unfortunately, after some downtime, the problem reappears. |
Taking into account @relu91 and @hyperloris's comments, I will do the following:
It would be helpful if you could share more details on how you triggered the error. Like:
|
Docker, behind a proxy (Traeffic - I don't think it is the problem but just to be crystal clear about the setup).
A GET HTTP operation to
Sorry, I don't know, but I believe the server has never gone down.
Tested today at 10:20 a.m. everything was fine. Tested again now at 12:43 a.m. and got 500. So for sure, we know that the error happens after at least about 2 hours of inactivity (I know it is a lot 🤣 ). But I think @hyperloris is right 30 min might be a reasonable minum. |
Guys, What you told me is very strange because I cannot replicate this error!
Do you have a complete log of the production deployed Zion to understand better when those errors occurred? Also, it would be helpful to calculate the exact time between requests. I am currently executing test pre 8a13a30 commit. I will update here when I get the results. |
Unfortunately, I have no additional clues, but I can add more details about the deployment we are using for the demo. We have 4 TDDs, each with its own Postgres database, everything deployed inside a Docker Swarm cluster within the same Docker stack.
Yes, they are the same.
postgres:14.3-alpine
I cannot retrieve it right now, but I will try to get it for you asap. |
After additional investigations, we have found that there are dependencies with our production environment. However, it is still not clear the implications of the 8a13a30 and whether it fixed the error or just downgraded the performance of the system. |
Currently, we can bypass the VIP/IPVS issues using the workaround described at point 3 in moby/moby#37466 (comment). Note, even if, in practice, it resolves the 500 transient errors bypassing VIP/IPVS does not work very well for scalability. Therefore, we are considering alternative solutions in replacement of knex. |
I received a
500
when accessing the/things
endpoint. Here is the Zion log:The error disappeared after a second try.
The text was updated successfully, but these errors were encountered: