Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown condition cause new event definition workflow to fail executing search #20925

Open
drewmiranda-gl opened this issue Nov 8, 2024 · 15 comments · May be fixed by #20953
Open

Unknown condition cause new event definition workflow to fail executing search #20925

drewmiranda-gl opened this issue Nov 8, 2024 · 15 comments · May be fixed by #20953

Comments

@drewmiranda-gl
Copy link
Member

drewmiranda-gl commented Nov 8, 2024

@mcdowellster first reported he encountered a strange bug where creating an event definition, Filter & Aggregation, the search always fails. I can replicate this in both sandbox.graylog.cloud and demo.graylog.cloud.

I cannot replicate this myself and suspect it has something to do with MongoDB replicasets.

See video:
https://github.com/user-attachments/assets/014e7f33-cf42-4a17-b520-2ca8ecf681a1

XHR requests:

POST /api/views/search/672e8980078d854e21b15434/execute (HTTP 201 Created)

GET /api/views/search/status/672e89805842237c9efb2434 (HTTP 200 OK)
(for some reason a second subsequent...)

GET /api/views/search/status/672e89805842237c9efb2434 (HTTP 404 Not Found)

{"type":"ApiError","message":"HTTP 404 Not Found"}

Expected Behavior

When creating an event, the search should work

Current Behavior

Search for creating an event does not work (see video)

Possible Solution

Steps to Reproduce (for bugs)

  1. Go to Alerts
  2. Click on Event Definitions "tab"(??)
  3. Create event definition
  4. Go to condition "tab"(??)
  5. Change condition type to "Filter & Aggregation"
  6. Observe error

Context

Your Environment

  • Graylog Version: 6.1.1, 6.1.2
  • Java Version: Bundled
  • OpenSearch Version: N/A
  • MongoDB Version: 5.0.28, 6.0.19 (whatever graylog cloud uses)
@kmerz
Copy link
Member

kmerz commented Nov 11, 2024

Are there any entries in the logs for that?

@kmerz kmerz added the triaged label Nov 11, 2024
@mcdowellster
Copy link

mcdowellster commented Nov 11, 2024 via email

@drewmiranda-gl
Copy link
Member Author

Seems the same as #20474 ?

@damianharouff
Copy link

damianharouff commented Nov 12, 2024

Seems the same as #20474 ?

Further to this: at the time I opened #20474 I was running a multi-node Graylog/MongoDB/Opensearch cluster.

I've since rebuilt into an AIO install, and can confirm that #20474 no longer happens in my environment.

This can also be replicated in the GLC sandbox.

@mcdowellster
Copy link

I can confirm this is affecting live customer environments too. On a call with a customer right now, I cannot setup a new event at all.

@luk-kaminski
Copy link
Contributor

In the video from @damianharouff I can see a 404 response code for ...search\status... endpoint.
Looks like in multi-node environment ProxiedResource fails to reach proper GL node to retrieve async search status.

@luk-kaminski
Copy link
Contributor

Similar issue : #20922
It was declared to be a k8s misconfiguration on the client side.

@luk-kaminski
Copy link
Contributor

luk-kaminski commented Nov 13, 2024

@drewmiranda-gl - Does it sometimes happen with regular search, on the search page, as well? (the way it happened with #20922)

@luk-kaminski luk-kaminski self-assigned this Nov 13, 2024
@damianharouff
Copy link

Keeping everything together here in the issue: for me it did not happen with regular search ; it only seems to happen in Event Defintion -> Filter & Aggregation’s second page. I haven’t noted any issues with regular search.

@mcdowellster
Copy link

I can confirm the same behavior as Drew. Regular search is normal. Only the filter and aggregation search in events appears affected.

@luk-kaminski
Copy link
Contributor

@damianharouff @mcdowellster Thanks a lot for confirmation!

At this point it looks like a FE issue.
Inside event definitions, we seem to use wrong call to search status, which does not use executing_node_id , but only jobId.
On the search page, we use it properly, passing both params. I guess it breaks event definition page on all multi-GL node installations, as status call cannot be passed to proper executing node.
I'm calling FE cavalry for help.

@mcdowellster
Copy link

Was there a change in this process for 6.1? This has been fine for years on my cluster until I upgraded to 6.1

@luk-kaminski
Copy link
Contributor

Yes, more impact was placed on asynchronous calls, which means separate status calls are directed towards the node executing the search asynchronously. It seems that here improper status call is made.

@mcdowellster
Copy link

For now, nginx users can workaround this by adding ip_hash to the backend. It will enforce requests stay on the same host unless it goes down while the session is live.

upstream graylog {
ip_hash;
server 192.168.1.231:9000 max_fails=3 fail_timeout=30s;
server 192.168.99.254:9000 max_fails=3 fail_timeout=30s;
server 192.168.99.253:9000 max_fails=3 fail_timeout=30s;
}

@grownuphacker
Copy link

Just to clarify the workaround as I understand it - so its not specific to Nginx:

  • Configure Load Balancer / ADC in front of Graylog to use a balancing algorithm other than Round Robin or Least Connections (recommended IP Source based distribution) so that there is a significantly more stable client to node session relationship.

This reduces the risk to the node going down during the event calculation.

@maxiadlovskii maxiadlovskii linked a pull request Nov 14, 2024 that will close this issue
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants