Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inclusion of a broadcast option to notify other processes of RAS events #24

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexandrelimassantana
Copy link

@alexandrelimassantana alexandrelimassantana commented Sep 28, 2020

This patch includes the addition of a RAS server using connection-oriented local sockets for inter-process communication.
The broadcast feature allows RASdaemon to assist other applications to be fault-aware while retaining its original purpose.
This initial implementation does not define an intricate protocol for communication.
It is simple enough so that user-level applications can receive RAS events just by issuing a socket connection.

The RAS server is initialized through a new function, namely ras_server_start, if both conditions apply:

  • rasdaemon is compiled with the broadcast option;
  • rasdaemon is executed with the --broadcast option.

The initialization spawns a new thread to manage socket connections using poll.

Whenever RASdaemon detects a new event, a call to ras_server_broadcast is issued in the main thread.
This function will broadcast the events to the connected client sockets.
A mutex is used to synchronize both threads when writing to the file descriptors.

The server is closed with a call to ras_server_stop.
This will cancel the server thread and de-allocate the server structures.
A new call to ras_server_start should get the server back up again.

Signed-off-by: Alexandre de Limas Santana [email protected]

A simple server implementation using UNIX domain sockets.
The server is initialized on start-up upon a call to ras_server_start().
The initialization spawns a new thread to manage socket connections.
The main thread broadcasts RAS events upon detection to client processes using ras_server_broadcast().
The broadcast uses a mutex to synchronize with the server thread and avoid reading wrong file descriptors.
The Server can be closed with ras_server_stop(), which cancels the server thread.
Copy link
Owner

@mchehab mchehab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the comments for some parts of the code, it would be nice to also have at least an example code about how to use the report server feature. Perhaps you could create a "contrib/" dir and place a simple client to receive the messages there.

AC_DEFINE(HAVE_BROADCAST,1,"broadcast RAS events")
AC_SUBST([WITH_BROADCAST])
])

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this feature ("broadcast") doesn't look nice, as this term is more used for TCP/IP special messages that are sent to a range of machines.

The way this feature is mapped, it is actually adding an Unix socket for communication with other process(es).

@@ -0,0 +1,363 @@
#include <poll.h>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int ras_server_start(void) {
struct sockaddr_un addr;

server.socketfd = socket(AF_UNIX, SOCK_STREAM, 0);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be made a little more generic, also allowing reporting issues to other machines via TCP/IP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mchehab, a remote report was one of my motivations for this functionality.
However, I chose to keep this out of RASdaemon as I thought it would not fit RASdaemon scope.

I envision that a local server feature is enough to enable a wide range of agents for handling the RAS event stream.
These agents can perform event pre-processing, reliability introspective analysis, and others without being tightly-coupled with RASdaemon.
In my opinion, the remote report feature is best suited for other software pieces that can determine their own protocol and/or the preferred system bus to do so.

What do you think?

if(server.socketfd == -1) {
log(ALL, LOG_WARNING, "Can't create local socket for broadcasting\n");
goto create_err;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Rasdaemon more or less follow Kernel coding style. We use:
foo(bar)
for functions, and:
if (foo)
(with a whitespace) for instructions like if, for, while, switch,...

ev->rwbs,
ev->cmd);
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit worried about maintainership with the above. As newer messages and types are added at rasdaemon, the above will become outdated with time. IMO, the better here would be to have those feature-specific functions inside the code where they are added, in a way that, if something changes there, such change will be reflected at the broadcasting code. Granted, ras-report.c has the same issue, but it is at least a single place where we have those things. Now that a new report daemon is added, we should unify, in order to make easier to maintain it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a couple of functions to create strings from events be enough?
Something in the lines of:

// Serialize a RAS event field by field, separated by 'sep'
void ras_to_str(int type, void* ev, char sep, char* buf);
// Get the name of a RAS event type
const char* ras_event_name(int type);

Is an error message on the default clause of a switch statement enough to warn developers of the need to serialize new event types?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants