-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APP-5248 use NetAppender in agent #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good over all, and glad this is turning out to be mostly straight forward so far. I've put one main request inline (move some stuff into dial() so it simplifes and works with fast_start), but also wanted to address your question:
I checked that this doesn't double-log, but why doesn't it? the matching logger? I thought that rdk logs showed up in journalctl. is that via subprocess inheriting stderr, not via the logger?
Look at the logger.Write() (logging.go:60). Two things here. First, matches can be marked as "masking" meaning the output ONLY goes to the match channel, and is hidden from normal logging (this is currently only used for health checks.)
The second bit is that it looks for a specific timestamp format coming in. If it's already a "log" line with a specific timestamp format in it, then it just writes directly to stdout/stderr, instead of the logger. (And actually, if something doesn't have a timestamp, it logs it with a warning about it being "unstructured output") The intent was to avoid double-timestamps on logger-formatted lines coming from subsystems, but this will need to be reworked somehow to make sure these ARE now passed into the regular logger, in order to make it to the cloud, but we don't want to "nest" them with double-timestamps. I'm not sure if the best solution is to somehow parse and forward the original (more accurate) timestamp, or to just strip it all and re-log the original message, while letting the new logger timestamp as it sees fit. I expect milliseconds of difference at most in normal cases, but haven't fully thought through if there are edge cases.
cmd/viam-agent/main.go
Outdated
@@ -164,6 +164,16 @@ func main() { | |||
} | |||
} | |||
|
|||
netAppender, err := manager.NetAppender(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely needs to be after the "fast start" section just below. Fast start is supposed to work entirely offline and as quickly as possible and this leads to dialing (and timeouts) vastly slowing that offline behavior I believe.
Might be best for this to live within the dial() function? E.g. it'll set up the network appender whenever anything first dials, regardless of source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
felt initially weird to put it in dial, then realized it has to be dial because we don't know when we will come online
done in ec27bc3
should I do anything to protect against a second netappender getting added on reconnect? I think this doesn't happen now, but could see it accidentally happening if we refactor in the future
(for example, do SetNetAppender
instead of AddNetAppender
)
manager.go
Outdated
func (m *Manager) NetAppender(ctx context.Context) (*logging.NetAppender, error) { | ||
if err := m.dial(ctx); err != nil { | ||
return nil, err | ||
} | ||
if m.cloudAddr == "" || m.partID == "" || m.cloudSecret == "" { | ||
return nil, ErrMissingCloudCreds | ||
} | ||
return logging.NewNetAppender( | ||
&logging.CloudConfig{ | ||
AppAddress: m.cloudAddr, | ||
ID: m.partID, | ||
Secret: m.cloudSecret, | ||
}, | ||
m.conn, | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the other comment, I think this could be rolled into dial() to remove some repitition here. E.g. the check for valid cloud configs is already done by LoadConfig(), so no need to duplicate that again here (once it's in an internal function that can't get called out of order.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in ec27bc3
kept the validity check because no harm in extra asserts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So some nits inline, and maybe a little bit of confusion about intent when I mentioned putting things in dial().
Big issue is that the (network) logging from subsystems still hasn't been dealt with. Or am I wrong and that's somehow working as intended?
|
||
// todo: ideally we would create the NetAppender ASAP (so logs are captured) and only *connect* it here. | ||
if err := m.attachNetAppender(); err != nil { | ||
m.logger.Errorw("error attaching NetAppender", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this should be returning the error here, not just logging it. What's the scenario you're expecting where this is non-fatal? From looking at the code, I think the ONLY possible error this could ever log would be if if it failed to get the hostname (from the os.Hostname() call inside NewNetAppender()) which I'd consider a pretty critical failure. All the others are impossible to hit per my above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an unattended device -- if it can possibly start up, it should
os.hostname could fail if procfs is unreadable https://cs.opensource.google/go/go/+/refs/tags/go1.22.4:src/os/sys_linux.go;l=40
(or the implementation of netappender could change such that it returns errors in more cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure to dial doesn't stop it from starting up. If it fails to dial, it will try to dial again in a minute, where we have a chance of netappender then also working as intended. If you just ignore the error, then we never get netappender (and its cloud logging) for the lifetime of the process. I feel the retry is preferable behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we get to this part of the function, m.client
is non-nil and a reattempt would short-circuit anyway (in the m.client != nil
guard above)
I think this gets easier once we do NetAppender delayed start in https://viam.atlassian.net/browse/APP-5330 -- okay to keep as-is for now, then improve when the NetAppender change is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, again, as long as it's done before we do OTHER new projects in agent (though no rush until then), I'm fine with that (and agree it'll be easier/cleaner.) I just don't want to let tech debt pile up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup agree, bumped 5330 to high and will submit a netappender PR sometime this week
manager.go
Outdated
// create and attach a NetAppender. | ||
func (m *Manager) attachNetAppender() error { | ||
if m.conn == nil { | ||
return errors.New("NetAppender requires non-null conn") | ||
} | ||
if m.netAppender != nil { | ||
return errors.New("Manager already has non-nil netAppender") | ||
} | ||
if m.cloudAddr == "" || m.partID == "" || m.cloudSecret == "" { | ||
return ErrMissingCloudCreds | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a separate function? When I suggested adding this to dial, I meant to simplify things. At the point this is called (bottom of dial()), all these error checks are impossible to hit, as dial() would've already returned an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions, and a couple (opinionated) nits from me. (I'm not a fan of mutable globals :-) )
Main concern is that this still isn't logging early output from viam-server, which means we can't catch failures there, and similarly, log line parsing failures mean potentially losing information too.
Some open questions as well. Feel free to poke me on slack to discuss.
logger.go
Outdated
var dateRegex = regexp.MustCompile(`^[0-9]{4}-[0-9]{2}-[0-9]{2}T`) | ||
|
||
// globalNetAppender receives matching logger writes if non-nil. | ||
var globalNetAppender *logging.NetAppender |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the global here and attach this to the actual logger object, just like we attach the logger itself?
Also, can we not do something like logger.Core().Write(parsedEntry)
like you can with normal zap loggers? Or is that another thing we lost when we wrapped it?
Would/could it be easier to just update the logger code to allow a Write() function that takes a direct entry? Then everything would be a lot cleaner here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you're right this is potentially way simpler than I'm making it
didn't realize the l.logger here was the globalLogger passed down from Manager
rdk logger does the appender stuff itself, so Core().Write() isn't an option, but we could make the Logger.log(*LogEntry)
method public and get the behavior we need
if we do this, I'd have to replace the if dateMatched { os.Stdout.Write(p) }
behavior above with the parseLog().entry() stuff -- are you okay with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after discussion, fine with this as long as APP-5330 is considered priority (for agent) before whatever next (significant) change we do here.
logger.go
Outdated
// TODO(RSDK-7895): these lines are sometimes multi-line. | ||
if l.upload { | ||
if parsed := parseLog(p); parsed.valid() && globalNetAppender != nil { | ||
globalNetAppender.Write(parsed.entry(), nil) //nolint:errcheck,gosec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check the error return here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was concerned that if I logged the error, it would end up in the same function and hang forever on the lock at L87 above
options:
create a non-netappender logger for this kind of error? it wouldn't be logged to network, but it would end up in journalctldelayed call in goroutine with some kind of recursion limit so it won't get into 'an error logging an error' situations
^ ignore that -- just realized MatchingLogger has a logging.Logger internally that it can use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 5c81cbf
logger.go
Outdated
var levels = map[string]zapcore.Level{ | ||
"DEBUG": zapcore.DebugLevel, | ||
"INFO": zapcore.InfoLevel, | ||
"WARN": zapcore.WarnLevel, | ||
"ERROR": zapcore.ErrorLevel, | ||
"DPANIC": zapcore.DPanicLevel, | ||
"PANIC": zapcore.PanicLevel, | ||
"FATAL": zapcore.FatalLevel, | ||
} | ||
|
||
var colorCodeRegexp = regexp.MustCompile(`\x1b\[\d+m`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these, (and any other globals) please go at the top in a var() or const() block for consistency? Or if they're only used once, just put them directly in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in bc17cb7
stdio := agent.NewMatchingLogger(s.logger, false, false) | ||
stderr := agent.NewMatchingLogger(s.logger, true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean it'll never upload logs for viam-server? Or am I missing something?
To be clear, I think it's vital that we catch early logs here to diagnose startup problems, but then we STOP uploading logs after it's passing health checks (and should be uploading its own.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 1190ea5 but if we get the logger.Core().Write() stuff above working, this will change again, so don't look until we've made a design decision
note that if those early logs happen in the pre-online section, they will get missed anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this now has the opposite problem. It no longer LOCALLY logs viam-server output at all. I just tested this with a quick build on my own pi.
Local logging needs to always happen (as it'd never be duplicated.) It's only the CLOUD logging where we need to log early startup of viam-server (in case it fails), then stop once it's online and doing its own cloud logging.
|
||
// todo: ideally we would create the NetAppender ASAP (so logs are captured) and only *connect* it here. | ||
if err := m.attachNetAppender(); err != nil { | ||
m.logger.Errorw("error attaching NetAppender", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure to dial doesn't stop it from starting up. If it fails to dial, it will try to dial again in a minute, where we have a chance of netappender then also working as intended. If you just ignore the error, then we never get netappender (and its cloud logging) for the lifetime of the process. I feel the retry is preferable behavior.
// parsedLog is a lightweight log structure we parse from subsystem logs. | ||
// Another approach for capturing logs from subsystems is to pass around | ||
// LogEntry or opentelemetry structs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, this feels so... "wrong" to have to do it this way. Sorry, not saying you did it wrong, just, there should be a cleaner way in our logging system to handle this. A zap-style log line parser just doesn't feel like it should be living directly in the agent code. Also shouldn't be having to manually write to netappender either.
Do you think some of this would be better in the logging package itself? (Along with adding a Write(zapcore.Entry) function?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah agree re kludge
it's not a completely new problem; would imagine stackdriver, splunk, fluentd et al have to deal with this at some point as aggregators. wonder if there are are standard tools for this task
no preference about where it lives. if you want to move it can that be a fast-follow rather than a blocker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it needs more thought, but its more important to get this out... so yeah, doesn't have to be a blocker on this PR.
logger.go
Outdated
func (p parsedLog) entry() zapcore.Entry { | ||
level, ok := levels[string(p.level)] | ||
if !ok { | ||
level = zapcore.DebugLevel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unknown level means the parsing has gone wrong, doesn't it? Feels like that should be Warn or higher to me so it's not masked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 1cf86b7
return nil | ||
} | ||
|
||
func parseLog(line []byte) *parsedLog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this needs to return an error instead of using the getIndexOrNil() stuff. If we can't parse the line, I think it's important to consider that an error and write the WHOLE line to the log. That is, revert to the "Unstructured Log Output" behavior that previously existed.) Otherwise, a parsing failure means we could lose critical information when it's likely the most critical to know it (e.g. something unexpected is happening.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in eb09411 + added tests to make sure the parser fallback behaves as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case this was waiting on me
manager.go
Outdated
if err := m.dial(ctx); err != nil { | ||
return nil, err | ||
} | ||
if m.cloudAddr == "" || m.partID == "" || m.cloudSecret == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the expert here, but I'd expect if m.cloudAddr
was nil, dialing would fail?
Also, if dialing succeeds, but this conditional is hit, does the connection have to be closed?
I think the code is correct this way, but reversing the if-checks would have less states to consider.
not waiting on you -- was in a holding pattern until we decided what to do
about Logger.Write PR, now rehabbing
…On Mon, Jun 24, 2024, 9:07 AM Dan Gottlieb ***@***.***> wrote:
***@***.**** approved this pull request.
Just in case this was waiting on me
------------------------------
In manager.go
<#26 (comment)>:
> @@ -95,6 +95,25 @@ func (m *Manager) LoadConfig(cfgPath string) error {
return nil
}
+var ErrMissingCloudCreds = errors.New("can't create NetAppender without cloud creds")
+
+func (m *Manager) NetAppender(ctx context.Context) (*logging.NetAppender, error) {
+ if err := m.dial(ctx); err != nil {
+ return nil, err
+ }
+ if m.cloudAddr == "" || m.partID == "" || m.cloudSecret == "" {
Not the expert here, but I'd expect if m.cloudAddr was nil, dialing would
fail?
Also, if dialing succeeds, but this conditional is hit, does the
connection have to be closed?
I think the code is correct this way, but reversing the if-checks would
have less states to consider.
—
Reply to this email directly, view it on GitHub
<#26 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXLTSYTFOSX47FIW3ZL6GTZJAK27AVCNFSM6AAAAABJC245DOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMBYGU2DENJTGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
okay I think this now has all desired log types (see screenshot):
this is kind of ugly -- Logger.Write didn't fully solve the problem we wanted it too. I opened a ticket APP-5330 to add 'delayed start' functionality, which should allow us to go back to the old way + simplify a lot of this |
…can inherit NA), stdout.Write viam-server logs
retested after cleanup in 667a0c7. confirmed output still has:
log:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussions, I'm okay with postponing some of the "correctness" things until the rdk logger is improved to allow delays and stuff.
The only remaining blocker IMHO is that viam-server logs are NEVER written locally now. These need to always be logged locally.
logger.go
Outdated
var dateRegex = regexp.MustCompile(`^[0-9]{4}-[0-9]{2}-[0-9]{2}T`) | ||
|
||
// globalNetAppender receives matching logger writes if non-nil. | ||
var globalNetAppender *logging.NetAppender |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after discussion, fine with this as long as APP-5330 is considered priority (for agent) before whatever next (significant) change we do here.
|
||
// todo: ideally we would create the NetAppender ASAP (so logs are captured) and only *connect* it here. | ||
if err := m.attachNetAppender(); err != nil { | ||
m.logger.Errorw("error attaching NetAppender", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, again, as long as it's done before we do OTHER new projects in agent (though no rush until then), I'm fine with that (and agree it'll be easier/cleaner.) I just don't want to let tech debt pile up.
stdio := agent.NewMatchingLogger(s.logger, false, false) | ||
stderr := agent.NewMatchingLogger(s.logger, true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this now has the opposite problem. It no longer LOCALLY logs viam-server output at all. I just tested this with a quick build on my own pi.
Local logging needs to always happen (as it'd never be duplicated.) It's only the CLOUD logging where we need to log early startup of viam-server (in case it fails), then stop once it's online and doing its own cloud logging.
// we check uploadAll because some subprocesses have their own netlogger which will | ||
// upload structured logs. (But won't upload unmatched logs). | ||
entry := parseLog(p).entry() | ||
l.logger.Write(&logging.LogEntry{Entry: entry}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to ALWAYS happen, not just when uploadAll is set. The logger is writing the local log too, and that local log needs to always be written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, global is gone, and what your'e doing it cleaner. Two small comment things inline, but the one block is a nil pointer panic.
logger.go
Outdated
matchers map[string]matcher | ||
defaultError bool | ||
// if uploadAll is false, only send unstructured log lines to globalNetAppender. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] globalNetAppender was removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edited in 2137934
manager.go
Outdated
@@ -231,6 +259,8 @@ func (m *Manager) CloseAll() { | |||
m.connMu.Lock() | |||
defer m.connMu.Unlock() | |||
|
|||
m.netAppender.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes a panic when offline.
Need to check for nil before Close() (see if m.conn != nil
block below.)
Also, need to SET it to nil after it's closed (for correctness, even though we currently never restart.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catch -- ran into this while removing netappender and did not think about close case
zooming out, would be good to have 'can it start and stop cleanly without network connection' as a CI test case
this is done in 2137934
tested that it can close with wifi disconnected: netappender takes like 20 seconds to fully exit (that's ticketed here), but no crash
2024-06-24T20:19:46.828Z INFO netlogger logging/net_appender.go:233 error logging to network: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2024-06-24T20:19:56.834Z INFO netlogger logging/net_appender.go:233 error logging to network: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2024-06-24T20:20:01.838Z INFO netlogger logging/net_appender.go:233 error logging to network: rpc error: code = DeadlineExceeded desc = context deadline exceeded
if m.conn == nil { | ||
return errors.New("NetAppender requires non-null conn") | ||
} | ||
if m.netAppender != nil { | ||
return errors.New("Manager already has non-nil netAppender") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically this needs a way to reattach after we re-dial in various scenarios. E.g. a machine is offline, and comes back online with a different IP.
That said, no point in bothering YET because our rpc dialer itself doesn't detect that scenario, and maintains the old/dead connection (nearly) indefinitely. (I think the kernel will eventually kill it after 2+ hours.)
Can you put a TODO here linking to the following ticket?
https://viam.atlassian.net/browse/RSDK-7936
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 2137934
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Sorry this turned into such a large amount of back and forth!
Checklist
What changed
globalNetAppender
for MatchingLogger + subsystem log parse / uploadWhy
Make agent logs visible in webapp.
Reviewer questions
I checked that this doesn't double-log, but why doesn't it? the matching logger? I thought that rdk logs showed up in journalctl. is that via subprocess inheriting stderr, not via the logger?-> I get it nowFollow-up changes
NetAppender.Close
takes a long time (10 second timeout). Agent + RDK are both doing this, so offline shutoff now takes 20+ seconds. Ticketed RSDK-7866