Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APP-5248 use NetAppender in agent #26

Merged
merged 24 commits into from
Jun 24, 2024
Merged

APP-5248 use NetAppender in agent #26

merged 24 commits into from
Jun 24, 2024

Conversation

abe-winter
Copy link
Member

@abe-winter abe-winter commented Jun 10, 2024

Checklist

What changed

  • replace golog + zap with rdk/logging everywhere
  • do NetAppender startup in Manager.dial
  • globalNetAppender for MatchingLogger + subsystem log parse / upload

Why

Make agent logs visible in webapp.

Reviewer questions

  • I checked that this doesn't double-log, but why doesn't it? the matching logger? I thought that rdk logs showed up in journalctl. is that via subprocess inheriting stderr, not via the logger? -> I get it now

Follow-up changes

  • add netappender delayed start APP-5330 which will let us undo the uglier changes here
  • When the device is offline, NetAppender.Close takes a long time (10 second timeout). Agent + RDK are both doing this, so offline shutoff now takes 20+ seconds. Ticketed RSDK-7866
  • should cloud logging be configurable? (i.e. turn it off). where?
  • rate limiting

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good over all, and glad this is turning out to be mostly straight forward so far. I've put one main request inline (move some stuff into dial() so it simplifes and works with fast_start), but also wanted to address your question:

I checked that this doesn't double-log, but why doesn't it? the matching logger? I thought that rdk logs showed up in journalctl. is that via subprocess inheriting stderr, not via the logger?

Look at the logger.Write() (logging.go:60). Two things here. First, matches can be marked as "masking" meaning the output ONLY goes to the match channel, and is hidden from normal logging (this is currently only used for health checks.)

The second bit is that it looks for a specific timestamp format coming in. If it's already a "log" line with a specific timestamp format in it, then it just writes directly to stdout/stderr, instead of the logger. (And actually, if something doesn't have a timestamp, it logs it with a warning about it being "unstructured output") The intent was to avoid double-timestamps on logger-formatted lines coming from subsystems, but this will need to be reworked somehow to make sure these ARE now passed into the regular logger, in order to make it to the cloud, but we don't want to "nest" them with double-timestamps. I'm not sure if the best solution is to somehow parse and forward the original (more accurate) timestamp, or to just strip it all and re-log the original message, while letting the new logger timestamp as it sees fit. I expect milliseconds of difference at most in normal cases, but haven't fully thought through if there are edge cases.

@@ -164,6 +164,16 @@ func main() {
}
}

netAppender, err := manager.NetAppender(ctx)
Copy link
Member

@Otterverse Otterverse Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely needs to be after the "fast start" section just below. Fast start is supposed to work entirely offline and as quickly as possible and this leads to dialing (and timeouts) vastly slowing that offline behavior I believe.

Might be best for this to live within the dial() function? E.g. it'll set up the network appender whenever anything first dials, regardless of source.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

felt initially weird to put it in dial, then realized it has to be dial because we don't know when we will come online

done in ec27bc3

should I do anything to protect against a second netappender getting added on reconnect? I think this doesn't happen now, but could see it accidentally happening if we refactor in the future

(for example, do SetNetAppender instead of AddNetAppender)

manager.go Outdated
Comment on lines 100 to 115
func (m *Manager) NetAppender(ctx context.Context) (*logging.NetAppender, error) {
if err := m.dial(ctx); err != nil {
return nil, err
}
if m.cloudAddr == "" || m.partID == "" || m.cloudSecret == "" {
return nil, ErrMissingCloudCreds
}
return logging.NewNetAppender(
&logging.CloudConfig{
AppAddress: m.cloudAddr,
ID: m.partID,
Secret: m.cloudSecret,
},
m.conn,
)
}
Copy link
Member

@Otterverse Otterverse Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the other comment, I think this could be rolled into dial() to remove some repitition here. E.g. the check for valid cloud configs is already done by LoadConfig(), so no need to duplicate that again here (once it's in an internal function that can't get called out of order.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in ec27bc3

kept the validity check because no harm in extra asserts

subsystems/viamserver/viamserver.go Outdated Show resolved Hide resolved
Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So some nits inline, and maybe a little bit of confusion about intent when I mentioned putting things in dial().

Big issue is that the (network) logging from subsystems still hasn't been dealt with. Or am I wrong and that's somehow working as intended?

manager.go Show resolved Hide resolved
manager.go Outdated Show resolved Hide resolved

// todo: ideally we would create the NetAppender ASAP (so logs are captured) and only *connect* it here.
if err := m.attachNetAppender(); err != nil {
m.logger.Errorw("error attaching NetAppender", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this should be returning the error here, not just logging it. What's the scenario you're expecting where this is non-fatal? From looking at the code, I think the ONLY possible error this could ever log would be if if it failed to get the hostname (from the os.Hostname() call inside NewNetAppender()) which I'd consider a pretty critical failure. All the others are impossible to hit per my above comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an unattended device -- if it can possibly start up, it should

os.hostname could fail if procfs is unreadable https://cs.opensource.google/go/go/+/refs/tags/go1.22.4:src/os/sys_linux.go;l=40

(or the implementation of netappender could change such that it returns errors in more cases)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure to dial doesn't stop it from starting up. If it fails to dial, it will try to dial again in a minute, where we have a chance of netappender then also working as intended. If you just ignore the error, then we never get netappender (and its cloud logging) for the lifetime of the process. I feel the retry is preferable behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we get to this part of the function, m.client is non-nil and a reattempt would short-circuit anyway (in the m.client != nil guard above)

I think this gets easier once we do NetAppender delayed start in https://viam.atlassian.net/browse/APP-5330 -- okay to keep as-is for now, then improve when the NetAppender change is done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, again, as long as it's done before we do OTHER new projects in agent (though no rush until then), I'm fine with that (and agree it'll be easier/cleaner.) I just don't want to let tech debt pile up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup agree, bumped 5330 to high and will submit a netappender PR sometime this week

manager.go Outdated Show resolved Hide resolved
subsystems/viamserver/viamserver.go Outdated Show resolved Hide resolved
manager.go Outdated
Comment on lines 101 to 111
// create and attach a NetAppender.
func (m *Manager) attachNetAppender() error {
if m.conn == nil {
return errors.New("NetAppender requires non-null conn")
}
if m.netAppender != nil {
return errors.New("Manager already has non-nil netAppender")
}
if m.cloudAddr == "" || m.partID == "" || m.cloudSecret == "" {
return ErrMissingCloudCreds
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a separate function? When I suggested adding this to dial, I meant to simplify things. At the point this is called (bottom of dial()), all these error checks are impossible to hit, as dial() would've already returned an error.

@abe-winter
Copy link
Member Author

screenshot of subsystem logs flowing through

Screenshot 2024-06-12 at 10-23-00 my-laptop - Viam

@abe-winter abe-winter changed the title RSDK-7786 use NetAppender in agent APP-5248 use NetAppender in agent Jun 12, 2024
Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions, and a couple (opinionated) nits from me. (I'm not a fan of mutable globals :-) )

Main concern is that this still isn't logging early output from viam-server, which means we can't catch failures there, and similarly, log line parsing failures mean potentially losing information too.

Some open questions as well. Feel free to poke me on slack to discuss.

logger.go Outdated
var dateRegex = regexp.MustCompile(`^[0-9]{4}-[0-9]{2}-[0-9]{2}T`)

// globalNetAppender receives matching logger writes if non-nil.
var globalNetAppender *logging.NetAppender
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the global here and attach this to the actual logger object, just like we attach the logger itself?

Also, can we not do something like logger.Core().Write(parsedEntry) like you can with normal zap loggers? Or is that another thing we lost when we wrapped it?

Would/could it be easier to just update the logger code to allow a Write() function that takes a direct entry? Then everything would be a lot cleaner here.

Copy link
Member Author

@abe-winter abe-winter Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you're right this is potentially way simpler than I'm making it

didn't realize the l.logger here was the globalLogger passed down from Manager

rdk logger does the appender stuff itself, so Core().Write() isn't an option, but we could make the Logger.log(*LogEntry) method public and get the behavior we need

if we do this, I'd have to replace the if dateMatched { os.Stdout.Write(p) } behavior above with the parseLog().entry() stuff -- are you okay with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after discussion, fine with this as long as APP-5330 is considered priority (for agent) before whatever next (significant) change we do here.

logger.go Outdated
// TODO(RSDK-7895): these lines are sometimes multi-line.
if l.upload {
if parsed := parseLog(p); parsed.valid() && globalNetAppender != nil {
globalNetAppender.Write(parsed.entry(), nil) //nolint:errcheck,gosec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check the error return here?

Copy link
Member Author

@abe-winter abe-winter Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was concerned that if I logged the error, it would end up in the same function and hang forever on the lock at L87 above

options:

  • create a non-netappender logger for this kind of error? it wouldn't be logged to network, but it would end up in journalctl
  • delayed call in goroutine with some kind of recursion limit so it won't get into 'an error logging an error' situations

^ ignore that -- just realized MatchingLogger has a logging.Logger internally that it can use

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 5c81cbf

logger.go Outdated
Comment on lines 131 to 141
var levels = map[string]zapcore.Level{
"DEBUG": zapcore.DebugLevel,
"INFO": zapcore.InfoLevel,
"WARN": zapcore.WarnLevel,
"ERROR": zapcore.ErrorLevel,
"DPANIC": zapcore.DPanicLevel,
"PANIC": zapcore.PanicLevel,
"FATAL": zapcore.FatalLevel,
}

var colorCodeRegexp = regexp.MustCompile(`\x1b\[\d+m`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these, (and any other globals) please go at the top in a var() or const() block for consistency? Or if they're only used once, just put them directly in the function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in bc17cb7

Comment on lines +80 to +81
stdio := agent.NewMatchingLogger(s.logger, false, false)
stderr := agent.NewMatchingLogger(s.logger, true, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean it'll never upload logs for viam-server? Or am I missing something?

To be clear, I think it's vital that we catch early logs here to diagnose startup problems, but then we STOP uploading logs after it's passing health checks (and should be uploading its own.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 1190ea5 but if we get the logger.Core().Write() stuff above working, this will change again, so don't look until we've made a design decision

note that if those early logs happen in the pre-online section, they will get missed anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this now has the opposite problem. It no longer LOCALLY logs viam-server output at all. I just tested this with a quick build on my own pi.

Local logging needs to always happen (as it'd never be duplicated.) It's only the CLOUD logging where we need to log early startup of viam-server (in case it fails), then stop once it's online and doing its own cloud logging.


// todo: ideally we would create the NetAppender ASAP (so logs are captured) and only *connect* it here.
if err := m.attachNetAppender(); err != nil {
m.logger.Errorw("error attaching NetAppender", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure to dial doesn't stop it from starting up. If it fails to dial, it will try to dial again in a minute, where we have a chance of netappender then also working as intended. If you just ignore the error, then we never get netappender (and its cloud logging) for the lifetime of the process. I feel the retry is preferable behavior.

Comment on lines +116 to +118
// parsedLog is a lightweight log structure we parse from subsystem logs.
// Another approach for capturing logs from subsystems is to pass around
// LogEntry or opentelemetry structs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, this feels so... "wrong" to have to do it this way. Sorry, not saying you did it wrong, just, there should be a cleaner way in our logging system to handle this. A zap-style log line parser just doesn't feel like it should be living directly in the agent code. Also shouldn't be having to manually write to netappender either.

Do you think some of this would be better in the logging package itself? (Along with adding a Write(zapcore.Entry) function?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agree re kludge

it's not a completely new problem; would imagine stackdriver, splunk, fluentd et al have to deal with this at some point as aggregators. wonder if there are are standard tools for this task

no preference about where it lives. if you want to move it can that be a fast-follow rather than a blocker?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it needs more thought, but its more important to get this out... so yeah, doesn't have to be a blocker on this PR.

logger.go Outdated
func (p parsedLog) entry() zapcore.Entry {
level, ok := levels[string(p.level)]
if !ok {
level = zapcore.DebugLevel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unknown level means the parsing has gone wrong, doesn't it? Feels like that should be Warn or higher to me so it's not masked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 1cf86b7

return nil
}

func parseLog(line []byte) *parsedLog {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this needs to return an error instead of using the getIndexOrNil() stuff. If we can't parse the line, I think it's important to consider that an error and write the WHOLE line to the log. That is, revert to the "Unstructured Log Output" behavior that previously existed.) Otherwise, a parsing failure means we could lose critical information when it's likely the most critical to know it (e.g. something unexpected is happening.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in eb09411 + added tests to make sure the parser fallback behaves as expected

Copy link
Member

@dgottlieb dgottlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case this was waiting on me

manager.go Outdated
if err := m.dial(ctx); err != nil {
return nil, err
}
if m.cloudAddr == "" || m.partID == "" || m.cloudSecret == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the expert here, but I'd expect if m.cloudAddr was nil, dialing would fail?

Also, if dialing succeeds, but this conditional is hit, does the connection have to be closed?

I think the code is correct this way, but reversing the if-checks would have less states to consider.

@abe-winter
Copy link
Member Author

abe-winter commented Jun 24, 2024 via email

@abe-winter
Copy link
Member Author

abe-winter commented Jun 24, 2024

okay I think this now has all desired log types (see screenshot):

  1. agent's own logs
  2. subsystem logs
  3. viam logs appear + are not doubled
  4. unstructured errors

this is kind of ugly -- Logger.Write didn't fully solve the problem we wanted it too. I opened a ticket APP-5330 to add 'delayed start' functionality, which should allow us to go back to the old way + simplify a lot of this

logging screenshot

@abe-winter
Copy link
Member Author

abe-winter commented Jun 24, 2024

retested after cleanup in 667a0c7. confirmed output still has:

  • both types of subsystem: subsystem.go + viamserver.go
  • subsystem logs re-logged from subprocess
  • non-doubled viam-server logs
  • unstructured output

log:

6/24/2024, 2:43:58 PM info viam-agent agent/subsystem.go:483 agent-syscfg started
6/24/2024, 2:43:58 PM info agent-syscfg viam-agent-syscfg/main.go:77 agent-syscfg startup complete
6/24/2024, 2:43:53 PM info robot_server impl/local_robot.go:1230 Robot (re)configured
6/24/2024, 2:43:53 PM error viam-agent unstructured output: ALSA lib pulse.c:242:(pulse_connect) PulseAudio: Unable to connect: Connection refused
6/24/2024, 2:43:50 PM info viam-agent viamserver/viamserver.go:76 Starting viam-server 

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussions, I'm okay with postponing some of the "correctness" things until the rdk logger is improved to allow delays and stuff.

The only remaining blocker IMHO is that viam-server logs are NEVER written locally now. These need to always be logged locally.

logger.go Outdated
var dateRegex = regexp.MustCompile(`^[0-9]{4}-[0-9]{2}-[0-9]{2}T`)

// globalNetAppender receives matching logger writes if non-nil.
var globalNetAppender *logging.NetAppender
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after discussion, fine with this as long as APP-5330 is considered priority (for agent) before whatever next (significant) change we do here.


// todo: ideally we would create the NetAppender ASAP (so logs are captured) and only *connect* it here.
if err := m.attachNetAppender(); err != nil {
m.logger.Errorw("error attaching NetAppender", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, again, as long as it's done before we do OTHER new projects in agent (though no rush until then), I'm fine with that (and agree it'll be easier/cleaner.) I just don't want to let tech debt pile up.

Comment on lines +80 to +81
stdio := agent.NewMatchingLogger(s.logger, false, false)
stderr := agent.NewMatchingLogger(s.logger, true, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this now has the opposite problem. It no longer LOCALLY logs viam-server output at all. I just tested this with a quick build on my own pi.

Local logging needs to always happen (as it'd never be duplicated.) It's only the CLOUD logging where we need to log early startup of viam-server (in case it fails), then stop once it's online and doing its own cloud logging.

// we check uploadAll because some subprocesses have their own netlogger which will
// upload structured logs. (But won't upload unmatched logs).
entry := parseLog(p).entry()
l.logger.Write(&logging.LogEntry{Entry: entry})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to ALWAYS happen, not just when uploadAll is set. The logger is writing the local log too, and that local log needs to always be written.

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, global is gone, and what your'e doing it cleaner. Two small comment things inline, but the one block is a nil pointer panic.

logger.go Outdated
matchers map[string]matcher
defaultError bool
// if uploadAll is false, only send unstructured log lines to globalNetAppender.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] globalNetAppender was removed

Copy link
Member Author

@abe-winter abe-winter Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited in 2137934

manager.go Outdated
@@ -231,6 +259,8 @@ func (m *Manager) CloseAll() {
m.connMu.Lock()
defer m.connMu.Unlock()

m.netAppender.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a panic when offline.

Need to check for nil before Close() (see if m.conn != nil block below.)

Also, need to SET it to nil after it's closed (for correctness, even though we currently never restart.)

Copy link
Member Author

@abe-winter abe-winter Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catch -- ran into this while removing netappender and did not think about close case

zooming out, would be good to have 'can it start and stop cleanly without network connection' as a CI test case

this is done in 2137934

tested that it can close with wifi disconnected: netappender takes like 20 seconds to fully exit (that's ticketed here), but no crash

2024-06-24T20:19:46.828Z	INFO	netlogger	logging/net_appender.go:233	error logging to network: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2024-06-24T20:19:56.834Z	INFO	netlogger	logging/net_appender.go:233	error logging to network: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2024-06-24T20:20:01.838Z	INFO	netlogger	logging/net_appender.go:233	error logging to network: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Comment on lines +101 to +106
if m.conn == nil {
return errors.New("NetAppender requires non-null conn")
}
if m.netAppender != nil {
return errors.New("Manager already has non-nil netAppender")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically this needs a way to reattach after we re-dial in various scenarios. E.g. a machine is offline, and comes back online with a different IP.

That said, no point in bothering YET because our rpc dialer itself doesn't detect that scenario, and maintains the old/dead connection (nearly) indefinitely. (I think the kernel will eventually kill it after 2+ hours.)

Can you put a TODO here linking to the following ticket?
https://viam.atlassian.net/browse/RSDK-7936

Copy link
Member Author

@abe-winter abe-winter Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 2137934

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Sorry this turned into such a large amount of back and forth!

@abe-winter abe-winter merged commit a7d7136 into main Jun 24, 2024
1 check passed
@abe-winter abe-winter deleted the netappender-2 branch June 24, 2024 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants