Skip to content

Commit

Permalink
[APP-4862] Improve lockfile, add faststart mode, add synfs calls, cle…
Browse files Browse the repository at this point in the history
…anup (#21)
  • Loading branch information
Otterverse authored May 20, 2024
1 parent 4f45203 commit ee21754
Show file tree
Hide file tree
Showing 13 changed files with 385 additions and 187 deletions.
41 changes: 41 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,47 @@ To remove the agent and all viam configuration completely, run the following scr
sudo /bin/sh -c "$(curl -fsSL https://storage.googleapis.com/packages.viam.com/apps/viam-agent/uninstall.sh)"
```

## Configuration
Configuration is maintained via the "agent" section of the device's config in the viam App. Below is an example config section:
```
{
"agent": {
"viam-agent": {
"release_channel": "stable",
"pin_version": "1.2.3",
"pin_url": "http://example/test.binary",
"disable_subsystem": false
},
"viam-server": {
"attributes": {
"fast_start": true
}
},
"agent-provisioning": {
"release_channel": "stable"
},
"agent-syscfg": {
"release_channel": "stable"
}
}
}
```
Above there are (currently) four subsystems, `viam-agent` (the main agent program itself), `viam-server` (the core of the robot/device), `agent-provisioning` (provides early setup and network management. [Provisioning Details](https://github.com/viamrobotics/agent-provisioning) ), and `agent-syscfg` (provides various OS/system configuration tweaks [Syscfg Details](https://github.com/viamrobotics/agent-syscfg))

Each section primarily controls updates for that subsystem, using one of three settings, `pin_url` (highest priority), `pin_version` (checked/used only if pin_url is unset or empty), and `release_channel` (used by default, and defaults to stable, but only if `pin_url` and `pin_version` are both unset.) The example above gives all three for visual clarity, but is not actually needed. In this case, only `pin_url` would be used.

For release channel, "stable" generally means semantically versioned releases that are tested before release, and are relatively infrequent, but will automatically upgrade when a new version is released. Using `pin_version` allows one to "lock" the subsystem to an explcit version (as provided by the release channel) no automatic upgrades will be performed until the setting is updated to a new version (or removed to revert to the release channel.) Lastly, `pin_url` can be used to point to a specific binary. Typically this is only used for testing/troubleshooting.

The `disable_subsystem` setting can be set to true if you don't wish to use/start a particular subsystem.

Note that only sections/settings you explicitly want to modify need to be included in the config. By default, all subsystems will use the `stable` release channel, so no configuration at all is needed to get that behavior. E.g. in the example above, viam-server will still get stable releases, as none of the update-related values are being modified, but it will ALSO use the fast_start behavior detailed below. For another example, the `agent-provisioning` or `agent-syscfg` sections could be left out entirely, and the device will use a default config for those subsystems anyway. To actually disable one, the section can be added, and `disable_subsystem` to to `true`


## FastStart Mode
This bypasses the normal network/online wait and update checks during inital startup, and executes viam-server as quickly as possible. Useful if you have a device that often starts when offline or on a slow connection, and having the latest version immediately after start isn't required. Note that normal, periodic update checks will continue to run afterwards. This only affects initial startup sequencing.

To use it, set `"fast_start": true` in the attributes for the viam-server subsystem. Alternately, set `VIAM_AGENT_FASTSTART=1` in your environment.

## Development

### Makefile Targets
Expand Down
132 changes: 101 additions & 31 deletions cmd/viam-agent/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import (
"fmt"
"io/fs"
"os"
"os/exec"
"os/signal"
"os/user"
"path/filepath"
Expand Down Expand Up @@ -39,7 +40,8 @@ func main() {

var opts struct {
Config string `default:"/etc/viam.json" description:"Path to config file" long:"config" short:"c"`
Debug bool `description:"Enable debug logging (for agent only)" long:"debug" short:"d"`
Debug bool `description:"Enable debug logging (for agent only)" env:"VIAM_AGENT_DEBUG" long:"debug" short:"d"`
Fast bool `description:"Enable fast start mode" env:"VIAM_AGENT_FAST_START" long:"fast" short:"f"`
Help bool `description:"Show this help message" long:"help" short:"h"`
Version bool `description:"Show version" long:"version" short:"v"`
Install bool `description:"Install systemd service" long:"install"`
Expand Down Expand Up @@ -81,10 +83,7 @@ func main() {
}

if opts.Install {
err := viamagent.Install(globalLogger)
if err != nil {
globalLogger.Error(err)
}
exitIfError(viamagent.Install(globalLogger))
return
}

Expand All @@ -104,23 +103,11 @@ func main() {
exitIfError(agent.InitPaths())

// use a lockfile to prevent running two agents on the same machine
pidFile, err := lockfile.New(filepath.Join(agent.ViamDirs["run"], "viam-agent.pid"))
exitIfError(errors.Wrap(err, "cannot init lock file"))
if err = pidFile.TryLock(); err != nil {
globalLogger.Error(errors.Wrapf(err, "cannot lock %s", pidFile))
if errors.Is(err, lockfile.ErrBusy) {
globalLogger.Debug("Retrying to lock file")

time.Sleep(2 * time.Second)

if err = pidFile.TryLock(); err != nil {
globalLogger.Fatal("Please terminate any other copies of viam-agent and try again.")
}
}
}
pidFile, err := getLock()
exitIfError(err)
defer func() {
if err := pidFile.Unlock(); err != nil {
exitIfError(errors.Wrapf(err, "cannot unlock %s", pidFile))
globalLogger.Error(errors.Wrapf(err, "unlocking %s", pidFile))
}
}()

Expand All @@ -144,7 +131,7 @@ func main() {
if !errors.Is(err, fs.ErrNotExist) {
if err := os.Rename(absConfigPath, absConfigPath+".old"); err != nil {
// if we can't rename the file, we're up a creek, and it's fatal
globalLogger.Error(errors.Wrapf(err, "cannot remove invalid config file %s", absConfigPath))
globalLogger.Error(errors.Wrapf(err, "removing invalid config file %s", absConfigPath))
globalLogger.Error("unable to continue with provisioning, exiting")
manager.CloseAll()
return
Expand All @@ -159,8 +146,8 @@ func main() {
if errors.Is(err, agent.ErrSubsystemDisabled) {
globalLogger.Warn("provisioning subsystem disabled, please manually update /etc/viam.json and connect to internet")
} else {
globalLogger.Error(errors.Wrapf(err, "could not start provisioning subsystem"))
globalLogger.Error("please manually update /etc/viam.json and connect to internet")
globalLogger.Error(errors.Wrapf(err,
"could not start provisioning subsystem, please manually update /etc/viam.json and connect to internet"))
}
}

Expand All @@ -177,20 +164,51 @@ func main() {
}
}

// Start viam server as soon as possible. Then, start other subsystems and check for updates
if err := manager.StartSubsystem(ctx, viamserver.SubsysName); err != nil {
if errors.Is(err, agent.ErrSubsystemDisabled) {
globalLogger.Warn("viam-server subsystem disabled, please manually update /etc/viam.json and connect to internet")
// if FastStart is set, skip updates and start viam-server immediately, then proceed as normal
var fastSuccess bool
if opts.Fast || viamserver.FastStart.Load() {
if err := manager.StartSubsystem(ctx, viamserver.SubsysName); err != nil {
globalLogger.Error(err)
} else {
globalLogger.Errorf("could not start viam-server subsystem: %s", err)
fastSuccess = true
}
}

globalLogger.Debug("==== Starting background checks =====")
manager.StartBackgroundChecks(ctx)
if !fastSuccess {
// wait to be online
timeoutCtx, cancel := context.WithTimeout(ctx, time.Minute)
defer cancel()
for {
cmd := exec.CommandContext(timeoutCtx, "systemctl", "is-active", "network-online.target")
_, err := cmd.CombinedOutput()

<-ctx.Done()
if err == nil {
break
}

if errors.As(err, &exec.ExitError{}) {
// if it's not an ExitError, that means it didn't even start, so bail out
globalLogger.Error(errors.Wrap(err, "running 'systemctl is-active network-online.target'"))
break
}
if !utils.SelectContextOrWait(timeoutCtx, time.Second) {
break
}
}

// Check for self-update and restart if needed.
needRestart, err := manager.SelfUpdate(ctx)
if err != nil {
globalLogger.Error(err)
}
if needRestart {
globalLogger.Info("updated self, exiting to await restart with new version")
return
}
}

manager.StartBackgroundChecks(ctx)
<-ctx.Done()
manager.CloseAll()

activeBackgroundWorkers.Wait()
Expand Down Expand Up @@ -249,3 +267,55 @@ func exitIfError(err error) {
globalLogger.Fatal(err)
}
}

func getLock() (lockfile.Lockfile, error) {
pidFile, err := lockfile.New(filepath.Join(agent.ViamDirs["tmp"], "viam-agent.pid"))
if err != nil {
return "", errors.Wrap(err, "init lockfile")
}
err = pidFile.TryLock()
if err == nil {
return pidFile, nil
}

globalLogger.Warn(errors.Wrapf(err, "locking %s", pidFile))

// if it's a potentially temporary error, retry
if errors.Is(err, lockfile.ErrBusy) || errors.Is(err, lockfile.ErrNotExist) {
time.Sleep(2 * time.Second)
globalLogger.Warn("retrying lock")
err = pidFile.TryLock()
if err == nil {
return pidFile, nil
}

// if (still) busy, validate that the PID in question is actually viam-agent
// some systems use sequential, low numbered PIDs that can easily repeat after a reboot or crash
// this could result some other valid/running process that matches a leftover lockfile PID
if errors.Is(err, lockfile.ErrBusy) {
var staleFile bool
proc, err := pidFile.GetOwner()
if err != nil {
globalLogger.Error(errors.Wrap(err, "getting lockfile owner"))
staleFile = true
}
runPath, err := filepath.EvalSymlinks(fmt.Sprintf("/proc/%d/exe", proc.Pid))
if err != nil {
globalLogger.Error(errors.Wrap(err, "cannot get info on lockfile owner"))
staleFile = true
} else if !strings.Contains(runPath, agent.SubsystemName) {
globalLogger.Warnf("lockfile owner isn't %s", agent.SubsystemName)
staleFile = true
}
if staleFile {
globalLogger.Warnf("deleting lockfile %s", pidFile)
if err := os.RemoveAll(string(pidFile)); err != nil {
return "", errors.Wrap(err, "removing lockfile")
}
return pidFile, pidFile.TryLock()
}
return "", errors.Errorf("other instance of viam-agent is already running with PID: %d", proc.Pid)
}
}
return "", err
}
48 changes: 22 additions & 26 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@ require (
github.com/edaniels/golog v0.0.0-20230215213219-28954395e8d0
github.com/jessevdk/go-flags v1.5.0
github.com/nightlyone/lockfile v1.0.0
go.uber.org/zap v1.26.0
go.viam.com/api v0.1.260
go.viam.com/utils v0.1.59
github.com/pkg/errors v0.9.1
github.com/ulikunitz/xz v0.5.12
go.uber.org/zap v1.27.0
go.viam.com/api v0.1.295
go.viam.com/utils v0.1.75
golang.org/x/sys v0.20.0
)

require (
Expand Down Expand Up @@ -39,36 +42,39 @@ require (
github.com/montanaflynn/stats v0.7.1 // indirect
github.com/pion/datachannel v1.5.5 // indirect
github.com/pion/dtls/v2 v2.2.7 // indirect
github.com/pion/ice/v2 v2.3.11 // indirect
github.com/pion/interceptor v0.1.24 // indirect
github.com/pion/ice/v2 v2.3.13 // indirect
github.com/pion/interceptor v0.1.25 // indirect
github.com/pion/logging v0.2.2 // indirect
github.com/pion/mdns v0.0.9 // indirect
github.com/pion/mdns v0.0.12 // indirect
github.com/pion/randutil v0.1.0 // indirect
github.com/pion/rtcp v1.2.10 // indirect
github.com/pion/rtp v1.8.2 // indirect
github.com/pion/sctp v1.8.9 // indirect
github.com/pion/sdp/v3 v3.0.6 // indirect
github.com/pion/srtp/v2 v2.0.17 // indirect
github.com/pion/rtcp v1.2.12 // indirect
github.com/pion/rtp v1.8.5 // indirect
github.com/pion/sctp v1.8.14 // indirect
github.com/pion/sdp/v3 v3.0.9 // indirect
github.com/pion/srtp/v2 v2.0.18 // indirect
github.com/pion/stun v0.6.1 // indirect
github.com/pion/transport/v2 v2.2.4 // indirect
github.com/pion/turn/v2 v2.1.4 // indirect
github.com/pion/webrtc/v3 v3.2.21 // indirect
github.com/pion/webrtc/v3 v3.2.36 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/rs/cors v1.10.1 // indirect
github.com/srikrsna/protoc-gen-gotag v0.6.2 // indirect
github.com/stretchr/testify v1.9.0 // indirect
github.com/xdg-go/pbkdf2 v1.0.0 // indirect
github.com/xdg-go/scram v1.1.2 // indirect
github.com/xdg-go/stringprep v1.0.4 // indirect
github.com/youmark/pkcs8 v0.0.0-20201027041543-1326539a0a0a // indirect
github.com/zitadel/oidc v1.13.5 // indirect
go.mongodb.org/mongo-driver v1.12.1 // indirect
go.opencensus.io v0.24.0 // indirect
golang.org/x/crypto v0.21.0 // indirect
go.uber.org/atomic v1.11.0 // indirect
go.uber.org/goleak v1.3.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
golang.org/x/crypto v0.23.0 // indirect
golang.org/x/mod v0.13.0 // indirect
golang.org/x/net v0.21.0 // indirect
golang.org/x/net v0.25.0 // indirect
golang.org/x/oauth2 v0.13.0 // indirect
golang.org/x/sync v0.4.0 // indirect
golang.org/x/text v0.14.0 // indirect
golang.org/x/text v0.15.0 // indirect
golang.org/x/tools v0.14.0 // indirect
google.golang.org/appengine v1.6.8 // indirect
google.golang.org/genproto v0.0.0-20231016165738-49dd2c1f3d0b // indirect
Expand All @@ -80,13 +86,3 @@ require (
gopkg.in/yaml.v3 v3.0.1 // indirect
nhooyr.io/websocket v1.8.9 // indirect
)

require (
github.com/pkg/errors v0.9.1
github.com/stretchr/testify v1.9.0 // indirect
github.com/ulikunitz/xz v0.5.11
go.uber.org/atomic v1.11.0 // indirect
go.uber.org/goleak v1.2.1 // indirect
go.uber.org/multierr v1.11.0 // indirect
golang.org/x/sys v0.18.0 // indirect
)
Loading

0 comments on commit ee21754

Please sign in to comment.