-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-4829 arm32 build #2790
RSDK-4829 arm32 build #2790
Conversation
Abe Winter seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Whow this is a huge accomplishment getting this into PR, well done! A couple broad comments relating to a large PR and one that touches the deployment lifecycle:
Are we running vendor lint on every PR and push to main now? Do we intend to release 32-bit builds on the same cadence as everything else now? @Otterverse is there any concern about increased resource consumption? I know we've had issues recently with buildjet machines slowing down due to load causing false positive test failures.
For posterity, was there something specific to arm32 builds that these helped with? I assume linting will grab some overflow-y and alignment type bugs? Did we only find problems in dependencies in our repositories or did we find some in other dependencies that would be more of a risk/cost to addressing expediently if a new one pops up? And does this run in CI and will block patches? What's one to do/whom should one reach out to if a "necessary" library upgrade fails linting? Will this run only when the
Similarly to the load concerns, how broadly is this being adopted and if this is causing too much workload is turning it off safe or will we be seeing "too slow" failures creeping up elsewhere? I'll leave some more targeted comments in the diff. |
@@ -0,0 +1,4 @@ | |||
bin/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these things generated by our make
commands that a user/github action is expected to have created now? Or were these artifacts of your personal workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this was already blocked in our .gitignore -- I added to .dockerignore to make my docker builds start faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are in theory nice-to-haves for anyone doing local docker iterations, but are not critical and I can remove if you want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, though I'm rather unfamiliar. I'll defer to @Otterverse if there's any concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see .direnv anywhere in .gitignore, so not sure where that came from. Ignoring .git is good though, saves some code size for most builds. There's other stuff in .gitignore that could be here too... especially node_modules and .artifact I think.
Also, why ignore **/Dockerfile ?
@@ -30,9 +30,6 @@ func init() { | |||
// Encoder keeps track of a motor position using a rotary incremental encoder. | |||
type Encoder struct { | |||
resource.Named | |||
|
|||
positionType encoder.PositionType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity:
I'm guessing the problem here is that sync.Mutex
needs to be on some byte-boundary (4? 8?). And this is to shuffle the ordering such that we satisfy that requirement.
I think this is going to be a common sharp edge for rdk developers. It's a reasonable practice to place mutexes near the variables they are intending to protect and (in Go at least) not worry about runtime constraints for correctness.
So in lieu of having some "gofmt-like" program that can rewrite our structures to be 32-bit mutex compliant, we need to start leaving more documentation around our sync.Mutex
variables about why they are placed where they are. Maybe in the future we'll fall into a convention that can avoid some of the documentation.
A concrete ask:
Leave a comment here saying: `"sync" and "atomic" primitives must be 4-byte aligned for 32-bit builds."MoveactiveBackgroundWorkers
below this mutex. Mostly so we can write one comment that applies to all the important parts. And so we start leaving good examples that people can crib off of.
I'll leave a comment on the other locations changed in this PR so you don't have to hope you caught all of the ones I'm seeing. But please do not audit the entire code-base to make these changes. I just want to get the ball rolling on communicating this subtle requirement to the broader set of developers at viam.
Offline discussion:
The problem isn't with mutexes but rather atomic accesses to the position
, pRaw
and pState
variables. E.g: here. Consider just changing them to atomic.Int64
@@ -173,6 +173,7 @@ func newEncodedMotor( | |||
|
|||
// EncodedMotor is a motor that utilizes an encoder to track its position. | |||
type EncodedMotor struct { | |||
rpmMonitorCalls int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline discussion: consider changing this to atomic.Int64
.
@@ -7,8 +7,8 @@ import "sync/atomic" | |||
// used for statistical purposes only due to the use of | |||
// atomics and not mutexes. | |||
type RollingAverage struct { | |||
data []int64 | |||
pos int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also consider making these atomic.Int64
and []atomic.Int64
re etc/splitter concerns:
the benefits of splitter:
|
Code Coverage
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it should've probably been four different PRs: The splitter testing tool, the vendorlint tool, the 32-bit workflow changes, and then the actual 32-bit fixes. I'm not saying they need to be split up at this point, but I will kind of treat them as different "reviews" in my responses below. I've put a lot of inline notes as well, but I'm going to write a larger overview here that may repeat some of them.
Also, my apologies for how long this is. Lots to address, and a lot of my critiques are things that probably need discussion, so keep in mind I'm wording this as what I think is best, not just saying "do it my way!"
Also-also… some of the inline comments may be out of date. You pushed several updates while I was in the process of reviewing, so some could be obsolete.
32-bit fixes
So first off, great work hunting down all the 64/32 bit and alignment issues. As Dan noted, I'm sure that was a pain to figure out. All that part of this PR looks great to me!
Splitter
The splitter code I have a few questions about. First off, it wasn't really clear to me the purpose, since go test already runs stuff in parallel where possible, and if it was a matter of horsepower, we have access to up to 32-core runners. I'm presuming that it's more about being able to retry failures more quickly? I'm also curious if you looked into existing tools for this, such as https://github.com/hashicorp-forge/go-test-split-action and if so, why they weren't viable.
I also have a slight concern about splitting tests onto different machines actually changing the behavior. We've caught quite a few bugs because tests WERE on the same machine, and processes had file or socket races against each other. At the far extreme (if every test got its own job) that'd mean we never catch issues like that. A handful of chunks doesn't seem too bad if it saves us on retries though.
Lastly I'm not a huge fan of a go-specific testing utility being written in python instead of Go itself (or very basic bash.) It means we now have to always involve two languages in all our build/test environments. It also means anyone debugging issues or looking to improve this now needs to have python expertise in addition to go expertise. This is a pretty complicated utility, and I actually struggled to understand parts of it, as my python skills are not as advanced as yours. I I'm not not saying (necessarily) that these can't remain python, but it is a discussion we should have, and likely with other people involved. The python code itself looks well thought out though, and the idea of speeding things up this way is great.
Vendorlint
First, I think the normal 32-bit lint should be split into a separate makefile target from the vendorlint run. I'm hoping vendorlint only needs to be run rarely, and not every lint check. I don't mind having an extra target there for manual runs/checks when needed, but automating it (with a full go mod vendor run) every lint check feels like a big resource/time hog. I understand why this was needed to initially hunt down the 32-bit issues, but can you clarify why and when this is expected to be needed going forward?
Secondly, by doing "go mod vendor" this changes how all future work in someone's codebase functions, which can disrupt IDEs that use LSP as they were using modules in ~/go/ for reference, so they may need reloads or point to the wrong files. Also, and more importantly, vendoring breaks builds. Once vendored, the pion/mediadevices module can no longer find the opus.h headers, and thus viam-server can no longer compile. Likewise, make lint-go fails due to issues compiling tools.
I definitely think this needs to be a target that only gets run manually when we need to investigate a 32-bit issue. It's too intensive to run in every PR update (I just timed it on your arm image and it took 20 minutes), and on developer machines the vendoring is disruptive. The latter can be overlooked if it's not part of the normal workflow though, just document that a "clean-all" is needed to get back to normal working state.
The same notes I had about splitter being written in python apply here as well.
Workflows/Docker
The actual build process stuff needs a rework I believe, and should follow the pattern used by all our other workflows (the ones that work using the canon docker images.)
The big thing is that the lint, test, and (server) build steps need to be well separated from the creation of the docker image itself, and should be integrated into the existing workflows.
For the docker image build itself... There should be a process to build a docker image that contains a canon-compatible toolchain/env. (The canon utility doesn't have 32 bit arm support yet, but I can add that next week pretty easily. And it's not needed for workflow/CI use.) I know we talked briefly the other day about bespoke being fine for now, but seeing how simple this dockerfile currently is (and how similar the end result is to canon), it might actually be worth it to just go ahead and add this properly as a canon image variant. (e.g. arm instead of arm64) Look at https://github.com/viamrobotics/SysOps/tree/main/docker and you can probably just add a couple build targets to the Makefile there, and tweak the canon/Dockerfile with any arm32 specific conditionals you need. Builds for this shouldn't need to be made/pushed any more frequently than canon itself, so no real need to automate I think (though I may do that anyway next month when updating canon like I talked about.) I think this is additionally a good thing to do now, as the current image being built isn't actually 32-bit.
For the workflow(s) (that are USING the built docker images), I'd really like to see them integrated into existing workflows (e.g. being called by pull-request-trusted.yml, main.yml, or sub-workflows of those) where needed. We definitely don't want this all to just run on every push like it's currently set to do. I'd also like to see them use the workflow native "container" options rather than manually constructing docker run commands. Lastly, everything should be using make targets as much as possible, rather than putting complex commands in the workflow files themselves. It's MUCH easier to debug/update/modify/test these later in the Makefile than in the workflow. It should be pretty easy to automate the arm32 detection and enabled the couple extra build flags you've added.
@@ -85,7 +85,7 @@ require ( | |||
go.uber.org/zap v1.24.0 | |||
go.viam.com/api v0.1.186 | |||
go.viam.com/test v1.1.1-0.20220913152726-5da9916c08a2 | |||
go.viam.com/utils v0.1.43 | |||
go.viam.com/utils v0.1.44-0.20230905200840-300b46e8f195 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to use a directly tagged/real version number (without the date/time stamp) when this actually publishes. (I know the 1.44 tag doesn't exist yet.) It makes things easier to parse for some workflow automation (we try to sync these versions to match in app as well) and easier for humans to look up/compare changes.
on: | ||
push: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to be building a new docker image every single push. This needs to be a called workflow, hooked into the proper "top-level" workflows (e.g. pull-request-trusted.yml and/or main.yml)
- uses: docker/build-push-action@v4 | ||
with: | ||
file: etc/packaging/32bit/Dockerfile | ||
tags: ghcr.io/viamrobotics/viam-server:armv7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the docker image probably shouldn't be the same as our binary, but it's fine for the moment.
- uses: docker/login-action@v2 | ||
with: | ||
registry: ghcr.io | ||
username: ${{ github.actor }} | ||
password: ${{ secrets.GITHUB_TOKEN }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe you need this when using the image (only when publishing.) The docker images should be public.
Makefile
Outdated
go mod vendor | ||
# parallelism is 1 for now because golangci-lint detects parallel runs and complains | ||
etc/vendorlint.py all --linter $(abspath $(TOOL_BIN)/golangci-lint) --parallel 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vendorlint ALWAYS necessary? Or can it be it's own "vendorlint" target that we only run manually/occasionally? In either case, I think this should be split into two seperate targets, so people working locally can run just the "quick" lint when they're changing their own files (and not modifying vendor stuff.)
registry: ghcr.io | ||
username: ${{ github.actor }} | ||
password: ${{ secrets.GITHUB_TOKEN }} | ||
- uses: docker/build-push-action@v4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be building a 32-bit image. The image currently at ghcr.io/viamrobotics/viam-server:armv7 is actually 64bit arm when I inspect it as well.
$ docker image inspect ghcr.io/viamrobotics/viam-server:armv7 | grep Arch
"Architecture": "arm64",
uses: google-github-actions/[email protected] | ||
with: | ||
headers: "cache-control: no-cache" | ||
path: viam-server-latest-armv7l |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this is a distributable binary. There's a whole process to do a (mostly) statically linked binary and/or appimages. This appears to just be raw binary, and with the current docker image, a 64-bit one at that.
@@ -0,0 +1,4 @@ | |||
bin/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see .direnv anywhere in .gitignore, so not sure where that came from. Ignoring .git is good though, saves some code size for most builds. There's other stuff in .gitignore that could be here too... especially node_modules and .artifact I think.
Also, why ignore **/Dockerfile ?
$(TOOL_BIN)/golangci-lint run -v --build-tags no_tflite,no_pigpio --tests=false --disable-all --enable staticcheck --timeout 30m | ||
|
||
vendorlint: tool-install | ||
go mod vendor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an expensive (bandwidth/time) operation. It also leaves the vendor directory permanently which breaks future builds for regular use. To be specific, make server
throws errors about missing opus.h after I run this target.
refactoring to split out fixes from CI / container changes |
what
high-level changes
This is a huge PR and if reviewers want, I can split it up:
tflite
+pigpio
overview of sync/atomic linting
GOARCH=arm
)