Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-9195] ota async task #350

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mattjperez
Copy link
Member

@mattjperez mattjperez commented Nov 14, 2024

The following PR:

  • RSDK-9195: moves the ota logic to an async task attached to viam-server (this will enable retries/incremental-download/etc)
  • RSDK-9198: uses async hyper::http2 instead of the blocking EspHttpClient

The errors will get handled better as part of RSDK-9214

@mattjperez mattjperez self-assigned this Nov 14, 2024
@mattjperez mattjperez requested a review from a team as a code owner November 14, 2024 17:18
Comment on lines +89 to +90
bincode = "2.0.0-rc.3"

Copy link
Member Author

@mattjperez mattjperez Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this an optional dependency for ota feature moving forward. That way it doesn't get into the regular build

will rm ws

@@ -622,6 +600,42 @@ where
}
}

#[cfg(feature = "ota")]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the whole ota section to after certs have been retrieved and stored

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need certs? I thought the certs were for us serving

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I am using the same certs for the hyper client's https connection. should I not be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try the empirical answer: does it still work if you don't use them?

Comment on lines +83 to +103
struct EspAppDesc {
//TODO verify this is doing what I think it's doing, validate results
/// ESP_APP_DESC_MAGIC_WORD (0xABCD5432)
magic_word: u32,
secure_version: u32,
reserv1: [u32; 2],
/// application version
version: [u8; 32],
project_name: [u8; 32],
/// compile time
time: [u8; 16],
/// compile date
date: [u8; 16],
idf_ver: [u8; 32],
app_elf_sha256: [u8; 32],
/// minimal eFuse block revision supported by image, in format: major * 100 + minor
min_efuse_blk_rev_full: u16,
/// maximal eFuse block revision supported by image, in format: major * 100 + minor
max_efuse_blk_rev_full: u16,
reserv2: [u32; 19],
}
Copy link
Member Author

@mattjperez mattjperez Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this for testing in native, however we can maybe use it in place of esp_app_desc_t entirely.
Or if we want to just use the c ffi for esp_app_desc_t we can do an unsafe transmute on the first 256 bytes of an image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, let's make a ticket to revisit. I'd prefer we ultimately use the C FFI one when possible, and declare our own for native. Ideally with some sanity checks on esp32 that the native one seems to line up with the FFI one.

let io = self
.connector
.connect_to(&uri)
.unwrap()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change to map_err in follow-up commit

}
let file_len = headers[hyper::header::CONTENT_LENGTH]
.to_str()
.expect("failed to get reference to content length")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

EspAppDesc,
bincode::config::Configuration,
>(
&data[..256], bincode::config::standard()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change this to size_of::<EspAppDesc>

.await
.map_err("failed to write data to partition".to_string())?;
// TODO change back to 'n' after impl async writer
nwritten += data.len();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will probably take the length before writing, but seems to still work

Copy link
Member

@gvaradarajan gvaradarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly LGTM actually, just a few questions

let mut loader = EspFirmwareInfoLoader::new();
loader.load(&data).map_err(OtaError::EspError)?;
let new_fw = loader.get_info().map_err(OtaError::EspError)?;
log::info!("current firmware: {:?}", running_fw_info);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log::debug instead (for both statements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall though, I do feel like this method should probably have some well thought out user facing logs that provide context about what the OTA system is doing.

}

let mut auth = uri.authority().expect("str").to_string();
auth.push_str(":443");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this represent a port? What determines this value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this represents a port. This is specifically in the case that the scheme is https and no port is specified. 443 is the known port for https (connection attempt will fail without it). hyper doesn't make any assumptions for you so you need to handle the case that a port isn't included.

If you include a port in the url (ex https://my_url.com:9000) it will use that port instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good

let (mut sender, conn) = {
http2::Builder::new(self.exec.clone())
.keep_alive_interval(Some(std::time::Duration::from_secs(120)))
.keep_alive_timeout(std::time::Duration::from_secs(10))
Copy link
Member

@gvaradarajan gvaradarajan Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the minimum amount of time we can use for this timeout? If a smaller timeout works, I think it would be preferred

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a lower limit you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I figured we could arrive at this by lowering the duration until it no longer works on a relatively unstable network (probably do something more like binary search)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we know we need to adjust these from the defaults?

));
}

drop(conn);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we drop connection before aborting the OTA? It seems to me that we'd like to keep the connection alive for as little time as possible and the abort might take a variable amount of time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I can move this up a bit to before the nwritten != file_len check.


#[cfg(feature = "esp32")]
type OtaConnector = crate::esp32::tcp::Esp32H2Connector;
#[cfg(not(feature = "esp32"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I forgot from the previous PR, but why are we allowing cases where we're not on an esp32? Isn't OTA meaningless otherwise?

Copy link
Member Author

@mattjperez mattjperez Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faster iteration when testing, specifically around hyper, config checks, and inspecting the binary to be downloaded (recommended by nico)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair for development purposes maybe, but isn't this confusing for merged-in code? It took me a while to see that OTA compiles for native but makes http calls that essentially do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by nothing. The calls are for downloading the OTA binary and inspecting it. This would also be useful for verifying future work we talked about for editing the binary signatures directly.

Should I put it back and take it out between every PR for this feature? Or clean it up at the end if it doesn't prove useful beyond the planned work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to that point, what is the purpose of having a native build?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the native build still tests the connection infrastructure abstraction and acts as a server. In native "OTA", there's no updating happening so it doesn't feel like a fair comparison. But we can leave it in and revisit at the end of the epic

{
self.ota_service_task
.replace(self.executor.spawn(async move {
if let Err(e) = ota.update().await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update be called periodically? Otherwise, it looks to me like this will do one check for an available OTA update on startup, but never again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, this only happens once on startup.

The train of thought was, when a user changes the robot config (to update ota settings or anything else) the device will restart, performing the ota check. This also means we can drop the task once the device is up-to-date, freeing up resources to be used elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of why we made this a separate async task was so we can do backoff and retries while still letting the rest of the program move forward, but the idea was for the task to eventually terminate.


// TODO(RSDK-9200): set according to active partition scheme
const OTA_MAX_IMAGE_SIZE: usize = 1024 * 1024 * 4; // 4MB
pub const OTA_MODEL_TYPE: &str = "ota_service";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be a slice of OTA_MODEL_TRIPLET so they can't get out of phase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing that initially, but was having issues with using a const str to define another const str, will look into it again.

Comment on lines +83 to +103
struct EspAppDesc {
//TODO verify this is doing what I think it's doing, validate results
/// ESP_APP_DESC_MAGIC_WORD (0xABCD5432)
magic_word: u32,
secure_version: u32,
reserv1: [u32; 2],
/// application version
version: [u8; 32],
project_name: [u8; 32],
/// compile time
time: [u8; 16],
/// compile date
date: [u8; 16],
idf_ver: [u8; 32],
app_elf_sha256: [u8; 32],
/// minimal eFuse block revision supported by image, in format: major * 100 + minor
min_efuse_blk_rev_full: u16,
/// maximal eFuse block revision supported by image, in format: major * 100 + minor
max_efuse_blk_rev_full: u16,
reserv2: [u32; 19],
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, let's make a ticket to revisit. I'd prefer we ultimately use the C FFI one when possible, and declare our own for native. Ideally with some sanity checks on esp32 that the native one seems to line up with the FFI one.

reserv2: [u32; 19],
}

// TODO(RSDK-9214)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When TODO(TICKET) please also include a little explanation of what is TODO. It can be brief, but that way the reader can get a sense without needing to jump over to the ticket.

cert: TlsCertificate,
exec: Executor,
) -> Result<Self, OtaError> {
// TODO(RSDK-9205)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto more details on comment

let mut total_downloaded: usize = 0;
let mut got_info = false;

while let Some(next) = response.frame().await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any ability to exercise control over the max size of the frames we will get?

Copy link
Member Author

@mattjperez mattjperez Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears the closest thing would be the range header, to request a specific range of bytes to be served (could be useful for resumable download). this would turn into mutiple requests though, one for each segment of our max size.

However, there's no guarantee that the server supports ranges and that would have to be verified and handled.

if total_downloaded < core::mem::size_of::<EspAppDesc>() {
log::error!("initial frame too small to retrieve esp_app_desc_t");
} else {
log::info!("data length {}", data.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a debug log, at best, or better removed

let mut loader = EspFirmwareInfoLoader::new();
loader.load(&data).map_err(OtaError::EspError)?;
let new_fw = loader.get_info().map_err(OtaError::EspError)?;
log::info!("current firmware: {:?}", running_fw_info);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall though, I do feel like this method should probably have some well thought out user facing logs that provide context about what the OTA system is doing.

log::info!("new firmware: {:?}", new_fw);
if let Some(ref running_fw) = running_fw_info {
if running_fw.version == new_fw.version
&& running_fw.released == new_fw.released
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is released here and what does it mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the compiled timestamp we talked about before

@@ -622,6 +600,42 @@ where
}
}

#[cfg(feature = "ota")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need certs? I thought the certs were for us serving

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants