-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RSDK-9195] ota async task #350
base: main
Are you sure you want to change the base?
Conversation
bincode = "2.0.0-rc.3" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this an optional dependency for ota
feature moving forward. That way it doesn't get into the regular build
will rm ws
@@ -622,6 +600,42 @@ where | |||
} | |||
} | |||
|
|||
#[cfg(feature = "ota")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved the whole ota section to after certs have been retrieved and stored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need certs? I thought the certs were for us serving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I am using the same certs for the hyper client's https connection. should I not be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try the empirical answer: does it still work if you don't use them?
struct EspAppDesc { | ||
//TODO verify this is doing what I think it's doing, validate results | ||
/// ESP_APP_DESC_MAGIC_WORD (0xABCD5432) | ||
magic_word: u32, | ||
secure_version: u32, | ||
reserv1: [u32; 2], | ||
/// application version | ||
version: [u8; 32], | ||
project_name: [u8; 32], | ||
/// compile time | ||
time: [u8; 16], | ||
/// compile date | ||
date: [u8; 16], | ||
idf_ver: [u8; 32], | ||
app_elf_sha256: [u8; 32], | ||
/// minimal eFuse block revision supported by image, in format: major * 100 + minor | ||
min_efuse_blk_rev_full: u16, | ||
/// maximal eFuse block revision supported by image, in format: major * 100 + minor | ||
max_efuse_blk_rev_full: u16, | ||
reserv2: [u32; 19], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this for testing in native
, however we can maybe use it in place of esp_app_desc_t
entirely.
Or if we want to just use the c ffi for esp_app_desc_t
we can do an unsafe transmute on the first 256 bytes of an image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, let's make a ticket to revisit. I'd prefer we ultimately use the C FFI one when possible, and declare our own for native. Ideally with some sanity checks on esp32 that the native one seems to line up with the FFI one.
let io = self | ||
.connector | ||
.connect_to(&uri) | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change to map_err in follow-up commit
} | ||
let file_len = headers[hyper::header::CONTENT_LENGTH] | ||
.to_str() | ||
.expect("failed to get reference to content length") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
EspAppDesc, | ||
bincode::config::Configuration, | ||
>( | ||
&data[..256], bincode::config::standard() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change this to size_of::<EspAppDesc>
.await | ||
.map_err("failed to write data to partition".to_string())?; | ||
// TODO change back to 'n' after impl async writer | ||
nwritten += data.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will probably take the length before writing, but seems to still work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It mostly LGTM actually, just a few questions
let mut loader = EspFirmwareInfoLoader::new(); | ||
loader.load(&data).map_err(OtaError::EspError)?; | ||
let new_fw = loader.get_info().map_err(OtaError::EspError)?; | ||
log::info!("current firmware: {:?}", running_fw_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log::debug
instead (for both statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall though, I do feel like this method should probably have some well thought out user facing logs that provide context about what the OTA system is doing.
} | ||
|
||
let mut auth = uri.authority().expect("str").to_string(); | ||
auth.push_str(":443"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this represent a port? What determines this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this represents a port. This is specifically in the case that the scheme is https
and no port is specified. 443
is the known port for https (connection attempt will fail without it). hyper doesn't make any assumptions for you so you need to handle the case that a port isn't included.
If you include a port in the url (ex https://my_url.com:9000
) it will use that port instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds good
let (mut sender, conn) = { | ||
http2::Builder::new(self.exec.clone()) | ||
.keep_alive_interval(Some(std::time::Duration::from_secs(120))) | ||
.keep_alive_timeout(std::time::Duration::from_secs(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the minimum amount of time we can use for this timeout? If a smaller timeout works, I think it would be preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a lower limit you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I figured we could arrive at this by lowering the duration until it no longer works on a relatively unstable network (probably do something more like binary search)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that we know we need to adjust these from the defaults?
)); | ||
} | ||
|
||
drop(conn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we drop connection before aborting the OTA? It seems to me that we'd like to keep the connection alive for as little time as possible and the abort might take a variable amount of time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I can move this up a bit to before the nwritten != file_len
check.
|
||
#[cfg(feature = "esp32")] | ||
type OtaConnector = crate::esp32::tcp::Esp32H2Connector; | ||
#[cfg(not(feature = "esp32"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I forgot from the previous PR, but why are we allowing cases where we're not on an esp32? Isn't OTA meaningless otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
faster iteration when testing, specifically around hyper, config checks, and inspecting the binary to be downloaded (recommended by nico)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair for development purposes maybe, but isn't this confusing for merged-in code? It took me a while to see that OTA compiles for native but makes http calls that essentially do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by nothing. The calls are for downloading the OTA binary and inspecting it. This would also be useful for verifying future work we talked about for editing the binary signatures directly.
Should I put it back and take it out between every PR for this feature? Or clean it up at the end if it doesn't prove useful beyond the planned work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess to that point, what is the purpose of having a native build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the native build still tests the connection infrastructure abstraction and acts as a server. In native "OTA", there's no updating happening so it doesn't feel like a fair comparison. But we can leave it in and revisit at the end of the epic
{ | ||
self.ota_service_task | ||
.replace(self.executor.spawn(async move { | ||
if let Err(e) = ota.update().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update
be called periodically? Otherwise, it looks to me like this will do one check for an available OTA update on startup, but never again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, this only happens once on startup.
The train of thought was, when a user changes the robot config (to update ota settings or anything else) the device will restart, performing the ota check. This also means we can drop the task once the device is up-to-date, freeing up resources to be used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part of why we made this a separate async task was so we can do backoff and retries while still letting the rest of the program move forward, but the idea was for the task to eventually terminate.
|
||
// TODO(RSDK-9200): set according to active partition scheme | ||
const OTA_MAX_IMAGE_SIZE: usize = 1024 * 1024 * 4; // 4MB | ||
pub const OTA_MODEL_TYPE: &str = "ota_service"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be a slice of OTA_MODEL_TRIPLET so they can't get out of phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried doing that initially, but was having issues with using a const str to define another const str, will look into it again.
struct EspAppDesc { | ||
//TODO verify this is doing what I think it's doing, validate results | ||
/// ESP_APP_DESC_MAGIC_WORD (0xABCD5432) | ||
magic_word: u32, | ||
secure_version: u32, | ||
reserv1: [u32; 2], | ||
/// application version | ||
version: [u8; 32], | ||
project_name: [u8; 32], | ||
/// compile time | ||
time: [u8; 16], | ||
/// compile date | ||
date: [u8; 16], | ||
idf_ver: [u8; 32], | ||
app_elf_sha256: [u8; 32], | ||
/// minimal eFuse block revision supported by image, in format: major * 100 + minor | ||
min_efuse_blk_rev_full: u16, | ||
/// maximal eFuse block revision supported by image, in format: major * 100 + minor | ||
max_efuse_blk_rev_full: u16, | ||
reserv2: [u32; 19], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, let's make a ticket to revisit. I'd prefer we ultimately use the C FFI one when possible, and declare our own for native. Ideally with some sanity checks on esp32 that the native one seems to line up with the FFI one.
reserv2: [u32; 19], | ||
} | ||
|
||
// TODO(RSDK-9214) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When TODO(TICKET)
please also include a little explanation of what is TODO. It can be brief, but that way the reader can get a sense without needing to jump over to the ticket.
cert: TlsCertificate, | ||
exec: Executor, | ||
) -> Result<Self, OtaError> { | ||
// TODO(RSDK-9205) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto more details on comment
let mut total_downloaded: usize = 0; | ||
let mut got_info = false; | ||
|
||
while let Some(next) = response.frame().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any ability to exercise control over the max size of the frames we will get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears the closest thing would be the range
header, to request a specific range of bytes to be served (could be useful for resumable download). this would turn into mutiple requests though, one for each segment of our max size.
However, there's no guarantee that the server supports ranges and that would have to be verified and handled.
if total_downloaded < core::mem::size_of::<EspAppDesc>() { | ||
log::error!("initial frame too small to retrieve esp_app_desc_t"); | ||
} else { | ||
log::info!("data length {}", data.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a debug log, at best, or better removed
let mut loader = EspFirmwareInfoLoader::new(); | ||
loader.load(&data).map_err(OtaError::EspError)?; | ||
let new_fw = loader.get_info().map_err(OtaError::EspError)?; | ||
log::info!("current firmware: {:?}", running_fw_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall though, I do feel like this method should probably have some well thought out user facing logs that provide context about what the OTA system is doing.
log::info!("new firmware: {:?}", new_fw); | ||
if let Some(ref running_fw) = running_fw_info { | ||
if running_fw.version == new_fw.version | ||
&& running_fw.released == new_fw.released |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is released
here and what does it mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the compiled timestamp we talked about before
@@ -622,6 +600,42 @@ where | |||
} | |||
} | |||
|
|||
#[cfg(feature = "ota")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need certs? I thought the certs were for us serving
The following PR:
The errors will get handled better as part of RSDK-9214