Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: add DMA support to SPI #342

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from
Draft

Proposal: add DMA support to SPI #342

wants to merge 6 commits into from

Conversation

aykevl
Copy link
Member

@aykevl aykevl commented Apr 8, 2023

Here is a possible API I've designed to add DMA support in the machine SPI API.

Here is a copy of the API description:

func (spi SPI) IsAsync() bool

Return whether the SPI supports asynchronous operation (usually using DMA). Asynchronous operation may be supported for some or all transfers, for example it may only be supported for send-only transfers.

func (spi SPI) StartTx(w, r []byte) error

Start a SPI transmission in the background (usually, using DMA). This has the same effect as running Tx in a goroutine, but doesn't spawn a new goroutine. The w and r byte slices must not be used while the transmission is in progress, but must be stored somewhere outside the StartTx function to avoid garbage collecting them.

It is allowed to start multiple transactions without waiting for the first to finish. They will have the effect as if Tx was called multiple times in sequence. Lots of hardware won't support this however and will simply wait for the first to finish before starting a new transmission.

If IsAsync returns false, this is an alias for Tx.

func (spi SPI) Wait() error

Wait until all active transactions (started by StartTx) have finished. The buffers provided in StartTx will be available after calling Wait.

If IsAsync returns false, this is a no-op.

Alternatives I've considered:

  • Making Tx yield to the scheduler and relying on goroutines. I tried using this, but found it hard to use correctly - at least when I simply added a gosched() call in Tx. I struggled to make it actually do things in parallel. Once I switched to an API that exposes asynchronous transfers directly, it worked the first time for me.
  • Letting StartTx return a promise/future that can be waited on. I liked the idea, but realized that it would be hard to put in an interface (what's the promise type going to be?). On the other hand, Wait is just a simple call that is easy to put in an interface.
  • Not including IsAsync but making this purely an implementation detail. Unfortunately, for displays it's important to know whether the transfer can be async: if the SPI peripheral isn't async, it's not worth using two separate buffers (one of which would have been used for DMA).
  • Not allowing multiple buffers in StartTx. I don't plan on implementing queuing now, but it seems like a possibly useful feature. Many chips support chained SPI transfers and if the chip doesn't support it, it can be emulated by starting the next transfer from the DMA end interrupt.

I also have a proposal to the Displayer interface coming up which will use this API. Together, they can deliver very fast SPI display updates like in my Simplex noise demo here (where I used DMA to update the display in parallel with calculating the next frame, reducing SPI time from 20ms to almost nothing):
https://twitter.com/aykevl/status/1644411686189776897

@sago35
Copy link
Member

sago35 commented Apr 8, 2023

I am creating a dma package for samd51.
The content is almost identical, so there is no reason to feel uncomfortable.

https://github.com/sago35/tinygo-dma

I can adapt it to your proposed interface.

		dmatx.Start()
		dbg6.Toggle()
		dmatx.Wait()

https://github.com/sago35/tinygo-dma/blob/77e2b3d2bad015c3c62acf6fb16da9322853c2bb/examples/spitx/main.go

@aykevl
Copy link
Member Author

aykevl commented Apr 8, 2023

I can adapt it to your proposed interface.

You can, but my intention is to integrate this into the TinyGo machine package. So there won't be a need for an external package.

@sago35
Copy link
Member

sago35 commented Apr 8, 2023

Yes, I too would like to include dma and spi communication using dma in the machine package.

@kenbell
Copy link
Member

kenbell commented Apr 10, 2023

It seems like this is probably a proposal for async I/O (DMA is one implementation) for multiple types of peripherals (I2C, UART, ...).

In general, I like the pattern: StartTx, Wait

I'm not sure about StartTx as an alias of Tx - i suspect the app will be designed to depend on the timing behavior of StartTx. Unless there's some plan to have a background go routine model for all 'async I/O on synchronous peripherals', I'd sooner catch it at compile time - if async is not available, the StartTx and Wait symbols don't exist.

Given this seems to be starting a pattern for multiple peripheral types, in my opinion it's worth thinking about when two different peripherals are involved. For example, a fully async device reading a sensor over I2C and updating an SPI display. I think you rapidly end up needing something like 'select()' - some way for the app to wait on multiple devices to complete, with a return value indicating which device 'completed'. This probably needs some integration with the scheduler?

In fact, Wait on individual devices might be dangerous - simple for the easy case, but i can see it causing additional issues if there are multiple devices in use (e.g. a library using 'Wait'). Would be better to have a machine.WaitForAny(....) type function?

@kenbell
Copy link
Member

kenbell commented Apr 10, 2023

Thinking some more about general-purpose async IO, you'd want something like this for the app?

display.StartDisplay(...)

for {
  select {
    case <-time.After(100 * time.Millisecond - (now() - last)):
       // start a reading
       accelSensor.Start()
  
    case readings := <-accelSensor.Wait():
       // process sensor data
  
    case <-display.Wait():
       // ... generate bitmap ...
  
       // update screen as soon as possible
       display.StartDisplay(...)
  }
}

Assuming this kind of API, how would the sensor driver be able to process an async I2C / SPI response? Some kind of callback from the bus maybe?

@aykevl
Copy link
Member Author

aykevl commented Apr 10, 2023

It seems like this is probably a proposal for async I/O (DMA is one implementation) for multiple types of peripherals (I2C, UART, ...).

Yes, although I haven't really thought of it that way. I was just focused on SPI for now.
Not all peripherals are the same though, for example for I2S we might want to use interrupts instead to avoid stuttering.

(Side note: initially I thought of putting DMA in the name but then I realized there could be other kinds of async I/O although I can't really think of any right now).

I'm not sure about StartTx as an alias of Tx - i suspect the app will be designed to depend on the timing behavior of StartTx.

With "timing behavior", do you mean the fact that it runs in the background?
My goal is to make the same API available everywhere, so that programs can unconditionally use it and perform well on hardware with DMA support but have a graceful fallback on hardware that doesn't.

For context, my main use for DMA right now is to support fast SPI display updates. The API for that would look something like this:

func (d *Device) StartDrawBitmap(x, y, w, h int16, data []pixel.RGB565) error {
    d.spi.Wait()
    d.setWindow(x, y, w, h)
    return d.spi.StartTx(data)
}

func (d *Device) Wait() error {
    return d.spi.Wait()
}

And it can be used like this (in pseudo-Go):

func paint(display Displayer, buffers [2][]pixel.RGB565) {
    for cycle := 0; cycle < chunks; cycle++ {
        // pick a buffer that's not currently in use
        buf := buffers[cycle % 2]
        // calculate the next frame
        calculateNextChunk(buf)
        // wait for the previous update to finish (if it hasn't already)
        // this way we can reuse the buffer in the next cycle
        display.Wait()
        // send the new buffer
        display.StartDrawBitmap(x, y, w, h, buf)
    }
    // make sure all active updates are finished
    display.Wait()
}

In this example, the display driver would need to be more complicated to support both asynchronous and synchronous APIs. By making StartTx and Wait available everywhere, the display driver only needs to be made for the asynchronous API and it will automatically work with the synchronous API (although at reduced performance).

Unless there's some plan to have a background go routine model for all 'async I/O on synchronous peripherals', I'd sooner catch it at compile time - if async is not available, the StartTx and Wait symbols don't exist.

I'm not planning to, and I doubt a background goroutine will help in any way. The time between sending two bytes over SPI is likely to be shorter than a context switch.

Given this seems to be starting a pattern for multiple peripheral types, in my opinion it's worth thinking about when two different peripherals are involved. For example, a fully async device reading a sensor over I2C and updating an SPI display. I think you rapidly end up needing something like 'select()' - some way for the app to wait on multiple devices to complete, with a return value indicating which device 'completed'. This probably needs some integration with the scheduler?

Hmm, that's an interesting thought. For more complicated cases like you describe (I2C and SPI), it might be worth relying on goroutines (the synchronous API could still use DMA and yield to the scheduler).

However, the API you describe with channels is quite interesting. I'm not sure about the overhead of channels, but I do like how it is much easier to compose with other async operations. I think I will prototype it and see what happens.
(In this case I would go for chan struct{} or similar which can potentially be implemented more efficiently than channels that contain a value).

@kenbell
Copy link
Member

kenbell commented Apr 10, 2023

I'm not sure about StartTx as an alias of Tx - i suspect the app will be designed to depend on the timing behavior of StartTx.

With "timing behavior", do you mean the fact that it runs in the background? My goal is to make the same API available everywhere, so that programs can unconditionally use it and perform well on hardware with DMA support but have a graceful fallback on hardware that doesn't.

...snip...

In this example, the display driver would need to be more complicated to support both asynchronous and synchronous APIs. By making StartTx and Wait available everywhere, the display driver only needs to be made for the asynchronous API and it will automatically work with the synchronous API (although at reduced performance).

Thinking again, at the bus level, I agree - drivers should be written once, using StartTx. Different drivers for sync vs async capable busses would be unwieldy.

However, the API you describe with channels is quite interesting. I'm not sure about the overhead of channels, but I do like how it is much easier to compose with other async operations. I think I will prototype it and see what happens. (In this case I would go for chan struct{} or similar which can potentially be implemented more efficiently than channels that contain a value).

Another example of 'async IO' might be GPIO interrupts. Currently, app logic has to implement an interrupt safe handler for GPIO events - the rules around which are 'subtle'. Async I/O would likely be 'good enough' for many use-cases I suspect and allow memory allocation, etc in response to a GPIO event:

select {
  case upEvent := <- machine.GPIO10.Wait():
     // 'up' button
  case downEvent := <- machine.GPIO11.Wait():
    // 'down' button

   //...
}

@aykevl
Copy link
Member Author

aykevl commented May 21, 2023

Here is one possible way to implement a DMA API using channels:

func (spi SPI) Wait() <-error

Return a channel that will send a single value once all queued DMA transactions have completed. The return value is nil if all transactions finished successfully, or an error if at least one transaction failed. After a value is received, the buffers previously passed to StartTx can be accessed again by the application.

Only receive at most one value from the returned channel. The easiest way to ensure this is to only use <-spi.Wait(). The returned value may be different between calls, or it may be the same. It might also be a closed channel (that always returns a value immediately), this is an implementation detail.

The SPI peripheral (like most peripherals) is not thread safe, so it's not safe to access from different goroutines. This also means you can't start new transactions while waiting for a previous transaction to complete.

It can be implemented like this:

  • The SPI peripheral has an internal buffered channel that has zero or one elements, depending on whether a transfer just finished. There is also a global closed chan error, that may be returned in case Wait is called while the DMA transaction is already finished.
  • StartTx clears the value in the internal channel, if there is one.
  • A DMA complete interrupt will send a single value to the channel if that's the last DMA transfer in progress.
  • Wait returns this internal channel if a DMA transfer is in progress, or else it returns the global closed error channel. A closed channel can still be read from, it returns the zero value.

The reason why I'm using the same channels (or a shared closed channel), is because allocating a new channel on each transaction will require a heap allocation and is therefore too slow for an API that's supposed to be performant.

This is how the API could be used in a display driver. Note that it exposes the Wait() function so that a user of the driver could use a select:

func (d *Device) StartDrawBitmap(x, y, w, h int16, data []pixel.RGB565) error {
    <-d.spi.Wait()
    d.setWindow(x, y, w, h)
    return d.spi.StartTx(data)
}

func (d *Device) Wait() chan error {
    return d.spi.Wait()
}

I am leaning slightly in favor of this API, even though it's quite a bit more complex. We'd need some form of synchronization between interrupts and the waiting goroutine anyway, and those might as well be channels. Channels might have a bit of overhead (more than a dedicated synchronization primitive), but I believe those can be optimized if we hit any issues.

@kenbell what do you think?

@kenbell
Copy link
Member

kenbell commented May 26, 2023

@aykevl I quite like it. I think I'd have spi.StartTx do the gratuitous wait, rather than the drivers.

Overlapping calls - error or delay?

The other option would be for spi.StartTx to return an error if a transaction is in process. I can kind of seeing it both ways, but doing a gratuitous wait might mask logic issues in the app?

Mixing Sync and Async

If sync and async operations occur on the same bus - e.g. I2C if one driver/device uses async and another uses sync. Sync calls should fail or delay until the async call completes?

Channel re-use

Agree with re-using the channel, not allocating for each transaction.

Sensors, etc...

For transactions involving a read, the driver would do something like this? (bulk read)

func (d *Device) StartRead(buf []byte) error {
    <-d.spi.Wait()  // unless we put this in d.spi.StartTx
    return d.spi.StartTx([]byte{READ,REG_DATA,len(buf)}, buf)
}

func (d *Device) Wait() chan error {
    return d.spi.Wait()
}

func (d *Device) EndRead() error {
    <- d.spi.Wait()  // if app doesn't want to wait, it can just call EndRead directly
    return nil
}

Or if more structured data...

func (d *Device) StartHeatAndMeasure() error {
    <-d.spi.Wait()  // unless we put this in d.spi.StartTx
    return d.spi.StartTx([]byte{READ,REG_START_HEATER_AND_MEASURE}, d.buf)
}

func (d *Device) Wait() chan error {
    return d.spi.Wait()
}

func (d *Device) EndHeatAndMeasure() (uint8, uint8, error) {
    <- d.spi.Wait()  // if app doesn't want to wait, it can just call EndXXX directly
    return d.buf[0], d.buf[1], nil
}

@aykevl
Copy link
Member Author

aykevl commented Jun 1, 2023

The other option would be for spi.StartTx to return an error if a transaction is in process. I can kind of seeing it both ways, but doing a gratuitous wait might mask logic issues in the app?

I guess, but that means that it will be impossible for more than one transaction to be queued (meaning, calling StartTx multiple time in sequence and Wait returns when all of them have finished). Some hardware like the rp2040 does support this, and if the hardware doesn't it can usually be implemented in software (starting the next transaction in the "transaction complete" interrupt). Perhaps not super useful, but potentially useful as a small performance optimization.
If the hardware doesn't support this feature, the easiest way to support this would indeed be to wait until the previous transaction has finished.

If sync and async operations occur on the same bus - e.g. I2C if one driver/device uses async and another uses sync. Sync calls should fail or delay until the async call completes?

Not sure, it could work either way. We could say "call Wait() before doing a sync transaction" which is of course kind of sensitive to bugs. The least surprising thing to do would probably be to wait until the last async transaction has finished.

I don't fully understand your EndXxx calls, what's the purpose of those and how are they different from plain Wait()?
(If it's to abort a transaction - that's possible in theory but probably too risky and likely to cause subtle bugs).

@kenbell
Copy link
Member

kenbell commented Jun 2, 2023

I think I agree with your argument. I like the model that the contract is that one or more asynchronous (+ 1 synchronous) transaction can be outstanding and the bus driver is responsible for ensuring all get done safely, whether the bus is truly asynchronous or not. A synchronous bus implementation would convert the async calls into sync ones, and an asynchronous bus implementation would convert the synchronous call into async followed by wait. That way device drivers can be written for the best performance case (async busses) or the most simple code structure (sync busses) without worrying what the bus actually does underneath. Of course the app developer may care, and want all drivers and busses to be async to avoid delays in their app, but that's just an argument to gradually move all busses and device drivers to an async model over time.

I don't fully understand your EndXxx calls, what's the purpose of those and how are they different from plain Wait()?
(If it's to abort a transaction - that's possible in theory but probably too risky and likely to cause subtle bugs).

I'm thinking EndXXX can mirror StartXXX on drivers where an async bus call by the driver should return data to the app (for example for an I2C sensor). The app code would do either:

sensor.StartHeatingAndMeasure()
// do stuff
val1, val2, val3 := sensor.EndHeatAndMeasure()

or the app could do:

sensor.StartHeatingAndMeasure()
select {
  case <-sensor.Wait():
      val1, val2, val3 := sensor.EndHeatAndMeasure()
  case ...
}

these would be equivalent to the synchronous version:

  val2, val2, val3 := sensor.HeatAndMeasure()

The async versions would have the benefit that if the sensor needed to hold the bus (for, say, 100ms to heat a hotplate) the app can do other useful stuff (like update the display) without needing go routines.

@aykevl
Copy link
Member Author

aykevl commented Nov 4, 2023

Rebased the PR (no actual update).

@kenbell looking back on this I'm not so sure about the API with channels. It seems rather brittle to me and very easy to get wrong (especially in the implementation). So what I'd much rather do is add a simple blocking Wait, and only add the special method (WaitChan perhaps?) later if there is an actual need for it.

What do you think of this?

@kenbell
Copy link
Member

kenbell commented Nov 5, 2023

@aykevl - let me drop a few comments on the PR. I think I'm missing something about the semantics of StartTx / Wait as-implemented, and it's probably best handled inline with the code.

@soypat
Copy link
Contributor

soypat commented Nov 11, 2023

A few comments:

Return whether the SPI supports asynchronous operation (usually using DMA). Asynchronous operation may be supported for some or all transfers, for example it may only be supported for send-only transfers.

Wouldn't we be able to tell if a SPI has DMA by looking at whether it implements an interface with StartTx and Wait?

Making Tx yield to the scheduler and relying on goroutines. I tried using this, but found it hard to use correctly - at least when I simply added a gosched() call in Tx. I struggled to make it actually do things in parallel. Once I switched to an API that exposes asynchronous transfers directly, it worked the first time for me.

This seems odd to me. A scheduled Tx seems like it could be implemented by putting the contents of StartTx and Wait inside the same function and adding gosched within the for loops? At least that's how I implemented it in piolib, the whole add scheduled DMA to SPI takes up around 35 lines of code.

Which leads me to the next point: I like my DMA abstraction a bit more since it is readily reusable for PIO, SPI, I2C, etc. Maybe we needed a better DMA abstraction for scheduled SPI to be as good as async possible? (I say "my abstraction" but it's actually heavily inspired by embassy-rs's implementation, who rely heavily on async to use DMA.)

So far I've been using SPI with DMA on the PIO with excellent results using the above API to communicate with the pico W's CY43439 at 25MHz clock. The core API used are the push and pull methods.

@aykevl
Copy link
Member Author

aykevl commented Dec 9, 2023

Sorry I missed this comment.

Wouldn't we be able to tell if a SPI has DMA by looking at whether it implements an interface with StartTx and Wait?

(I think I answered this elsewhere but I don't remember where, so repeating here).

Yes, and it's certainly a viable implementation strategy. However, the benefit of having a single API for all SPI objects makes it slightly easier to write drivers: they can use one or the other API and be correct in both cases.

Also, this only works at the machine.SPI layer. Once you write a driver (like a display driver), you're not going to have different implementations depending on whether the bus supports DMA or not. So in that case, you'd have an IsAsync method on the driver object anyway.

Making Tx yield to the scheduler and relying on goroutines. I tried using this, but found it hard to use correctly - at least when I simply added a gosched() call in Tx. I struggled to make it actually do things in parallel. Once I switched to an API that exposes asynchronous transfers directly, it worked the first time for me.

This seems odd to me. A scheduled Tx seems like it could be implemented by putting the contents of StartTx and Wait inside the same function and adding gosched within the for loops?

Already explained this in my other comment: tinygo-org/tinygo#3985 (comment)
I'd be very curious if you have found a way around this scheduling issue? Or does the pio library suffer from the same problem?

@ysmilda
Copy link

ysmilda commented Jun 20, 2024

The initial proposed API covers the use cases for a display quite nicely. The way a StartTx and Wait are handled are very similar to how one would do this outside (tiny)go and in my opinion it is good to stay close to the hardware here.

Will the same API be made available for the other peripherals and how will they handle sharing the available DMA channels?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants