Proposal: add DMA support to SPI #342

aykevl · 2023-04-08T13:15:23Z

Here is a possible API I've designed to add DMA support in the machine SPI API.

Here is a copy of the API description:

func (spi SPI) IsAsync() bool
Return whether the SPI supports asynchronous operation (usually using DMA). Asynchronous operation may be supported for some or all transfers, for example it may only be supported for send-only transfers.
func (spi SPI) StartTx(w, r []byte) error
Start a SPI transmission in the background (usually, using DMA). This has the same effect as running Tx in a goroutine, but doesn't spawn a new goroutine. The w and r byte slices must not be used while the transmission is in progress, but must be stored somewhere outside the StartTx function to avoid garbage collecting them.

It is allowed to start multiple transactions without waiting for the first to finish. They will have the effect as if Tx was called multiple times in sequence. Lots of hardware won't support this however and will simply wait for the first to finish before starting a new transmission.

If IsAsync returns false, this is an alias for Tx.
func (spi SPI) Wait() error
Wait until all active transactions (started by StartTx) have finished. The buffers provided in StartTx will be available after calling Wait.

If IsAsync returns false, this is a no-op.

Alternatives I've considered:

Making Tx yield to the scheduler and relying on goroutines. I tried using this, but found it hard to use correctly - at least when I simply added a gosched() call in Tx. I struggled to make it actually do things in parallel. Once I switched to an API that exposes asynchronous transfers directly, it worked the first time for me.
Letting StartTx return a promise/future that can be waited on. I liked the idea, but realized that it would be hard to put in an interface (what's the promise type going to be?). On the other hand, Wait is just a simple call that is easy to put in an interface.
Not including IsAsync but making this purely an implementation detail. Unfortunately, for displays it's important to know whether the transfer can be async: if the SPI peripheral isn't async, it's not worth using two separate buffers (one of which would have been used for DMA).
Not allowing multiple buffers in StartTx. I don't plan on implementing queuing now, but it seems like a possibly useful feature. Many chips support chained SPI transfers and if the chip doesn't support it, it can be emulated by starting the next transfer from the DMA end interrupt.

I also have a proposal to the Displayer interface coming up which will use this API. Together, they can deliver very fast SPI display updates like in my Simplex noise demo here (where I used DMA to update the display in parallel with calculating the next frame, reducing SPI time from 20ms to almost nothing):
https://twitter.com/aykevl/status/1644411686189776897

sago35 · 2023-04-08T13:37:27Z

I am creating a dma package for samd51.
The content is almost identical, so there is no reason to feel uncomfortable.

https://github.com/sago35/tinygo-dma

I can adapt it to your proposed interface.

		dmatx.Start()
		dbg6.Toggle()
		dmatx.Wait()

https://github.com/sago35/tinygo-dma/blob/77e2b3d2bad015c3c62acf6fb16da9322853c2bb/examples/spitx/main.go

aykevl · 2023-04-08T14:02:27Z

I can adapt it to your proposed interface.

You can, but my intention is to integrate this into the TinyGo machine package. So there won't be a need for an external package.

sago35 · 2023-04-08T14:06:38Z

Yes, I too would like to include dma and spi communication using dma in the machine package.

kenbell · 2023-04-10T07:52:51Z

It seems like this is probably a proposal for async I/O (DMA is one implementation) for multiple types of peripherals (I2C, UART, ...).

In general, I like the pattern: StartTx, Wait

I'm not sure about StartTx as an alias of Tx - i suspect the app will be designed to depend on the timing behavior of StartTx. Unless there's some plan to have a background go routine model for all 'async I/O on synchronous peripherals', I'd sooner catch it at compile time - if async is not available, the StartTx and Wait symbols don't exist.

Given this seems to be starting a pattern for multiple peripheral types, in my opinion it's worth thinking about when two different peripherals are involved. For example, a fully async device reading a sensor over I2C and updating an SPI display. I think you rapidly end up needing something like 'select()' - some way for the app to wait on multiple devices to complete, with a return value indicating which device 'completed'. This probably needs some integration with the scheduler?

In fact, Wait on individual devices might be dangerous - simple for the easy case, but i can see it causing additional issues if there are multiple devices in use (e.g. a library using 'Wait'). Would be better to have a machine.WaitForAny(....) type function?

kenbell · 2023-04-10T09:37:56Z

Thinking some more about general-purpose async IO, you'd want something like this for the app?

display.StartDisplay(...)

for {
  select {
    case <-time.After(100 * time.Millisecond - (now() - last)):
       // start a reading
       accelSensor.Start()
  
    case readings := <-accelSensor.Wait():
       // process sensor data
  
    case <-display.Wait():
       // ... generate bitmap ...
  
       // update screen as soon as possible
       display.StartDisplay(...)
  }
}

Assuming this kind of API, how would the sensor driver be able to process an async I2C / SPI response? Some kind of callback from the bus maybe?

aykevl · 2023-04-10T17:10:00Z

It seems like this is probably a proposal for async I/O (DMA is one implementation) for multiple types of peripherals (I2C, UART, ...).

Yes, although I haven't really thought of it that way. I was just focused on SPI for now.
Not all peripherals are the same though, for example for I2S we might want to use interrupts instead to avoid stuttering.

(Side note: initially I thought of putting DMA in the name but then I realized there could be other kinds of async I/O although I can't really think of any right now).

I'm not sure about StartTx as an alias of Tx - i suspect the app will be designed to depend on the timing behavior of StartTx.

With "timing behavior", do you mean the fact that it runs in the background?
My goal is to make the same API available everywhere, so that programs can unconditionally use it and perform well on hardware with DMA support but have a graceful fallback on hardware that doesn't.

For context, my main use for DMA right now is to support fast SPI display updates. The API for that would look something like this:

func (d *Device) StartDrawBitmap(x, y, w, h int16, data []pixel.RGB565) error {
    d.spi.Wait()
    d.setWindow(x, y, w, h)
    return d.spi.StartTx(data)
}

func (d *Device) Wait() error {
    return d.spi.Wait()
}

And it can be used like this (in pseudo-Go):

func paint(display Displayer, buffers [2][]pixel.RGB565) {
    for cycle := 0; cycle < chunks; cycle++ {
        // pick a buffer that's not currently in use
        buf := buffers[cycle % 2]
        // calculate the next frame
        calculateNextChunk(buf)
        // wait for the previous update to finish (if it hasn't already)
        // this way we can reuse the buffer in the next cycle
        display.Wait()
        // send the new buffer
        display.StartDrawBitmap(x, y, w, h, buf)
    }
    // make sure all active updates are finished
    display.Wait()
}

In this example, the display driver would need to be more complicated to support both asynchronous and synchronous APIs. By making StartTx and Wait available everywhere, the display driver only needs to be made for the asynchronous API and it will automatically work with the synchronous API (although at reduced performance).

Unless there's some plan to have a background go routine model for all 'async I/O on synchronous peripherals', I'd sooner catch it at compile time - if async is not available, the StartTx and Wait symbols don't exist.

I'm not planning to, and I doubt a background goroutine will help in any way. The time between sending two bytes over SPI is likely to be shorter than a context switch.

Given this seems to be starting a pattern for multiple peripheral types, in my opinion it's worth thinking about when two different peripherals are involved. For example, a fully async device reading a sensor over I2C and updating an SPI display. I think you rapidly end up needing something like 'select()' - some way for the app to wait on multiple devices to complete, with a return value indicating which device 'completed'. This probably needs some integration with the scheduler?

Hmm, that's an interesting thought. For more complicated cases like you describe (I2C and SPI), it might be worth relying on goroutines (the synchronous API could still use DMA and yield to the scheduler).

However, the API you describe with channels is quite interesting. I'm not sure about the overhead of channels, but I do like how it is much easier to compose with other async operations. I think I will prototype it and see what happens.
(In this case I would go for chan struct{} or similar which can potentially be implemented more efficiently than channels that contain a value).

kenbell · 2023-04-10T21:13:24Z

I'm not sure about StartTx as an alias of Tx - i suspect the app will be designed to depend on the timing behavior of StartTx.

With "timing behavior", do you mean the fact that it runs in the background? My goal is to make the same API available everywhere, so that programs can unconditionally use it and perform well on hardware with DMA support but have a graceful fallback on hardware that doesn't.

...snip...

In this example, the display driver would need to be more complicated to support both asynchronous and synchronous APIs. By making StartTx and Wait available everywhere, the display driver only needs to be made for the asynchronous API and it will automatically work with the synchronous API (although at reduced performance).

Thinking again, at the bus level, I agree - drivers should be written once, using StartTx. Different drivers for sync vs async capable busses would be unwieldy.

However, the API you describe with channels is quite interesting. I'm not sure about the overhead of channels, but I do like how it is much easier to compose with other async operations. I think I will prototype it and see what happens. (In this case I would go for chan struct{} or similar which can potentially be implemented more efficiently than channels that contain a value).

Another example of 'async IO' might be GPIO interrupts. Currently, app logic has to implement an interrupt safe handler for GPIO events - the rules around which are 'subtle'. Async I/O would likely be 'good enough' for many use-cases I suspect and allow memory allocation, etc in response to a GPIO event:

select {
  case upEvent := <- machine.GPIO10.Wait():
     // 'up' button
  case downEvent := <- machine.GPIO11.Wait():
    // 'down' button

   //...
}

aykevl · 2023-05-21T19:31:09Z

Here is one possible way to implement a DMA API using channels:

func (spi SPI) Wait() <-error
Return a channel that will send a single value once all queued DMA transactions have completed. The return value is nil if all transactions finished successfully, or an error if at least one transaction failed. After a value is received, the buffers previously passed to StartTx can be accessed again by the application.

Only receive at most one value from the returned channel. The easiest way to ensure this is to only use <-spi.Wait(). The returned value may be different between calls, or it may be the same. It might also be a closed channel (that always returns a value immediately), this is an implementation detail.

The SPI peripheral (like most peripherals) is not thread safe, so it's not safe to access from different goroutines. This also means you can't start new transactions while waiting for a previous transaction to complete.

It can be implemented like this:

The SPI peripheral has an internal buffered channel that has zero or one elements, depending on whether a transfer just finished. There is also a global closed chan error, that may be returned in case Wait is called while the DMA transaction is already finished.
StartTx clears the value in the internal channel, if there is one.
A DMA complete interrupt will send a single value to the channel if that's the last DMA transfer in progress.
Wait returns this internal channel if a DMA transfer is in progress, or else it returns the global closed error channel. A closed channel can still be read from, it returns the zero value.

The reason why I'm using the same channels (or a shared closed channel), is because allocating a new channel on each transaction will require a heap allocation and is therefore too slow for an API that's supposed to be performant.

This is how the API could be used in a display driver. Note that it exposes the Wait() function so that a user of the driver could use a select:

func (d *Device) StartDrawBitmap(x, y, w, h int16, data []pixel.RGB565) error {
    <-d.spi.Wait()
    d.setWindow(x, y, w, h)
    return d.spi.StartTx(data)
}

func (d *Device) Wait() chan error {
    return d.spi.Wait()
}

I am leaning slightly in favor of this API, even though it's quite a bit more complex. We'd need some form of synchronization between interrupts and the waiting goroutine anyway, and those might as well be channels. Channels might have a bit of overhead (more than a dedicated synchronization primitive), but I believe those can be optimized if we hit any issues.

@kenbell what do you think?

kenbell · 2023-05-26T19:28:20Z

@aykevl I quite like it. I think I'd have spi.StartTx do the gratuitous wait, rather than the drivers.

Overlapping calls - error or delay?

The other option would be for spi.StartTx to return an error if a transaction is in process. I can kind of seeing it both ways, but doing a gratuitous wait might mask logic issues in the app?

Mixing Sync and Async

If sync and async operations occur on the same bus - e.g. I2C if one driver/device uses async and another uses sync. Sync calls should fail or delay until the async call completes?

Channel re-use

Agree with re-using the channel, not allocating for each transaction.

Sensors, etc...

For transactions involving a read, the driver would do something like this? (bulk read)

func (d *Device) StartRead(buf []byte) error {
    <-d.spi.Wait()  // unless we put this in d.spi.StartTx
    return d.spi.StartTx([]byte{READ,REG_DATA,len(buf)}, buf)
}

func (d *Device) Wait() chan error {
    return d.spi.Wait()
}

func (d *Device) EndRead() error {
    <- d.spi.Wait()  // if app doesn't want to wait, it can just call EndRead directly
    return nil
}

Or if more structured data...

func (d *Device) StartHeatAndMeasure() error {
    <-d.spi.Wait()  // unless we put this in d.spi.StartTx
    return d.spi.StartTx([]byte{READ,REG_START_HEATER_AND_MEASURE}, d.buf)
}

func (d *Device) Wait() chan error {
    return d.spi.Wait()
}

func (d *Device) EndHeatAndMeasure() (uint8, uint8, error) {
    <- d.spi.Wait()  // if app doesn't want to wait, it can just call EndXXX directly
    return d.buf[0], d.buf[1], nil
}

aykevl · 2023-06-01T17:41:28Z

The other option would be for spi.StartTx to return an error if a transaction is in process. I can kind of seeing it both ways, but doing a gratuitous wait might mask logic issues in the app?

I guess, but that means that it will be impossible for more than one transaction to be queued (meaning, calling StartTx multiple time in sequence and Wait returns when all of them have finished). Some hardware like the rp2040 does support this, and if the hardware doesn't it can usually be implemented in software (starting the next transaction in the "transaction complete" interrupt). Perhaps not super useful, but potentially useful as a small performance optimization.
If the hardware doesn't support this feature, the easiest way to support this would indeed be to wait until the previous transaction has finished.

If sync and async operations occur on the same bus - e.g. I2C if one driver/device uses async and another uses sync. Sync calls should fail or delay until the async call completes?

Not sure, it could work either way. We could say "call Wait() before doing a sync transaction" which is of course kind of sensitive to bugs. The least surprising thing to do would probably be to wait until the last async transaction has finished.

I don't fully understand your EndXxx calls, what's the purpose of those and how are they different from plain Wait()?
(If it's to abort a transaction - that's possible in theory but probably too risky and likely to cause subtle bugs).

kenbell · 2023-06-02T21:59:30Z

I think I agree with your argument. I like the model that the contract is that one or more asynchronous (+ 1 synchronous) transaction can be outstanding and the bus driver is responsible for ensuring all get done safely, whether the bus is truly asynchronous or not. A synchronous bus implementation would convert the async calls into sync ones, and an asynchronous bus implementation would convert the synchronous call into async followed by wait. That way device drivers can be written for the best performance case (async busses) or the most simple code structure (sync busses) without worrying what the bus actually does underneath. Of course the app developer may care, and want all drivers and busses to be async to avoid delays in their app, but that's just an argument to gradually move all busses and device drivers to an async model over time.

I don't fully understand your EndXxx calls, what's the purpose of those and how are they different from plain Wait()?
(If it's to abort a transaction - that's possible in theory but probably too risky and likely to cause subtle bugs).

I'm thinking EndXXX can mirror StartXXX on drivers where an async bus call by the driver should return data to the app (for example for an I2C sensor). The app code would do either:

sensor.StartHeatingAndMeasure()
// do stuff
val1, val2, val3 := sensor.EndHeatAndMeasure()

or the app could do:

sensor.StartHeatingAndMeasure()
select {
  case <-sensor.Wait():
      val1, val2, val3 := sensor.EndHeatAndMeasure()
  case ...
}

these would be equivalent to the synchronous version:

  val2, val2, val3 := sensor.HeatAndMeasure()

The async versions would have the benefit that if the sensor needed to hold the bus (for, say, 100ms to heat a hotplate) the app can do other useful stuff (like update the display) without needing go routines.

…VM 15

Tested on Fedora 39.

This makes test outputs more consistent, even though it is a lot slower on my arm64 laptop (running the tests in QEMU).

aykevl · 2023-11-04T14:29:42Z

Rebased the PR (no actual update).

@kenbell looking back on this I'm not so sure about the API with channels. It seems rather brittle to me and very easy to get wrong (especially in the implementation). So what I'd much rather do is add a simple blocking Wait, and only add the special method (WaitChan perhaps?) later if there is an actual need for it.

What do you think of this?

kenbell · 2023-11-05T22:21:13Z

@aykevl - let me drop a few comments on the PR. I think I'm missing something about the semantics of StartTx / Wait as-implemented, and it's probably best handled inline with the code.

soypat · 2023-11-11T02:46:25Z

A few comments:

Return whether the SPI supports asynchronous operation (usually using DMA). Asynchronous operation may be supported for some or all transfers, for example it may only be supported for send-only transfers.

Wouldn't we be able to tell if a SPI has DMA by looking at whether it implements an interface with StartTx and Wait?

Making Tx yield to the scheduler and relying on goroutines. I tried using this, but found it hard to use correctly - at least when I simply added a gosched() call in Tx. I struggled to make it actually do things in parallel. Once I switched to an API that exposes asynchronous transfers directly, it worked the first time for me.

This seems odd to me. A scheduled Tx seems like it could be implemented by putting the contents of StartTx and Wait inside the same function and adding gosched within the for loops? At least that's how I implemented it in piolib, the whole add scheduled DMA to SPI takes up around 35 lines of code.

Which leads me to the next point: I like my DMA abstraction a bit more since it is readily reusable for PIO, SPI, I2C, etc. Maybe we needed a better DMA abstraction for scheduled SPI to be as good as async possible? (I say "my abstraction" but it's actually heavily inspired by embassy-rs's implementation, who rely heavily on async to use DMA.)

So far I've been using SPI with DMA on the PIO with excellent results using the above API to communicate with the pico W's CY43439 at 25MHz clock. The core API used are the push and pull methods.

aykevl · 2023-12-09T18:38:38Z

Sorry I missed this comment.

Wouldn't we be able to tell if a SPI has DMA by looking at whether it implements an interface with StartTx and Wait?

(I think I answered this elsewhere but I don't remember where, so repeating here).

Yes, and it's certainly a viable implementation strategy. However, the benefit of having a single API for all SPI objects makes it slightly easier to write drivers: they can use one or the other API and be correct in both cases.

Also, this only works at the machine.SPI layer. Once you write a driver (like a display driver), you're not going to have different implementations depending on whether the bus supports DMA or not. So in that case, you'd have an IsAsync method on the driver object anyway.

Making Tx yield to the scheduler and relying on goroutines. I tried using this, but found it hard to use correctly - at least when I simply added a gosched() call in Tx. I struggled to make it actually do things in parallel. Once I switched to an API that exposes asynchronous transfers directly, it worked the first time for me.

This seems odd to me. A scheduled Tx seems like it could be implemented by putting the contents of StartTx and Wait inside the same function and adding gosched within the for loops?

Already explained this in my other comment: tinygo-org/tinygo#3985 (comment)
I'd be very curious if you have found a way around this scheduling issue? Or does the pio library suffer from the same problem?

ysmilda · 2024-06-20T11:30:01Z

The initial proposed API covers the use cases for a display quite nicely. The way a StartTx and Wait are handled are very similar to how one would do this outside (tiny)go and in my opinion it is good to stay close to the hardware here.

Will the same API be made available for the other peripherals and how will they handle sharing the available DMA channels?

aykevl force-pushed the spi-dma branch from 5dfe5a3 to a4ab7e6 Compare April 8, 2023 14:00

deadprogram force-pushed the dev branch from 3bfe833 to 52e982e Compare April 22, 2023 16:03

aykevl mentioned this pull request May 3, 2023

Rp2040 uart flush - resolves #3537 tinygo-org/tinygo#3538

Closed

aykevl and others added 6 commits October 11, 2023 17:50

guides: LLVM 16 is now the default, so adjust the docs

95d1d90

guides: LLVM 14 will soon be removed, so update the example to use LL…

58b8efc

…VM 15

guides: lld-libs isn't needed on Fedora, but clang-libs is

725851b

Tested on Fedora 39.

imports: always run tests on linux/amd64

54bf7c4

This makes test outputs more consistent, even though it is a lot slower on my arm64 laptop (running the tests in QEMU).

Document machine.HardwareID

df4b869

reference/machine: add SPI DMA support

9f630d9

aykevl force-pushed the spi-dma branch from a4ab7e6 to 9f630d9 Compare November 4, 2023 14:20

aykevl mentioned this pull request Nov 4, 2023

machine: add SPI DMA support for the rp2040 and samd51 tinygo-org/tinygo#3985

Draft

2 tasks

deadprogram force-pushed the dev branch from d441139 to d7fd4d2 Compare February 26, 2024 18:53

deadprogram force-pushed the dev branch from 30b9f8a to 63b57bd Compare June 18, 2024 08:49

deadprogram force-pushed the dev branch from 63b57bd to 3e84af2 Compare July 2, 2024 11:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: add DMA support to SPI #342

Proposal: add DMA support to SPI #342

aykevl commented Apr 8, 2023 •

edited

Loading

sago35 commented Apr 8, 2023

aykevl commented Apr 8, 2023

sago35 commented Apr 8, 2023

kenbell commented Apr 10, 2023 •

edited

Loading

kenbell commented Apr 10, 2023

aykevl commented Apr 10, 2023 •

edited

Loading

kenbell commented Apr 10, 2023

aykevl commented May 21, 2023

kenbell commented May 26, 2023 •

edited

Loading

aykevl commented Jun 1, 2023 •

edited

Loading

kenbell commented Jun 2, 2023

aykevl commented Nov 4, 2023

kenbell commented Nov 5, 2023

soypat commented Nov 11, 2023

aykevl commented Dec 9, 2023 •

edited

Loading

ysmilda commented Jun 20, 2024

Proposal: add DMA support to SPI #342

Are you sure you want to change the base?

Proposal: add DMA support to SPI #342

Conversation

aykevl commented Apr 8, 2023 • edited Loading

sago35 commented Apr 8, 2023

aykevl commented Apr 8, 2023

sago35 commented Apr 8, 2023

kenbell commented Apr 10, 2023 • edited Loading

kenbell commented Apr 10, 2023

aykevl commented Apr 10, 2023 • edited Loading

kenbell commented Apr 10, 2023

aykevl commented May 21, 2023

kenbell commented May 26, 2023 • edited Loading

Overlapping calls - error or delay?

Mixing Sync and Async

Channel re-use

Sensors, etc...

aykevl commented Jun 1, 2023 • edited Loading

kenbell commented Jun 2, 2023

aykevl commented Nov 4, 2023

kenbell commented Nov 5, 2023

soypat commented Nov 11, 2023

aykevl commented Dec 9, 2023 • edited Loading

ysmilda commented Jun 20, 2024

aykevl commented Apr 8, 2023 •

edited

Loading

kenbell commented Apr 10, 2023 •

edited

Loading

aykevl commented Apr 10, 2023 •

edited

Loading

kenbell commented May 26, 2023 •

edited

Loading

aykevl commented Jun 1, 2023 •

edited

Loading

aykevl commented Dec 9, 2023 •

edited

Loading