MAM: Acknowledge writes #30

imphil · 2016-11-24T17:35:12Z

Currently, osd_memory_write() is fire-and-forget; the software does not know when MAM has finished writing data to the memory. This causes problems when memory loading must be finished before CPUs can be started.

We therefore should

Add an write acknowledgement event to MAM which is sent out if a write request has finished.
Make the OSD software block until this write ACK has arrived.

The text was updated successfully, but these errors were encountered:

wsong83 · 2016-11-24T17:48:37Z

At least for the lowRISC implementation, all transactions are in order.
So any read following the write is served after the write is finished.
Although this does not guarantees that if there is any internal buffers between MAM and memory/L2.
Currently there is none. So if a read after write finished, it is guaranteed that the write is finished (observable to processor).

wallento · 2016-11-24T17:53:36Z

Yes, the packets are ordered, but currently it is possible to send the "start" packet while the MAM is still writing, leading potentially to corruption..

wallento · 2016-11-24T17:56:04Z

@imphil I "solved" that before by reading back the memory (-verify)

wsong83 · 2016-11-24T18:24:50Z

Enforcing a reply for every write leads to latency overhead and control overhead in the chisel mam. I think if a reply is really needed just for strong order, may be a special type of write rather than every write transaction.

wallento · 2016-11-24T18:31:52Z

You are right, Wei. For now I was only thinking about an acknowledge on the internal MAM interface when the write is issued. But that is not necessarily when it is finished.

imphil · 2016-11-24T23:42:48Z

First, how can lowrisc not be affected by that? The problem is independent of the ordering or buffering. Assume the following scenario: You issue a MAM write request, and directly after the last MAM packet you send the start cpu register write to STM (that's what happens currently). This start cpu packet will reach STM (which is the second module in the NoC) before even the MAM worm is at its destination (in our case for example, it's ~20 more hops) causing the system to start before MAM has a chance to write the data to the memory, independent of any buffering.

Second, there is no significant additional latency overhead. The software does not need to wait for the ACK before it sends the next write. It only should wait for the ACK of the last packet. So essentially we introduce one roundtrip latency between writing the last memory through the MAM and starting the system. That should be acceptable. Also the additional NoC traffic caused by ACK packets is small enough to be acceptable.

wsong83 · 2016-11-25T11:01:27Z

lowRISC is affected by this and this is indeed a problem which should be fixed.
What I argue about is how to fix it, either ask debug software to work around it or ask hardware to give an ack in a light-weighted manner.

The connection can be looked as:
HIM <> Debug Network <> Debug MAM module <> Chisel MAM (MAM/TileLink bridge) <> TileLink Bus (crossbar) <> L2 cache coherence tracker <> L2 cache

Currently, all components, from HIM to L2 cache coherence tracker treat write operations as fire and forget. To safely provide an ack (ensure that the write operation is finished), it requires the Chisel MAM to issue a special write operation to the TileLink Bus; therefore, the L2 cache coherence tracker will send back a finish packet to Chisel MAM when the data are written the L2 cache. Then the Chisel MAM can generate an ack to HIM.

To efficiently utilize the available bandwidth, Chisel MAM will not wait for the finish packet from L2 coherence tracker before sending the following write requests to TileLink. This means multiple write operations are in fly simultaneously. A buffer is needed to record all these concurrent operations. The reasons for this buffer: (1) L2 is banked and every L2 bank have multiple parallel tracker. So L2 does not ensure the write order. This means the finish packets may come back out-of-order. (2) If there are multiple masters on the debug network, for example we support HIM and CTM to store traces in memory/cache, the Chisel MAM needs to record which packet an ack is responding to.

Once this buffer is required, the control logic of Chisel MAM is increased, could be significantly. If the Chisel MAM do not want to keep a record, it then force sequential write operations: block further write operations until a finish packet is received from L2 coherence trackers. Then this would incur latency overhead.

My proposals are:
(1) Ask debug software to get around this issue. After all write requests are issued to HIM, debug software reads the last byte written. Chisel MAM issues all operations in order, so the read operation is issued to L2 after all write.
Although this still does not guarantee all write operations are finished (L2 is still out-of-order), the chance is extremely low considering the long delay in sending back an ack.
To play safe, debug software can then wait for a short while after the read is finished.
This would require no hardware changes.

(2) A light-weight hardware fix.
If the debug network can support two types of write, the normal one which is fire and forget and the special one which asks for an ack.
Then the Chisel MAM can enforce an ack to the L2 only for this special type.
Since at any time there should only be one special write, the Chisel MAM can block any further write operation when a special write is still on the fly. There is no need to have a record of all concurrent write but only the special one. Both latency and control overhead is very low.

I hope I have explained it better this time.

wallento · 2016-11-25T11:20:41Z

Hi,

I think we should fix it for now with an acknowledge on the last packet of a write on write setup (write_valid & write_ready). As @wsong83 said it is nearly impossible to guarantee a write was effective as long as there is no RISC-V/Tilelink memory ordering model (which is pending, right?). In the long run we can then change the hardware behavior with an extension to the MAM interface:

Add write_sync output that tells the interconnect layer that this write and everything before need to complete
Add write_complete input that the interconnect layer uses to signal the fence back

That would not require another software change.

Does that sound reasonable?

imphil · 2016-11-25T12:00:47Z

Thanks Wei for the explanation, now I understand what you meant.

The most significant problem (and only one in OpTiMSoC actually) is the path from the host to the MAM. For this problem, I think the easiest solution is just to issue a memory read after the last memory write, as you proposed Wei in your first solution.

The much harder problem of ensuring that the data actually made it from MAM to the memory (or the caches) can for now either solved by a simple sleep() (since the time window here should be really small) or actually going the way Stefan/Wei Solution 2. I'll leave that up for you guys, since you better know what's going on inside LowRISC.

imphil · 2016-11-25T12:28:01Z

Maybe to add to that, from how i understood Wei we need to have two writes, one fire and forget and one sync one, as always issuing sync writes is too costly.

If I get your idea right Stefan then you think about a version with acks which we can later extend to provide fully synced writes to the memory system without software changes. But if we need two write types, this won't work without additional changes to the software to use the sync write type anyways? (the other option would be to add a config register to mam which can switch between two "meanings" of the ack, and set that one to "full barrier mode" before issuing the last write request)

wallento · 2016-11-25T12:32:02Z

Actually it will be a flag in the last packet that this needs to sync I suppose. Alternatively, we can just have a memory sync packet I think.

wsong83 · 2016-11-25T13:07:41Z

@imphil Yes, my idea was to have two types of write. I think @wallento 's idea is to let debug MAM handle it internally because the long debug packet from HIM is broken into smaller packets to Chisel MAM. I think @wallento 's idea could work just fine. Then this is hidden from software.

Just find out that actually now the Chisel MAM does not run the TileLink with full throttle. It actually waits from an ack from TileLink before sending the next write. I think the idea was the internal buffer in Chisel MAM is able to hide this latency considering TileLink is much faster than Debug network. However, synced write still has delay overhead.

I think a single write_sync would be enough. If write_sync is high for a write request, the Chisel MAM can hold req.ready low until the whole write request is finished. I think I can have this done with couple of lines. It should be easy to do in debug MAM as well I think?

It could get complicated when MAM runs in full throttle, but let's leave it for the future.

imphil mentioned this issue Nov 24, 2016

MAM does not properly load last tile optimsoc/optimsoc#30

Closed

wallento mentioned this issue Nov 29, 2016

MAM: Add synchronous flag #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAM: Acknowledge writes #30

MAM: Acknowledge writes #30

imphil commented Nov 24, 2016

wsong83 commented Nov 24, 2016

wallento commented Nov 24, 2016

wallento commented Nov 24, 2016

wsong83 commented Nov 24, 2016

wallento commented Nov 24, 2016

imphil commented Nov 24, 2016

wsong83 commented Nov 25, 2016

wallento commented Nov 25, 2016

imphil commented Nov 25, 2016

imphil commented Nov 25, 2016

wallento commented Nov 25, 2016

wsong83 commented Nov 25, 2016

MAM: Acknowledge writes #30

MAM: Acknowledge writes #30

Comments

imphil commented Nov 24, 2016

wsong83 commented Nov 24, 2016

wallento commented Nov 24, 2016

wallento commented Nov 24, 2016

wsong83 commented Nov 24, 2016

wallento commented Nov 24, 2016

imphil commented Nov 24, 2016

wsong83 commented Nov 25, 2016

wallento commented Nov 25, 2016

imphil commented Nov 25, 2016

imphil commented Nov 25, 2016

wallento commented Nov 25, 2016

wsong83 commented Nov 25, 2016