Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAM: Acknowledge writes #30

Open
imphil opened this issue Nov 24, 2016 · 12 comments
Open

MAM: Acknowledge writes #30

imphil opened this issue Nov 24, 2016 · 12 comments

Comments

@imphil
Copy link
Contributor

imphil commented Nov 24, 2016

Currently, osd_memory_write() is fire-and-forget; the software does not know when MAM has finished writing data to the memory. This causes problems when memory loading must be finished before CPUs can be started.

We therefore should

  • Add an write acknowledgement event to MAM which is sent out if a write request has finished.
  • Make the OSD software block until this write ACK has arrived.
@wsong83
Copy link
Member

wsong83 commented Nov 24, 2016

At least for the lowRISC implementation, all transactions are in order.
So any read following the write is served after the write is finished.
Although this does not guarantees that if there is any internal buffers between MAM and memory/L2.
Currently there is none. So if a read after write finished, it is guaranteed that the write is finished (observable to processor).

@wallento
Copy link
Member

Yes, the packets are ordered, but currently it is possible to send the "start" packet while the MAM is still writing, leading potentially to corruption..

@wallento
Copy link
Member

@imphil I "solved" that before by reading back the memory (-verify)

@wsong83
Copy link
Member

wsong83 commented Nov 24, 2016

Enforcing a reply for every write leads to latency overhead and control overhead in the chisel mam. I think if a reply is really needed just for strong order, may be a special type of write rather than every write transaction.

@wallento
Copy link
Member

You are right, Wei. For now I was only thinking about an acknowledge on the internal MAM interface when the write is issued. But that is not necessarily when it is finished.

@imphil
Copy link
Contributor Author

imphil commented Nov 24, 2016

First, how can lowrisc not be affected by that? The problem is independent of the ordering or buffering. Assume the following scenario: You issue a MAM write request, and directly after the last MAM packet you send the start cpu register write to STM (that's what happens currently). This start cpu packet will reach STM (which is the second module in the NoC) before even the MAM worm is at its destination (in our case for example, it's ~20 more hops) causing the system to start before MAM has a chance to write the data to the memory, independent of any buffering.

Second, there is no significant additional latency overhead. The software does not need to wait for the ACK before it sends the next write. It only should wait for the ACK of the last packet. So essentially we introduce one roundtrip latency between writing the last memory through the MAM and starting the system. That should be acceptable. Also the additional NoC traffic caused by ACK packets is small enough to be acceptable.

@wsong83
Copy link
Member

wsong83 commented Nov 25, 2016

lowRISC is affected by this and this is indeed a problem which should be fixed.
What I argue about is how to fix it, either ask debug software to work around it or ask hardware to give an ack in a light-weighted manner.

The connection can be looked as:
HIM <> Debug Network <> Debug MAM module <> Chisel MAM (MAM/TileLink bridge) <> TileLink Bus (crossbar) <> L2 cache coherence tracker <> L2 cache

Currently, all components, from HIM to L2 cache coherence tracker treat write operations as fire and forget. To safely provide an ack (ensure that the write operation is finished), it requires the Chisel MAM to issue a special write operation to the TileLink Bus; therefore, the L2 cache coherence tracker will send back a finish packet to Chisel MAM when the data are written the L2 cache. Then the Chisel MAM can generate an ack to HIM.

To efficiently utilize the available bandwidth, Chisel MAM will not wait for the finish packet from L2 coherence tracker before sending the following write requests to TileLink. This means multiple write operations are in fly simultaneously. A buffer is needed to record all these concurrent operations. The reasons for this buffer: (1) L2 is banked and every L2 bank have multiple parallel tracker. So L2 does not ensure the write order. This means the finish packets may come back out-of-order. (2) If there are multiple masters on the debug network, for example we support HIM and CTM to store traces in memory/cache, the Chisel MAM needs to record which packet an ack is responding to.

Once this buffer is required, the control logic of Chisel MAM is increased, could be significantly. If the Chisel MAM do not want to keep a record, it then force sequential write operations: block further write operations until a finish packet is received from L2 coherence trackers. Then this would incur latency overhead.

My proposals are:
(1) Ask debug software to get around this issue. After all write requests are issued to HIM, debug software reads the last byte written. Chisel MAM issues all operations in order, so the read operation is issued to L2 after all write.
Although this still does not guarantee all write operations are finished (L2 is still out-of-order), the chance is extremely low considering the long delay in sending back an ack.
To play safe, debug software can then wait for a short while after the read is finished.
This would require no hardware changes.

(2) A light-weight hardware fix.
If the debug network can support two types of write, the normal one which is fire and forget and the special one which asks for an ack.
Then the Chisel MAM can enforce an ack to the L2 only for this special type.
Since at any time there should only be one special write, the Chisel MAM can block any further write operation when a special write is still on the fly. There is no need to have a record of all concurrent write but only the special one. Both latency and control overhead is very low.

I hope I have explained it better this time.

@wallento
Copy link
Member

Hi,

I think we should fix it for now with an acknowledge on the last packet of a write on write setup (write_valid & write_ready). As @wsong83 said it is nearly impossible to guarantee a write was effective as long as there is no RISC-V/Tilelink memory ordering model (which is pending, right?). In the long run we can then change the hardware behavior with an extension to the MAM interface:

  • Add write_sync output that tells the interconnect layer that this write and everything before need to complete
  • Add write_complete input that the interconnect layer uses to signal the fence back

That would not require another software change.

Does that sound reasonable?

@imphil
Copy link
Contributor Author

imphil commented Nov 25, 2016

Thanks Wei for the explanation, now I understand what you meant.

The most significant problem (and only one in OpTiMSoC actually) is the path from the host to the MAM. For this problem, I think the easiest solution is just to issue a memory read after the last memory write, as you proposed Wei in your first solution.

The much harder problem of ensuring that the data actually made it from MAM to the memory (or the caches) can for now either solved by a simple sleep() (since the time window here should be really small) or actually going the way Stefan/Wei Solution 2. I'll leave that up for you guys, since you better know what's going on inside LowRISC.

@imphil
Copy link
Contributor Author

imphil commented Nov 25, 2016

Maybe to add to that, from how i understood Wei we need to have two writes, one fire and forget and one sync one, as always issuing sync writes is too costly.

If I get your idea right Stefan then you think about a version with acks which we can later extend to provide fully synced writes to the memory system without software changes. But if we need two write types, this won't work without additional changes to the software to use the sync write type anyways? (the other option would be to add a config register to mam which can switch between two "meanings" of the ack, and set that one to "full barrier mode" before issuing the last write request)

@wallento
Copy link
Member

Actually it will be a flag in the last packet that this needs to sync I suppose. Alternatively, we can just have a memory sync packet I think.

@wsong83
Copy link
Member

wsong83 commented Nov 25, 2016

@imphil Yes, my idea was to have two types of write. I think @wallento 's idea is to let debug MAM handle it internally because the long debug packet from HIM is broken into smaller packets to Chisel MAM. I think @wallento 's idea could work just fine. Then this is hidden from software.

Just find out that actually now the Chisel MAM does not run the TileLink with full throttle. It actually waits from an ack from TileLink before sending the next write. I think the idea was the internal buffer in Chisel MAM is able to hide this latency considering TileLink is much faster than Debug network. However, synced write still has delay overhead.

I think a single write_sync would be enough. If write_sync is high for a write request, the Chisel MAM can hold req.ready low until the whole write request is finished. I think I can have this done with couple of lines. It should be easy to do in debug MAM as well I think?

It could get complicated when MAM runs in full throttle, but let's leave it for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants