- revision history
- preface
- 1. Introduction
- 2. programmers model
- 3. XTightlyCoupledIO subextensions
- 3.1. XTightlyCoupledIOsupp
- 3.2. XTightlyCoupledIOsuppbfi
- 3.3. XTightlyCoupledIOsuppbri
- 3.4. XTightlyCoupledIOaddi
- 3.5. XTightlyCoupledIOa
- 3.6. XTightlyCoupledIOad
- 3.7. XTightlyCoupledIObb
- 3.8. XTightlyCoupledIObbd
- 3.9. XTightlyCoupledIOsb
- 3.10. XTightlyCoupledIOsbd
- 3.11. XTightlyCoupledIObf
- 3.12. XTightlyCoupledIOsbbr
- 3.13. XTightlyCoupledIOfcvt
- 3.14. XTightlyCoupledIOcm
- 3.15. XTightlyCoupledIOsideOPdelay
- Appendix A: code samples
- A.1. stm32 GPIO output toggle
- A.2. stm32f0 minimum PLL clock init (assume reset state of registers, no other config)
- A.3. stm32f0 minimum PLL clock init (assume unknown or "worst case" state of registers)
- A.4. stm32f0 gpio + timer init for 7segment display (assume reset state of registers)
- A.5. stm32f0 7segment display interrupt handler
- A.6. c2000 workshop sample
- A.7. c2000 "This is very efficient; there is a one-to-one correlation between C and assembly"
- A.8. c2000 magic value less, whole register write
- A.9. c2000 preserving write 1 to clear bits
- A.10. c2000 32bit only peripherals
- A.11. stm32g4 buck converter (3p3z, voltage mode)
- Bibliography
Jan Oleksiewicz [email protected]
document version 3.2.50
extension status: unstable/PoC
This document is released under a Creative Commons Attribution 4.0 International License
Version | Change |
---|---|
v3.2.50 |
trimmed non functional changes from revision history, smaller font |
v3.2.0 |
improved sideattach syntax, extra notes |
v3.1.0 |
added |
v3.0.0 |
rework of encodings, removed destructive shifts and |
v2.8.0 |
added fence interop |
v2.7.0 |
added relaxation section, supplementary instrs can also pseudoinstr |
v2.6.0 |
added sideOPdelay subextension |
v2.5.0 |
initial memory model |
v2.4.0 |
added bitfield insert from immediate |
v2.3.0 |
use |
v2.2.0 |
added |
v2.1.0 |
added |
v2.0.0 |
major rework of encodings, the |
This document uses semantic versioning with respect to potential hardware designs. Assembly syntax change is a minor increment. Version 1.0.0 is the first publicly released. Changes in prior versions are not versioned properly and not tracked in revision history. The number in a major revision doesn’t hold the freeze or ratification status.
Document is written in a way that reduces the duplications as those are hard to maintain.
There was no attempt at optimizing instruction encodings, other than sticking close to canonical risc-v encodings, yet.
The spec can be donated (FOSS org??), if it allows it to undergo more comparative studies and proceed to "standardization"
The scope of XTightlyCoupledIO extension is to reduce code size, register pressure and increase performance in peripheral accessing code. All of which results in reduced latency in control loops etc.
This spec was created solely because we would have to wait for proprietary one otherwise.
And if we are talking about proprietary extensions, they are usually:
-
Done wrong, mainly because those specs are created on tight deadlines without community feedback (like the severely missing instructions in XTheadBs)
-
Not done at all (the most obvious and common approach)
-
Those specs also almost never see an outside word and if they do, they are very badly documented or not documented at all (let’s guess what custom instructions the ch32v003 or ch32v307 implements…)
-
They also focus on gpio too much, leaving out the most frequently used or most critical peripherals.
Note
|
In modern microcontroller codebases the gpio tends to become accessed less frequently than other peripherals. And it’s due to a simple reason - if the peripherals are present, they no longer have to be bit-banged by gpio as it was done in the past. |
My observation of frequent peripheral patterns are:
-
only single bit needs to be modified or branched on
-
register is written with a heavy constant (including memory addresses)
-
register written with zero
-
in specific cases like STM32 BSRR or flag clearing, a single bit or inverted single bit constant is used
-
the register content comes directly from/to memory
-
otherwise the content is used in/comes from computations
-
register content is immediately converted to float for computation
-
small bitfields are extracted or inserted from/to registers
Note
|
Also the C/C++ volatile specifier prevent many possible compiler optimizations.
The "side effecting" acceses must follow what was written in the source code exactly, even though a
read + 2 single bit branches could be actually optimized into just two tio.bsb*.y instructions.
There is no way to distinguish if the intent was to avoid side effects, taking snapshot of status flags in time
or just an optimization for typical architectures.
|
avr8 Provides 64 IO registers each being accesible by in
and out
instructions, 32 of them
being available for the single bit instructions.
All registers are available through IO address space and memory addres space.
Single bit instructions consists of:
-
sbi
andcbi
for setting and clearing IO bits -
sbis
andsbic
that can skip one instruction if IO bit is set/cleared -
sbrc
andsbrs
that can skip one instruction if bit in general purpose register is set/cleared
There are also gpior
registers that serve as a scratch registers for e.g. global variables/flags.
Those have to be used explicitly in source code.
let’s have a look on, how efficiently it’s used:
- atmega8
-
-
3 reserved registers in bottom io space
-
8 non-bit registers in bottom io space
-
- atmega328p
-
The most used chip in arduino, as well as the most cloned one.
-
15 reserved registers in bottom io space
-
10 reserved registers in upper io space
-
many registers available only as memory mapped
-
- xmega
-
-
half of the bottom IO space is dedicated for
GPIO
(akagpior
) registers -
the other half is taken by VPORTs that can map to any gpio port configured
-
area between 0x1f and 0x30 is not populated at all
-
0x30 to 0x3f is populated by "CPU"
VPORTs have to be configured and used explicitly in source code.
-
- AVR-DA
-
One of the most recent avr8 family after Microchip.
similarly to xmega, there is only 7 GPIO virtual ports and 4
GPR
(akagpior
) registers
the upper part is populated only by the "CPU"
It is also worth to mention that avr8 architecture has not been licensed to 3rd parties like the 8051 did. Even though it could offer better PPA [1] and development ease than average "1T" 8051. Today we have only a few chinese clones of atmega328p due to expired patents.
Proprietary TI RISC architecture [3]. Popularized in beaglebone sbcs
Only the GPIO pins are mapped to r30
and r31
registers, though sometimes there is a mux
on r30
/r31
interfaces with e.g. MII or shift registers [4] (5.2.2)
special instructions for:
-
set/clear bit
-
branch if bit is set/cleared
Source and destinstion operands can independently address their bytes and half-words.
Proprietary TI accumulator-memory architecture [5] similar to the classic CISCs.
Peripherals can be accessed using indirect (XAR pointer registers) or DP addressing (16bit + 6bit offset from instr). Provide AMO-ALU instructions as well as integer to float conversions.
The CLA can also convert to float directly from memory (including peripherals)
[6] claims 2 cycles for ADC reg to float, Fig 4-3 claims 3x cycle speedup over cortex m4 (stm32g4)
Uses exactly the same code of memory mapped IO but the loads and stores execute in 1 cycle instead of 2 cycles
Reffered to as a programmable state machine, able to emulate serial and parallel peripherals over GPIO. Very limited instruction set.
Assumes cycle accurate, single cycle micro architecture.
Has an optional "side-set" operation and delay which stall execution of any following instruction.
8051 dedicates half of IRAM address space (aka zero page) for IO SFRs. SFRs are not available by indirect addressing as it targets the "hidden" SRAM.
0x20-0x2F memory range is bit-addressable.
8 vertical (0x80, 0x88, 0x90…) SFR registers are bit addressable.
Some of them are pre-occupied by (mandatory) standard SFRs, including the accumulator A
and
less usefull B
.
bit-addressable registers can be operated by special irregular instructions:
-
set/clear/complement bit
-
jump if bit is set/clear
-
jump if bit is set then clear it
-
mov
between bit and carry flag -
and
/or
operation of carry flag into bit (or its inverse)
x86 offers an 16 bit IO address space accessible by in*
and out*
instructions [15]
There is some legacy peripherals at fixed IO addresses. The rest are typically remappable.
Originally designed for 8080/8086 peripherals hanging on an off-chip bus, and thus not
being tightly integrated.
Today serving as a legacy ballast. As the address space is no more constrained
and the code size gains are negligible.
Especially considering the fact that the offending peripherals typically use MMIO
mappings instead anyway.
The IO ports can be divided into 1,4,8,16, or 32 bit witdth.
Buffered by shift registers, clocked by a timer or external clock.
Accessible by in
and out
(including partial and shifting variants) instructions.
Original 56000 [18] architecture offers IO address space that could be
accessed by a 6bit immediate addressing mode ("6-bit I/O Short Address")
Provided by the following instructions:
-
jump if bit is set/clear
-
jump to subroutine if bit is set/clear
-
bit test (and set/clear/change) instructions (updates carry flag)
Later versions (e.g. 56800) [19] extended the single bit into a bitmask match
where all of selected bits must be set or cleared to cause the condition.
Masks in branching instructions are limited to 8 bits, targeting top or bottom byte.
Available on RVE only. Limited to 16 GPR mapped registers. Allows to recycle major part of the microarchitectural pipeline as well as standard risc-v instructions operating on GPRs.
csrr* instructions implement an atomic swap and immediate bitmask set/clear operations.
However csr
registers are generally used to modify core architectural behaviour and thus perform slower than expected.
Note
|
for this reason RISC-V V spec forbids writes to vtype and vl with anything but vsetvl instructions
|
Note
|
xpulp extension is also planning on disallowing writes to hwloop registers with general csr instructions |
Implemented by cortex-m3 and cortex-m4
Not available on cortex-m0 and cortex-m7, optional on cortex-m3/m4.
Still requires loading of base address for bitbanded bit.
Must be used explicitly in source code
Special kind of write only registers e.g BSRR/IFCR found in STM32 and clones.
Still require loading of peripheral base address. Requires also generating
preformatted (shifted) constants even if only single bit is written.
Note
|
BSRR is still usefull for tio.mv acces as it can work on non-continous bitfields
or content from pre generated lookup tables [7]
|
Similar to ti PRU approach.
Only a few registers can be reserved like that. It takes out general purpose registers from use leading to less efficient code. Some assembly code would have to be rewritten to avoid now reserved registers.
Note
|
ABI deviations is not standardized at this moment |
There is limited availability of A extension across embedded cores.
Still requires loading of base address.
Base address must be generated with full lui
+ addi
sequence as there is no immediate offset
like in regular load/store instructions.
Implements only swap/add/or/and/xor/min/max operations.
Note
|
still available in first alternative approach as well as ABI deviations one |
Useful to directly store or load IO content to/from memory without processing. It is also non deterministic and can trap due to e.g. alignment or pmp restrictions, violating atomicity guarantee (with expensive workarounds). Those also would consume a lot of encoding space.
Usefull for fixed point arithmetic scaling etc.
Sometimes multi cycle, non deterministic.
Even single cycle implementations are potentially problematic to implement as the multiplier can span more pipeline stages than regular ALUs.
In presence of P or other custom DSP extension, it would be necessary to provide
IO versions of the myriads of those multiply accumulate instructions.
Otherwise tio.mul
+ add
wouldn’t provide any benefit over tio.mv
+ dsp.macc
sequence.
Note
|
if the mulh is necessary the tio.mul becomes useless
|
Note
|
P ext like, tio.mull.xy with destination register pair should still be possible
|
IIR and FIR filters need to cache the raw ADC readings, effectively enforcing use
of the tio.mv
instead of directly sourced multiplications (or MACs)
Note
|
Typical control loop IIR/FIR filters are designed to accept raw ADC readings. |
Note
|
Usually ADCs can be configured to do a sign extension of outputs (e.g 12 → 16 bits). tio.sbfextracti
could be used to perform such sign extension without need for additional sign extensions in ADCs.
|
Low use cases to be worth.
Bottom 11 bits can be done with single instruction:
tio.addi iod, zero, (~(1<<pos))
Otherwise we can achieve this in 2 instructions:
lui t0, %hi(~(1<<pos)) // 'c.' if bit 16-12 zeoroed tio.addi iod, t0, %lo(~(1<<pos))
or
c.li t0, -1 tio.bclri iod, t0, pos
Low use cases of independent io to io moves/ops.
Low flexibility of implementations, as the non destructive ops cannot provide AMO like decoupled execution.
Note
|
Destructive encodings are also justified by a bitfield insert instructions, possible only within destructive encoding. |
Note
|
P extension is about to introduce instructions with destructive rd encodings,
including IFMA, designated for DSP tasks of the same domain as targeted by XTightlyCoupledIO
|
Requires 4 instruction sequence to insert a constant. Let’s consider followng sample:
// switch PLL (0b10) to HSE (0b01)
RCC->CFGR = (RCC->CFGR & ~RCC_CFGR_SW_Msk) | (RCC_CFGR_SW_HSE);
using bfp:
li t1, RCC_CFGR_SW_HSE
addi t0, zero, {length[3:0], offset[7:0]}
pack t0, t1, t0
bfp a0, a0, t0
Note
|
below samples cannot be performed directly on IO sfr (require caching of intermediate result) |
In best case scenario it can be done in 2 instructions:
andi a0, a0, ~RCC_CFGR_SW_Msk
ori a0, a0, RCC_CFGR_SW_HSE
or in considered scenario:
bseti a0, a0, RCC_CFGR_SW_Pos
bclri a0, a0, RCC_CFGR_SW_Pos+1
Alternatively a more general sequence (4-6 instructions):
li a1, RCC_CFGR_SW_Msk // non inverted can be a single lui
andn a0, a0, a1 // use ~RCC_CFGR_SW_Msk for and, when Zbb is missing
li a1, RCC_CFGR_SW_HSE
or a0, a0, a1
Note
|
Can use bseti or bclri to cover a single bit in a field and avoid loading constants.
|
In [8], bfp
didn’t yield enough improvement.
It would be more efficient if the offset and length of the field could be given as immediate values, so that the preparatory setup steps aren’t needed.
The XTightlyCoupledIO extension adds 4 banks of 32 XLEN sized IO registers each.
The IO registers are reffered from rs1
or rd
field. Named ios1
and iod
.
If a given bank is not populated, corresponding instructions are reserved.
The IO targetting instructions must execute atomically. Therefore those instructions cannot be interrupted with visible side-effects.
Note
|
number of banks and availability in certain instructions was decided totally arbitrarily, will be refined later |
For easier mapping to high level languages, any access to IO registers causes side effects as if the entire XLEN sized word was accessed.
A partial modification triggers side effects as if the entire XLEN sized word was read, modified and written back.
GPIOA->OUT |= (1<<13);
//is equivalent to
tio.bseti io123, 13
For more efficient use of IO register space available by certain instructions.
Not reflecting actual memory mapped registers.
The access to IO registers by tio.
instructions, follows the TSO memory model with respect to each other.
The repeated accesses to the same IO register is sequentially consistent.
Note
|
TSO model is the best fit for typical in-order pipelines longer than 2-3 stages |
Note
|
implementations cannot reuse operand forwarding to solve RAW hazards of IO registers
due to volatile rules
|
Synchronization with (indepotent) memory access requires explicit FENCE
synchronization.
Access to IO registers by tio.
instructions and memory mapped interface is not synchronized.
Note
|
it would be too expensive to sync read-ALU-writeback stages with memory interface |
Note
|
implementations are still free to microcode tio. instructions using memory load and store
|
fence
instruction orders access of tio.
instructions using the PI/PO/SI/SO fields.
RMW operation is interpreted as combined read and write.
It must also properly order tio.
accesses with respect to memory mapped IO, that use the same PI/PO/SI/SO fields.
Note
|
it was decided to not extend fence instruction, due to limited use cases
|
For efficient use (aka having it used at all) of the tio
instructions, the compilers
need to automatically translate accesses to memory mapped registers into IO address space.
In case of avr8, the IO address space was mapped linearly to a specific offset in data address space (+0x20).
In case of arm or risc-v the peripherals are scattered over large memory area with 1024 byte minimum spacing. Because of this there needs to be a special mapping into IO address space and we are about to end up with thousands (sometimes GPL violating) outdated builds of custom toolchains, for all of those. As is already happening with interrupt controllers (e.g. WCH hw stacking)
Therefore we need an unified file format describing peripheral to IO mapping, that will be provided by vendors. It will be passed to compiler command line similarly to source code or linker scripts.
Note
|
Those mapping files can be also self made in case of "typical chinese vendors" |
Note
|
Those files could be used to provide named aliases in debuggers/decompilers |
Note
|
it is recommended to not keep registers mapped lienarly one after the other but split into appropriate banks. e.g. read/write data register doesn’t need to live in a bit operable banks. |
Even though compilers can automatically do a remap in compiled code, the assembly has to explicitly use the dedicated IO instructions leading to unportable code.
Note
|
in theory load/store with absolute addressing mode can indeed be relaxed
into in and out instructions, but risc-v doesn’t do an absolute addressing like avr8
|
In avr world portability of IO accesing assembly code was done like:
#if defined(atmega1234)||defined(atmega12345) #define RDR_REGISTER_IN_IO #define CONTROL1_REGISTER_IN_IO #define CONTROL1_REGISTER_IN_LOWER_IO #elif defined(atmega123456) //...
And appropriately spam #ifdef’s in the actual code.
As can be seen, each new device has to be added to the config header manually.
Therefore we need a way to discover wether given peripheral register is remapped into IO space, and use this information in e.g. #ifdefs
Note
|
assembly will stay messy with this anyway, especially when number of used register needs to be kept low in default inline interrupts |
Apart from the peripherals, the IO address space can hold avr8 like scratch registers. Those can be used to store the global variables/flags.
it can be:
-
used explicitly like in avr8
-
higly unportable
-
falls into "premature optimization" category
-
how many avr projects using
gpior
(akaGPIO
akaGPR
) did you see so far?
-
-
automatically mapped to global variables/flags
-
allows those scratch regs to be actually used
-
no longer relaxable to gp-rel load/stores
-
-
used with explicit attribute e.g.
__attribute__((mapto_ioscratch("bsb_accessible,bool_mergable,1cycle")))
-
usefull for critical control loop globals
-
can overide default cost function of above option
-
variable is not forced into scratch register if specific criteria is not met
-
no longer relaxable to gp-rel load/stores
-
It is possible to have SFRs that are not mapped to memory address space which are used by e.g. special
__attribute__
, but this prevents use of pointers to such peripherals.
Pointers are often used to avoid code duplication and resulting size increase [9].
(even wrt. tio
access, in some scenarios). Those are also commonly used in various HALs.
Compilers could theoretically track and translate the pointer useage, but it will finally
lead to highly inefficient code in corner or even regular cases.
Note
|
still suitable for a dedicated IO slave cores. |
All IO accessing instructions are prefixed with tio.
prefix.
Bank number is part of the instruction name, except supplementary instructions.
The suffix denominates wether rd
or rs1
field targets io registers
Takes the form of tio.instr{n}.{rdm}{rsm}
where {n} is the bank number
and {rdm} and {rsm} are substituted with one of the following letter.
-
x - integer reg
-
s - floating point reg
-
y - io reg
Register specifiers use the same letter.
tio.bseti3.y y11, 13 // set bit 13 in io 11 register in bank 3 tio.bseti2.yx y22, zero, 17 // write (1<<17) to io 22 register in bank 2
Note
|
letter y was picked totally arbitrarily as it’s single letter and doesn’t have conflicts |
tio
instructions referred to without the bank number and suffix.
Pseudoinstructions use the io
name prefix as the register specifier with
linearized addressing.
The supplementary instructions with omitted suffix are also considered as pseudoinstructions.
tio.bseti io107, 13 // set bit 13 in io 11 register in bank 3 tio.bseti io86, zero, 17 // write (1<<17) to io 22 register in bank 2
The following instructions are designated as a canonical IO move instructions:
tio.add{n}.yx iod, zero, rs2 tio.add{n}.xy rd, ios1, zero
Available under tio.mv
name with suffixed or linearized version.
Note
|
The canonical move in base risc-v is an addi , but because of
limited encoding, tio.addi cannot be provided with all necessary forms.
Therefore alternative instruction was picked.
|
Note
|
tio.add was picked because an addition is one of the most common
operations and the add ALU tend’s to be most available one. e.g. cortex-m7
doesn’t provide bitwise and/or/xor in its early ALU
|
Note
|
the move to/from IO registeris are not named as in and out
as I find those names confusing
|
Only the pseudo instructions are allowed to be relaxed into a different instruction, be it compressed or different one of the same size.
Note
|
BTW, this is how it should be done with base riscv instructions
where e.g. i.add a0, a0, a1 must alway emit exactly specified encoding
and add a0, a0, a1 can be relaxed to compressed instruction or a different one
(e.g. bseti a0, a1, 11 can be turned into ori a0, a1, (1<<11) for assumed,
better execution units availability).
For now we have only the unreliable and bloaty .option norvc +.option norelax workaround.
|
sideOP value can be optionally encoded by value placed in square brackets that is
placed after the last instruction param, separated by comma if there is at least one param.
If ommited the value 0
is encoded.
If an extension choses to use different syntax than plain uimm[4:0]
constant,
it must be placed within the square bracket.
If square bracket is provided with a single number, it must always be interpreted as uimm[4:0]
constant
- usage
1: tio.bseti GPIOA_ODR, 13
2: tio.bseti GPIOA_ODR, 13, [0] // equivalent to 1
3: tio.bseti GPIOA_ODR, 13, [31]
4: tio.bseti GPIOA_ODR, 13, [sideset 0b10, 7] // imaginary extension
Note
|
Square bracket was selected as MIPS syntax inherited by RISC-V doesn’t use those. |
Note
|
pioasm use it for delay only, not separated by comma from rest of the instruction params. |
When iom
bit is present, it controls wether rd
or rs1
targets IO register.
When high the rd field targets IO register. When low, the rs1 field targets the IO register.
bsel
immediate selects the accessed bank number. Bits missing from encodings are implied to be zero.
sideOP
encodes a side operation, that will be a part of another extension. Otherwise this field is reserved
and must be set to 0b00000
(no extra operation)
The name XTightlyCoupledIO
can be used as a catch all of following extensions.
Supplementary instructions useful for alternative upper GPR approach.
Necessary when working on "cached" IO register content, as those cannot be
accessed multiple times due to volatile
rules.
Note
|
usefull also in non IO code. |
- Synopsis
-
Branch if single bit in register is set (immediate)
- Mnemonic
tio.bsbseti.x rs1, shamt, label
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x7b, attr: ['CUSTOM-3'] }, { bits: 5, name: 'imm[4:1|11]' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'rs1' }, { bits: 5, name: 'shamt' }, { bits: 7, name: 'imm[12|10:5]' }, ]}
Note
|
instruction proposed as Zce 32bit candidate |
Note
|
only bottom 32 bits of target register are accessible on rv64 |
- Synopsis
-
Branch if single bit in register is cleared (immediate)
- Mnemonic
tio.bsbclri.x rs1, shamt, label
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x7b, attr: ['CUSTOM-3'] }, { bits: 5, name: 'imm[4:1|11]' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'rs1' }, { bits: 5, name: 'shamt' }, { bits: 7, name: 'imm[12|10:5]' }, ]}
Note
|
instruction proposed as Zce 32bit candidate |
Note
|
only bottom 32 bits of target register are accessible on rv64 |
- Synopsis
-
extract bitfield from register (immediate)
- Mnemonic
tio.bfextracti.xx rd, rs1, offset, len
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'rs1' }, { bits: 5, name: 'offset' }, { bits: 5, name: 'len' }, { bits: 2, name: 0x0 }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'rs1' }, { bits: 6, name: 'offset' }, { bits: 6, name: 'len' }, ]}
Note
|
instruction is equivalent to slli + srli sequence
|
- Synopsis
-
extract and sign extend bitfield from register (immediate)
- Mnemonic
tio.sbfextracti.xx rd, rs1, offset, len
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'rs1' }, { bits: 5, name: 'offset' }, { bits: 5, name: 'len' }, { bits: 2, name: 0x0 }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'rs1' }, { bits: 6, name: 'offset' }, { bits: 6, name: 'len' }, ]}
Note
|
instruction is equivalent to slli + srai sequence
|
Supplementary bitfield insert useful for alternative upper GPR approach.
Necessary when working on "cached" IO register content, as those cannot be
accessed multiple times due to volatile
rules.
- Synopsis
-
Destructive bitfield insert into register (immediate)
- Mnemonic
tio.bfinserti.xx rd, rs1, offset, len
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x0 }, { bits: 5, name: 'rs1' }, { bits: 5, name: 'offset' }, { bits: 5, name: 'len' }, { bits: 2, name: 0x0 }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x0 }, { bits: 5, name: 'rs1' }, { bits: 6, name: 'offset' }, { bits: 6, name: 'len' }, ]}
Note
|
due to encoding constraints only destructive form is provided |
Note
|
instruction was proposed for P extension as there are many more rd destructive ones |
- Synopsis
-
Destructive bitfield insert into register from immediate (immediate)
- Mnemonic
tio.bfinserti.xi rd, uimm, offset, len
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'uimm[4:0]' }, { bits: 5, name: 'offset' }, { bits: 5, name: 'len' }, { bits: 2, name: 0x0 }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'uimm[4:0]' }, { bits: 6, name: 'offset' }, { bits: 6, name: 'len' }, ]}
- Description
-
Insert
len
bits of expanded 'uimm[4:0]' constant into rd register atoffset
position. Theuimm=0
is mapped into-1
constant.
Note
|
due to encoding constraints only destructive form is provided |
Supplementary instructions for branching against immediate
Necessary for branching on exact pattern match of extracted bitfields.
Note
|
xpulp does signed immediate in rs2 position, meanwhile Zce v0.50 puts nzuimm in rs1 position |
Note
|
uimm=0 can be expressed with beq/bne zero, rs2, label therefore this case can
be reserved or mapped to other constant
|
Note
|
uimm from rs1 position was selected as it is already used by csrr*i as well as vsetivli instructions
|
Note
|
usefull also for lowering general code size and register pressure (for e.g. rv32e or IPRA compilation), |
- Synopsis
-
Branch if equal (immediate)
- Mnemonic
tio.beqi.xi rs2, uimm, label
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x63, attr: ['BRANCH'] }, { bits: 5, name: 'imm[4:1|11]' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'uimm[4:0]' }, { bits: 5, name: 'rs2' }, { bits: 7, name: 'imm[12|10:5]' }, ]}
- Description
-
Branch to
label
if rs2 content is equal to expanded 'uimm[4:0]' constant. Theuimm=0
is mapped into-1
constant.
- Synopsis
-
Branch if not equal (immediate)
- Mnemonic
tio.bnei.xi rs2, uimm, label
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x63, attr: ['BRANCH'] }, { bits: 5, name: 'imm[4:1|11]' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'uimm[4:0]' }, { bits: 5, name: 'rs2' }, { bits: 7, name: 'imm[12|10:5]' }, ]}
- Description
-
Branch to
label
if rs2 content is not equal to expanded 'uimm[4:0]' constant. Theuimm=0
is mapped into-1
constant.
Single IO addi
instruction provided for minimal implementations
- Synopsis
-
Add immediate and write to io register
- Mnemonic
tio.addi{bsel}.yx iod, rs1, imm
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 2, name: 0x0 }, { bits: 1, name: 'bsel' }, { bits: 5, name: 'rs1' }, { bits: 12, name: 'imm[11:0]' }, ]}
Note
|
lui + tio.addi pair can be used to write any 32bit constant into IO register.
|
General IO alu instructions
- Mnemonic
tio.add{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x0 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.sub{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x1 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.and{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x2 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.or{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x3 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.xor{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x4 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.slli{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 3, name: 0x3 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 6, name: 'shamt' }, { bits: 3, name: 0x3 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.srli{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 3, name: 0x4 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 6, name: 'shamt' }, { bits: 3, name: 0x4 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.srai{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 3, name: 0x5 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 6, name: 'shamt' }, { bits: 3, name: 0x5 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.sll{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 3, name: 0x3 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.srl{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 3, name: 0x4 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.sra{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 3, name: 0x5 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
Destructive general IO alu instructions
- Mnemonic
tio.add{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x0 }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.sub{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x1 }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.and{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x2 }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.or{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x3 }, { bits: 2, name: 'bsel' }, ]}
General IO bitmanip instructions
- Mnemonic
tio.andn{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x5 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.orn{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x6 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.xnor{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x7 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.min{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x8 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.minu{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0x9 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.max{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0xa }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.maxu{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0xb }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.rev8{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x1 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 4, name: 0xc }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
Destructive general IO bitmanip instructions
- Mnemonic
tio.andn{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x5 }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.orn{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x6 }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.xnor{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x7 }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.min{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x8 }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.minu{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0x9 }, { bits: 2, name: 'bsel' }, ]}
- Mnemonic
tio.max{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 5, name: 0xa }, { bits: 2, name: 'bsel' }, ]}
Single bit IO access instructions
- Synopsis
-
Single bit set (immediate)
- Mnemonic
tio.bseti{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 3, name: 0x0 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 6, name: 'shamt' }, { bits: 3, name: 0x0 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Single bit clear (immediate)
- Mnemonic
tio.bclri{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 3, name: 0x1 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 6, name: 'shamt' }, { bits: 3, name: 0x1 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Single bit invert (immediate)
- Mnemonic
tio.binvi{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 3, name: 0x2 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 6, name: 'shamt' }, { bits: 3, name: 0x2 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Single bit extract from IO register (immediate)
- Mnemonic
tio.bexti{bsel}.xy rd, ios1, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 3, name: 0x6 }, { bits: 1, name: 0, attr: ['iom'] }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x3 }, { bits: 5, name: 'ios1/rs1' }, { bits: 6, name: 'shamt' }, { bits: 3, name: 0x6 }, { bits: 1, name: 0, attr: ['iom'] }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Single bit set
- Mnemonic
tio.bset{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 3, name: 0x0 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Single bit clear
- Mnemonic
tio.bclr{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 3, name: 0x1 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Single bit invert
- Mnemonic
tio.binv{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 3, name: 0x2 }, { bits: 1, name: 'iom' }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Single bit extract from IO register
- Mnemonic
tio.bext{bsel}.xy rd, ios1, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod/rd' }, { bits: 3, name: 0x2 }, { bits: 5, name: 'ios1/rs1' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 3, name: 0x6 }, { bits: 1, name: 0, attr: ['iom'] }, { bits: 2, name: 'bsel' }, ]}
Destructive single bit IO access instructions
- Synopsis
-
Destructive single bit set (immediate)
- Mnemonic
tio.bseti{bsel}.y iod, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x7 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 4, name: 0x0 }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x7 }, { bits: 5, name: 'sideOP' }, { bits: 6, name: 'shamt' }, { bits: 4, name: 0x0 }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Destructive single bit clear (immediate)
- Mnemonic
tio.bclri{bsel}.y iod, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x7 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 4, name: 0x1 }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x7 }, { bits: 5, name: 'sideOP' }, { bits: 6, name: 'shamt' }, { bits: 4, name: 0x1 }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Destructive single bit invert (immediate)
- Mnemonic
tio.binvi{bsel}.y iod, shamt
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x7 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'shamt' }, { bits: 1, name: 0 }, { bits: 4, name: 0x2 }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x7 }, { bits: 5, name: 'sideOP' }, { bits: 6, name: 'shamt' }, { bits: 4, name: 0x2 }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Destructive single bit set
- Mnemonic
tio.bset{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x6 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 4, name: 0x0 }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Destructive single bit clear
- Mnemonic
tio.bclr{bsel}.y iod, rs2
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x6 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 4, name: 0x1 }, { bits: 2, name: 'bsel' }, ]}
- Synopsis
-
Destructive single bit invert
- Mnemonic
tio.binv{bsel}.y iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x6 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 'rs2' }, { bits: 1, name: 0 }, { bits: 4, name: 0x2 }, { bits: 2, name: 'bsel' }, ]}
IO bitfield instructions
- Synopsis
-
Destructive bitfield insert into IO register (immediate)
- Mnemonic
tio.bfinserti{bsel}.yx iod, rs1, offset, len
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x4 }, { bits: 5, name: 'rs1' }, { bits: 5, name: 'offset' }, { bits: 5, name: 'len' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x4 }, { bits: 5, name: 'rs1' }, { bits: 6, name: 'offset' }, { bits: 6, name: 'len' }, ]}
Note
|
rv64 encoding could tradeoff the extra len/offset range similarly to branches |
- Synopsis
-
Destructive bitfield insert into IO register from immediate (immediate)
- Mnemonic
tio.bfinserti{bsel}.yi iod, uimm, offset, len
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'uimm[4:0]' }, { bits: 5, name: 'offset' }, { bits: 5, name: 'len' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'iod' }, { bits: 3, name: 0x5 }, { bits: 5, name: 'uimm[4:0]' }, { bits: 6, name: 'offset' }, { bits: 6, name: 'len' }, ]}
- Description
-
Insert
len
bits of expanded 'uimm[4:0]' constant into iod register atoffset
position. Theuimm=0
is mapped into-1
constant.
Note
|
due to encoding constraints only destructive form is provided |
- Synopsis
-
extract bitfield from IO register (immediate)
- Mnemonic
tio.bfextracti{bsel}.xy rd, ios1, offset, len
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x6 }, { bits: 5, name: 'ios1' }, { bits: 5, name: 'offset' }, { bits: 5, name: 'len' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x6 }, { bits: 5, name: 'ios1' }, { bits: 6, name: 'offset' }, { bits: 6, name: 'len' }, ]}
Note
|
instruction is equivalent to tio.slli + srli sequence
|
- Synopsis
-
extract and sign extend bitfield from IO register (immediate)
- Mnemonic
tio.sbfextracti{bsel}.xy rd, ios1, offset, len
- Encoding (RV32)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x7 }, { bits: 5, name: 'ios1' }, { bits: 5, name: 'offset' }, { bits: 5, name: 'len' }, { bits: 2, name: 'bsel' }, ]}
- Encoding (RV64)
{reg:[ { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 0x7 }, { bits: 5, name: 'ios1' }, { bits: 6, name: 'offset' }, { bits: 6, name: 'len' }, ]}
Note
|
instruction is equivalent to tio.slli + srai sequence
|
branch on single IO bit instructions
- Synopsis
-
Branch if single bit in IO register is set (immediate)
- Mnemonic
tio.bsbseti{bsel}.y ios1, shamt, label
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x7b, attr: ['CUSTOM-3'] }, { bits: 5, name: 'imm[4:1|11]' }, { bits: 2, name: 0x0 }, { bits: 1, name: 'bsel' }, { bits: 5, name: 'ios1' }, { bits: 5, name: 'shamt' }, { bits: 7, name: 'imm[12|10:5]' }, ]}
Note
|
only bottom 32 bits of target register are accessible on rv64 |
- Synopsis
-
Branch if single bit in IO register is cleared (immediate)
- Mnemonic
tio.bsbclri{bsel}.y ios1, shamt, label
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x7b, attr: ['CUSTOM-3'] }, { bits: 5, name: 'imm[4:1|11]' }, { bits: 2, name: 0x1 }, { bits: 1, name: 'bsel' }, { bits: 5, name: 'ios1' }, { bits: 5, name: 'shamt' }, { bits: 7, name: 'imm[12|10:5]' }, ]}
Note
|
only bottom 32 bits of target register are accessible on rv64 |
implemented similarly to F or Zfinx fcvt instructions
Note
|
ADC readings are often immediately converted to float for processing in control loop algorithms |
- Synopsis
-
Read IO register and convert to float
- Mnemonic
tio.fcvt{bsel}.s.w.sy rd, ios1, rm
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x53, attr: ['OP-FP'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 'rm' }, { bits: 5, name: 'ios1' }, { bits: 3, name: 0x4 }, { bits: 2, name: 'bsel' }, { bits: 2, name: 'fmt', attr: ['S'] }, { bits: 5, name: 0x1a }, ]}
- Prerequisites
-
F or Zfinx
- Synopsis
-
Read IO register and convert to float
- Mnemonic
tio.fcvt{bsel}.s.wu.sy rd, ios1, rm
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x53, attr: ['OP-FP'] }, { bits: 5, name: 'rd' }, { bits: 3, name: 'rm' }, { bits: 5, name: 'ios1' }, { bits: 3, name: 0x5 }, { bits: 2, name: 'bsel' }, { bits: 2, name: 'fmt', attr: ['S'] }, { bits: 5, name: 0x1a }, ]}
- Prerequisites
-
F or Zfinx
implemented similarly to Zcm* extensions, incompatible with Zcd
- Synopsis
-
Move into IO register
- Mnemonic
tio.cm.mv{bsel}.yx iod, rs2
- Encoding (RV32, RV64)
{reg:[ { bits: 2, name: 0x0, attr: ['C0'] }, { bits: 5, name: 'rs2' }, { bits: 5, name: 'iod' }, { bits: 1, name: 'bsel' }, { bits: 3, name: 0x5, attr: ['FSD'] }, ],config:{bits:16}}
- Prerequisites
-
Zca
- Synopsis
-
Move from IO register
- Mnemonic
tio.cm.mv{bsel}.xy rd, ios1
- Encoding (RV32, RV64)
{reg:[ { bits: 2, name: 0x2, attr: ['C2'] }, { bits: 5, name: 'ios1' }, { bits: 5, name: 'rd' }, { bits: 1, name: 'bsel' }, { bits: 3, name: 0x1, attr: ['FLDSP'] }, ],config:{bits:16}}
- Prerequisites
-
Zca
Note
|
ios1 in rs2 position, the low bits store only rd' in C extension, maybe swap? |
- Synopsis
-
Set bit in IO register (immediate)
- Mnemonic
tio.cm.bseti0.y iod, shamt
- Encoding (RV32, RV64)
{reg:[ { bits: 2, name: 0x0, attr: ['C0'] }, { bits: 5, name: 'shamt' }, { bits: 5, name: 'iod' }, { bits: 1, name: '0' }, { bits: 3, name: 0x1, attr: ['FLD'] }, ],config:{bits:16}}
- Prerequisites
-
Zca
Note
|
only bottom 32 bits are accessible on rv64 |
- Synopsis
-
Clear bit in IO register (immediate)
- Mnemonic
tio.cm.bclri0.y iod, shamt
- Encoding (RV32, RV64)
{reg:[ { bits: 2, name: 0x0, attr: ['C0'] }, { bits: 5, name: 'shamt' }, { bits: 5, name: 'iod' }, { bits: 1, name: '1' }, { bits: 3, name: 0x1, attr: ['FLD'] }, ],config:{bits:16}}
- Prerequisites
-
Zca
Note
|
only bottom 32 bits are accessible on rv64 |
This extension provides optional 0 to 31 cycles of delay before the next IO targetting instruction
can be executed. Number of delay cycles is encoded as uimm[4:0]
in sideOP position.
It starts in next cycle after the implied writeback stage (and write side effects) The delayed instruction cannot trigger any of the side effects until the implied downcounter of delay reaches zero at the cycle of instructions implied writeback stage (and write side effects).
Note
|
allowing execution of regular instructions under delay window allows to achieve deterministic timing under non-deterministic execution conditions (caches, flash waitstates etc.), where extra computation is necessary (bit stuffing, access fifos etc.) |
Note
|
other sideOP behaviour can be configured by a custom CSR of another extension |
- example of generating 50:50 square wave with 64 cycle period
1:
tio.bseti GPIOA_ODR, 17, [31]
tio.bclri GPIOA_ODR, 17, [31]
b 1b
This instruction doesn’t access any IO register, but it causes pipeline contention as if it was a read-modify-write on IO register.
- Mnemonic
tio.nop
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 0x0 }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 0x0 }, { bits: 5, name: 0xc }, { bits: 2, name: 0x0 }, ]}
In opposition to tio.nop
it doesn’t cause pipelie contention, but instead
attaches its own sideOP
to a next IO accessing tio
instruction. Effectively overriding
sideOP in a next instruction if present. (sideOP of next instruction has no effect)
Cannot be overriden by itself, only the last sideattach
instruction is effective
Note
|
requires special CSR to hold attached sideOP
|
Note
|
uimm=0 sideOP encoding can be used to null out the sideOP of the following instruction
|
- Mnemonic
tio.nop.sideattach [sideOP]
Note
|
square bracket is mandatory |
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] }, { bits: 5, name: 0x0 }, { bits: 3, name: 0x5 }, { bits: 5, name: 'sideOP' }, { bits: 5, name: 0x0 }, { bits: 5, name: 0xd }, { bits: 2, name: 0x0 }, ]}
risc-v listings were generated by "clang 15.0.0" with -Os -march=rv32imafc_zba_zbb_zbs
flags. (clang as the listing is cleaner
than in gcc, and the generated code is a bit more efficient)
armv7m listings were generated by "gcc 11.2.1 (none)" with -Os -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16
flags.
(newest non linux one on godbolt)
risc-v + XTightlyCoupledIO listings are imaginary compile outputs. Note that many of definitions don’t even exists in device headers.
void toggle() {
GPIOB->ODR ^= GPIO_ODR_13;
}
Note
|
on avr8 GPIO pin toggling can be achieved by writing into PINxn registers by out or sbi instructions
(the sbi here is not a RMW)
|
- risc-v
toggle(): # @toggle()
lui a0, 294912
lw a1, 1044(a0)
binvi a1, a1, 13
sw a1, 1044(a0)
ret
- armv7m
toggle():
ldr r2, .L5
ldr r3, [r2, #20]
eor r3, r3, #8192
str r3, [r2, #20]
bx lr
.L5:
.word 1207960576
- risc-v + XTightlyCoupledIO
toggle():
tio.binvi GPIOB_ODR, 13
ret
- Results
risc-v | armv7-m | risc-v + XTightlyCoupledIO | |
---|---|---|---|
code size (bytes) |
18 |
16 |
6 |
void init_clocks()
{
FLASH->ACR = FLASH_ACR_PRFTBE | (FLASH_ACR_LATENCY_Msk & 0b001); // 1ws
RCC->CFGR = RCC_CFGR_PLLMUL12;
RCC->CR |= RCC_CR_PLLON;
while(!(RCC->CR & RCC_CR_PLLRDY));
RCC->CFGR |= RCC_CFGR_SW_PLL;
while ((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_PLL);
}
- risc-v
init_clocks(): # @init_clocks()
lui a0, 262178
li a1, 17
sw a1, 0(a0)
lui a0, 262177
lui a1, 640
sw a1, 4(a0)
lw a1, 0(a0)
bseti a1, a1, 24
sw a1, 0(a0)
.LBB0_1: # =>This Inner Loop Header: Depth=1
lw a1, 0(a0)
slli a1, a1, 6
bgez a1, .LBB0_1
lui a0, 262177 // redundant
lw a1, 4(a0)
ori a1, a1, 2
sw a1, 4(a0)
li a1, 8
.LBB0_3: # =>This Inner Loop Header: Depth=1
lw a2, 4(a0)
andi a2, a2, 12
bne a2, a1, .LBB0_3
ret
Note
|
gcc 12.2 fails to detect slli + bgez pattern and performs
li + and + beq, even though on arm it works fine
|
- armv7m
init_clocks():
ldr r3, .L7
movs r2, #17
str r2, [r3]
sub r3, r3, #4096
mov r2, #2621440
str r2, [r3, #4]
ldr r2, [r3]
orr r2, r2, #16777216
str r2, [r3]
.L2:
ldr r2, [r3]
lsls r2, r2, #6
bpl .L2
ldr r2, [r3, #4]
orr r2, r2, #2
str r2, [r3, #4]
.L3:
ldr r2, [r3, #4]
and r2, r2, #12
cmp r2, #8
bne .L3
bx lr
.L7:
.word 1073881088
- risc-v + XTightlyCoupledIO
init_clocks():
tio.addi FLASH_ACR, zero, (FLASH_ACR_PRFTBE | (FLASH_ACR_LATENCY_Msk & 0b001))
lui t0, %hi(RCC_CFGR_PLLMUL12)
tio.cm.mv RCC_CFGR, t0 // no need for addi
tio.cm.bseti RCC_CR, RCC_CR_PLLON_Pos
1:
tio.bsbclri RCC_CR1, RCC_CR_PLLRDY_Pos, 1b
tio.cm.bseti RCC_CFGR, RCC_CFGR_SW_Pos+1 // effectively 0b10
2:
tio.bfextracti t0, RCC_CFGR, RCC_CFGR_SWS_Pos, 2
tio.bnei t0, (RCC_CFGR_SWS_PLL >> RCC_CFGR_SWS_Pos), 2b
ret
- Results
risc-v | armv7-m | risc-v + XTightlyCoupledIO | |
---|---|---|---|
code size (bytes) |
58(54 without redundant lui) |
52 |
28 |
void init_clocks2()
{
FLASH->ACR = FLASH_ACR_PRFTBE | (FLASH_ACR_LATENCY_Msk & 0b001); // 1ws
if((RCC->CFGR & RCC_CFGR_SWS) == RCC_CFGR_SWS_PLL)
{
RCC->CFGR &= ~RCC_CFGR_SW_Msk; // switch to HSI (0b00)
while((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_HSI);
}
RCC->CR &= ~RCC_CR_PLLON;
while((RCC->CR & RCC_CR_PLLRDY));
RCC->CFGR = RCC_CFGR_PLLMUL12 | (RCC->CFGR & ~RCC_CFGR_PLLMUL_Msk);
RCC->CR |= RCC_CR_PLLON;
while(!(RCC->CR & RCC_CR_PLLRDY));
RCC->CFGR = RCC_CFGR_SW_PLL | (RCC->CFGR & ~RCC_CFGR_SW_Msk);
while((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_PLL);
}
- risc-v
init_clocks2(): # @init_clocks2()
lui a0, 262178
li a1, 17
sw a1, 0(a0)
lui a0, 262177
lw a1, 4(a0)
andi a1, a1, 12
li a2, 8
bne a1, a2, .LBB1_3
lw a1, 4(a0)
andi a1, a1, -4
sw a1, 4(a0)
.LBB1_2: # =>This Inner Loop Header: Depth=1
lw a1, 4(a0)
andi a1, a1, 12
bnez a1, .LBB1_2
.LBB1_3:
lw a1, 0(a0)
bclri a1, a1, 24
sw a1, 0(a0)
.LBB1_4: # =>This Inner Loop Header: Depth=1
lw a1, 0(a0)
slli a1, a1, 6
bltz a1, .LBB1_4
lui a0, 262177 // redundant
lw a1, 4(a0)
lui a2, 1047616
addi a2, a2, -1
and a1, a1, a2
bseti a1, a1, 19
bseti a1, a1, 21
sw a1, 4(a0)
lw a1, 0(a0)
bseti a1, a1, 24
sw a1, 0(a0)
.LBB1_6: # =>This Inner Loop Header: Depth=1
lw a1, 0(a0)
slli a1, a1, 6
bgez a1, .LBB1_6
lui a0, 262177 // redundant
lw a1, 4(a0)
andi a1, a1, -4
ori a1, a1, 2
sw a1, 4(a0)
li a1, 8
.LBB1_8: # =>This Inner Loop Header: Depth=1
lw a2, 4(a0)
andi a2, a2, 12
bne a2, a1, .LBB1_8
ret
- armv7m
init_clocks2():
ldr r3, .L20
movs r2, #17
str r2, [r3]
sub r3, r3, #4096
ldr r2, [r3, #4]
and r2, r2, #12
cmp r2, #8
bne .L10
ldr r2, [r3, #4]
bic r2, r2, #3
str r2, [r3, #4]
.L11:
ldr r2, [r3, #4]
tst r2, #12
bne .L11
.L10:
ldr r2, [r3]
bic r2, r2, #16777216
str r2, [r3]
.L12:
ldr r2, [r3]
lsls r1, r2, #6
bmi .L12
ldr r2, [r3, #4]
bic r2, r2, #3932160
orr r2, r2, #2621440
str r2, [r3, #4]
ldr r2, [r3]
orr r2, r2, #16777216
str r2, [r3]
.L13:
ldr r2, [r3]
lsls r2, r2, #6
bpl .L13
ldr r2, [r3, #4]
bic r2, r2, #3
orr r2, r2, #2
str r2, [r3, #4]
.L14:
ldr r2, [r3, #4]
and r2, r2, #12
cmp r2, #8
bne .L14
bx lr
.L20:
.word 1073881088
Note
|
gcc fails to detect bfi from constant, pattern generally
|
- risc-v + XTightlyCoupledIO
init_clocks2():
tio.addi FLASH_ACR, zero, (FLASH_ACR_PRFTBE | (FLASH_ACR_LATENCY_Msk & 0b001))
tio.bfextracti a0, RCC_CFGR, RCC_CFGR_SWS_Pos, 2
tio.bnei a0, (RCC_CFGR_SWS_PLL >> RCC_CFGR_SWS_Pos), 2f
tio.bfinserti RCC_CFGR, zero, RCC_CFGR_SW_Pos, 2
1:
tio.bfextracti a0, RCC_CFGR, RCC_CFGR_SWS_Pos, 2
c.bnez a0, 1b // needs x8-x15 register
2:
tio.cm.bclri RCC_CR, RCC_CR_PLLON_Pos
3:
tio.bsbseti RCC_CR, RCC_CR_PLLRDY_Pos, 3b
tio.bfinserti RCC_CFGR, (RCC_CFGR_PLLMUL12 >> RCC_CFGR_PLLMUL_Pos), RCC_CFGR_PLLMUL_Pos, 4
tio.cm.bseti RCC_CR, RCC_CR_PLLON_Pos
4:
tio.bsbclri, RCC_CR, RCC_CR_PLLRDY_Pos, 4b
tio.bfinserti RCC_CFGR, (RCC_CFGR_SW_PLL >> RCC_CFGR_SW_Pos), RCC_CFGR_SW_Pos, 2
5:
tio.bfextracti a0, RCC_CFGR, RCC_CFGR_SWS_Pos, 2
tio.bnei a0, (RCC_CFGR_SWS_PLL >> RCC_CFGR_SWS_Pos), 5b
ret
- Results
risc-v | armv7-m | risc-v + XTightlyCoupledIO | |
---|---|---|---|
code size (bytes) |
116(108 without redundant lui) |
104 |
52 |
comes from: [7]
void init_7seg() {
RCC->AHBENR |= RCC_AHBENR_GPIOAEN | RCC_AHBENR_GPIOBEN | RCC_AHBENR_GPIOFEN;
//common
GPIOB->MODER |= (0b01 << GPIO_MODER_MODER1_Pos);
GPIOF->MODER |= (0b01 << GPIO_MODER_MODER0_Pos) | (0b01 << GPIO_MODER_MODER1_Pos);
GPIOA->MODER |= (0b01 << GPIO_MODER_MODER9_Pos);
// initialize to disabled state (common scattered will blink first
// digit on all columns on startup otherwise)
GPIOB->BSRR = GPIO_BSRR_BS_1;
GPIOF->BSRR = GPIO_BSRR_BS_0 | GPIO_BSRR_BS_1;
GPIOA->BSRR = GPIO_BSRR_BS_9;
//segment
GPIOA->MODER |= (0b01 << GPIO_MODER_MODER4_Pos)
|(0b01 << GPIO_MODER_MODER2_Pos)
|(0b01 << GPIO_MODER_MODER6_Pos)
|(0b01 << GPIO_MODER_MODER5_Pos)
|(0b01 << GPIO_MODER_MODER1_Pos)
|(0b01 << GPIO_MODER_MODER3_Pos)
|(0b01 << GPIO_MODER_MODER7_Pos)
|(0b01 << GPIO_MODER_MODER0_Pos);
GPIOA->OSPEEDR |= (0b11 << GPIO_OSPEEDR_OSPEEDR4_Pos)
|(0b11 << GPIO_OSPEEDR_OSPEEDR2_Pos)
|(0b11 << GPIO_OSPEEDR_OSPEEDR6_Pos)
|(0b11 << GPIO_OSPEEDR_OSPEEDR5_Pos)
|(0b11 << GPIO_OSPEEDR_OSPEEDR1_Pos)
|(0b11 << GPIO_OSPEEDR_OSPEEDR3_Pos)
|(0b11 << GPIO_OSPEEDR_OSPEEDR7_Pos)
|(0b11 << GPIO_OSPEEDR_OSPEEDR0_Pos);
RCC->APB2ENR |= RCC_APB2ENR_TIM16EN;
TIM16->DIER = TIM_DIER_UIE;
TIM16->ARR = 47999; // 1khz isr rate at 48 mhz
TIM16->CR1 = TIM_CR1_CEN;
//NVIC_EnableIRQ(TIM16_IRQn);
}
- risc-v
init_7seg(): # @init_7seg_gpio()
lui a0, 262177
lw a1, 20(a0)
lui a2, 1120
or a1, a1, a2
sw a1, 20(a0)
lui a1, 294912
lw a2, 1024(a1)
ori a2, a2, 4
sw a2, 1024(a1)
lui a2, 294913
lw a3, 1024(a2)
ori a3, a3, 5
sw a3, 1024(a2)
lw a3, 0(a1)
bseti a3, a3, 18
sw a3, 0(a1)
li a3, 2
sw a3, 1048(a1)
li a3, 3
sw a3, 1048(a2)
li a2, 512
sw a2, 24(a1)
lw a2, 0(a1)
lui a3, 5
addi a3, a3, 1365
or a2, a2, a3
sw a2, 0(a1)
lw a2, 8(a1)
lui a3, 16
addi a3, a3, -1
or a2, a2, a3
sw a2, 8(a1)
lw a1, 24(a0)
bseti a1, a1, 17
sw a1, 24(a0)
lui a0, 262164
li a1, 1
sw a1, 1036(a0)
lui a2, 12
addi a2, a2, -1153
sw a2, 1068(a0)
sw a1, 1024(a0)
ret
- armv7m
init_7seg():
ldr r1, .L2
ldr r0, .L2+4
ldr r3, [r1, #20]
ldr r2, .L2+8
orr r3, r3, #4587520
push {r4, lr}
str r3, [r1, #20]
ldr r3, [r0]
orr r3, r3, #4
str r3, [r0]
ldr r3, [r2]
orr r3, r3, #5
str r3, [r2]
mov r3, #1207959552
ldr r4, [r3]
orr r4, r4, #262144
str r4, [r3]
movs r4, #2
str r4, [r0, #24]
movs r0, #3
str r0, [r2, #24]
mov r2, #512
str r2, [r3, #24]
ldr r2, [r3]
orr r2, r2, #21760
orr r2, r2, #85
str r2, [r3]
ldr r2, [r3, #8]
mvn r2, r2, lsr #16
mvn r2, r2, lsl #16
str r2, [r3, #8]
ldr r3, [r1, #24]
orr r3, r3, #131072
str r3, [r1, #24]
ldr r3, .L2+12
movs r2, #1
movw r1, #47999
str r2, [r3, #12]
str r1, [r3, #44]
str r2, [r3]
pop {r4, pc}
.L2:
.word 1073876992
.word 1207960576
.word 1207964672
.word 1073824768
- risc-v + XTightlyCoupledIO
init_7seg():
lui t0, %hi(RCC_AHBENR_GPIOAEN | RCC_AHBENR_GPIOBEN | RCC_AHBENR_GPIOFEN)
tio.or RCC_AHBENR, t0
tio.cm.bseti GPIOB_MODER, GPIO_MODER_MODER1_Pos // '0' bit doesn't matter in oring
c.li t0, (0b01 << GPIO_MODER_MODER0_Pos) | (0b01 << GPIO_MODER_MODER1_Pos)
tio.or GPIOF_MODER, t0
tio.cm.bseti GPIOA_MODER, GPIO_MODER_MODER9_Pos // '0' bit doesn't matter in oring
tio.addi GPIOB_BSRR, zero, GPIO_BSRR_BS_1 // can also bseti from x0
tio.addi GPIOF_BSRR, zero, (GPIO_BSRR_BS_0 | GPIO_BSRR_BS_1)
tio.addi GPIOA_BSRR, zero, GPIO_BSRR_BS_9 // can also bseti from x0
c.lui t0, %hi(0b01010101010101)
addi t0, %lo(0b01010101010101)
tio.or GPIOA_MODER, t0
tio.bfinserti GPIOA_OSPEEDR, -1, 0, 16 // equiv to or
tio.cm.bseti RCC_APB2ENR, RCC_APB2ENR_TIM16EN_Pos
//c.li t1, 1 // UIE and CEN, 2 bytes smaller at higher reg presure
//tio.cm.mv TIM16_DIER, t1
tio.addi TIM16_DIER, zero, TIM_DIER_UIE // can also bseti from x0
c.lui t0, %hi(47999)
tio.addi TIM16_ARR, t0, %lo(47999)
//tio.cm.mv TIM16_CR1, t1
tio.addi TIM16_CR1, zero, TIM_CR1_CEN // can also bseti from x0
ret
- Results
risc-v | armv7-m | risc-v + XTightlyCoupledIO | |
---|---|---|---|
code size (bytes) |
128 |
124 |
60(58 at higher pressure) |
comes from: [7], heavily based on BSRR/BRR registers.
using segment_config = jnk0le::sseg::PinConfig<false, GPIOA_BASE, 4, 2, 6, 5, 1, 3, 7, 0>;
using common_simple = jnk0le::sseg::CommonConfig<true, GPIOB_BASE, 2, 3, 5, 8>;
jnk0le::sseg::Display<segment_config, common_simple> displ;
extern "C" void TIM16_IRQHandler()
{
TIM16->SR = 0; //bits are rc_w0
displ.defaultIrqHandler();
}
// handler in Display class
void defaultIrqHandler()
{
common_config::turnOff(cnt);
// cnt is not volatile but gcc emits some garbage otherwise
// It must span no more, otherwise increases register pressure in llvm and gcc
uint32_t cnt_tmp = cnt;
if(cnt_tmp == 0)
cnt_tmp = common_config::getColumnAmount(); // 1 more than effective indexing
cnt_tmp--;
cnt = cnt_tmp;
// put delay here in case of ghosting
seg_config::getSegGPIO()->BSRR = disp_cache[cnt];
common_config::turnOn(cnt);
}
// turn off/on in CommonConfig class
static inline constexpr void turnOff([[maybe_unused]] uint32_t idx)
{
if constexpr(invert_polarity)
reinterpret_cast<GPIO_TypeDef*>(gpio_addr)->BSRR = selectAllPinsMask();
else
reinterpret_cast<GPIO_TypeDef*>(gpio_addr)->BRR = selectAllPinsMask();
}
static inline constexpr void turnOn(uint32_t idx)
{
if constexpr(invert_polarity) {
reinterpret_cast<GPIO_TypeDef*>(gpio_addr)->BRR =
static_cast<uint32_t>(column_pin_mask_lut[idx]);
} else {
reinterpret_cast<GPIO_TypeDef*>(gpio_addr)->BSRR =
static_cast<uint32_t>(column_pin_mask_lut[idx]);
}
}
- risc-v
TIM16_IRQHandler: # @TIM16_IRQHandler
lui a0, 262164
sw zero, 1040(a0)
lui a0, 294912
li a1, 300
sw a1, 1048(a0) //20
lui a1, %hi(displ)
addi a2, a1, %lo(displ)
lw a3, 16(a2)
li a1, 3
beqz a3, .LBB0_2 //34
addi a1, a3, -1 //38
.LBB0_2:
sw a1, 16(a2)
sh2add a2, a1, a2
lw a2, 0(a2) //46
lui a3, %hi(trimmed::column_pin_mask_lut)
addi a3, a3, %lo(trimmed::column_pin_mask_lut)
sh1add a1, a1, a3 //58
lhu a1, 0(a1) //62
sw a2, 24(a0)
sw a1, 1064(a0)
ret
- armv7m
TIM16_IRQHandler:
ldr r3, .L3
ldr r1, .L3+4
movs r2, #0
str r2, [r3, #16]
ldr r2, .L3+8
ldr r3, [r2, #16]
cmp r3, #0
it eq
moveq r3, #4
subs r3, r3, #1
mov r0, #300
str r0, [r1, #24]
ldr r0, [r2, r3, lsl #2]
str r3, [r2, #16]
mov r2, #1207959552
str r0, [r2, #24]
ldr r2, .L3+12
ldrh r3, [r2, r3, lsl #1]
str r3, [r1, #40]
bx lr
.L3:
.word 1073824768
.word 1207960576
.word .LANCHOR0
.word trimmed::column_pin_mask_lut
- risc-v + XTightlyCoupledIO
tio.cm.mv TIM16_SR, zero
tio.addi GPIOB_BSRR, zero, 0x12c // pins 2,3,5,8
lui a0, %hi(displ)
addi a0, a0, %lo(displ)
c.lw a1, 16(a0) // get cnt
c.bnez a1, 1f
c.li a1, 4
1:
c.addi a1, -1
c.sw a1, 16(a0)
sh2add a0, a1, a0 // disp_cache[cnt]
c.lw a0, 0(a0)
tio.cm.mv GPIOA_BSRR, a0
lui a0 %hi(trimmed::column_pin_mask_lut)
addi a0 %lo(trimmed::column_pin_mask_lut)
sh1add a0, a1, a0
lh a0, 0(a0) //c. with Zcb
tio.cm.mv GPIOB_BRR, a0
ret
- Results
risc-v | armv7-m | risc-v + XTightlyCoupledIO | |
---|---|---|---|
code size (bytes) |
70(68 with Zcb) |
64 |
52(50 with Zcb) |
Results assume that FLASH/SRAM are kept at typical 0x08000000/0x20000000 addresses
Note
|
gp relaxing can further reduce risc-v sizes
|
from [10], page 3-5.
- "#define approach"
*TIMER0TCR |= 0x0010; // Stop CPU Timer0
*TIMER0TPRD32 = 0x00010000; // Load new 32-bit period value
*TIMER0TCR &= 0xFFEF; // Start CPU Timer0
- "structure approach"
CpuTimer0Regs.TCR.bit.TSS = 1; // Stop CPU Timer0
CpuTimer0Regs.PRD.all = 0x00010000; // Load new 32-bit period value
CpuTimer0Regs.TCR.bit.TSS = 0; // Start CPU Timer0
- c2000 "#define approach"
MOV @AL,*(0:0x0C04) ;4
ORB AL, #0x10 ;2
MOV *(0:0x0C04), @AL ;4
MOVL XAR5, #0x010000 ;4
MOVL XAR4, #0x000C0A ;4
MOVL *+XAR4[0], XAR5 ;2
MOV @AL, *(0:0x0C04) ;4
AND @AL, #0xFFEF ;4
MOV *(0:0x0C04), @AL ;4
32 bytes and 9 cycles
- c2000 "structure approach"
MOVW DP, #0030 ;4/2?
OR @4, #0x0010 ;4
MOVL XAR4, #0x010000 ;4
MOVL @2, XAR4 ;2
AND @4, #0xFFEF ;4
18 bytes (16 if DP
can be done by MOVZ
) and 5 cycles
- risc-v + XTightlyCoupledIO
tio.cm.bseti TIMER0TCR, TSS_Pos
tio.bseti TIMER0TPRD32, zero, 16
tio.cm.bclri TIMER0TCR, TSS_Pos
8 bytes and 3 cycles (12 bytes if tio.cm
is unavailable)
Note
|
when using modern compiler (gcc,llvm), there should be no difference between defines and structures |
Note
|
type punning by union bitfields in C++ is UB and implementation specified in C [11] |
from [12], par 5.
This is the kind of coding that appears very frequently, especially in c2000 codebases. Even though it is possible to coalesce all of that into a single write, compilers can’t do anything about that. Any optimization attempt by compilers will change the resulting side effects effectively breakig the code.
SysCtrlRegs.PCLKCR0.bit.rsvd1 = 0;
SysCtrlRegs.PCLKCR0.bit.TBCLKSYNC = 0;
SysCtrlRegs.PCLKCR0.bit.ADCENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.I2CAENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.rsvd2 = 0;
SysCtrlRegs.PCLKCR0.bit.SPICENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SPIDENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SPIAENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SPIBENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SCIAENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SCIBENCLK = 0;
SysCtrlRegs.PCLKCR0.bit.rsvd3 = 0;
SysCtrlRegs.PCLKCR0.bit.ECANAENCLK= 1;
SysCtrlRegs.PCLKCR0.bit.ECANBENCLK= 0;
- c2000
MOVW DP,#0x01C0
AND @28,#0xFFFC
AND @28,#0xFFFB
OR @28,#0x0008
OR @28,#0x0010
AND @28,#0xFFDF
OR @28,#0x0040
OR @28,#0x0080
OR @28,#0x0100
OR @28,#0x0200
OR @28,#0x0400
AND @28,#0xF7FF
AND @28,#0xCFFF
OR @28,#0x4000
AND @28,#0x7FFF
60 bytes (58 if DP
can be done by MOVZ
)
Note
|
table 3 suggests it’s 6 cycles per one AMO-ALU instruction |
- risc-v + XTightlyCoupledIO
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_rsvd1_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_TBCLKSYNC_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_ADCENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_I2CAENCLK_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_rsvd2_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SPICENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SPIDENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SPIAENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SPIBENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SCIAENCLK_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SCIBENCLK_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_rsvd3_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_ECANAENCLK_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_ECANBENCLK_Pos
28 bytes (56 if tio.cm
is unavailable)
from [12] par 5.
- using magic value
SysCtrlRegs.PCLKCR0.all = 0x47D8;
- using "shadow register"
// Enable only 2801 Peripheral Clocks
union PCLKCR0_REG shadowPCLKCR0;
shadowPCLKCR0.bit.rsvd1 = 0;
shadowPCLKCR0.bit.TBCLKSYNC = 0;
shadowPCLKCR0.bit.ADCENCLK = 1; // ADC
shadowPCLKCR0.bit.I2CAENCLK = 1; // I2C
shadowPCLKCR0.bit.rsvd2 = 0;
shadowPCLKCR0.bit.SPICENCLK = 1; // SPI-C
shadowPCLKCR0.bit.SPIDENCLK = 1; // SPI-D
shadowPCLKCR0.bit.SPIAENCLK = 1; // SPI-A
shadowPCLKCR0.bit.SPIBENCLK = 1; // SPI-B
shadowPCLKCR0.bit.SCIAENCLK = 1; // SCI-A
shadowPCLKCR0.bit.SCIBENCLK = 0; // SCI-B
shadowPCLKCR0.bit.rsvd3 = 0;
shadowPCLKCR0.bit.ECANAENCLK= 1; // eCAN-A
shadowPCLKCR0.bit.ECANBENCLK= 0; // eCAN-B
SysCtrlRegs.PCLKCR0.all = shadowPCLKCR0.all;
- c2000 using magic value
MOVW DP,#0x01C0
MOV @28,#0x47D8
8 bytes (6 if DP
can be done by MOVZ
)
- c2000 using "shadow register"
MOV @AL,#0x47D8
MOVW DP,#0x01C0
MOV @28,AL
10 bytes (8 if DP
can be done by MOVZ
)
- risc-v + XTightlyCoupledIO
c.lui t0, %hi(0x47D8)
tio.addi SysCtrl_PCLKCR0, t0, %lo(0x47D8)
6 bytes
Note
|
when using modern compiler (gcc,llvm), there should be no difference between magic values and "shadow register". |
Note
|
usually vendors provide bitmask definitions for those bits so as to construct the write by
bitwise operations on them. e.g. SysCtrl.PCLKCR0 = SysCtrl_PCLKCR0_ADCENCLK | SysCtrl_PCLKCR0_I2CAENCLK […]
|
from [12] par 6.2.
- using "shadow register" to preserve TIF
union TCR_REG shadowTCR;
// Use a shadow register to stop the timer
// and preserve TIF (write 1-to-clear bit)
shadowTCR.all = CpuTimer0Regs.TCR.all;
shadowTCR.bit.TSS = 1;
shadowTCR.bit.TIF = 0;
CpuTimer0Regs.TCR.all = shadowTCR.all;
// Check the TIF flag
if(CpuTimer0Regs.TCR.bit.TIF == 1)
{
// TIF set, insert action here
// NOP is only a place holder
asm("NOP");
}
- c2000
MOVW DP,#0x0030 ;4/2?
MOV AL,@4 ;2
ORB AL,#0x10 ;2
MOVL XAR5,#0x000C00 ;4
AND AL,@AL,#0x7FFF ;4
MOV *+XAR5[4],AL ;2
TBIT *+XAR5[4],#15 ;4
SBF L1,NTC ;2 (7 bit forward range)
NOP ;2 ; placeholder
L1:
26 bytes (24 if DP
can be done by MOVZ
)
- risc-v + XTightlyCoupledIO
tio.bclri t0, CpuTimer0_TCR, CpuTimer0_TCR_TIF_Pos
tio.bseti CpuTimer0_TCR, t0, CpuTimer0_TCR_TSS_Pos
tio.bsbclri CpuTimer0_TCR, CpuTimer0_TCR_TIF_Pos, L1 // 11 bit forward range
nop // placeholder
L1:
14 bytes
Note
|
write 1 to clear bits are usually separated from control registers |
from [12] par 7.
- using "shadow register" to force 32bit access
union CANMC_REG shadowCANMC;
// 32-bit read of CANMC
shadowCANMC.all = ECanaRegs.CANMC.all;
shadowCANMC.bit.SCB = 1;
// 32-bit write of CANMC
ECanaRegs.CANMC.all = shadowCANMC.all;
- c2000
MOVW DP,#0x0180 ;4/2?
MOVL ACC,@20 ;2
OR @AL,#0x2000 ;4
MOVL @20,ACC ;2
12 bytes (10 if DP
can be done by MOVZ
)
- risc-v + XTightlyCoupledIO
tio.cm.bseti ECana_CANMC, ECana_CANMC_SCB_Pos
2 bytes (4 if tio.cm
is unavailable)
Note
|
risc-v is naturally 32bit, no gimmicks required. |
Note
|
C/C++ allows compilers to generate narrower acces to type puned volatile bitfields
and it does happen [13],[14] until -fstrict-volatile-bitfields flag is provided.
Therefore the explicit volatile load/store must always be porformed to safely use type punning by bitfields.
(c2000 cannot do narrower access than 16 bits so their compiler cannot break 16bit peripherals)
|
Note
|
interrupt overhead and related optimizations are out of scope of Xtighlycoupledio, therefore only a C function scenario is analysed. See [20] for further irq latency analysis. |
Magic numbers and overall design according to [21], that provides following assumptions:
-
Vref (aka target voltage, not to be confused with ADC reference voltage) set by DAC on the differential ADC, or subtracted by ADC from result (
ADC_OFRy
). -
ADC handles sign extension to 16 bits (right adjusted)
-
timer saturates the output to maximum duty (assuming >16bit values are not produced, or are handled by timer)
-
anti windup accumulator saturation (as used by denominators) not considered
-
early conversion trigger not available
Note
|
On stm32g4 it is possible to do c2000/dsPIC like early ADC trigger but
only "normal" channels can generate EOSMP flag. Which is useless because
this conversion can be interrupted by injected channels. (and injected channels
are used for control loops) The early trigger happens at least 36 cycles ahead
(@170MHz, 12,5 cycle conv, 60MHz adc clk) of the end of conversion, requiring additional wait loop.
|
- 3p3z compensator irq, implemented using transposed direct form II IIR filter
#include <algorithm>
// those numbers are obtained by external calculators/tools
#define B0 (1.553498447795f)
#define B1 (-1.361492224301f)
#define B2 (-1.547612874966f)
#define B3 (1.367377797130f)
#define A1 (1.521558814252f)
#define A2 (-0.356458881462f)
#define A3 (-0.165099932790f)
#define K (115.36533642f)
// aggregate into struct to avoid address loads of every single global variable
typedef struct {
// delay line
float Z[3]; // -1 indexed, as Z0 is handled on the fly
// keep those constants in memory as compilers are trying
// to put them right next to the code causing von neumann bottleneck
// It is possible to optimize those out when something fits
// in `f.li` (Zfa) or `lui` (Zfinx) instructions, but that's a lot of manual work
float b[4];
float a[3]; // -1 indexed, as a0 is skipped
} TDF2_3p3z;
TDF2_3p3z buck2 = {
.Z = {},
.b = {B0*K,B1*K,B2*K,B3*K},
.a = {A1,A2,A3}
};
void ADC1_IRQHandler()
{
// ADC doesn't sign extend to 32bits, cast it by load insn
float in = (float)(*(volatile int16_t*)&ADC1->DR);
float out = buck2.Z[0] + in * buck2.b[0];
// saturate negative, timer saturates positive
// casting directly to unsigned is UB in C/C++ (and it does break on x86)
HRTIM1_TIMA->CMP1xR = std::max((int)out, 0);
// defer non critical code to after the timer write
ADC1->ISR = ADC_ISR_JEOC; // ack the interrupt
buck2.Z[0] = buck2.Z[1] + in * buck2.b[1] + out * buck2.a[1 - 1];
buck2.Z[1] = buck2.Z[2] + in * buck2.b[2] + out * buck2.a[2 - 1];
buck2.Z[2] = in * buck2.b[3] + out * buck2.a[3 - 1];
}
- risc-v
ADC1_IRQHandler(): # @ADC1_IRQHandler()
lui a0, 327680
lh a1, 64(a0)
lui a2, %hi(buck2)
flw ft0, %lo(buck2)(a2)
addi a3, a2, %lo(buck2)
flw ft1, 12(a3)
fcvt.s.w ft2, a1
fmadd.s ft0, ft2, ft1, ft0
fcvt.w.s a1, ft0, rtz
max a1, a1, zero
lui a4, 262167
sw a1, -1892(a4)
li a1, 32
sw a1, 0(a0)
flw ft1, 4(a3)
flw ft3, 16(a3)
flw ft4, 28(a3)
fmadd.s ft1, ft2, ft3, ft1
fmadd.s ft1, ft0, ft4, ft1
fsw ft1, %lo(buck2)(a2)
flw ft1, 8(a3)
flw ft3, 20(a3)
flw ft4, 32(a3)
fmadd.s ft1, ft2, ft3, ft1
flw ft3, 36(a3)
flw ft5, 24(a3)
fmadd.s ft1, ft0, ft4, ft1
fsw ft1, 4(a3)
fmul.s ft0, ft0, ft3
fmadd.s ft0, ft2, ft5, ft0
fsw ft0, 8(a3)
ret
Note
|
recent llvm versions allocate fa0-fa5 registers first |
- armv7-m
ADC1_IRQHandler():
mov r1, #1342177280
ldr r0, .L2
ldrsh r3, [r1, #64]
vmov s14, r3 @ int
ldr r3, .L2+4
vcvt.f32.s32 s14, s14
vldr.32 s13, [r3, #12]
vldr.32 s15, [r3]
vldr.32 s12, [r3, #16]
vfma.f32 s15, s13, s14
vcvt.s32.f32 s13, s15
vmov r2, s13 @ int
vldr.32 s13, [r3, #4]
vfma.f32 s13, s12, s14
bic r2, r2, r2, asr #31
str r2, [r0, #156]
vldr.32 s12, [r3, #28]
vfma.f32 s13, s12, s15
movs r2, #32
str r2, [r1]
vldr.32 s12, [r3, #20]
vstr.32 s13, [r3]
vldr.32 s13, [r3, #8]
vfma.f32 s13, s12, s14
vldr.32 s12, [r3, #32]
vfma.f32 s13, s12, s15
vstr.32 s13, [r3, #4]
vldr.32 s13, [r3, #36]
vmul.f32 s15, s15, s13
vldr.32 s13, [r3, #24]
vfma.f32 s15, s13, s14
vstr.32 s15, [r3, #8]
bx lr
.L2:
.word 1073833984
.word .LANCHOR0
Note
|
gcc result was manipulated with non volatile casting due to missing optimization
float in = (float)((int16_t)&ADC1→DR);
|
- risc-v + XTightlyCoupledIO
ADC1_IRQHandler():
// if ADC did sign extension to whole 32 bits we could convert it directly
//tio.fcvt.s.w fa0, ADC1_DR // tio.fcvt.s.h not available
tio.sbfextracti a1, ADC1_DR, 0, 16 // tio.sext.h not available
fcvt.s.w fa0, a1
lui a0, %hi(buck2) // lui+addi not needed when it can be `gp` relaxed
addi a0, a0, %lo(buck2) // can be omitted if struct doesn't span +/-2KiB boundary
flw fa1, 0(a0) // Z[0]
flw fa2, 12(a0) // b[0]
fmadd.s fa1, fa0, fa2, fa1
fcvt.w.s a1, fa3
tio.max HRTIM1_TIMA_CMP1xR, a1, zero
tio.bseti ADC1_ISR, zero, ADC_ISR_JEOC_Pos // can also tio.addi
flw fa2,, 4(a0) // Z[1]
flw fa3, 16(a0) // b[1]
fmadd.s fa2, fa0, fa3, fa2
flw fa3, 28(a0) // a[0]
fmadd.s fa2, fa1, fa3, fa2
fsw fa2, 0(a0) // Z[0]
flw fa2, 8(a0) // Z[2]
flw fa3, 20(a0) // b[2]
fmadd.s fa2, fa0, fa3, fa2
flw fa3, 32(a0) // a[1]
fmadd.s fa2, fa1, fa3, fa2
fsw fa2, 4(a0) // Z[1]
flw fa3, 24(a0) // b[3]
fmul.s fa2, fa0, fa3
flw fa3, 36(a0) // a[2]
fmadd.s fa2, fa1, fa3, fa2
fsw fa2, 8(a0) // Z[2]
ret
Note
|
register pressure is 2 scalar and 4 fp registers, or possible 5 total with Zfinx. Applying pipeline optimizations may increase it a bit. |
Assuming all instructions execute in 1 cycle and there are no pipeline hazard bubbles:
- Results
risc-v | armv7-m | risc-v + XTightlyCoupledIO | |
---|---|---|---|
total cycles (possible) |
32(30) |
33 |
28(26) |
non filter loads/stores |
2 |
4(2 pcrel) |
0 |
cycles to PWM (possible) |
12(9) |
16(13) |
9(7) |
Note
|
possible results assume gp relaxing of all filter variable loads, and deffering
all unnecessary instructions. Additional unaccounted cycles can be gained by saturating negative
to zero by float to int conversion which is UB in C/C++ (1 in risc-v and 2 in armv7-m)
Another is ADC sign extending to 32 bits (1 cycle for armv7-m and Xtightlycoupledio)
|
In order to reduce phase erosion (by up to 18 degres according to [22],[21]) the
ADC blanking period have to be extended towards the end of the switching cycle.
The following techniques can be employed to improve sample to PWM update latency.
-
Use ADC early trigger (if available). When conversion extends over the computations, then the explicit wait may be necessary before reading result (
while(!(ADC1→ISR & ADC_ISR_JEOSMP));
which can resolve to1: tio.bsbclri ADC1_ISR, ADC_ISR_JEOSMP_Pos, 1b
) [22] -
Use transposed direct form II with natural result latency of 1 MAC operation
-
(in e.g. Direct form I) precompute most of the accumulations in previous cycle, as described in [22]
-
Defer the state updates to after the write to timer registers
-
apply gain factor (K) to the numerator coefficients instead of applying it separatly. (straightforward only with FP implementations)
Note
|
Some of these techniques can affect the latency by imposing additional register pressure. Even if output is computed early, the compiler will spill all registers before executing actual code. Compilers also have tendency to reschedule code around the sensitive IO write. Therefore assembly implementations may be necessary. |
In case of simple ADC, the target voltage can be subtracted directly from ADC readout:
int32_t in = ADC1->DR - Vref;
which can resolve to:
// Vref in a0
tio.sub a0, ADC1_DR, a0
alternatively:
// Vref in fa0
tio.fcvt.s.wu fa1, ADC1_DR
fsub.s fa0, fa1, fa0
- by adc early trigger
1: tio.bsbclri ADC1_ISR, ADC_ISR_JEOSMP_Pos, 1b
tio.fcvt.s.w fa0, ADC1_DR
fmadd.s fa1, fa0, fa2, fa1 // * b[0] + Z[0]
fcvt.w.s a1, fa3
tio.max HRTIM1_TIMA_CMP1xR, a1, zero
Note
|
tio.bsbclri/seti can be implemented as a pipeline stall until condition is met.
Wchich avoids introducing jitter/latency from a branch overhead.
|
- by "preserving shadow registers" (+Zfinx)
// allocation of preserving shadow registers
// x4 - x
// x5 - A1
// x6 - A2
// x7 - A3
// x12 - x
// x13 - x
// x14 - x
// x15 - x
// x20 - Z[0]
// x21 - Z[1]
// x22 - Z[2]
// x23 - x
// x28 - B0
// x29 - B1
// x30 - B2
// x31 - B3
ADC1_IRQHandler():
tio.fcvt.s.w a0, ADC1_DR // adc sign extends
fmadd.s a1, a0, x28, x20
fcvt.wu.s a2, a1 // saturate negative, UB in C/C++
tio.cm.mv HRTIM1_TIMA_CMP1xR, a2 // can do tio.max after fcvt.w
tio.bseti ADC1_ISR, zero, ADC_ISR_JEOC_Pos // can also tio.addi
fmadd.s x20, a0, x29, x21
fmadd.s x21, a0, x30, x22
fmul.s x22, a0, x31 // a0 no longer needed, can do the PWM here at lower reg pressure
fmadd.s x20, a1, x5, x20
fmadd.s x21, a1, x6, x21
fmadd.s x22, a1, x7, x22
ret
12 instructions total, 4 to pwm (7 at lower reg pressure, fits rv32e regs) It can be implemented only in assembly.
Note
|
in [20] initializing those shadow registers would require triggering SW deffered handler configured at desired nesting priority. |
Note
|
FMA4 instructions allow to get rid of 2 unnecesary moves (as per TDF2 implementation) |
-
[1] https://web.archive.org/web/20111213030633/http://www.atmel.com/dyn/resources/prod_documents/DOC1292.PDF
-
[2] https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf
-
[3] https://www.ti.com/lit/ug/spruij2/spruij2.pdf?ts=1678361442691
-
[4] https://mythopoeic.org/BBB-PRU/am335xPruReferenceGuide.pdf
-
[5] https://www.ti.com/lit/ug/spru430f/spru430f.pdf?ts=1677869437551
-
[7] https://github.com/jnk0le/random/tree/master/stm32_7segment
-
[8] https://liu.diva-portal.org/smash/get/diva2:1636414/FULLTEXT01.pdf
-
[9] https://github.com/mjbots/moteus/commit/a398d0c4fde08ea5a585bbf0d53da6be422e0915
-
[10] http://www.ee.iitb.ac.in/~ccgroup/old/Lab_pages/experiment_files/TI.pdf
-
[11] https://stackoverflow.com/questions/24542964/aliasing-type-punning-unions-structs-and-bit-fields-in-c99
-
[12] http://staff.ii.pw.edu.pl/kowalski/dsp/F28x/F2808_page/spraa85a.pdf
-
[13] https://stackoverflow.com/questions/67340350/bitfield-write-size
-
[14] https://stackoverflow.com/questions/42171429/force-gcc-to-access-structs-with-words
-
[15] https://opensecuritytraining.info/IntroBIOS_files/Day1_04_Advanced%20x86%20-%20BIOS%20and%20SMM%20Internals%20-%20IO.pdf
-
[16] https://www.xmos.com/download/The-XMOS-XS1-Architecture(X7879A).pdf
-
[17] https://www.xmos.com/download/XMOS-Programming-Guide-(documentation)(E).pdf
-
[18] https://www.nxp.com/docs/en/reference-manual/DSP56000UM.pdf
-
[19] https://www.nxp.com/docs/en/reference-manual/DSP56800FM.pdf
-
[21] https://www.st.com/resource/en/application_note/an5305-digital-filter-implementation-with-the-fmac-using-stm32cubeg4-mcu-package-stmicroelectronics.pdf
-
[22] https://www.how2power.com/newsletters/1603/articles/H2PToday1603_design_Microchip.pdf?NOREDIR=1