XtightlyCoupledIO

Table of Contents

revision history
preface
1. Introduction
2. programmers model
3. XTightlyCoupledIO subextensions
Appendix A: code samples
Bibliography

Jan Oleksiewicz jnk0le@hotmail.com
document version 3.2.50
extension status: unstable/PoC
This document is released under a Creative Commons Attribution 4.0 International License

revision history

Version	Change
v3.2.50	trimmed non functional changes from revision history, smaller font
v3.2.0	improved sideattach syntax, extra notes
v3.1.0	added `tio.nop` and `tio.nop.sideattach`
v3.0.0	rework of encodings, removed destructive shifts and `beq.y`/`bne.y` canonical move into IO uses rs2 for symmetry with compressed encoding
v2.8.0	added fence interop
v2.7.0	added relaxation section, supplementary instrs can also pseudoinstr
v2.6.0	added sideOPdelay subextension
v2.5.0	initial memory model
v2.4.0	added bitfield insert from immediate
v2.3.0	use `.xi` suffix for reg-imm beqi
v2.2.0	added `tio.beqi.x` and `tio.bnei.x`
v2.1.0	added `tio.beq` and `tio.bne`
v2.0.0	major rework of encodings, the `.yy` is now destructive `.y` form, removed `tio.slt`/`tio.sgt` instructions, shuffled subetensions, added reg-reg single bit instructions, minor fixes

preface

This document uses semantic versioning with respect to potential hardware designs. Assembly syntax change is a minor increment. Version 1.0.0 is the first publicly released. Changes in prior versions are not versioned properly and not tracked in revision history. The number in a major revision doesn’t hold the freeze or ratification status.

Document is written in a way that reduces the duplications as those are hard to maintain.

There was no attempt at optimizing instruction encodings, other than sticking close to canonical risc-v encodings, yet.

The spec can be donated (FOSS org??), if it allows it to undergo more comparative studies and proceed to "standardization"

1. Introduction

The scope of XTightlyCoupledIO extension is to reduce code size, register pressure and increase performance in peripheral accessing code. All of which results in reduced latency in control loops etc.

This spec was created solely because we would have to wait for proprietary one otherwise.

And if we are talking about proprietary extensions, they are usually:

Done wrong, mainly because those specs are created on tight deadlines without community feedback (like the severely missing instructions in XTheadBs)
Not done at all (the most obvious and common approach)
Those specs also almost never see an outside word and if they do, they are very badly documented or not documented at all (let’s guess what custom instructions the ch32v003 or ch32v307 implements…)
They also focus on gpio too much, leaving out the most frequently used or most critical peripherals.

Note	In modern microcontroller codebases the gpio tends to become accessed less frequently than other peripherals. And it’s due to a simple reason - if the peripherals are present, they no longer have to be bit-banged by gpio as it was done in the past.

My observation of frequent peripheral patterns are:

only single bit needs to be modified or branched on
register is written with a heavy constant (including memory addresses)
register written with zero
in specific cases like STM32 BSRR or flag clearing, a single bit or inverted single bit constant is used
the register content comes directly from/to memory
otherwise the content is used in/comes from computations
register content is immediately converted to float for computation
small bitfields are extracted or inserted from/to registers

Note

Also the C/C++ volatile specifier prevent many possible compiler optimizations. The "side effecting" acceses must follow what was written in the source code exactly, even though a read + 2 single bit branches could be actually optimized into just two tio.bsb*.y instructions. There is no way to distinguish if the intent was to avoid side effects, taking snapshot of status flags in time or just an optimization for typical architectures.

1.1. prior art

1.1.1. avr8

avr8 Provides 64 IO registers each being accesible by in and out instructions, 32 of them being available for the single bit instructions. All registers are available through IO address space and memory addres space.

Single bit instructions consists of:

sbi and cbi for setting and clearing IO bits
sbis and sbic that can skip one instruction if IO bit is set/cleared
sbrc and sbrs that can skip one instruction if bit in general purpose register is set/cleared

There are also gpior registers that serve as a scratch registers for e.g. global variables/flags. Those have to be used explicitly in source code.

everything looks clean and nice but…

let’s have a look on, how efficiently it’s used:

atmega8

3 reserved registers in bottom io space
8 non-bit registers in bottom io space

atmega328p

The most used chip in arduino, as well as the most cloned one.

15 reserved registers in bottom io space
10 reserved registers in upper io space
many registers available only as memory mapped

xmega

half of the bottom IO space is dedicated for GPIO (aka gpior) registers
the other half is taken by VPORTs that can map to any gpio port configured
area between 0x1f and 0x30 is not populated at all
0x30 to 0x3f is populated by "CPU"

VPORTs have to be configured and used explicitly in source code.

AVR-DA

One of the most recent avr8 family after Microchip.

similarly to xmega, there is only 7 GPIO virtual ports and 4 GPR (aka gpior) registers
the upper part is populated only by the "CPU"

It is also worth to mention that avr8 architecture has not been licensed to 3rd parties like the 8051 did. Even though it could offer better PPA [1] and development ease than average "1T" 8051. Today we have only a few chinese clones of atmega328p due to expired patents.

1.1.2. ti PRU

Proprietary TI RISC architecture [3]. Popularized in beaglebone sbcs

Only the GPIO pins are mapped to r30 and r31 registers, though sometimes there is a mux on r30/r31 interfaces with e.g. MII or shift registers [4] (5.2.2)

special instructions for:

set/clear bit
branch if bit is set/cleared

Source and destinstion operands can independently address their bytes and half-words.

1.1.3. ti c2000

Proprietary TI accumulator-memory architecture [5] similar to the classic CISCs.

Peripherals can be accessed using indirect (XAR pointer registers) or DP addressing (16bit + 6bit offset from instr). Provide AMO-ALU instructions as well as integer to float conversions.

The CLA can also convert to float directly from memory (including peripherals)

[6] claims 2 cycles for ADC reg to float, Fig 4-3 claims 3x cycle speedup over cortex m4 (stm32g4)

1.1.4. cortex m0+ single cycle IO

Uses exactly the same code of memory mapped IO but the loads and stores execute in 1 cycle instead of 2 cycles

1.1.5. PIO (in RP2040)

Reffered to as a programmable state machine, able to emulate serial and parallel peripherals over GPIO. Very limited instruction set.

Assumes cycle accurate, single cycle micro architecture.
Has an optional "side-set" operation and delay which stall execution of any following instruction.

1.1.6. 8051

8051 dedicates half of IRAM address space (aka zero page) for IO SFRs. SFRs are not available by indirect addressing as it targets the "hidden" SRAM.

0x20-0x2F memory range is bit-addressable. 8 vertical (0x80, 0x88, 0x90…) SFR registers are bit addressable. Some of them are pre-occupied by (mandatory) standard SFRs, including the accumulator A and less usefull B.

bit-addressable registers can be operated by special irregular instructions:

set/clear/complement bit
jump if bit is set/clear
jump if bit is set then clear it
mov between bit and carry flag
and/or operation of carry flag into bit (or its inverse)

1.1.7. x86

x86 offers an 16 bit IO address space accessible by in* and out* instructions [15]

There is some legacy peripherals at fixed IO addresses. The rest are typically remappable.

Originally designed for 8080/8086 peripherals hanging on an off-chip bus, and thus not being tightly integrated. Today serving as a legacy ballast. As the address space is no more constrained and the code size gains are negligible.
Especially considering the fact that the offending peripherals typically use MMIO mappings instead anyway.

1.1.8. xmos

Xmos went for software defined peripherals with a barrel processing. [16], [17]

The IO ports can be divided into 1,4,8,16, or 32 bit witdth.
Buffered by shift registers, clocked by a timer or external clock.
Accessible by in and out (including partial and shifting variants) instructions.

1.1.9. 56k

Original 56000 [18] architecture offers IO address space that could be accessed by a 6bit immediate addressing mode ("6-bit I/O Short Address")
Provided by the following instructions:

jump if bit is set/clear
jump to subroutine if bit is set/clear
bit test (and set/clear/change) instructions (updates carry flag)

Later versions (e.g. 56800) [19] extended the single bit into a bitmask match where all of selected bits must be set or cleared to cause the condition.
Masks in branching instructions are limited to 8 bits, targeting top or bottom byte.

1.2. alternative approaches

1.2.1. map to upper GPR

Available on RVE only. Limited to 16 GPR mapped registers. Allows to recycle major part of the microarchitectural pipeline as well as standard risc-v instructions operating on GPRs.

1.2.2. use custom `csr` registers

csrr* instructions implement an atomic swap and immediate bitmask set/clear operations.

However csr registers are generally used to modify core architectural behaviour and thus perform slower than expected.

Note	for this reason RISC-V V spec forbids writes to `vtype` and `vl` with anything but `vsetvl` instructions

Note	xpulp extension is also planning on disallowing writes to hwloop registers with general csr instructions

1.2.3. bitbanding

Implemented by cortex-m3 and cortex-m4

Not available on cortex-m0 and cortex-m7, optional on cortex-m3/m4.
Still requires loading of base address for bitbanded bit. Must be used explicitly in source code

1.2.4. special purpose write only registers

Special kind of write only registers e.g BSRR/IFCR found in STM32 and clones.
Still require loading of peripheral base address. Requires also generating preformatted (shifted) constants even if only single bit is written.

Note	BSRR is still usefull for `tio.mv` acces as it can work on non-continous bitfields or content from pre generated lookup tables [7]

1.2.5. use reserved registers in ABI deviations

Similar to ti PRU approach.

Only a few registers can be reserved like that. It takes out general purpose registers from use leading to less efficient code. Some assembly code would have to be rewritten to avoid now reserved registers.

Note	ABI deviations is not standardized at this moment

1.2.6. use AMO-op instructions

There is limited availability of A extension across embedded cores.

Still requires loading of base address.
Base address must be generated with full lui + addi sequence as there is no immediate offset like in regular load/store instructions.
Implements only swap/add/or/and/xor/min/max operations.

1.3. omitted instructions

Note	still available in first alternative approach as well as ABI deviations one

1.3.1. load to IO/store from IO register

Useful to directly store or load IO content to/from memory without processing. It is also non deterministic and can trap due to e.g. alignment or pmp restrictions, violating atomicity guarantee (with expensive workarounds). Those also would consume a lot of encoding space.

1.3.2. IO with multiply/multiply-accumulate

Usefull for fixed point arithmetic scaling etc.

Sometimes multi cycle, non deterministic.

Even single cycle implementations are potentially problematic to implement as the multiplier can span more pipeline stages than regular ALUs.

In presence of P or other custom DSP extension, it would be necessary to provide IO versions of the myriads of those multiply accumulate instructions. Otherwise tio.mul + add wouldn’t provide any benefit over tio.mv + dsp.macc sequence.

Note	if the `mulh` is necessary the `tio.mul` becomes useless

Note	P ext like, `tio.mull.xy` with destination register pair should still be possible

IIR and FIR filters need to cache the raw ADC readings, effectively enforcing use of the tio.mv instead of directly sourced multiplications (or MACs)

Note	Typical control loop IIR/FIR filters are designed to accept raw ADC readings.

Note	Usually ADCs can be configured to do a sign extension of outputs (e.g 12 → 16 bits). `tio.sbfextracti` could be used to perform such sign extension without need for additional sign extensions in ADCs.

1.3.3. inverted single bit constant

Low use cases to be worth.

Bottom 11 bits can be done with single instruction:

tio.addi iod, zero, (~(1<<pos))

Otherwise we can achieve this in 2 instructions:

lui t0, %hi(~(1<<pos)) // 'c.' if bit 16-12 zeoroed
tio.addi iod, t0, %lo(~(1<<pos))

or

c.li t0, -1
tio.bclri iod, t0, pos

1.3.4. non destructive io-io-reg instructions

Low use cases of independent io to io moves/ops.

Low flexibility of implementations, as the non destructive ops cannot provide AMO like decoupled execution.

Note	Destructive encodings are also justified by a bitfield insert instructions, possible only within destructive encoding.

Note	P extension is about to introduce instructions with destructive `rd` encodings, including IFMA, designated for DSP tasks of the same domain as targeted by XTightlyCoupledIO

1.3.5. `bfp` from 0.94 bitmanip

Requires 4 instruction sequence to insert a constant. Let’s consider followng sample:

// switch PLL (0b10) to HSE (0b01)
RCC->CFGR = (RCC->CFGR & ~RCC_CFGR_SW_Msk) | (RCC_CFGR_SW_HSE);

using bfp:

li t1, RCC_CFGR_SW_HSE
addi t0, zero, {length[3:0], offset[7:0]}
pack t0, t1, t0
bfp a0, a0, t0

Note	below samples cannot be performed directly on IO sfr (require caching of intermediate result)

In best case scenario it can be done in 2 instructions:

andi a0, a0, ~RCC_CFGR_SW_Msk
ori a0, a0, RCC_CFGR_SW_HSE

or in considered scenario:

bseti a0, a0, RCC_CFGR_SW_Pos
bclri a0, a0, RCC_CFGR_SW_Pos+1

Alternatively a more general sequence (4-6 instructions):

li a1, RCC_CFGR_SW_Msk // non inverted can be a single lui
andn a0, a0, a1 // use ~RCC_CFGR_SW_Msk for and, when Zbb is missing
li a1, RCC_CFGR_SW_HSE
or a0, a0, a1

Note	Can use `bseti` or `bclri` to cover a single bit in a field and avoid loading constants.

In [8], bfp didn’t yield enough improvement.

It would be more efficient if the offset and length of the field could be given as immediate values, so that the preparatory setup steps aren’t needed.

2. programmers model

The XTightlyCoupledIO extension adds 4 banks of 32 XLEN sized IO registers each. The IO registers are reffered from rs1 or rd field. Named ios1 and iod.

If a given bank is not populated, corresponding instructions are reserved.

The IO targetting instructions must execute atomically. Therefore those instructions cannot be interrupted with visible side-effects.

Note	number of banks and availability in certain instructions was decided totally arbitrarily, will be refined later

2.1. side effects

For easier mapping to high level languages, any access to IO registers causes side effects as if the entire XLEN sized word was accessed.

A partial modification triggers side effects as if the entire XLEN sized word was read, modified and written back.

GPIOA->OUT |= (1<<13);
//is equivalent to
tio.bseti io123, 13

2.1.1. TODO: grouping of bits from multiple different registers

For more efficient use of IO register space available by certain instructions.

Not reflecting actual memory mapped registers.

2.1.2. memory model of IO access

The access to IO registers by tio. instructions, follows the TSO memory model with respect to each other. The repeated accesses to the same IO register is sequentially consistent.

Note	TSO model is the best fit for typical in-order pipelines longer than 2-3 stages

Note	implementations cannot reuse operand forwarding to solve RAW hazards of IO registers due to `volatile` rules

Synchronization with (indepotent) memory access requires explicit FENCE synchronization.

Access to IO registers by tio. instructions and memory mapped interface is not synchronized.

Note	it would be too expensive to sync read-ALU-writeback stages with memory interface

Note	implementations are still free to microcode `tio.` instructions using memory load and store

2.1.3. `fence` interop

fence instruction orders access of tio. instructions using the PI/PO/SI/SO fields.
RMW operation is interpreted as combined read and write.

It must also properly order tio. accesses with respect to memory mapped IO, that use the same PI/PO/SI/SO fields.

Note	it was decided to not extend `fence` instruction, due to limited use cases

2.2. automatic mapping of memory mapped registers to tightly coupled registers

For efficient use (aka having it used at all) of the tio instructions, the compilers need to automatically translate accesses to memory mapped registers into IO address space.

In case of avr8, the IO address space was mapped linearly to a specific offset in data address space (+0x20).

In case of arm or risc-v the peripherals are scattered over large memory area with 1024 byte minimum spacing. Because of this there needs to be a special mapping into IO address space and we are about to end up with thousands (sometimes GPL violating) outdated builds of custom toolchains, for all of those. As is already happening with interrupt controllers (e.g. WCH hw stacking)

Therefore we need an unified file format describing peripheral to IO mapping, that will be provided by vendors. It will be passed to compiler command line similarly to source code or linker scripts.

Note	Those mapping files can be also self made in case of "typical chinese vendors"

Note	Those files could be used to provide named aliases in debuggers/decompilers

Note	it is recommended to not keep registers mapped lienarly one after the other but split into appropriate banks. e.g. read/write data register doesn’t need to live in a bit operable banks.

2.2.1. TODO: define the iomapping file format

2.2.2. TODO: named aliases for use in assembly

2.2.3. TODO: IO remap detection in assembly

Even though compilers can automatically do a remap in compiled code, the assembly has to explicitly use the dedicated IO instructions leading to unportable code.

Note	in theory load/store with absolute addressing mode can indeed be relaxed into `in` and `out` instructions, but risc-v doesn’t do an absolute addressing like avr8

In avr world portability of IO accesing assembly code was done like:

#if defined(atmega1234)||defined(atmega12345)

#define RDR_REGISTER_IN_IO
#define CONTROL1_REGISTER_IN_IO
#define CONTROL1_REGISTER_IN_LOWER_IO

#elif defined(atmega123456)
//...

And appropriately spam #ifdef’s in the actual code.

As can be seen, each new device has to be added to the config header manually.

Therefore we need a way to discover wether given peripheral register is remapped into IO space, and use this information in e.g. #ifdefs

Note	assembly will stay messy with this anyway, especially when number of used register needs to be kept low in default inline interrupts

2.2.4. TODO: automatic mapping of globals to IO scratch registers

Apart from the peripherals, the IO address space can hold avr8 like scratch registers. Those can be used to store the global variables/flags.

it can be:

used explicitly like in avr8
- higly unportable
- falls into "premature optimization" category
- how many avr projects using gpior (aka GPIO aka GPR) did you see so far?
automatically mapped to global variables/flags
- allows those scratch regs to be actually used
- no longer relaxable to gp-rel load/stores
used with explicit attribute e.g. __attribute__((mapto_ioscratch("bsb_accessible,bool_mergable,1cycle")))
- usefull for critical control loop globals
- can overide default cost function of above option
- variable is not forced into scratch register if specific criteria is not met
- no longer relaxable to gp-rel load/stores

2.2.5. peripherals without memory mapped interface

It is possible to have SFRs that are not mapped to memory address space which are used by e.g. special __attribute__, but this prevents use of pointers to such peripherals.

Pointers are often used to avoid code duplication and resulting size increase [9]. (even wrt. tio access, in some scenarios). Those are also commonly used in various HALs.
Compilers could theoretically track and translate the pointer useage, but it will finally lead to highly inefficient code in corner or even regular cases.

Note	still suitable for a dedicated IO slave cores.

2.3. assembly syntax

All IO accessing instructions are prefixed with tio. prefix.
Bank number is part of the instruction name, except supplementary instructions.
The suffix denominates wether rd or rs1 field targets io registers
Takes the form of tio.instr{n}.{rdm}{rsm} where {n} is the bank number and {rdm} and {rsm} are substituted with one of the following letter.

x - integer reg
s - floating point reg
y - io reg

Register specifiers use the same letter.

tio.bseti3.y y11, 13 // set bit 13 in io 11 register in bank 3
tio.bseti2.yx y22, zero, 17 // write (1<<17) to io 22 register in bank 2

Note	letter y was picked totally arbitrarily as it’s single letter and doesn’t have conflicts

2.3.1. pseudoinstructions

tio instructions referred to without the bank number and suffix.

Pseudoinstructions use the io name prefix as the register specifier with linearized addressing.

The supplementary instructions with omitted suffix are also considered as pseudoinstructions.

tio.bseti io107, 13 // set bit 13 in io 11 register in bank 3
tio.bseti io86, zero, 17 // write (1<<17) to io 22 register in bank 2

2.3.2. Canonical io move instruction

The following instructions are designated as a canonical IO move instructions:

tio.add{n}.yx iod, zero, rs2
tio.add{n}.xy rd, ios1, zero

Available under tio.mv name with suffixed or linearized version.

Note	The canonical move in base risc-v is an `addi`, but because of limited encoding, `tio.addi` cannot be provided with all necessary forms. Therefore alternative instruction was picked.

Note	`tio.add` was picked because an addition is one of the most common operations and the add ALU tend’s to be most available one. e.g. cortex-m7 doesn’t provide bitwise and/or/xor in its early ALU

Note	the move to/from IO registeris are not named as `in` and `out` as I find those names confusing

2.3.3. code relaxation (aka compression)

Only the pseudo instructions are allowed to be relaxed into a different instruction, be it compressed or different one of the same size.

Note

BTW, this is how it should be done with base riscv instructions where e.g. i.add a0, a0, a1 must alway emit exactly specified encoding and add a0, a0, a1 can be relaxed to compressed instruction or a different one (e.g. bseti a0, a1, 11 can be turned into ori a0, a1, (1<<11) for assumed, better execution units availability). For now we have only the unreliable and bloaty .option norvc+.option norelax workaround.

2.3.4. sideOP

sideOP value can be optionally encoded by value placed in square brackets that is placed after the last instruction param, separated by comma if there is at least one param. If ommited the value 0 is encoded.

If an extension choses to use different syntax than plain uimm[4:0] constant, it must be placed within the square bracket.

If square bracket is provided with a single number, it must always be interpreted as uimm[4:0] constant

usage

1:	tio.bseti GPIOA_ODR, 13
2:	tio.bseti GPIOA_ODR, 13, [0] // equivalent to 1
3:	tio.bseti GPIOA_ODR, 13, [31]
4:	tio.bseti GPIOA_ODR, 13, [sideset 0b10, 7] // imaginary extension

Note	Square bracket was selected as MIPS syntax inherited by RISC-V doesn’t use those.

Note	pioasm use it for delay only, not separated by comma from rest of the instruction params.

2.4. instruction encodings

When iom bit is present, it controls wether rd or rs1 targets IO register.
When high the rd field targets IO register. When low, the rs1 field targets the IO register.

bsel immediate selects the accessed bank number. Bits missing from encodings are implied to be zero.

sideOP encodes a side operation, that will be a part of another extension. Otherwise this field is reserved and must be set to 0b00000 (no extra operation)

3. XTightlyCoupledIO subextensions

The name XTightlyCoupledIO can be used as a catch all of following extensions.

3.1. XTightlyCoupledIOsupp

Supplementary instructions useful for alternative upper GPR approach.

Necessary when working on "cached" IO register content, as those cannot be accessed multiple times due to volatile rules.

Note	usefull also in non IO code.

3.1.1. tio.bsbseti.x

Synopsis: Branch if single bit in register is set (immediate)
Mnemonic

tio.bsbseti.x rs1, shamt, label

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x7b, attr: ['CUSTOM-3'] },
 { bits: 5, name: 'imm[4:1|11]' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 7, name: 'imm[12|10:5]' },
]}

Note	instruction proposed as Zce 32bit candidate

Note	only bottom 32 bits of target register are accessible on rv64

3.1.2. tio.bsbclri.x

Synopsis: Branch if single bit in register is cleared (immediate)
Mnemonic

tio.bsbclri.x rs1, shamt, label

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x7b, attr: ['CUSTOM-3'] },
 { bits: 5, name: 'imm[4:1|11]' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 7, name: 'imm[12|10:5]' },
]}

Note	instruction proposed as Zce 32bit candidate

Note	only bottom 32 bits of target register are accessible on rv64

3.1.3. tio.bfextracti.xx

Synopsis: extract bitfield from register (immediate)
Mnemonic

tio.bfextracti.xx rd, rs1, offset, len

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'rs1' },
 { bits: 5, name: 'offset' },
 { bits: 5, name: 'len' },
 { bits: 2, name: 0x0 },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'rs1' },
 { bits: 6, name: 'offset' },
 { bits: 6, name: 'len' },
]}

Note	instruction is equivalent to `slli` + `srli` sequence

3.1.4. tio.sbfextracti.xx

Synopsis: extract and sign extend bitfield from register (immediate)
Mnemonic

tio.sbfextracti.xx rd, rs1, offset, len

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'rs1' },
 { bits: 5, name: 'offset' },
 { bits: 5, name: 'len' },
 { bits: 2, name: 0x0 },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'rs1' },
 { bits: 6, name: 'offset' },
 { bits: 6, name: 'len' },
]}

Note	instruction is equivalent to `slli` + `srai` sequence

3.2. XTightlyCoupledIOsuppbfi

Supplementary bitfield insert useful for alternative upper GPR approach.

Necessary when working on "cached" IO register content, as those cannot be accessed multiple times due to volatile rules.

3.2.1. tio.bfinserti.xx

Synopsis: Destructive bitfield insert into register (immediate)
Mnemonic

tio.bfinserti.xx rd, rs1, offset, len

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x0 },
 { bits: 5, name: 'rs1' },
 { bits: 5, name: 'offset' },
 { bits: 5, name: 'len' },
 { bits: 2, name: 0x0 },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x0 },
 { bits: 5, name: 'rs1' },
 { bits: 6, name: 'offset' },
 { bits: 6, name: 'len' },
]}

Note	due to encoding constraints only destructive form is provided

Note	instruction was proposed for P extension as there are many more rd destructive ones

3.2.2. tio.bfinserti.xi

Synopsis: Destructive bitfield insert into register from immediate (immediate)
Mnemonic

tio.bfinserti.xi rd, uimm, offset, len

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'uimm[4:0]' },
 { bits: 5, name: 'offset' },
 { bits: 5, name: 'len' },
 { bits: 2, name: 0x0 },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'uimm[4:0]' },
 { bits: 6, name: 'offset' },
 { bits: 6, name: 'len' },
]}

Description: Insert len bits of expanded 'uimm[4:0]' constant into rd register at offset position. The uimm=0 is mapped into -1 constant.

Note	due to encoding constraints only destructive form is provided

3.3. XTightlyCoupledIOsuppbri

Supplementary instructions for branching against immediate

Necessary for branching on exact pattern match of extracted bitfields.

Note	xpulp does signed immediate in rs2 position, meanwhile Zce v0.50 puts nzuimm in rs1 position

Note	`uimm=0` can be expressed with `beq/bne zero, rs2, label` therefore this case can be reserved or mapped to other constant

Note	`uimm` from rs1 position was selected as it is already used by `csrr*i` as well as `vsetivli` instructions

Note	usefull also for lowering general code size and register pressure (for e.g. rv32e or IPRA compilation),

3.3.1. tio.beqi.xi

Synopsis: Branch if equal (immediate)
Mnemonic

tio.beqi.xi rs2, uimm, label

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x63, attr: ['BRANCH'] },
 { bits: 5, name: 'imm[4:1|11]' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'uimm[4:0]' },
 { bits: 5, name: 'rs2' },
 { bits: 7, name: 'imm[12|10:5]' },
]}

Description: Branch to label if rs2 content is equal to expanded 'uimm[4:0]' constant. The uimm=0 is mapped into -1 constant.

3.3.2. tio.bnei.xi

Synopsis: Branch if not equal (immediate)
Mnemonic

tio.bnei.xi rs2, uimm, label

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x63, attr: ['BRANCH'] },
 { bits: 5, name: 'imm[4:1|11]' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'uimm[4:0]' },
 { bits: 5, name: 'rs2' },
 { bits: 7, name: 'imm[12|10:5]' },
]}

Description: Branch to label if rs2 content is not equal to expanded 'uimm[4:0]' constant. The uimm=0 is mapped into -1 constant.

3.4. XTightlyCoupledIOaddi

Single IO addi instruction provided for minimal implementations

3.4.1. tio.addi.yx

Synopsis: Add immediate and write to io register
Mnemonic

tio.addi{bsel}.yx iod, rs1, imm

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 2, name: 0x0 },
 { bits: 1, name: 'bsel' },
 { bits: 5, name: 'rs1' },
 { bits: 12, name: 'imm[11:0]' },
]}

Note	`lui` + `tio.addi` pair can be used to write any 32bit constant into IO register.

3.5. XTightlyCoupledIOa

General IO alu instructions

3.5.1. tio.add

Mnemonic

tio.add{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x0 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.2. tio.sub

Mnemonic

tio.sub{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x1 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.3. tio.and

Mnemonic

tio.and{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x2 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.4. tio.or

Mnemonic

tio.or{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x3 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.5. tio.xor

Mnemonic

tio.xor{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x4 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.6. tio.slli

Mnemonic

tio.slli{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x3 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 6, name: 'shamt' },
 { bits: 3, name: 0x3 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.7. tio.srli

Mnemonic

tio.srli{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x4 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 6, name: 'shamt' },
 { bits: 3, name: 0x4 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.8. tio.srai

Mnemonic

tio.srai{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x5 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 6, name: 'shamt' },
 { bits: 3, name: 0x5 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.9. tio.sll

Mnemonic

tio.sll{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x3 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.10. tio.srl

Mnemonic

tio.srl{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x4 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.5.11. tio.sra

Mnemonic

tio.sra{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x5 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.6. XTightlyCoupledIOad

Destructive general IO alu instructions

3.6.1. tio.add.y

Mnemonic

tio.add{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x0 },
 { bits: 2, name: 'bsel' },
]}

3.6.2. tio.sub.y

Mnemonic

tio.sub{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x1 },
 { bits: 2, name: 'bsel' },
]}

3.6.3. tio.and.y

Mnemonic

tio.and{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x2 },
 { bits: 2, name: 'bsel' },
]}

3.6.4. tio.or.y

Mnemonic

tio.or{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x3 },
 { bits: 2, name: 'bsel' },
]}

3.6.5. tio.xor.y

Mnemonic

tio.xor{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x4 },
 { bits: 2, name: 'bsel' },
]}

3.7. XTightlyCoupledIObb

General IO bitmanip instructions

3.7.1. tio.andn

Mnemonic

tio.andn{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x5 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.7.2. tio.orn

Mnemonic

tio.orn{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x6 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.7.3. tio.xnor

Mnemonic

tio.xnor{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x7 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.7.4. tio.min

Mnemonic

tio.min{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x8 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.7.5. tio.minu

Mnemonic

tio.minu{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0x9 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.7.6. tio.max

Mnemonic

tio.max{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0xa },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.7.7. tio.maxu

Mnemonic

tio.maxu{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0xb },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.7.8. tio.rev8

Mnemonic

tio.rev8{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 4, name: 0xc },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.8. XTightlyCoupledIObbd

Destructive general IO bitmanip instructions

3.8.1. tio.andn.y

Mnemonic

tio.andn{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x5 },
 { bits: 2, name: 'bsel' },
]}

3.8.2. tio.orn.y

Mnemonic

tio.orn{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x6 },
 { bits: 2, name: 'bsel' },
]}

3.8.3. tio.xnor.y

Mnemonic

tio.xnor{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x7 },
 { bits: 2, name: 'bsel' },
]}

3.8.4. tio.min.y

Mnemonic

tio.min{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x8 },
 { bits: 2, name: 'bsel' },
]}

3.8.5. tio.minu.y

Mnemonic

tio.minu{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0x9 },
 { bits: 2, name: 'bsel' },
]}

3.8.6. tio.max.y

Mnemonic

tio.max{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0xa },
 { bits: 2, name: 'bsel' },
]}

3.8.7. tio.maxu.y

Mnemonic

tio.maxu{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 5, name: 0xb },
 { bits: 2, name: 'bsel' },
]}

3.9. XTightlyCoupledIOsb

Single bit IO access instructions

3.9.1. tio.bseti

Synopsis: Single bit set (immediate)
Mnemonic

tio.bseti{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x0 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 6, name: 'shamt' },
 { bits: 3, name: 0x0 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.9.2. tio.bclri

Synopsis: Single bit clear (immediate)
Mnemonic

tio.bclri{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x1 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 6, name: 'shamt' },
 { bits: 3, name: 0x1 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.9.3. tio.binvi

Synopsis: Single bit invert (immediate)
Mnemonic

tio.binvi{bsel}.{xy,yx} rd/iod, rs1/ios1, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x2 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 6, name: 'shamt' },
 { bits: 3, name: 0x2 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.9.4. tio.bexti.xy

Synopsis: Single bit extract from IO register (immediate)
Mnemonic

tio.bexti{bsel}.xy rd, ios1, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x6 },
 { bits: 1, name: 0, attr: ['iom'] },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x3 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 6, name: 'shamt' },
 { bits: 3, name: 0x6 },
 { bits: 1, name: 0, attr: ['iom'] },
 { bits: 2, name: 'bsel' },
]}

3.9.5. tio.bset

Synopsis: Single bit set
Mnemonic

tio.bset{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x0 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.9.6. tio.bclr

Synopsis: Single bit clear
Mnemonic

tio.bclr{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x1 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.9.7. tio.binv

Synopsis: Single bit invert
Mnemonic

tio.binv{bsel}.{xy,yx} rd/iod, rs1/ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x2 },
 { bits: 1, name: 'iom' },
 { bits: 2, name: 'bsel' },
]}

3.9.8. tio.bext.xy

Synopsis: Single bit extract from IO register
Mnemonic

tio.bext{bsel}.xy rd, ios1, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod/rd' },
 { bits: 3, name: 0x2 },
 { bits: 5, name: 'ios1/rs1' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 3, name: 0x6 },
 { bits: 1, name: 0, attr: ['iom'] },
 { bits: 2, name: 'bsel' },
]}

3.10. XTightlyCoupledIOsbd

Destructive single bit IO access instructions

3.10.1. tio.bseti.y

Synopsis: Destructive single bit set (immediate)
Mnemonic

tio.bseti{bsel}.y iod, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x7 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 4, name: 0x0 },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x7 },
 { bits: 5, name: 'sideOP' },
 { bits: 6, name: 'shamt' },
 { bits: 4, name: 0x0 },
 { bits: 2, name: 'bsel' },
]}

3.10.2. tio.bclri.y

Synopsis: Destructive single bit clear (immediate)
Mnemonic

tio.bclri{bsel}.y iod, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x7 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 4, name: 0x1 },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x7 },
 { bits: 5, name: 'sideOP' },
 { bits: 6, name: 'shamt' },
 { bits: 4, name: 0x1 },
 { bits: 2, name: 'bsel' },
]}

3.10.3. tio.binvi.y

Synopsis: Destructive single bit invert (immediate)
Mnemonic

tio.binvi{bsel}.y iod, shamt

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x7 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'shamt' },
 { bits: 1, name: 0 },
 { bits: 4, name: 0x2 },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x7 },
 { bits: 5, name: 'sideOP' },
 { bits: 6, name: 'shamt' },
 { bits: 4, name: 0x2 },
 { bits: 2, name: 'bsel' },
]}

3.10.4. tio.bset.y

Synopsis: Destructive single bit set
Mnemonic

tio.bset{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x6 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 4, name: 0x0 },
 { bits: 2, name: 'bsel' },
]}

3.10.5. tio.bclr.y

Synopsis: Destructive single bit clear
Mnemonic

tio.bclr{bsel}.y iod, rs2

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x6 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 4, name: 0x1 },
 { bits: 2, name: 'bsel' },
]}

3.10.6. tio.binv.y

Synopsis: Destructive single bit invert
Mnemonic

tio.binv{bsel}.y iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x6 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 'rs2' },
 { bits: 1, name: 0 },
 { bits: 4, name: 0x2 },
 { bits: 2, name: 'bsel' },
]}

3.11. XTightlyCoupledIObf

IO bitfield instructions

3.11.1. tio.bfinserti.yx

Synopsis: Destructive bitfield insert into IO register (immediate)
Mnemonic

tio.bfinserti{bsel}.yx iod, rs1, offset, len

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x4 },
 { bits: 5, name: 'rs1' },
 { bits: 5, name: 'offset' },
 { bits: 5, name: 'len' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x4 },
 { bits: 5, name: 'rs1' },
 { bits: 6, name: 'offset' },
 { bits: 6, name: 'len' },
]}

Note	rv64 encoding could tradeoff the extra len/offset range similarly to branches

3.11.2. tio.bfinserti.yi

Synopsis: Destructive bitfield insert into IO register from immediate (immediate)
Mnemonic

tio.bfinserti{bsel}.yi iod, uimm, offset, len

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'uimm[4:0]' },
 { bits: 5, name: 'offset' },
 { bits: 5, name: 'len' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'iod' },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'uimm[4:0]' },
 { bits: 6, name: 'offset' },
 { bits: 6, name: 'len' },
]}

Description: Insert len bits of expanded 'uimm[4:0]' constant into iod register at offset position. The uimm=0 is mapped into -1 constant.

Note	due to encoding constraints only destructive form is provided

3.11.3. tio.bfextracti.xy

Synopsis: extract bitfield from IO register (immediate)
Mnemonic

tio.bfextracti{bsel}.xy rd, ios1, offset, len

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x6 },
 { bits: 5, name: 'ios1' },
 { bits: 5, name: 'offset' },
 { bits: 5, name: 'len' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x6 },
 { bits: 5, name: 'ios1' },
 { bits: 6, name: 'offset' },
 { bits: 6, name: 'len' },
]}

Note	instruction is equivalent to `tio.slli` + `srli` sequence

3.11.4. tio.sbfextracti.xy

Synopsis: extract and sign extend bitfield from IO register (immediate)
Mnemonic

tio.sbfextracti{bsel}.xy rd, ios1, offset, len

Encoding (RV32)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x7 },
 { bits: 5, name: 'ios1' },
 { bits: 5, name: 'offset' },
 { bits: 5, name: 'len' },
 { bits: 2, name: 'bsel' },
]}

Encoding (RV64)

{reg:[
 { bits: 7, name: 0x5b, attr: ['CUSTOM-2'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 0x7 },
 { bits: 5, name: 'ios1' },
 { bits: 6, name: 'offset' },
 { bits: 6, name: 'len' },
]}

Note	instruction is equivalent to `tio.slli` + `srai` sequence

3.12. XTightlyCoupledIOsbbr

branch on single IO bit instructions

3.12.1. tio.bsbseti.y

Synopsis: Branch if single bit in IO register is set (immediate)
Mnemonic

tio.bsbseti{bsel}.y ios1, shamt, label

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x7b, attr: ['CUSTOM-3'] },
 { bits: 5, name: 'imm[4:1|11]' },
 { bits: 2, name: 0x0 },
 { bits: 1, name: 'bsel' },
 { bits: 5, name: 'ios1' },
 { bits: 5, name: 'shamt' },
 { bits: 7, name: 'imm[12|10:5]' },
]}

Note	only bottom 32 bits of target register are accessible on rv64

3.12.2. tio.bsbclri.y

Synopsis: Branch if single bit in IO register is cleared (immediate)
Mnemonic

tio.bsbclri{bsel}.y ios1, shamt, label

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x7b, attr: ['CUSTOM-3'] },
 { bits: 5, name: 'imm[4:1|11]' },
 { bits: 2, name: 0x1 },
 { bits: 1, name: 'bsel' },
 { bits: 5, name: 'ios1' },
 { bits: 5, name: 'shamt' },
 { bits: 7, name: 'imm[12|10:5]' },
]}

Note	only bottom 32 bits of target register are accessible on rv64

3.13. XTightlyCoupledIOfcvt

implemented similarly to F or Zfinx fcvt instructions

Note	ADC readings are often immediately converted to float for processing in control loop algorithms

3.13.1. tio.fcvt.s.w.sy

Synopsis: Read IO register and convert to float
Mnemonic

tio.fcvt{bsel}.s.w.sy rd, ios1, rm

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x53, attr: ['OP-FP'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 'rm' },
 { bits: 5, name: 'ios1' },
 { bits: 3, name: 0x4 },
 { bits: 2, name: 'bsel' },
 { bits: 2, name: 'fmt', attr: ['S'] },
 { bits: 5, name: 0x1a },
]}

Prerequisites: F or Zfinx

3.13.2. tio.fcvt.s.wu.sy

Synopsis: Read IO register and convert to float
Mnemonic

tio.fcvt{bsel}.s.wu.sy rd, ios1, rm

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x53, attr: ['OP-FP'] },
 { bits: 5, name: 'rd' },
 { bits: 3, name: 'rm' },
 { bits: 5, name: 'ios1' },
 { bits: 3, name: 0x5 },
 { bits: 2, name: 'bsel' },
 { bits: 2, name: 'fmt', attr: ['S'] },
 { bits: 5, name: 0x1a },
]}

Prerequisites: F or Zfinx

3.13.3. TODO: double precision and rv64

3.13.4. TODO: float to int

potentially problematic to implement, as the float pipe is usually longer than integer one

3.14. XTightlyCoupledIOcm

implemented similarly to Zcm* extensions, incompatible with Zcd

3.14.1. tio.cm.mv.yx

Synopsis: Move into IO register
Mnemonic

tio.cm.mv{bsel}.yx iod, rs2

Encoding (RV32, RV64)

{reg:[
 { bits:  2, name: 0x0, attr: ['C0'] },
 { bits:  5, name: 'rs2' },
 { bits:  5, name: 'iod' },
 { bits:  1, name: 'bsel' },
 { bits:  3, name: 0x5, attr: ['FSD'] },
],config:{bits:16}}

Prerequisites: Zca

3.14.2. tio.cm.mv.xy

Synopsis: Move from IO register
Mnemonic

tio.cm.mv{bsel}.xy rd, ios1

Encoding (RV32, RV64)

{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  5, name: 'ios1' },
 { bits:  5, name: 'rd' },
 { bits:  1, name: 'bsel' },
 { bits:  3, name: 0x1, attr: ['FLDSP'] },
],config:{bits:16}}

Prerequisites: Zca

Note	ios1 in rs2 position, the low bits store only rd' in C extension, maybe swap?

3.14.3. tio.cm.bseti0.y

Synopsis: Set bit in IO register (immediate)
Mnemonic

tio.cm.bseti0.y iod, shamt

Encoding (RV32, RV64)

{reg:[
 { bits:  2, name: 0x0, attr: ['C0'] },
 { bits:  5, name: 'shamt' },
 { bits:  5, name: 'iod' },
 { bits:  1, name: '0' },
 { bits:  3, name: 0x1, attr: ['FLD'] },
],config:{bits:16}}

Prerequisites: Zca

Note	only bottom 32 bits are accessible on rv64

3.14.4. tio.cm.bclri0.y

Synopsis: Clear bit in IO register (immediate)
Mnemonic

tio.cm.bclri0.y iod, shamt

Encoding (RV32, RV64)

{reg:[
 { bits:  2, name: 0x0, attr: ['C0'] },
 { bits:  5, name: 'shamt' },
 { bits:  5, name: 'iod' },
 { bits:  1, name: '1' },
 { bits:  3, name: 0x1, attr: ['FLD'] },
],config:{bits:16}}

Prerequisites: Zca

Note	only bottom 32 bits are accessible on rv64

3.15. XTightlyCoupledIOsideOPdelay

This extension provides optional 0 to 31 cycles of delay before the next IO targetting instruction can be executed. Number of delay cycles is encoded as uimm[4:0] in sideOP position.

It starts in next cycle after the implied writeback stage (and write side effects) The delayed instruction cannot trigger any of the side effects until the implied downcounter of delay reaches zero at the cycle of instructions implied writeback stage (and write side effects).

Note	allowing execution of regular instructions under delay window allows to achieve deterministic timing under non-deterministic execution conditions (caches, flash waitstates etc.), where extra computation is necessary (bit stuffing, access fifos etc.)

Note	other sideOP behaviour can be configured by a custom CSR of another extension

example of generating 50:50 square wave with 64 cycle period

1:
	tio.bseti GPIOA_ODR, 17, [31]
	tio.bclri GPIOA_ODR, 17, [31]
	b 1b

3.15.1. tio.nop

This instruction doesn’t access any IO register, but it causes pipeline contention as if it was a read-modify-write on IO register.

Mnemonic

tio.nop

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 0x0 },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 0x0 },
 { bits: 5, name: 0xc },
 { bits: 2, name: 0x0 },
]}

3.15.2. tio.nop.sideattach

In opposition to tio.nop it doesn’t cause pipelie contention, but instead attaches its own sideOP to a next IO accessing tio instruction. Effectively overriding sideOP in a next instruction if present. (sideOP of next instruction has no effect)

Cannot be overriden by itself, only the last sideattach instruction is effective

Note	requires special CSR to hold attached `sideOP`

Note	`uimm=0` sideOP encoding can be used to null out the sideOP of the following instruction

Mnemonic

tio.nop.sideattach [sideOP]

Note	square bracket is mandatory

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x2b, attr: ['CUSTOM-1'] },
 { bits: 5, name: 0x0 },
 { bits: 3, name: 0x5 },
 { bits: 5, name: 'sideOP' },
 { bits: 5, name: 0x0 },
 { bits: 5, name: 0xd },
 { bits: 2, name: 0x0 },
]}

3.15.3. TODO: delay vs interrupts

Interrupted context with delay shouldn’t affect execution of IO instructions inside the interrupt handlers. It also shouldn’t freeze for the duration of the ISR as well as it shouldn’t be "removed" by interrupts shorter than remaining delay.

Appendix A: code samples

risc-v listings were generated by "clang 15.0.0" with -Os -march=rv32imafc_zba_zbb_zbs flags. (clang as the listing is cleaner than in gcc, and the generated code is a bit more efficient)

armv7m listings were generated by "gcc 11.2.1 (none)" with -Os -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 flags. (newest non linux one on godbolt)

risc-v + XTightlyCoupledIO listings are imaginary compile outputs. Note that many of definitions don’t even exists in device headers.

A.1. stm32 GPIO output toggle

void toggle() {
    GPIOB->ODR ^= GPIO_ODR_13;
}

Note	on avr8 GPIO pin toggling can be achieved by writing into PINxn registers by `out` or `sbi` instructions (the `sbi` here is not a RMW)

risc-v

toggle():                             # @toggle()
	lui     a0, 294912
	lw      a1, 1044(a0)
	binvi   a1, a1, 13
	sw      a1, 1044(a0)
	ret

armv7m

toggle():
	ldr     r2, .L5
	ldr     r3, [r2, #20]
	eor     r3, r3, #8192
	str     r3, [r2, #20]
	bx      lr
.L5:
	.word   1207960576

risc-v + XTightlyCoupledIO

toggle():
	tio.binvi GPIOB_ODR, 13
	ret

Results

	risc-v	armv7-m	risc-v + XTightlyCoupledIO
code size (bytes)	18	16	6

A.2. stm32f0 minimum PLL clock init (assume reset state of registers, no other config)

void init_clocks()
{
	FLASH->ACR = FLASH_ACR_PRFTBE | (FLASH_ACR_LATENCY_Msk & 0b001); // 1ws

	RCC->CFGR = RCC_CFGR_PLLMUL12;

	RCC->CR |= RCC_CR_PLLON;
	while(!(RCC->CR & RCC_CR_PLLRDY));

	RCC->CFGR |= RCC_CFGR_SW_PLL;
	while ((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_PLL);
}

risc-v

init_clocks():                       # @init_clocks()
	lui     a0, 262178
	li      a1, 17
	sw      a1, 0(a0)
	lui     a0, 262177
	lui     a1, 640
	sw      a1, 4(a0)
	lw      a1, 0(a0)
	bseti   a1, a1, 24
	sw      a1, 0(a0)
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
	lw      a1, 0(a0)
	slli    a1, a1, 6
	bgez    a1, .LBB0_1
	lui     a0, 262177 // redundant
	lw      a1, 4(a0)
	ori     a1, a1, 2
	sw      a1, 4(a0)
	li      a1, 8
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
	lw      a2, 4(a0)
	andi    a2, a2, 12
	bne     a2, a1, .LBB0_3
	ret

Note	gcc 12.2 fails to detect `slli` + `bgez` pattern and performs li + and + beq, even though on arm it works fine

armv7m

init_clocks():
	ldr     r3, .L7
	movs    r2, #17
	str     r2, [r3]
	sub     r3, r3, #4096
	mov     r2, #2621440
	str     r2, [r3, #4]
	ldr     r2, [r3]
	orr     r2, r2, #16777216
	str     r2, [r3]
.L2:
	ldr     r2, [r3]
	lsls    r2, r2, #6
	bpl     .L2
	ldr     r2, [r3, #4]
	orr     r2, r2, #2
	str     r2, [r3, #4]
.L3:
	ldr     r2, [r3, #4]
	and     r2, r2, #12
	cmp     r2, #8
	bne     .L3
	bx      lr
.L7:
	.word   1073881088

risc-v + XTightlyCoupledIO

init_clocks():
	tio.addi FLASH_ACR, zero, (FLASH_ACR_PRFTBE | (FLASH_ACR_LATENCY_Msk & 0b001))
	lui t0, %hi(RCC_CFGR_PLLMUL12)
	tio.cm.mv RCC_CFGR, t0 // no need for addi
	tio.cm.bseti RCC_CR, RCC_CR_PLLON_Pos
1:
	tio.bsbclri RCC_CR1, RCC_CR_PLLRDY_Pos, 1b
	tio.cm.bseti RCC_CFGR, RCC_CFGR_SW_Pos+1 // effectively 0b10
2:
	tio.bfextracti t0, RCC_CFGR, RCC_CFGR_SWS_Pos, 2
	tio.bnei t0, (RCC_CFGR_SWS_PLL >> RCC_CFGR_SWS_Pos), 2b
	ret

Results

	risc-v	armv7-m	risc-v + XTightlyCoupledIO
code size (bytes)	58(54 without redundant lui)	52	28

A.3. stm32f0 minimum PLL clock init (assume unknown or "worst case" state of registers)

void init_clocks2()
{
	FLASH->ACR = FLASH_ACR_PRFTBE | (FLASH_ACR_LATENCY_Msk & 0b001); // 1ws

	if((RCC->CFGR & RCC_CFGR_SWS) == RCC_CFGR_SWS_PLL)
	{
		RCC->CFGR &= ~RCC_CFGR_SW_Msk; // switch to HSI (0b00)
		while((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_HSI);
	}

	RCC->CR &= ~RCC_CR_PLLON;
	while((RCC->CR & RCC_CR_PLLRDY));

	RCC->CFGR = RCC_CFGR_PLLMUL12 | (RCC->CFGR & ~RCC_CFGR_PLLMUL_Msk);

	RCC->CR |= RCC_CR_PLLON;
	while(!(RCC->CR & RCC_CR_PLLRDY));

	RCC->CFGR = RCC_CFGR_SW_PLL | (RCC->CFGR & ~RCC_CFGR_SW_Msk);
	while((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_PLL);
}

risc-v

init_clocks2():                      # @init_clocks2()
	lui     a0, 262178
	li      a1, 17
	sw      a1, 0(a0)
	lui     a0, 262177
	lw      a1, 4(a0)
	andi    a1, a1, 12
	li      a2, 8
	bne     a1, a2, .LBB1_3
	lw      a1, 4(a0)
	andi    a1, a1, -4
	sw      a1, 4(a0)
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
	lw      a1, 4(a0)
	andi    a1, a1, 12
	bnez    a1, .LBB1_2
.LBB1_3:
	lw      a1, 0(a0)
	bclri   a1, a1, 24
	sw      a1, 0(a0)
.LBB1_4:                                # =>This Inner Loop Header: Depth=1
	lw      a1, 0(a0)
	slli    a1, a1, 6
	bltz    a1, .LBB1_4
	lui     a0, 262177 // redundant
	lw      a1, 4(a0)
	lui     a2, 1047616
	addi    a2, a2, -1
	and     a1, a1, a2
	bseti   a1, a1, 19
	bseti   a1, a1, 21
	sw      a1, 4(a0)
	lw      a1, 0(a0)
	bseti   a1, a1, 24
	sw      a1, 0(a0)
.LBB1_6:                                # =>This Inner Loop Header: Depth=1
	lw      a1, 0(a0)
	slli    a1, a1, 6
	bgez    a1, .LBB1_6
	lui     a0, 262177 // redundant
	lw      a1, 4(a0)
	andi    a1, a1, -4
	ori     a1, a1, 2
	sw      a1, 4(a0)
	li      a1, 8
.LBB1_8:                                # =>This Inner Loop Header: Depth=1
	lw      a2, 4(a0)
	andi    a2, a2, 12
	bne     a2, a1, .LBB1_8
	ret

armv7m

init_clocks2():
	ldr     r3, .L20
	movs    r2, #17
	str     r2, [r3]
	sub     r3, r3, #4096
	ldr     r2, [r3, #4]
	and     r2, r2, #12
	cmp     r2, #8
	bne     .L10
	ldr     r2, [r3, #4]
	bic     r2, r2, #3
	str     r2, [r3, #4]
.L11:
	ldr     r2, [r3, #4]
	tst     r2, #12
	bne     .L11
.L10:
	ldr     r2, [r3]
	bic     r2, r2, #16777216
	str     r2, [r3]
.L12:
	ldr     r2, [r3]
	lsls    r1, r2, #6
	bmi     .L12
	ldr     r2, [r3, #4]
	bic     r2, r2, #3932160
	orr     r2, r2, #2621440
	str     r2, [r3, #4]
	ldr     r2, [r3]
	orr     r2, r2, #16777216
	str     r2, [r3]
.L13:
	ldr     r2, [r3]
	lsls    r2, r2, #6
	bpl     .L13
	ldr     r2, [r3, #4]
	bic     r2, r2, #3
	orr     r2, r2, #2
	str     r2, [r3, #4]
.L14:
	ldr     r2, [r3, #4]
	and     r2, r2, #12
	cmp     r2, #8
	bne     .L14
	bx      lr
.L20:
	.word   1073881088

Note	gcc fails to detect `bfi` from constant, pattern generally

risc-v + XTightlyCoupledIO

init_clocks2():
	tio.addi FLASH_ACR, zero, (FLASH_ACR_PRFTBE | (FLASH_ACR_LATENCY_Msk & 0b001))
	tio.bfextracti a0, RCC_CFGR, RCC_CFGR_SWS_Pos, 2
	tio.bnei a0, (RCC_CFGR_SWS_PLL >> RCC_CFGR_SWS_Pos), 2f
	tio.bfinserti RCC_CFGR, zero, RCC_CFGR_SW_Pos, 2
1:
	tio.bfextracti a0, RCC_CFGR, RCC_CFGR_SWS_Pos, 2
	c.bnez a0, 1b // needs x8-x15 register
2:
	tio.cm.bclri RCC_CR, RCC_CR_PLLON_Pos
3:
	tio.bsbseti RCC_CR, RCC_CR_PLLRDY_Pos, 3b
	tio.bfinserti RCC_CFGR, (RCC_CFGR_PLLMUL12 >> RCC_CFGR_PLLMUL_Pos), RCC_CFGR_PLLMUL_Pos, 4
	tio.cm.bseti RCC_CR, RCC_CR_PLLON_Pos
4:
	tio.bsbclri, RCC_CR, RCC_CR_PLLRDY_Pos, 4b
	tio.bfinserti RCC_CFGR, (RCC_CFGR_SW_PLL >> RCC_CFGR_SW_Pos), RCC_CFGR_SW_Pos, 2
5:
	tio.bfextracti a0, RCC_CFGR, RCC_CFGR_SWS_Pos, 2
	tio.bnei a0, (RCC_CFGR_SWS_PLL >> RCC_CFGR_SWS_Pos), 5b
	ret

Results

	risc-v	armv7-m	risc-v + XTightlyCoupledIO
code size (bytes)	116(108 without redundant lui)	104	52

A.4. stm32f0 gpio + timer init for 7segment display (assume reset state of registers)

comes from: [7]

void init_7seg() {
	RCC->AHBENR |= RCC_AHBENR_GPIOAEN | RCC_AHBENR_GPIOBEN | RCC_AHBENR_GPIOFEN;

	//common
	GPIOB->MODER |= (0b01 << GPIO_MODER_MODER1_Pos);
	GPIOF->MODER |= (0b01 << GPIO_MODER_MODER0_Pos) | (0b01 << GPIO_MODER_MODER1_Pos);
	GPIOA->MODER |= (0b01 << GPIO_MODER_MODER9_Pos);

	// initialize to disabled state (common scattered will blink first
	// digit on all columns on startup otherwise)
	GPIOB->BSRR = GPIO_BSRR_BS_1;
	GPIOF->BSRR = GPIO_BSRR_BS_0 | GPIO_BSRR_BS_1;
	GPIOA->BSRR = GPIO_BSRR_BS_9;

	//segment
	GPIOA->MODER |= (0b01 << GPIO_MODER_MODER4_Pos)
		|(0b01 << GPIO_MODER_MODER2_Pos)
		|(0b01 << GPIO_MODER_MODER6_Pos)
		|(0b01 << GPIO_MODER_MODER5_Pos)
		|(0b01 << GPIO_MODER_MODER1_Pos)
		|(0b01 << GPIO_MODER_MODER3_Pos)
	    |(0b01 << GPIO_MODER_MODER7_Pos)
		|(0b01 << GPIO_MODER_MODER0_Pos);

	GPIOA->OSPEEDR |= (0b11 << GPIO_OSPEEDR_OSPEEDR4_Pos)
		|(0b11 << GPIO_OSPEEDR_OSPEEDR2_Pos)
		|(0b11 << GPIO_OSPEEDR_OSPEEDR6_Pos)
		|(0b11 << GPIO_OSPEEDR_OSPEEDR5_Pos)
		|(0b11 << GPIO_OSPEEDR_OSPEEDR1_Pos)
		|(0b11 << GPIO_OSPEEDR_OSPEEDR3_Pos)
		|(0b11 << GPIO_OSPEEDR_OSPEEDR7_Pos)
		|(0b11 << GPIO_OSPEEDR_OSPEEDR0_Pos);

	RCC->APB2ENR |= RCC_APB2ENR_TIM16EN;

	TIM16->DIER = TIM_DIER_UIE;

	TIM16->ARR = 47999; // 1khz isr rate at 48 mhz

	TIM16->CR1 = TIM_CR1_CEN;

	//NVIC_EnableIRQ(TIM16_IRQn);
}

risc-v

init_7seg():                    # @init_7seg_gpio()
	lui     a0, 262177
	lw      a1, 20(a0)
	lui     a2, 1120
	or      a1, a1, a2
	sw      a1, 20(a0)
	lui     a1, 294912
	lw      a2, 1024(a1)
	ori     a2, a2, 4
	sw      a2, 1024(a1)
	lui     a2, 294913
	lw      a3, 1024(a2)
	ori     a3, a3, 5
	sw      a3, 1024(a2)
	lw      a3, 0(a1)
	bseti   a3, a3, 18
	sw      a3, 0(a1)
	li      a3, 2
	sw      a3, 1048(a1)
	li      a3, 3
	sw      a3, 1048(a2)
	li      a2, 512
	sw      a2, 24(a1)
	lw      a2, 0(a1)
	lui     a3, 5
	addi    a3, a3, 1365
	or      a2, a2, a3
	sw      a2, 0(a1)
	lw      a2, 8(a1)
	lui     a3, 16
	addi    a3, a3, -1
	or      a2, a2, a3
	sw      a2, 8(a1)
	lw      a1, 24(a0)
	bseti   a1, a1, 17
	sw      a1, 24(a0)
	lui     a0, 262164
	li      a1, 1
	sw      a1, 1036(a0)
	lui     a2, 12
	addi    a2, a2, -1153
	sw      a2, 1068(a0)
	sw      a1, 1024(a0)
	ret

armv7m

init_7seg():
	ldr     r1, .L2
	ldr     r0, .L2+4
	ldr     r3, [r1, #20]
	ldr     r2, .L2+8
	orr     r3, r3, #4587520
	push    {r4, lr}
	str     r3, [r1, #20]
	ldr     r3, [r0]
	orr     r3, r3, #4
	str     r3, [r0]
	ldr     r3, [r2]
	orr     r3, r3, #5
	str     r3, [r2]
	mov     r3, #1207959552
	ldr     r4, [r3]
	orr     r4, r4, #262144
	str     r4, [r3]
	movs    r4, #2
	str     r4, [r0, #24]
	movs    r0, #3
	str     r0, [r2, #24]
	mov     r2, #512
	str     r2, [r3, #24]
	ldr     r2, [r3]
	orr     r2, r2, #21760
	orr     r2, r2, #85
	str     r2, [r3]
	ldr     r2, [r3, #8]
	mvn     r2, r2, lsr #16
	mvn     r2, r2, lsl #16
	str     r2, [r3, #8]
	ldr     r3, [r1, #24]
	orr     r3, r3, #131072
	str     r3, [r1, #24]
	ldr     r3, .L2+12
	movs    r2, #1
	movw    r1, #47999
	str     r2, [r3, #12]
	str     r1, [r3, #44]
	str     r2, [r3]
	pop     {r4, pc}
.L2:
	.word   1073876992
	.word   1207960576
	.word   1207964672
	.word   1073824768

risc-v + XTightlyCoupledIO

init_7seg():
	lui t0, %hi(RCC_AHBENR_GPIOAEN | RCC_AHBENR_GPIOBEN | RCC_AHBENR_GPIOFEN)
	tio.or RCC_AHBENR, t0
	tio.cm.bseti GPIOB_MODER, GPIO_MODER_MODER1_Pos // '0' bit doesn't matter in oring
	c.li t0, (0b01 << GPIO_MODER_MODER0_Pos) | (0b01 << GPIO_MODER_MODER1_Pos)
	tio.or GPIOF_MODER, t0
	tio.cm.bseti GPIOA_MODER, GPIO_MODER_MODER9_Pos // '0' bit doesn't matter in oring
	tio.addi GPIOB_BSRR, zero, GPIO_BSRR_BS_1 // can also bseti from x0
	tio.addi GPIOF_BSRR, zero, (GPIO_BSRR_BS_0 | GPIO_BSRR_BS_1)
	tio.addi GPIOA_BSRR, zero, GPIO_BSRR_BS_9 // can also bseti from x0
	c.lui t0, %hi(0b01010101010101)
	addi t0, %lo(0b01010101010101)
	tio.or GPIOA_MODER, t0
	tio.bfinserti GPIOA_OSPEEDR, -1, 0, 16 // equiv to or
	tio.cm.bseti RCC_APB2ENR, RCC_APB2ENR_TIM16EN_Pos
	//c.li t1, 1 // UIE and CEN, 2 bytes smaller at higher reg presure
	//tio.cm.mv TIM16_DIER, t1
	tio.addi TIM16_DIER, zero, TIM_DIER_UIE // can also bseti from x0
	c.lui t0, %hi(47999)
	tio.addi TIM16_ARR, t0, %lo(47999)
	//tio.cm.mv TIM16_CR1, t1
	tio.addi TIM16_CR1, zero, TIM_CR1_CEN // can also bseti from x0
	ret

Results

	risc-v	armv7-m	risc-v + XTightlyCoupledIO
code size (bytes)	128	124	60(58 at higher pressure)

A.5. stm32f0 7segment display interrupt handler

comes from: [7], heavily based on BSRR/BRR registers.

using segment_config = jnk0le::sseg::PinConfig<false, GPIOA_BASE, 4, 2, 6, 5, 1, 3, 7, 0>;
using common_simple = jnk0le::sseg::CommonConfig<true, GPIOB_BASE, 2, 3, 5, 8>;

jnk0le::sseg::Display<segment_config, common_simple> displ;

extern "C" void TIM16_IRQHandler()
{
	TIM16->SR = 0; //bits are rc_w0
	displ.defaultIrqHandler();
}

// handler in Display class
void defaultIrqHandler()
{
	common_config::turnOff(cnt);

	// cnt is not volatile but gcc emits some garbage otherwise
	// It must span no more, otherwise increases register pressure in llvm and gcc
	uint32_t cnt_tmp = cnt;

	if(cnt_tmp == 0)
		cnt_tmp = common_config::getColumnAmount(); // 1 more than effective indexing

	cnt_tmp--;

	cnt = cnt_tmp;

	// put delay here in case of ghosting

	seg_config::getSegGPIO()->BSRR = disp_cache[cnt];

	common_config::turnOn(cnt);
}

// turn off/on in CommonConfig class
static inline constexpr void turnOff([[maybe_unused]] uint32_t idx)
{
	if constexpr(invert_polarity)
		reinterpret_cast<GPIO_TypeDef*>(gpio_addr)->BSRR = selectAllPinsMask();
	else
		reinterpret_cast<GPIO_TypeDef*>(gpio_addr)->BRR = selectAllPinsMask();
}

static inline constexpr void turnOn(uint32_t idx)
{
	if constexpr(invert_polarity) {
		reinterpret_cast<GPIO_TypeDef*>(gpio_addr)->BRR =
				static_cast<uint32_t>(column_pin_mask_lut[idx]);
	} else {
		reinterpret_cast<GPIO_TypeDef*>(gpio_addr)->BSRR =
				static_cast<uint32_t>(column_pin_mask_lut[idx]);
	}
}

risc-v

TIM16_IRQHandler:                       # @TIM16_IRQHandler
	lui     a0, 262164
	sw      zero, 1040(a0)
	lui     a0, 294912
	li      a1, 300
	sw      a1, 1048(a0) //20
	lui     a1, %hi(displ)
	addi    a2, a1, %lo(displ)
	lw      a3, 16(a2)
	li      a1, 3
	beqz    a3, .LBB0_2 //34
	addi    a1, a3, -1 //38
.LBB0_2:
	sw      a1, 16(a2)
	sh2add  a2, a1, a2
	lw      a2, 0(a2) //46
	lui     a3, %hi(trimmed::column_pin_mask_lut)
	addi    a3, a3, %lo(trimmed::column_pin_mask_lut)
	sh1add  a1, a1, a3 //58
	lhu     a1, 0(a1) //62
	sw      a2, 24(a0)
	sw      a1, 1064(a0)
	ret

armv7m

TIM16_IRQHandler:
	ldr     r3, .L3
	ldr     r1, .L3+4
	movs    r2, #0
	str     r2, [r3, #16]
	ldr     r2, .L3+8
	ldr     r3, [r2, #16]
	cmp     r3, #0
	it      eq
	moveq   r3, #4
	subs    r3, r3, #1
	mov     r0, #300
	str     r0, [r1, #24]
	ldr     r0, [r2, r3, lsl #2]
	str     r3, [r2, #16]
	mov     r2, #1207959552
	str     r0, [r2, #24]
	ldr     r2, .L3+12
	ldrh    r3, [r2, r3, lsl #1]
	str     r3, [r1, #40]
	bx      lr
.L3:
		.word   1073824768
		.word   1207960576
		.word   .LANCHOR0
		.word   trimmed::column_pin_mask_lut

risc-v + XTightlyCoupledIO

	tio.cm.mv TIM16_SR, zero
	tio.addi GPIOB_BSRR, zero, 0x12c // pins 2,3,5,8
	lui a0, %hi(displ)
	addi a0, a0, %lo(displ)
	c.lw a1, 16(a0) // get cnt
	c.bnez a1, 1f
	c.li a1, 4
1:
	c.addi a1, -1
	c.sw a1, 16(a0)
	sh2add a0, a1, a0 // disp_cache[cnt]
	c.lw a0, 0(a0)
	tio.cm.mv GPIOA_BSRR, a0
	lui a0 %hi(trimmed::column_pin_mask_lut)
	addi a0 %lo(trimmed::column_pin_mask_lut)
	sh1add a0, a1, a0
	lh a0, 0(a0) //c. with Zcb
	tio.cm.mv GPIOB_BRR, a0
	ret

Results

	risc-v	armv7-m	risc-v + XTightlyCoupledIO
code size (bytes)	70(68 with Zcb)	64	52(50 with Zcb)

Results assume that FLASH/SRAM are kept at typical 0x08000000/0x20000000 addresses

Note	`gp` relaxing can further reduce risc-v sizes

A.6. c2000 workshop sample

from [10], page 3-5.

"#define approach"

*TIMER0TCR |= 0x0010; // Stop CPU Timer0
*TIMER0TPRD32 = 0x00010000; // Load new 32-bit period value
*TIMER0TCR &= 0xFFEF; // Start CPU Timer0

"structure approach"

CpuTimer0Regs.TCR.bit.TSS = 1; // Stop CPU Timer0
CpuTimer0Regs.PRD.all = 0x00010000; // Load new 32-bit period value
CpuTimer0Regs.TCR.bit.TSS = 0; // Start CPU Timer0

c2000 "#define approach"

MOV @AL,*(0:0x0C04)		;4
ORB AL, #0x10			;2
MOV *(0:0x0C04), @AL	;4
MOVL XAR5, #0x010000	;4
MOVL XAR4, #0x000C0A	;4
MOVL *+XAR4[0], XAR5	;2
MOV @AL, *(0:0x0C04)	;4
AND @AL, #0xFFEF		;4
MOV *(0:0x0C04), @AL	;4

32 bytes and 9 cycles

c2000 "structure approach"

MOVW DP, #0030			;4/2?
OR @4, #0x0010			;4
MOVL XAR4, #0x010000	;4
MOVL @2, XAR4			;2
AND @4, #0xFFEF			;4

18 bytes (16 if DP can be done by MOVZ) and 5 cycles

risc-v + XTightlyCoupledIO

tio.cm.bseti TIMER0TCR, TSS_Pos
tio.bseti TIMER0TPRD32, zero, 16
tio.cm.bclri TIMER0TCR, TSS_Pos

8 bytes and 3 cycles (12 bytes if tio.cm is unavailable)

Note	when using modern compiler (gcc,llvm), there should be no difference between defines and structures

Note	type punning by union bitfields in C++ is UB and implementation specified in C [11]

A.7. c2000 "This is very efficient; there is a one-to-one correlation between C and assembly"

from [12], par 5.

This is the kind of coding that appears very frequently, especially in c2000 codebases. Even though it is possible to coalesce all of that into a single write, compilers can’t do anything about that. Any optimization attempt by compilers will change the resulting side effects effectively breakig the code.

SysCtrlRegs.PCLKCR0.bit.rsvd1 = 0;
SysCtrlRegs.PCLKCR0.bit.TBCLKSYNC = 0;
SysCtrlRegs.PCLKCR0.bit.ADCENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.I2CAENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.rsvd2 = 0;
SysCtrlRegs.PCLKCR0.bit.SPICENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SPIDENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SPIAENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SPIBENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SCIAENCLK = 1;
SysCtrlRegs.PCLKCR0.bit.SCIBENCLK = 0;
SysCtrlRegs.PCLKCR0.bit.rsvd3 = 0;
SysCtrlRegs.PCLKCR0.bit.ECANAENCLK= 1;
SysCtrlRegs.PCLKCR0.bit.ECANBENCLK= 0;

c2000

MOVW DP,#0x01C0
AND @28,#0xFFFC
AND @28,#0xFFFB
OR @28,#0x0008
OR @28,#0x0010
AND @28,#0xFFDF
OR @28,#0x0040
OR @28,#0x0080
OR @28,#0x0100
OR @28,#0x0200
OR @28,#0x0400
AND @28,#0xF7FF
AND @28,#0xCFFF
OR @28,#0x4000
AND @28,#0x7FFF

60 bytes (58 if DP can be done by MOVZ)

Note	table 3 suggests it’s 6 cycles per one AMO-ALU instruction

risc-v + XTightlyCoupledIO

tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_rsvd1_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_TBCLKSYNC_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_ADCENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_I2CAENCLK_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_rsvd2_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SPICENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SPIDENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SPIAENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SPIBENCLK_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SCIAENCLK_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_SCIBENCLK_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_rsvd3_Pos
tio.cm.bseti SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_ECANAENCLK_Pos
tio.cm.bclri SysCtrlRegs_PCLKCR0, SysCtrl_PCLKCR0_ECANBENCLK_Pos

28 bytes (56 if tio.cm is unavailable)

A.8. c2000 magic value less, whole register write

from [12] par 5.

using magic value

SysCtrlRegs.PCLKCR0.all = 0x47D8;

using "shadow register"

// Enable only 2801 Peripheral Clocks
union PCLKCR0_REG shadowPCLKCR0;

shadowPCLKCR0.bit.rsvd1 = 0;
shadowPCLKCR0.bit.TBCLKSYNC = 0;
shadowPCLKCR0.bit.ADCENCLK = 1; // ADC
shadowPCLKCR0.bit.I2CAENCLK = 1; // I2C
shadowPCLKCR0.bit.rsvd2 = 0;
shadowPCLKCR0.bit.SPICENCLK = 1; // SPI-C
shadowPCLKCR0.bit.SPIDENCLK = 1; // SPI-D
shadowPCLKCR0.bit.SPIAENCLK = 1; // SPI-A
shadowPCLKCR0.bit.SPIBENCLK = 1; // SPI-B
shadowPCLKCR0.bit.SCIAENCLK = 1; // SCI-A
shadowPCLKCR0.bit.SCIBENCLK = 0; // SCI-B
shadowPCLKCR0.bit.rsvd3 = 0;
shadowPCLKCR0.bit.ECANAENCLK= 1; // eCAN-A
shadowPCLKCR0.bit.ECANBENCLK= 0; // eCAN-B
SysCtrlRegs.PCLKCR0.all = shadowPCLKCR0.all;

c2000 using magic value

MOVW DP,#0x01C0
MOV @28,#0x47D8

8 bytes (6 if DP can be done by MOVZ)

c2000 using "shadow register"

MOV @AL,#0x47D8
MOVW DP,#0x01C0
MOV @28,AL

10 bytes (8 if DP can be done by MOVZ)

risc-v + XTightlyCoupledIO

c.lui t0, %hi(0x47D8)
tio.addi SysCtrl_PCLKCR0, t0, %lo(0x47D8)

6 bytes

Note	when using modern compiler (gcc,llvm), there should be no difference between magic values and "shadow register".

Note	usually vendors provide bitmask definitions for those bits so as to construct the write by bitwise operations on them. e.g. `SysCtrl.PCLKCR0 = SysCtrl_PCLKCR0_ADCENCLK \| SysCtrl_PCLKCR0_I2CAENCLK […]`

A.9. c2000 preserving write 1 to clear bits

from [12] par 6.2.

using "shadow register" to preserve TIF

union TCR_REG shadowTCR;
// Use a shadow register to stop the timer
// and preserve TIF (write 1-to-clear bit)
shadowTCR.all = CpuTimer0Regs.TCR.all;
shadowTCR.bit.TSS = 1;
shadowTCR.bit.TIF = 0;
CpuTimer0Regs.TCR.all = shadowTCR.all;

// Check the TIF flag
if(CpuTimer0Regs.TCR.bit.TIF == 1)
{
	// TIF set, insert action here
	// NOP is only a place holder
	asm("NOP");
}

c2000

	MOVW DP,#0x0030			;4/2?
	MOV AL,@4				;2
	ORB AL,#0x10			;2
	MOVL XAR5,#0x000C00		;4
	AND AL,@AL,#0x7FFF		;4
	MOV *+XAR5[4],AL		;2
	TBIT *+XAR5[4],#15		;4
	SBF L1,NTC				;2 (7 bit forward range)
	NOP						;2 ; placeholder
L1:

26 bytes (24 if DP can be done by MOVZ)

risc-v + XTightlyCoupledIO

	tio.bclri t0, CpuTimer0_TCR, CpuTimer0_TCR_TIF_Pos
	tio.bseti CpuTimer0_TCR, t0, CpuTimer0_TCR_TSS_Pos
	tio.bsbclri CpuTimer0_TCR, CpuTimer0_TCR_TIF_Pos, L1 // 11 bit forward range
	nop // placeholder
L1:

14 bytes

Note	write 1 to clear bits are usually separated from control registers

A.10. c2000 32bit only peripherals

from [12] par 7.

using "shadow register" to force 32bit access

union CANMC_REG shadowCANMC;

// 32-bit read of CANMC
shadowCANMC.all = ECanaRegs.CANMC.all;
shadowCANMC.bit.SCB = 1;

// 32-bit write of CANMC
ECanaRegs.CANMC.all = shadowCANMC.all;

c2000

MOVW DP,#0x0180	;4/2?
MOVL ACC,@20	;2
OR @AL,#0x2000	;4
MOVL @20,ACC	;2

12 bytes (10 if DP can be done by MOVZ)

risc-v + XTightlyCoupledIO

tio.cm.bseti ECana_CANMC, ECana_CANMC_SCB_Pos

2 bytes (4 if tio.cm is unavailable)

Note	risc-v is naturally 32bit, no gimmicks required.

Note

C/C++ allows compilers to generate narrower acces to type puned volatile bitfields and it does happen [13],[14] until -fstrict-volatile-bitfields flag is provided. Therefore the explicit volatile load/store must always be porformed to safely use type punning by bitfields. (c2000 cannot do narrower access than 16 bits so their compiler cannot break 16bit peripherals)

A.11. stm32g4 buck converter (3p3z, voltage mode)

Note	interrupt overhead and related optimizations are out of scope of Xtighlycoupledio, therefore only a C function scenario is analysed. See [20] for further irq latency analysis.

Magic numbers and overall design according to [21], that provides following assumptions:

Vref (aka target voltage, not to be confused with ADC reference voltage) set by DAC on the differential ADC, or subtracted by ADC from result (ADC_OFRy).
ADC handles sign extension to 16 bits (right adjusted)
timer saturates the output to maximum duty (assuming >16bit values are not produced, or are handled by timer)
anti windup accumulator saturation (as used by denominators) not considered
early conversion trigger not available

Note

On stm32g4 it is possible to do c2000/dsPIC like early ADC trigger but only "normal" channels can generate EOSMP flag. Which is useless because this conversion can be interrupted by injected channels. (and injected channels are used for control loops) The early trigger happens at least 36 cycles ahead (@170MHz, 12,5 cycle conv, 60MHz adc clk) of the end of conversion, requiring additional wait loop.

3p3z compensator irq, implemented using transposed direct form II IIR filter

#include <algorithm>

// those numbers are obtained by external calculators/tools
#define B0 (1.553498447795f)
#define B1 (-1.361492224301f)
#define B2 (-1.547612874966f)
#define B3 (1.367377797130f)
#define A1 (1.521558814252f)
#define A2 (-0.356458881462f)
#define A3 (-0.165099932790f)
#define K (115.36533642f)

// aggregate into struct to avoid address loads of every single global variable
typedef struct {
	// delay line
	float Z[3]; // -1 indexed, as Z0 is handled on the fly

	// keep those constants in memory as compilers are trying
	// to put them right next to the code causing von neumann bottleneck
	// It is possible to optimize those out when something fits
	// in `f.li` (Zfa) or `lui` (Zfinx) instructions, but that's a lot of manual work
	float b[4];
	float a[3]; // -1 indexed, as a0 is skipped
} TDF2_3p3z;

TDF2_3p3z buck2 = {
	.Z = {},
	.b = {B0*K,B1*K,B2*K,B3*K},
	.a = {A1,A2,A3}
};

void ADC1_IRQHandler()
{
	// ADC doesn't sign extend to 32bits, cast it by load insn
	float in = (float)(*(volatile int16_t*)&ADC1->DR);

	float out = buck2.Z[0] + in * buck2.b[0];

	// saturate negative, timer saturates positive
	// casting directly to unsigned is UB in C/C++ (and it does break on x86)
	HRTIM1_TIMA->CMP1xR = std::max((int)out, 0);

	// defer non critical code to after the timer write
	ADC1->ISR = ADC_ISR_JEOC; // ack the interrupt

	buck2.Z[0] = buck2.Z[1] + in * buck2.b[1] + out * buck2.a[1 - 1];
	buck2.Z[1] = buck2.Z[2] + in * buck2.b[2] + out * buck2.a[2 - 1];
	buck2.Z[2] = in * buck2.b[3] + out * buck2.a[3 - 1];
}

risc-v

ADC1_IRQHandler():                   # @ADC1_IRQHandler()
	lui     a0, 327680
	lh      a1, 64(a0)
	lui     a2, %hi(buck2)
	flw     ft0, %lo(buck2)(a2)
	addi    a3, a2, %lo(buck2)
	flw     ft1, 12(a3)
	fcvt.s.w	ft2, a1
	fmadd.s ft0, ft2, ft1, ft0
	fcvt.w.s	a1, ft0, rtz
	max     a1, a1, zero
	lui     a4, 262167
	sw      a1, -1892(a4)
	li      a1, 32
	sw      a1, 0(a0)
	flw     ft1, 4(a3)
	flw     ft3, 16(a3)
	flw     ft4, 28(a3)
	fmadd.s ft1, ft2, ft3, ft1
	fmadd.s ft1, ft0, ft4, ft1
	fsw     ft1, %lo(buck2)(a2)
	flw     ft1, 8(a3)
	flw     ft3, 20(a3)
	flw     ft4, 32(a3)
	fmadd.s ft1, ft2, ft3, ft1
	flw     ft3, 36(a3)
	flw     ft5, 24(a3)
	fmadd.s ft1, ft0, ft4, ft1
	fsw     ft1, 4(a3)
	fmul.s  ft0, ft0, ft3
	fmadd.s ft0, ft2, ft5, ft0
	fsw     ft0, 8(a3)
	ret

Note	recent llvm versions allocate fa0-fa5 registers first

armv7-m

ADC1_IRQHandler():
	mov     r1, #1342177280
	ldr     r0, .L2
	ldrsh   r3, [r1, #64]
	vmov    s14, r3 @ int
	ldr     r3, .L2+4
	vcvt.f32.s32    s14, s14
	vldr.32 s13, [r3, #12]
	vldr.32 s15, [r3]
	vldr.32 s12, [r3, #16]
	vfma.f32	s15, s13, s14
	vcvt.s32.f32    s13, s15
	vmov    r2, s13 @ int
	vldr.32 s13, [r3, #4]
	vfma.f32	s13, s12, s14
	bic     r2, r2, r2, asr #31
	str     r2, [r0, #156]
	vldr.32 s12, [r3, #28]
	vfma.f32	s13, s12, s15
	movs    r2, #32
	str     r2, [r1]
	vldr.32 s12, [r3, #20]
	vstr.32 s13, [r3]
	vldr.32 s13, [r3, #8]
	vfma.f32	s13, s12, s14
	vldr.32 s12, [r3, #32]
	vfma.f32	s13, s12, s15
	vstr.32 s13, [r3, #4]
	vldr.32 s13, [r3, #36]
	vmul.f32	s15, s15, s13
	vldr.32 s13, [r3, #24]
	vfma.f32	s15, s13, s14
	vstr.32 s15, [r3, #8]
	bx      lr
.L2:
	.word   1073833984
	.word   .LANCHOR0

Note	gcc result was manipulated with non volatile casting due to missing optimization `float in = (float)((int16_t)&ADC1→DR);`

risc-v + XTightlyCoupledIO

ADC1_IRQHandler():
	// if ADC did sign extension to whole 32 bits we could convert it directly
	//tio.fcvt.s.w fa0, ADC1_DR // tio.fcvt.s.h not available
	tio.sbfextracti a1, ADC1_DR, 0, 16 // tio.sext.h not available
	fcvt.s.w fa0, a1
	lui a0, %hi(buck2) // lui+addi not needed when it can be `gp` relaxed
	addi a0, a0, %lo(buck2) // can be omitted if struct doesn't span +/-2KiB boundary
	flw fa1, 0(a0) // Z[0]
	flw fa2, 12(a0) // b[0]
	fmadd.s fa1, fa0, fa2, fa1
	fcvt.w.s a1, fa3
	tio.max HRTIM1_TIMA_CMP1xR, a1, zero
	tio.bseti ADC1_ISR, zero, ADC_ISR_JEOC_Pos // can also tio.addi
	flw fa2,, 4(a0) // Z[1]
	flw fa3, 16(a0) // b[1]
	fmadd.s fa2, fa0, fa3, fa2
	flw fa3, 28(a0) // a[0]
	fmadd.s fa2, fa1, fa3, fa2
	fsw fa2, 0(a0) // Z[0]
	flw fa2, 8(a0) // Z[2]
	flw fa3, 20(a0) // b[2]
	fmadd.s fa2, fa0, fa3, fa2
	flw fa3, 32(a0) // a[1]
	fmadd.s fa2, fa1, fa3, fa2
	fsw fa2, 4(a0) // Z[1]
	flw fa3, 24(a0) // b[3]
	fmul.s fa2, fa0, fa3
	flw fa3, 36(a0) // a[2]
	fmadd.s fa2, fa1, fa3, fa2
	fsw fa2, 8(a0) // Z[2]
	ret

Note	register pressure is 2 scalar and 4 fp registers, or possible 5 total with Zfinx. Applying pipeline optimizations may increase it a bit.

Assuming all instructions execute in 1 cycle and there are no pipeline hazard bubbles:

Results

	risc-v	armv7-m	risc-v + XTightlyCoupledIO
total cycles (possible)	32(30)	33	28(26)
non filter loads/stores	2	4(2 pcrel)	0
cycles to PWM (possible)	12(9)	16(13)	9(7)

Note

possible results assume gp relaxing of all filter variable loads, and deffering all unnecessary instructions. Additional unaccounted cycles can be gained by saturating negative to zero by float to int conversion which is UB in C/C++ (1 in risc-v and 2 in armv7-m) Another is ADC sign extending to 32 bits (1 cycle for armv7-m and Xtightlycoupledio)

A.11.1. ultra low latency 3p3z

In order to reduce phase erosion (by up to 18 degres according to [22],[21]) the ADC blanking period have to be extended towards the end of the switching cycle.
The following techniques can be employed to improve sample to PWM update latency.

Use ADC early trigger (if available). When conversion extends over the computations, then the explicit wait may be necessary before reading result (while(!(ADC1→ISR & ADC_ISR_JEOSMP)); which can resolve to 1: tio.bsbclri ADC1_ISR, ADC_ISR_JEOSMP_Pos, 1b) [22]
Use transposed direct form II with natural result latency of 1 MAC operation
(in e.g. Direct form I) precompute most of the accumulations in previous cycle, as described in [22]
Defer the state updates to after the write to timer registers
apply gain factor (K) to the numerator coefficients instead of applying it separatly. (straightforward only with FP implementations)

Note

Some of these techniques can affect the latency by imposing additional register pressure. Even if output is computed early, the compiler will spill all registers before executing actual code. Compilers also have tendency to reschedule code around the sensitive IO write. Therefore assembly implementations may be necessary.

A.11.2. Vref subtracted in software

In case of simple ADC, the target voltage can be subtracted directly from ADC readout:

int32_t in = ADC1->DR - Vref;

which can resolve to:

// Vref in a0
tio.sub a0, ADC1_DR, a0

alternatively:

// Vref in fa0
tio.fcvt.s.wu fa1, ADC1_DR
fsub.s fa0, fa1, fa0

A.11.3. further latency improvements

by adc early trigger

1:	tio.bsbclri ADC1_ISR, ADC_ISR_JEOSMP_Pos, 1b
	tio.fcvt.s.w fa0, ADC1_DR
	fmadd.s fa1, fa0, fa2, fa1 // * b[0] + Z[0]
	fcvt.w.s a1, fa3
	tio.max HRTIM1_TIMA_CMP1xR, a1, zero

Note	`tio.bsbclri/seti` can be implemented as a pipeline stall until condition is met. Wchich avoids introducing jitter/latency from a branch overhead.

by "preserving shadow registers" (+Zfinx)

// allocation of preserving shadow registers
// x4 - x
// x5 - A1
// x6 - A2
// x7 - A3
// x12 - x
// x13 - x
// x14 - x
// x15 - x
// x20 - Z[0]
// x21 - Z[1]
// x22 - Z[2]
// x23 - x
// x28 - B0
// x29 - B1
// x30 - B2
// x31 - B3

ADC1_IRQHandler():
	tio.fcvt.s.w a0, ADC1_DR // adc sign extends
	fmadd.s a1, a0, x28, x20
	fcvt.wu.s a2, a1 // saturate negative, UB in C/C++
	tio.cm.mv HRTIM1_TIMA_CMP1xR, a2 // can do tio.max after fcvt.w
	tio.bseti ADC1_ISR, zero, ADC_ISR_JEOC_Pos // can also tio.addi
	fmadd.s x20, a0, x29, x21
	fmadd.s x21, a0, x30, x22
	fmul.s x22, a0, x31 // a0 no longer needed, can do the PWM here at lower reg pressure
	fmadd.s x20, a1, x5, x20
	fmadd.s x21, a1, x6, x21
	fmadd.s x22, a1, x7, x22
	ret

12 instructions total, 4 to pwm (7 at lower reg pressure, fits rv32e regs) It can be implemented only in assembly.

Note	in [20] initializing those shadow registers would require triggering SW deffered handler configured at desired nesting priority.

Note	FMA4 instructions allow to get rid of 2 unnecesary moves (as per TDF2 implementation)

Bibliography

[1] https://web.archive.org/web/20111213030633/http://www.atmel.com/dyn/resources/prod_documents/DOC1292.PDF
[2] https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf
[3] https://www.ti.com/lit/ug/spruij2/spruij2.pdf?ts=1678361442691
[4] https://mythopoeic.org/BBB-PRU/am335xPruReferenceGuide.pdf
[5] https://www.ti.com/lit/ug/spru430f/spru430f.pdf?ts=1677869437551
[6] https://www.ti.com/lit/an/spracw5a/spracw5a.pdf
[7] https://github.com/jnk0le/random/tree/master/stm32_7segment
[8] https://liu.diva-portal.org/smash/get/diva2:1636414/FULLTEXT01.pdf
[9] https://github.com/mjbots/moteus/commit/a398d0c4fde08ea5a585bbf0d53da6be422e0915
[10] http://www.ee.iitb.ac.in/~ccgroup/old/Lab_pages/experiment_files/TI.pdf
[11] https://stackoverflow.com/questions/24542964/aliasing-type-punning-unions-structs-and-bit-fields-in-c99
[12] http://staff.ii.pw.edu.pl/kowalski/dsp/F28x/F2808_page/spraa85a.pdf
[13] https://stackoverflow.com/questions/67340350/bitfield-write-size
[14] https://stackoverflow.com/questions/42171429/force-gcc-to-access-structs-with-words
[15] https://opensecuritytraining.info/IntroBIOS_files/Day1_04_Advanced%20x86%20-%20BIOS%20and%20SMM%20Internals%20-%20IO.pdf
[16] https://www.xmos.com/download/The-XMOS-XS1-Architecture(X7879A).pdf
[17] https://www.xmos.com/download/XMOS-Programming-Guide-(documentation)(E).pdf
[18] https://www.nxp.com/docs/en/reference-manual/DSP56000UM.pdf
[19] https://www.nxp.com/docs/en/reference-manual/DSP56800FM.pdf
[20] https://github.com/jnk0le/riscv-total-embedded
[21] https://www.st.com/resource/en/application_note/an5305-digital-filter-implementation-with-the-fmac-using-stm32cubeg4-mcu-package-stmicroelectronics.pdf
[22] https://www.how2power.com/newsletters/1603/articles/H2PToday1603_design_Microchip.pdf?NOREDIR=1

Files

xtightlycoupledio.adoc

Latest commit

History

xtightlycoupledio.adoc

File metadata and controls

XtightlyCoupledIO

revision history

preface

1. Introduction

1.1. prior art

1.1.1. avr8

1.1.2. ti PRU

1.1.3. ti c2000

1.1.4. cortex m0+ single cycle IO

1.1.5. PIO (in RP2040)

1.1.6. 8051

1.1.7. x86

1.1.8. xmos

1.1.9. 56k

1.2. alternative approaches

1.2.1. map to upper GPR

1.2.2. use custom csr registers

1.2.3. bitbanding

1.2.4. special purpose write only registers

1.2.5. use reserved registers in ABI deviations

1.2.6. use AMO-op instructions

1.3. omitted instructions

1.3.1. load to IO/store from IO register

1.3.2. IO with multiply/multiply-accumulate

1.3.3. inverted single bit constant

1.3.4. non destructive io-io-reg instructions

1.3.5. bfp from 0.94 bitmanip

2. programmers model

2.1. side effects

2.1.1. TODO: grouping of bits from multiple different registers

2.1.2. memory model of IO access

2.1.3. fence interop

2.2. automatic mapping of memory mapped registers to tightly coupled registers

2.2.1. TODO: define the iomapping file format

2.2.2. TODO: named aliases for use in assembly

2.2.3. TODO: IO remap detection in assembly

2.2.4. TODO: automatic mapping of globals to IO scratch registers

2.2.5. peripherals without memory mapped interface

2.3. assembly syntax

2.3.1. pseudoinstructions

2.3.2. Canonical io move instruction

2.3.3. code relaxation (aka compression)

2.3.4. sideOP

2.4. instruction encodings

3. XTightlyCoupledIO subextensions

3.1. XTightlyCoupledIOsupp

3.1.1. tio.bsbseti.x

3.1.2. tio.bsbclri.x

3.1.3. tio.bfextracti.xx

3.1.4. tio.sbfextracti.xx

3.2. XTightlyCoupledIOsuppbfi

3.2.1. tio.bfinserti.xx

3.2.2. tio.bfinserti.xi

3.3. XTightlyCoupledIOsuppbri

3.3.1. tio.beqi.xi

3.3.2. tio.bnei.xi

3.4. XTightlyCoupledIOaddi

3.4.1. tio.addi.yx

3.5. XTightlyCoupledIOa

3.5.1. tio.add

3.5.2. tio.sub

3.5.3. tio.and

3.5.4. tio.or

3.5.5. tio.xor

3.5.6. tio.slli

3.5.7. tio.srli

3.5.8. tio.srai

3.5.9. tio.sll

3.5.10. tio.srl

3.5.11. tio.sra

3.6. XTightlyCoupledIOad

3.6.1. tio.add.y

3.6.2. tio.sub.y

3.6.3. tio.and.y

1.2.2. use custom `csr` registers

1.3.5. `bfp` from 0.94 bitmanip

2.1.3. `fence` interop