2 RISC-V Bitmanip Extension
In the proposals provided in this chapter, the C code examples are for illustration purposes only. They are not optimal implementations, but are intended to specify the desired functionality.
The final standard will likely define a range of Z-extensions for different bit manipulation instructions, with the “B” extension itself being a mix of instructions from those Z-extensions. It is unclear as of yet what this will look like exactly, but it will probably look something like this:
The main open questions are:
Which of the “Zb*” extensions should be included in “B”?
There is some redundancy/overlap with
add[i]wu
,subwu
,addu.w
,subu.w
, andslliu.w
. Which instructions should stay?Which “Zbp” pseudo-ops should be included in “Zbb”?
These decisions will be informed in big part by evaluations of the cost and added value for the individual instructions.
For the purpose of tool-chain development “B” is currently everything.
For extensions that only implement certain pseudo-instructions (such as “Zbb”
implements rev8
and rev
, which are pseudo-instructions for grevi rd, rs1, -8
and grevi rd, rs1, -1
respectively, the same binary encoding is used for those
instructions as are used on a core with full support for the grev[i]
instruction.
Like in the base ISA, we add W
instruction variants on RV64 with the
semantic of the matching RV32 instruction. Those instructions ignore the upper
32 bit of their input and sign-extend their 32 bit output values. The W
instruction is omitted when the 64-bit instruction produces the same result as
the W
instruction would when the 64-bit instruction is fed sign-extended
32 bit values.
2.1 Basic bit manipulation instructions
2.1.1 Count Leading/Trailing Zeros (clz, ctz
)
RV32, RV64: clz rd, rs ctz rd, rs
RV64 only: clzw rd, rs ctzw rd, rs
The clz
operation counts the number of 0 bits at the MSB end of the
argument. That is, the number of 0 bits before the first 1 bit counting from
the most significant bit. If the input is 0, the output is XLEN. If the input
is -1, the output is 0.
The ctz
operation counts the number of 0 bits at the LSB end of the
argument. If the input is 0, the output is XLEN. If the input is -1, the
output is 0.
The expression XLEN-1-clz(x)
evaluates to the index of the most significant
set bit, also known as integer base-2 logarithm, or -1 if x
is zero.
These instructions are commonly used for scanning bitmaps for set bits, for
example in malloc()
, in binary GCD, or in priority queues such as the
sched_find_first_bit()
function used in the Linux kernel real-time
scheduler.
Another common applications include normalization in fixed-point code and soft float libraries, null suppression in data compression.
2.1.2 Count Bits Set (pcnt
)
RV32, RV64: pcnt rd, rs
RV64 only: pcntw rd, rs
This instruction counts the number of 1 bits in a register. This operations is known as population count, popcount, sideways sum, bit summation, or Hamming weight. [HammingWeight Warren12]
2.1.3 Logic-with-negate (andn
, orn
, xnor
)
RV32, RV64: andn rd, rs1, rs2 orn rd, rs1, rs2 xnor rd, rs1, rs2
This instructions implement AND, OR, and XOR with the 2nd arument inverted.
This can use the existing inverter on rs2 in the ALU that’s already there to implement subtract.
Among other things, those instructions allow implementing the “trailing bit
manipulation” code patterns in two instructions each. For example, (x - 1) & ~x
produces a mask from trailing zero bits in x
.
2.1.4 Pack two words in one register (pack, packu, packh
)
RV32, RV64: pack rd, rs1, rs2 packu rd, rs1, rs2 packh rd, rs1, rs2
RV64 only: packw rd, rs1, rs2 packuw rd, rs1, rs2
The pack
instruction packs the XLEN/2-bit lower halves of rs1 and rs2 into
rd, with rs1 in the lower half and rs2 in the upper half.
The packu
instruction packs the upper halves of rs1 and rs2 into rd.
And the packh
instruction packs the LSB bytes of rs1 and rs2 into the 16 LSB bits
of rd, zero extending the rest of rd.
Applications include XLEN/2-bit funnel shifts, zero-extend XLEN/2 bit values, duplicate the lower XLEN/2 bits (e.g. for mask creation), loading unsigned 32 constants on RV64, and packing C structs that fit in a register and are therefore passed in a register according to the RISC-V calling convention.
; Constructing a 32-bit int from four bytes (RV32)
packh a0, a0, a1
packh a1, a2, a3
pack a0, a0, a1
; Load 0xffff0000ffff0000 on RV64
lui rd, 0xffff0
pack rd, rd, rd
; Same as FSLW on RV64
pack rd, rs1, rs3
rol rd, rd, rs2
addiw rd, rd, 0
; Clear the upper half of rd
pack rd, rd, zero
Paired with shfli/unshfli
and the other bit permutation instructions,
pack can interleave arbitrary power-of-two chunks of rs1
and rs2
. For
example, interleaving the bytes in the lower halves of rs1
and rs2
:
pack rd, rs1, rs2
zip8 rd, rd
pack
is most commonly used to zero-extend words <XLEN.
For this purpose we define the following assembler pseudo-ops:
RV32:
zext.b rd, rs -> andi rd, rs, 255
zext.h rd, rs -> pack rd, rs, zero
RV64:
zext.b rd, rs -> andi rd, rs, 255
zext.h rd, rs -> packw rd, rs, zero
zext.w rd, rs -> pack rd, rs, zero
RV128:.
zext.b rd, rs -> andi rd, rs, 255
zext.h rd, rs -> packw rd, rs, zero
zext.w rd, rs -> packd rd, rs, zero
zext.d rd, rs -> pack rd, rs, zero
2.1.5 Min/max instructions (min, max, minu, maxu
)
RV32, RV64: min rd, rs1, rs2 max rd, rs1, rs2 minu rd, rs1, rs2 maxu rd, rs1, rs2
We define 4 R-type instructions min, max, minu, maxu
with the
following semantics:
Code that performs saturated arithmetic on a word size < XLEN
needs to perform
min/max operations frequently. A simple way of performing those operations without branching
can benefit those programs.
Some applications spend a lot of time on calculating the absolute values of
signed integers. One example of that would be SAT solvers, due to the way CNF
literals are commonly encoded [BiereComm]. With max
(or
minu
) this is a two-instruction operation:
neg a1, a0
max a0, a0, a1
2.1.6 Sign-extend instructions (sext.b, sext.h
)
RV32, RV64: sext.b rd, rs sext.h rd, rs
For sign-extending a byte or half-word we define two unary instructions:
Additionally, we define pseudo-instructions for zero extending bytes or half-words and for zero- and sign-extending words:
RV32:
zext.b rd, rs -> andi rd, rs, 255
zext.h rd, rs -> pack rd, rs, zero
RV64:
zext.b rd, rs -> andi rd, rs, 255
zext.h rd, rs -> packw rd, rs, zero
zext.w rd, rs -> pack rd, rs, zero
sext.w rd, rs -> addiw rd, rs, 0
Sign extending 8-bit and 16-bit values is needed when calling a function that accepts an 8-bit or 16-bit signed argument, because the RISC-V calling conventions dictates that such an argument must be passed in sign-extended form.
2.1.7 Single-bit instructions (sbset, sbclr, sbinv, sbext
)
RV32, RV64: sbset rd, rs1, rs2 sbclr rd, rs1, rs2 sbinv rd, rs1, rs2 sbext rd, rs1, rs2 sbseti rd, rs1, imm sbclri rd, rs1, imm sbinvi rd, rs1, imm sbexti rd, rs1, imm
RV64: sbsetw rd, rs1, rs2 sbclrw rd, rs1, rs2 sbinvw rd, rs1, rs2 sbextw rd, rs1, rs2 sbsetiw rd, rs1, imm sbclriw rd, rs1, imm sbinviw rd, rs1, imm
We define 4 single-bit instructions sbset
(set), sbclr
(clear),
sbinv
(invert), and sbext
(extract), and their immediate-variants,
with the following semantics:
2.1.8 Shift Ones (Left/Right) (slo, sloi, sro, sroi
)
RV32, RV64: slo rd, rs1, rs2 sro rd, rs1, rs2 sloi rd, rs1, imm sroi rd, rs1, imm
RV64 only: slow rd, rs1, rs2 srow rd, rs1, rs2 sloiw rd, rs1, imm sroiw rd, rs1, imm
These instructions are similar to shift-logical operations from the base spec, except instead of shifting in zeros, they shift in ones.
ISAs with flag registers often have a "Shift in Carry" or "Rotate through Carry" instruction. Arguably a "Shift Ones" is an equivalent on an ISA like RISC-V that avoids such flag registers.
The main application for the Shift Ones instruction is mask generation.
When implementing this circuit, the only change in the ALU over a standard logical shift is that the value shifted in is not zero, but is a 1-bit register value that has been forwarded from the high bit of the instruction decode. This creates the desired behavior on both logical zero-shifts and logical ones-shifts.
2.2 Bit permutation instructions
The following sections describe 3 types of bit permutation instructions: Rotate shift, generalized reverse, and generalized shuffle.
A bit permutation essentially is applying an invertible function to the bit addresses. (Bit addresses are 5 bit values on RV32 and 6 bit values on RV64.)
Rotate shift by k is simply addition (rol
) or subtraction (ror
) modulo XLEN.
i′rot := i ± kmod XLEN
Generalized reverse with control argument k is simply XOR-ing the bit addresses with k:
i′grev := i ⊕ k
And generalized shuffle is performing a bit permutation on the bits of the bit addresses:
i′shfl := permk(i)
With the caveat that a single shfl
/unshfl
instruction can only
perferm a certain sub-set of bit address permutations, but a sequence of 4 shfl
/unshfl
instructions can perform any of the 120 such permutations on
RV32, and a sequence of 5 shfl
/unshfl
instructions can perform any
of the 720 such permutations on RV64.
Combining those three types of operations makes a vast number of bit permutations accessible within only a few instructions [Wolf19A] (see Table [numperms]).
N | ROT-only | GREV-only | SHFL-only | ROT+GREV | ROT+GREV+SHFL |
---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | 1 |
1 | 32 | 32 | 24 | 62 | 85 |
2 | — | — | 86 | 864 | 3 030 |
3 | — | — | 119 | 4 640 | 78 659 |
4 | — | — | 120 | 23 312 | 2 002 167 |
5 | — | — | — | 92 192 | 50 106 844 |
6 | — | — | — | 294 992 | 1 234 579 963 |
7 | — | — | — | 703 744 | est. 30 000 000 000 |
8 | — | — | — | 1 012 856 | est. 700 000 000 000 |
9 | — | — | — | 1 046 224 | est. 15 000 000 000 000 |
10 | — | — | — | 1 048 576 | ⋯ ⋯ ⋯ ⋯ ⋯ |
11 | — | — | — | — | ⋯ ⋯ ⋯ ⋯ ⋯ |
Sequences of ror
, grev
, and [un]shfl
instructions can
generate any arbitrary bit permutation. Often in surprising ways. For example,
the following sequence swaps the two LSB bits of a0
:
Instruction | State (XLEN=8) | Bit-Index Op |
---|---|---|
initial value | 7 6 5 4 3 2 1 0 | — |
rori a0, a0, 2 |
1 0 7 6 5 4 3 2 | i′ := i − 2 |
unshfli a0, a0, -1 |
1 7 5 3 0 6 4 2 | i′ := ror(i) |
roli a0, a0, 1 |
7 5 3 0 6 4 2 1 | i′ := i + 1 |
shfli a0, a0, -1 |
7 6 5 4 3 2 0 1 | i′ := rol(i) |
rori a0, a0, 2
unshfli a0, a0, -1
roli a0, a0, 1
shfli a0, a0, -1
The mechanics of this sequence is closely related to the fact that rol(ror(x-2)+1)
is a function that maps 1 to 0 and 0 to 1 and every other
number to itself. (With rol
and ror
denoting 1-bit rotate left and
right shifts respectively.) See Table [permdemo] for details.
The numbers in the right column of Table [numperms] might appear large,
but they are tiny in comparison to the total number of 32-bit permutations
(2.63 ⋅ 1035). Fortunately the [un]shfl
instruction “explores”
permutations that involve fields that have a size that’s a power-of-two,
and/or moves that are a power-of-two first, which means we get shorter sequences
for permutations we more often care about in real-world applications.
For example, there are 24 ways of arranging the four bytes in a 32-bit word.
ror
, grev
, and [un]shfl
can perform any of those permutations
in at most 3 instructions. See Table [permbytes] for a list of those 24
sequences.
There are 40320 ways of arranging the eight bytes in a 64-bit word or the eight
nibbles in a 32-bit word. ror
, grev
, and [un]shfl
can perform
any of those permutations in at most 9 instructions. [Wolf19A]
Besides the more-or-less arbitrary permutations we get from combining ror
, grev
, and [un]shfl
in long sequences, there are of course
many cases where just one of these instructions does exactly what we need.
For grev
and [un]shfl
we define rev*
and [un]zip*
pseudo instructions for the most common use-cases.
2.2.1 Rotate (Left/Right) (rol, ror, rori
)
RV32, RV64: ror rd, rs1, rs2 rol rd, rs1, rs2 rori rd, rs1, imm
RV64 only: rorw rd, rs1, rs2 rolw rd, rs1, rs2 roriw rd, rs1, imm
These instructions are similar to shift-logical operations from the base spec, except they shift in the values from the opposite side of the register, in order. This is also called ‘circular shift’.

ror
permutation network2.2.2 Generalized Reverse (grev
, grevi
, rev
)
RV32, RV64: grev rd, rs1, rs2 grevi rd, rs1, imm
RV64 only: grevw rd, rs1, rs2 greviw rd, rs1, imm
This instruction provides a single hardware instruction that can implement all of byte-order swap, bitwise reversal, short-order-swap, word-order-swap (RV64), nibble-order swap, bitwise reversal in a byte, etc, all from a single hardware instruction.
The Generalized Reverse (GREV) operation iteratively checks each bit i in the 2nd argument from i = 0 to log2(XLEN) − 1, and if the corresponding bit is set, swaps each adjacent pair of 2i bits.

grev
permutation networkThe above pattern should be intuitive to understand in order to extend this definition in an obvious manner for RV128.

grevi
instructionThe grev
operation can easily be implemented using a permutation
network with log2(XLEN) stages. Figure 1.1
shows the permutation network for ror
for reference.
Figure 1.2 shows the permutation network for grev
.
Pseudo-instructions are provided for the most common GREVI use-cases. Their names consist of a prefix and and optional suffix. Each prefix and suffix corresponds to a bit mask (see Table [grevi-names]). The GREVI control word is obtained by AND-ing the two masks together.
Prefix | Mask | Suffix | Mask | ||
---|---|---|---|---|---|
rev |
111111 | — | 111111 | ||
rev2 |
111110 | .w |
011111 | (w = word) |
|
rev4 |
111100 | .h |
001111 | (h = half word) |
|
rev8 |
111000 | .b |
000111 | (b = byte) |
|
rev16 |
110000 | .n |
000011 | (n = nibble) |
|
rev32 |
100000 | .p |
000001 | (p = pair) |
In other words, the prefix controls the number of zero bits at the LSB end of the control word, and the suffix controls the number of zeros at the MSB end of the control word.
rev8
reverses the order of bytes in a word, thus performs endianness conversion. This
is equivalent to the ARM REV
instructions or BSWAP
on x86. ARM also has instructions
for swapping the bytes in 16-bit and 32-bit words, and reversing the bit order (see table [grevi-cmp]).
RISC-V | ARM | X86 |
---|---|---|
rev |
RBIT |
— |
rev8.h |
REV16 |
— |
rev8.w |
REV32 |
— |
rev8 |
REV |
BSWAP |
2.2.3 Generalized Shuffle (shfl
, unshfl
, shfli
, unshfli
, zip
, unzip
)
RV32, RV64: shfl rd, rs1, rs2 unshfl rd, rs1, rs2 shfli rd, rs1, imm unshfli rd, rs1, imm
RV64 only: shflw rd, rs1, rs2 unshflw rd, rs1, rs2
Shuffle is the third bit permutation instruction in the RISC-V Bitmanip extension, after rotary shift and generalized reverse. It implements a generalization of the operation commonly known as perfect outer shuffle and its inverse (shuffle/unshuffle), also known as zip/unzip or interlace/uninterlace.
Bit permutations can be understood as reversible functions on bit indices (i.e. 5 bit functions on RV32 and 6 bit functions on RV64).
A generalized (un)shuffle operation has log2(XLEN) − 1 control bits,
one for each pair of neighbouring bits in a bit index. When the bit is set,
generalized shuffle will swap the two index bits. The shfl
operation
performs this swaps in MSB-to-LSB order (performing a rotate left shift on
contiguous regions of set control bits), and the unshfl
operation performs
the swaps in LSB-to-MSB order (performing a rotate right shift on contiguous
regions of set control bits). Combining up to log2(XLEN) of those
shfl
/unshfl
operations can implement any bitpermutation on the
bit indices.
The most common type of shuffle/unshuffle operation is one on an immediate
control value that only contains one contiguous region of set bits. We call
those operations zip/unzip and provide pseudo-instructions for them. The naming
scheme for those pseudo-instructions is similar to the naming scheme for the
grevi
pseudo-instructions (see Tables 1.3 and
[grevi-names]), except that the LSB bit of the masks in Table [grevi-names]
is not used for zip/unzip.
Shuffle/unshuffle operations that only have individual bits set (not a contiguous region of two or more bits) are their own inverse.

shfli
/unshfli
instruction
shfli
/unshfli
instruction
Like GREV and rotate shift, the (un)shuffle instruction can be implemented using a short sequence of elementary permutations, that are enabled or disabled by the shamt bits. But (un)shuffle has one stage fewer than GREV. Thus shfli+unshfli together require the same amount of encoding space as grevi.
Or for RV64:
The above pattern should be intuitive to understand in order to extend this definition in an obvious manner for RV128.
Alternatively (un)shuffle can be implemented in a single network with one more stage than GREV, with the additional first and last stage executing a permutation that effectively reverses the order of the inner stages. However, since the inner stages only mux half of the bits in the word each, a hardware implementation using this additional “flip” stages might actually be more expensive than simply creating two networks.
Figure 1.7 shows the (un)shuffle permutation network with “flip” stages and Figure 1.6 shows the (un)shuffle permutation network without “flip” stages.

The zip
instruction with the upper half of its input cleared performs
the commonly needed “fan-out” operation. (Equivalent to bdep
with a
0x55555555 mask.) The zip
instruction applied twice fans out the bits
in the lower quarter of the input word by a spacing of 4 bits.
For example, the following code calculates the bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32:
andi a0, a0, 0xff
zip a0, a0
zip a0, a0
slli a1, a0, 4
c.add a0, a1
slli a1, a0, 8
c.add a0, a1
slli a1, a0, 16
c.add a0, a1
The final prefix sum is stored in the 8 nibbles of the a0
output word.
Similarly, the following code stores the indices of the set bits in the LSB nibbles of the output word (with the LSB bit having index 1), with the unused MSB nibbles in the output set to zero:
andi a0, a0, 0xff
zip a0, a0
zip a0, a0
orc.n a0, a0
li a1, 0x87654321
and a1, a0, a1
bext a0, a1, a0
Other zip
modes can be used to “fan-out” in blocks of 2, 4, 8, or 16 bit.
zip
can be combined with grevi
to perform inner shuffles. For example
on RV64:
li a0, 0x0000000012345678
zip4 t0, a0 ; <- 0x0102030405060708
rev4.b t1, t0 ; <- 0x1020304050607080
zip8 t2, a0 ; <- 0x0012003400560078
rev8.h t3, t2 ; <- 0x1200340056007800
zip16 t4, a0 ; <- 0x0000123400005678
rev16.w t5, t4 ; <- 0x1234000056780000
Another application for the zip instruction is generating Morton code [MortonCode].
The x86 PUNPCK[LH]*
MMX/SSE/AVX instructions perform similar operations
as zip8
and zip16
.
2.2.4 Crossbar Permutation Instructions (xperm.[nbhw]
)
RV32, RV64: xperm.n rd, rs1, rs2 xperm.b rd, rs1, rs2 xperm.h rd, rs1, rs2
RV64 only: xperm.w rd, rs1, rs2
These instructions operate on nibbles/bytes/half-words/words. rs1
is
a vector of data words and rs2
is a vector of indices into rs1
.
The result of the instruction is the vector rs2
with each element replaced
by the corresponding data word from rs1
, or zero then the index in rs2
is out of bounds.
The xperm.[nbhw]
instructions can be implemented with an XLEN/4-lane
nibble-wide crossbar switch.
The xperm.n
instruction can be used to implement an arbitrary 64-bit
bit permutation in 15 instructions, using 8 control words and 3 constant masks [Wolf20A]:
uint64_t perm64_bitmanip_cmix(perm64t &p, uint64_t v)
{
uint64_t v0 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[0]) & p.mask[0], 3); // 3 insns
uint64_t v1 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[1]) & p.mask[1], 3); // 6 insns
uint64_t v2 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[2]) & p.mask[2], 3); // 9 insns
uint64_t v3 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[3]) & p.mask[3], 3); // 12 insns
v0 = _rv_cmix(0x1111111111111111LL, v0, v1); // 13 insns
v2 = _rv_cmix(0x4444444444444444LL, v2, v3); // 14 insns
return _rv_cmix(0x3333333333333333LL, v0, v2); // 15 insns
}

gorci
instruction2.3 Generalized OR-Combine (gorc, gorci
)
RV32, RV64: gorc rd, rs1, rs2 gorci rd, rs1, imm
RV64 only: gorcw rd, rs1, rs2 gorciw rd, rs1, imm
The GORC operation is similar to GREV, except that instead of swapping pairs of bits, GORC ORs them together, and writes the new value in both positions.
GORC can be usefull for copying naturally aligned fields in a word, and testing such fields for being equal zero.
gorci
pseudo-instructions follow the same naming scheme as grevi
pseudo-instructions (see Tables 1.3 and [grevi-names]),
except the prefix orc
is used instead of rev
. See
Table 1.8 for a full list of gorci
pseudo-instructions.
An important use-case is strlen()
and strcpy()
, which can utilize
orc.b
for testing for zero bytes, and counting trailing non-zero bytes
in a word.
2.4 Bit-Field Place (bfp
)
RV32, RV64: bfp rd, rs1, rs2
RV64 only: bfpw rd, rs1, rs2
The bit field place (bfp
) instruction places up to XLEN/2 LSB bits from rs2
into
the value in rs1
. The upper bits of rs2
control the length of the bit field
and target position. The layout of rs2
is chosen in a way that makes it possible to construct
rs2
easily using pack[h]
instructions and/or andi
/lui
.
The layout of the control word in rs2 is as follows for RV32. LEN=0
encodes for LEN=16
.
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|---------------|---------------|---------------|---------------|
| | LEN | | OFF | DATA |
|---------------|---------------|---------------|---------------|
And on RV64 (with LEN=0
encoding for LEN=32
):
| 6 5 4 3 |
|3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 .... 2 1 0|
|---------------|---------------|---------------|---------------|------ -- ------|
|SEL| | LEN | | OFF | | LEN' | | OFF' | DATA |
|---------------|---------------|---------------|---------------|------ -- ------|
When SEL=10
then LEN
and OFF
are used, otherwise LEN’
and OFF’
are used.
Placing bits from a0
in a1
, with results in t0
on RV32:
addi t0, zero, {length[3:0], offset[7:0]}
pack t0, a0, t0
bfp t0, a1, t0
And on RV64:
lui t0, zero, {3'b 100, length[4:0], offset[7:0], 4'b 0000}
pack t0, a0, t0
bfp t0, a1, t0
Or with a2=length
and a3=offset
:
packh t0, a3, a2
pack t0, a0, t0
bfp t0, a1, t0
Placing up to 16 constant bits in any contiguous region:
lui t0, ...
addi t0, t0, ...
bfp[w] t0, a1, t0
Note that above sequences only modify one register (t0
), which makes them
fuse-able sequences.
2.5 Bit Extract/Deposit (bext, bdep
)
RV32, RV64: bext rd, rs1, rs2 bdep rd, rs1, rs2
RV64 only: bextw rd, rs1, rs2 bdepw rd, rs1, rs2
This instructions implement the generic bit extract and bit deposit functions. This operation is also referred to as bit gather/scatter, bit pack/unpack, parallel extract/deposit, compress/expand, or right_compress/right_expand.
bext
collects LSB justified bits to rd from rs1 using extract mask in rs2.
bdep
writes LSB justified bits from rs1 to rd using deposit mask in rs2.
Implementations may choose to use smaller multi-cycle implementations of
bext
and bdep
, or even emulate the instructions in software.
Even though multi-cycle bext
and bdep
often are not fast
enough to outperform algorithms that use sequences of shifts and bit masks,
dedicated instructions for those operations can still be of great advantage in
cases where the mask argument is not constant.
For example, the following code efficiently calculates the index of the tenth
set bit in a0
using bdep
:
li a1, 0x00000200
bdep a0, a1, a0
ctz a0, a0
For cases with a constant mask an optimizing compiler would decide when to use
bext
or bdep
based on the optimization profile for the
concrete processor it is optimizing for. This is similar to the decision
whether to use MUL or DIV with a constant, or to perform the same operation
using a longer sequence of much simpler operations.
The bext
and bdep
instructions are equivalent to the x86 BMI2
instructions PEXT and PDEP. But there is much older prior art. For example, the
soviet BESM-6 mainframe computer, designed and built in the 1960s, had APX/AUX
instructions with almost the same semantics. [BESM6] (The BESM-6 APX/AUX
instructions packed/unpacked at the MSB end instead of the LSB end. Otherwise
it is the same instruction.)
Efficient hardware implementations of bext
and bdep
are described
in [Hilewitz06] and demonstrated in [Wolf17B].
2.6 Carry-Less Multiply (clmul, clmulh, clmulr
)
RV32, RV64: clmul rd, rs1, rs2 clmulh rd, rs1, rs2 clmulr rd, rs1, rs2
RV64 only: clmulw rd, rs1, rs2 clmulhw rd, rs1, rs2 clmulrw rd, rs1, rs2
Calculate the carry-less product [CarryLessProduct] of the two arguments. clmul
produces the lower half of the carry-less product and clmulh
produces the upper half
of the 2⋅XLEN carry-less product.
Carry-less multiplication is equivalent to multiplication in the polynomial ring over GF(2).
clmulr
produces bits 2⋅XLEN − 2:XLEN-1 of the 2⋅XLEN carry-less product.
That means clmulh
is equivalent to clmulr
followed by a 1-bit right shift.
(The MSB of a clmulh
result is always zero.) Another equivalent definition of
clmulr
is clmulr(a,b) := rev(clmul(rev(a), rev(b)))
. (The “r”
in clmulr
means reversed.)
Unlike mulh[[s]u]
, we add a *W variant of clmulh
. This is because we expect
some code to use 32-bit clmul intrisics, even on 64-bit architectures. For example in cases
where data is processed in 32-bit chunks.
The classic applications for clmul
are Cyclic Redundancy Check (CRC) [FastCRC Wolf18A]
and Galois/Counter Mode (GCM), but more applications exist, including the following examples.
There are obvious applications in hashing and pseudo random number generations. For example, it has been reported that hashes based on carry-less multiplications can outperform Google’s CityHash [CLHASH].
clmul
of a number with itself inserts zeroes between each input bit. This can
be useful for generating Morton code [MortonCode].
clmul
of a number with -1 calculates the prefix XOR operation. This can
be useful for decoding gray codes.
Another application of XOR prefix sums calculated with clmul
is
branchless tracking of quoted strings in high-performance parsers. [ParseJSON]
Carry-less multiply can also be used to implement Erasure code efficiently. [ClmulErasureCode]
SPARC introduced similar instructions (XMULX, XMULXHI) in SPARC T3 in 2010. [sparct3]
TI C6000 introduced a similar instruction (XORMPY) in C64x+. [c64xp]
2.7 CRC Instructions (crc32.[bhwd], crc32c.[bhwd]
)
RV32, RV64: crc32.b rd, rs crc32.h rd, rs crc32.w rd, rs crc32c.b rd, rs crc32c.h rd, rs crc32c.w rd, rs
RV64 only: crc32.d rd, rs crc32c.d rd, rs
Unary Cyclic Redundancy Check (CRC) instructions that interpret the bits of rs1 as a CRC32/CRC32C state and perform a polynomial reduction of that state shifted left by 8, 16, 32, or 64 bits.
The instructions return the new CRC32/CRC32C state.
The crc32.w
/crc32c.w
instructions are equivalent to executing
crc32.h
/crc32c.h
twice, and crc32.h
/crc32c.h
instructions are equivalent to executing crc32.b
/crc32c.b
twice.
All 8 CRC instructions operate on bit-reflected data.
Payload data must be XOR’ed into the LSB end of the state before executing the
CRC instruction. The following code demonstrates the use of crc32.b
:
uint32_t crc32_demo(const uint8_t *p, int len)
{
uint32_t x = 0xffffffff;
for (int i = 0; i < len; i++) {
x = x ^ p[i];
x = crc32_b(x);
}
return ~x;
}
In terms of binary polynomial arithmetic those instructions perform the operation
$$\texttt{rd}'(x) = (\texttt{rs1}'(x) \cdot x^N) \; \textrm{mod} \; \{\texttt{1}, P'\}(x)\textrm,$$
with N ∈ {8, 16, 32, 64},
P = 0xEDB8_8320
for CRC32 and P = 0x82F6_3B78
for CRC32C,
a′ denoting the XLEN bit reversal of a,
and {a, b} denoting bit concatenation.
Note that for example for CRC32 {1
, P′} = 0x1_04C1_1DB7
on RV32 and {1
, P′} = 0x1_04C1_1DB7_0000_0000
on RV64.
These dedicated CRC instructions are meant for RISC-V implementations without fast multiplier
and therefore without fast clmul[h]
. For implementations with fast clmul[h]
it is recommended to use the methods described in [FastCRC] and demonstrated in [Wolf18A]
that can process XLEN input bits using just one carry-less multiply for arbitrary CRC polynomials.
In applications where those methods are not applicable it is possible to emulate the dedicated CRC
instructions using two carry-less multiplies that implement a Barrett reduction. The following example
implements a replacement for crc32.w
(RV32).
crc32_w:
li t0, 0xF7011641
li t1, 0xEDB88320
clmul a0, a0, t0
clmulr a0, a0, t1
ret
2.8 Bit-Matrix Instructions (bmatxor, bmator, bmatflip
, RV64 only)
RV64 only: bmator rd, rs1, rs2 bmatxor rd, rs1, rs2 bmatflip rd, rs
These are 64-bit-only instruction that are not available on RV32. On RV128 they ignore the upper half of operands and sign extend the results.
This instructions interpret a 64-bit value as 8x8 binary matrix.
bmatxor
performs a matrix-matrix multiply with boolean AND as multiply
operator and boolean XOR as addition operator.
bmator
performs a matrix-matrix multiply with boolean AND as multiply
operator and boolean OR as addition operator.
bmatflip
is a unary operator that transposes the source matrix. It is
equivalent to zip; zip; zip
on RV64.
Among other things, bmatxor
/bmator
can be used to perform
arbitrary permutations of bits within each byte (permutation matrix as 2nd
operand) or perform arbitrary permutations of bytes within a 64-bit word
(permutation matrix as 1st operand).
There are similar instructions in Cray XMT [CrayXMT]. The Cray X1
architecture even has a full 64x64 bit matrix multiply unit [CrayX1].
(See Section [bmat64] for how to implement 64x64 bit matix operations
with bmat[x]or
.)
The MMIX architecture has MOR and MXOR instructions with the same semantic. [Knuth4A]
The x86 EVEX/VEX/SSE instruction GF2P8AFFINEQB is equivalent to bmatxor
.
The bmm.8
instruction proposed in [Hilewitz08] is also equivalent to bmatxor
.
2.9 Ternary Bit-Manipulation Instructions
2.9.1 Conditional Mix (cmix
)
RV32, RV64: cmix rd, rs2, rs1, rs3
(Note that the assembler syntax of cmix
has the rs2
argument first
to make assembler code more readable. But the reference C code code below uses
the “architecturally correct” argument order rs1, rs2, rs3
.)
The cmix rd, rs2, rs1, rs3
instruction selects bits from rs1
and rs3
based
on the bits in the control word rs2
.
It replaces sequences like the following.
and rd, rs1, rs2
andn t0, rs3, rs2
or rd, rd, t0
Using cmix
a single butterfly stage can be implemented in only two
instructions. Thus, arbitrary bit-permutations can be implemented using only
18 instruction (32 bit) or 22 instructions (64 bits).
2.9.2 Conditional Move (cmov
)
RV32, RV64: cmov rd, rs2, rs1, rs3
(Note that the assembler syntax of cmov
has the rs2
argument first
to make assembler code more readable. But the reference C code code below uses
the “architecturally correct” argument order rs1, rs2, rs3
.)
The cmov rd, rs2, rs1, rs3
instruction selects rs1
if the control
word rs2
is non-zero, and rs3
if the control word is zero.
The cmov
instruction helps avoiding branches, which can lead to better
performance, and helps with constant-time code as used in some cryptography
applications.
2.9.3 Funnel Shift (fsl
, fsr
, fsri
)
RV32, RV64: fsl rd, rs1, rs3, rs2 fsr rd, rs1, rs3, rs2 fsri rd, rs1, rs3, imm
RV64 only: fslw rd, rs1, rs3, rs2 fsrw rd, rs1, rs3, rs2 fsriw rd, rs1, rs3, imm
(Note that the assembler syntax for funnel shifts has the rs2
argument
last to make assembler code more readable. But the reference C code code below
uses the “architecturally correct” argument order rs1, rs2, rs3
.)
The fsl rd, rs1, rs3, rs2
instruction creates a 2 ⋅ XLEN word
by concatenating rs1 and rs3 (with rs1 in the MSB half), rotate-left-shifts that
word by the amount indicated in the log2(XLEN) + 1 LSB bits in rs2, and
then writes the MSB half of the result to rd.
The fsr rd, rs1, rs3, rs2
instruction creates a 2 ⋅ XLEN word
by concatenating rs1 and rs3 (with rs1 in the LSB half), rotate-right-shifts that
word by the amount indicated in the log2(XLEN) + 1 LSB bits in rs2, and
then writes the LSB half of the result to rd.
A shift unit capable of either fsl
or fsr
is capable of performing all
the other shift functions, including the other funnel shift, with only minimal additional
logic.
For any values of A
, B
, and C
:
fsl(A, B, C) = fsr(A, -B, C)
And for any values x
and 0 ≤ shamt
< XLEN
:
sll(x, shamt) == fsl(x, shamt, 0)
srl(x, shamt) == fsr(x, shamt, 0)
sra(x, shamt) == fsr(x, shamt, sext_x)
slo(x, shamt) == fsl(x, shamt, ~0)
sro(x, shamt) == fsr(x, shamt, ~0)
ror(x, shamt) == fsr(x, shamt, x)
rol(x, shamt) == fsl(x, shamt, x)
Furthermore an RV64 implementation of either fsl
or fsr
is capable
of performing the *W versions of all shift operations with only a few gates
of additional control logic.
On RV128 there is no fsri
instruction. But there is fsriw
and fsrid
.
2.10 Address calculation instructions
RV32, RV64: sh1add rd, rs1, rs2 sh2add rd, rs1, rs2 sh3add rd, rs1, rs2
RV64 only: sh1addu.w rd, rs1, rs2 sh2addu.w rd, rs1, rs2 sh3addu.w rd, rs1, rs2
These instructions shift rs1
left by 1, 2, or 3 bits, then add the result
to rs2
. The sh?addu.w
instructions are identical to sh?add
, except
that bits XLEN-1:32 of the rs1
argument are cleared before the shift.
An opcode for sh4add
/sh4addu.w
for RV128 and/or RVQ is reserved.
2.11 Mixed uint32/int64 arithmetic instructions
Consider RV64 C code that’s using unsigned 32-bit ints as array indices. For example:
char u32i64demo(char *p, unsigned int i) {
return p[i-1];
}
In such cases the expression within p[…]
must overflow according to
32-bit arithmetic, then be zero-extended, and then this zero-extended result
must be used in the address calculation.
The instructions below make sure that no explicit zext.w
instruction
is needed in such cases, to make sure there is no systematic performance
penalty for code like shown above on RV64 compared to RV32.
2.11.1 Add/sub with postfix zero-extend (addwu
, subwu
, addiwu
)
RV64: addwu rd, rs1, rs2 subwu rd, rs1, rs2 addiwu rd, rs1, imm
These instructions are identical to addw
, subw
, addiw
,
except that bits XLEN-1:32 of the result are cleared after the addition. I.e.
these instructions zero-extend instead of sign-extend the 32-bit result.
The 12-bit immediate to addiwu
is sign-extended, exactly like the immediate
to addiw
.
2.11.2 Add/sub/shift with prefix zero-extend (addu.w
, subu.w
, slliu.w
)
RV64: addu.w rd, rs1, rs2 subu.w rd, rs1, rs2 slliu.w rd, rs1, imm
slliu.w
is identical to slli
, except that bits XLEN-1:32 of the
rs1
argument are cleared before the shift.
addu.w
and subu.w
are identical to add
and sub
, except
that bits XLEN-1:32 of the rs2
argument are cleared before the add/subtract.
[bextcref-slliuw] [bextcref-adduw]
2.12 Opcode Encodings
This chapter contains proposed encodings for most of the instructions described in this document. DO NOT IMPLEMENT THESE OPCODES YET. We are trying to get official opcodes assigned and will update this chapter soon with the official opcodes.
The andn
, orn
, and xnor
instruction are encoded the same way
as and
, or
, and xor
, but with op[30]
set, mirroring the
encoding scheme used for add
and sub
.
All shift instructions use funct3=001
for left shifts and funct3=101
for right shifts. Just like in the RISC-V integer base ISA, the shift-immediate
instructions have a 5 bit immediate on RV32, and a 6 bit immediate on RV64, and we
reserve encoding space for a 7 bit immediate for RV128. The same sizes apply
to sbseti
, sbclri
, sbinvi
, and sbexti
.
The immediate for shfli
/unshufli
is one bit smaller than the immediate
for shift instructions, that is 4 bits on RV32, 5 bits on RV64, and we reserve 6
bits for RV128.
op[26]=1
selects funnel shifts. For funnel shifts op[30:29]
is part
if the 3rd operand and therefore unused for encoding the operation. For all other
shift operations op[26]=0
.
fsri
is also encoded with op[26]=1
, leaving a 6 bit immediate. The 7th
bit, that is necessary to perform a 128 bit funnel shift on RV64, can be
emulated by swapping rs1 and rs3.
There is no shfliw
instruction. The slliu.w
instruction occupies
the encoding slot that would be occupied by shfliw
.
On RV128 op[26]
contains the MSB of the immediate for the shift instructions.
Therefore there is no FSRI instruction on RV128. (But there is FSRIW/FSRID.)
| SLL SRL SRA | SLO SRO | ROL ROR | FSL FSR
op[30] | 0 0 1 | 0 0 | 1 1 | - -
op[29] | 0 0 0 | 1 1 | 1 1 | - -
op[26] | 0 0 0 | 0 0 | 0 0 | 1 1
funct3 | 001 101 101 | 001 101 | 001 101 | 001 101
Only an encoding for RORI exists, as ROLI can be implemented with RORI by negating the immediate. Unary functions are encoded in the spot that would correspond to ROLI, with the function encoded in the 5 LSB bits of the immediate.
The CRC instructions are encoded as unary instructions with op[24]
set. The
polynomial is selected via op[23]
, with op[23]=0
for CRC32 and
op[23]=1
for CRC32C. The width is selected with op[22:20]
, using
the same encoding as is used in funct3
for load/store operations.
cmix
and cmov
are encoded using the two remaining ternary operator
encodings in funct3=001
and funct3=101
. (There are two ternary
operator encodings per minor opcode using the op[26]=1
scheme for
marking ternary OPs.)
The single-bit instructions are also encoded within the shift opcodes, with
op[27]
set, and using op[30]
and op[29]
to select the operation:
| SBCLR SBSET SBINV | SBEXT GORC GREV
op[30] | 1 0 1 | 1 0 1
op[29] | 0 1 1 | 0 1 1
op[27] | 1 1 1 | 1 1 1
funct3 | 001 001 001 | 101 101 101
There is no sbextiw
instruction as it can be emulated trivially using
sbexti
. However, there is sbsetiw
, sbclriw
, and sbinviw
as changing bit 31 would change the sign extend. There are non-immediate *W
instructions of all single-bit instructions, including sbextw
, because
the number of used bits in rs2 is different in sbext
and sbextw
.
GORC and GREV are encoded in the two remaining slots in the single-bit instruction encoding space.
The remaining instructions are encoded within funct7=0000100
and
funct7=0000101
.
The funct7=0000101
block contains clmul[hr]
,
min[u]
, and max[u]
.
The encoding of clmul, clmulr, clmulh
is identical to the encoding of
mulh, mulhsu, mulhu
, except that op[27]=1
.
The encoding of min[u]
/max[u]
uses funct3=100..111
. The
funct3
encoding matches op[31:29]
of the AMO min/max functions.
The remaining instructions are encoded within funct7=0000100
. The
shift-like shfl
/unshfl
instructions uses the same funct3
values as the shift operations. bdep
and bext
are encoded in a
way so that funct3[2]
selects the “direction”, similar to shift
operations.
bmat[x]or
use funct3=011
and funct3=111
in funct7=0000100
.
pack
occupies funct3=100
in funct7=0000100
.
addwu
and subwu
are encoded like addw
and subw
, except
that op[25]=1
and op[27]=1
.
addu.w
and subu.w
are encoded like addw
and subw
, except
that op[27]=1
.
addiwu
is encoded using funct3=100
(XOR) instead of funct3=000
in OP-32.
Finally, RV64 has W
instructions for all bitmanip instructions, with the
following exceptions:
andn
, cmix
, cmov
, min[u]
, max[u]
have no W
variants because they already behave in the way a W
instruction would
when presented with sign-exteded 32-bit arguments.
bmatflip
, bmatxor
, bmator
have no W
variants because
they are 64-bit only instructions.
crc32.[bhwd]
, crc32c.[bhwd]
have no W
variants because crc32[c].w
is deemed sufficient.
There is no [un]shfliw
, as a perfect outer shuffle always preserves the
MSB bit, thus [un]shfli
preserves proper sign extension when the
upper bit in the control word is set. There’s still [un]shflw
that
masks that upper control bit and sign-extends the output.
Relevant instruction encodings from the base ISA are included in the table below
and are marked with a .
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|---------------------------------------------------------------|
| funct7 | rs2 | rs1 | f3 | rd | opcode | R-type
| rs3 | f2| rs2 | rs1 | f3 | rd | opcode | R4-type
| imm | rs1 | f3 | rd | opcode | I-type
|===============================================================|
| 0000000 | rs2 | rs1 | 111 | rd | 0110011 | AND*
| 0000000 | rs2 | rs1 | 110 | rd | 0110011 | OR*
| 0000000 | rs2 | rs1 | 100 | rd | 0110011 | XOR*
| 0100000 | rs2 | rs1 | 111 | rd | 0110011 | ANDN
| 0100000 | rs2 | rs1 | 110 | rd | 0110011 | ORN
| 0100000 | rs2 | rs1 | 100 | rd | 0110011 | XNOR
|---------------------------------------------------------------|
| 0000000 | rs2 | rs1 | 001 | rd | 0110011 | SLL*
| 0000000 | rs2 | rs1 | 101 | rd | 0110011 | SRL*
| 0100000 | rs2 | rs1 | 101 | rd | 0110011 | SRA*
| 0010000 | rs2 | rs1 | 001 | rd | 0110011 | SLO
| 0010000 | rs2 | rs1 | 101 | rd | 0110011 | SRO
| 0110000 | rs2 | rs1 | 001 | rd | 0110011 | ROL
| 0110000 | rs2 | rs1 | 101 | rd | 0110011 | ROR
|---------------------------------------------------------------|
| 0010000 | rs2 | rs1 | 010 | rd | 0110011 | SH1ADD
| 0010000 | rs2 | rs1 | 100 | rd | 0110011 | SH2ADD
| 0010000 | rs2 | rs1 | 110 | rd | 0110011 | SH3ADD
|---------------------------------------------------------------|
| 0100100 | rs2 | rs1 | 001 | rd | 0110011 | SBCLR
| 0010100 | rs2 | rs1 | 001 | rd | 0110011 | SBSET
| 0110100 | rs2 | rs1 | 001 | rd | 0110011 | SBINV
| 0100100 | rs2 | rs1 | 101 | rd | 0110011 | SBEXT
| 0010100 | rs2 | rs1 | 101 | rd | 0110011 | GORC
| 0110100 | rs2 | rs1 | 101 | rd | 0110011 | GREV
|---------------------------------------------------------------|
| 00000 | imm | rs1 | 001 | rd | 0010011 | SLLI*
| 00000 | imm | rs1 | 101 | rd | 0010011 | SRLI*
| 01000 | imm | rs1 | 101 | rd | 0010011 | SRAI*
| 00100 | imm | rs1 | 001 | rd | 0010011 | SLOI
| 00100 | imm | rs1 | 101 | rd | 0010011 | SROI
| 01100 | imm | rs1 | 101 | rd | 0010011 | RORI
|---------------------------------------------------------------|
| 01001 | imm | rs1 | 001 | rd | 0010011 | SBCLRI
| 00101 | imm | rs1 | 001 | rd | 0010011 | SBSETI
| 01101 | imm | rs1 | 001 | rd | 0010011 | SBINVI
| 01001 | imm | rs1 | 101 | rd | 0010011 | SBEXTI
| 00101 | imm | rs1 | 101 | rd | 0010011 | GORCI
| 01101 | imm | rs1 | 101 | rd | 0010011 | GREVI
|---------------------------------------------------------------|
| rs3 | 11| rs2 | rs1 | 001 | rd | 0110011 | CMIX
| rs3 | 11| rs2 | rs1 | 101 | rd | 0110011 | CMOV
| rs3 | 10| rs2 | rs1 | 001 | rd | 0110011 | FSL
| rs3 | 10| rs2 | rs1 | 101 | rd | 0110011 | FSR
| rs3 |1| imm | rs1 | 101 | rd | 0010011 | FSRI
|---------------------------------------------------------------|
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|===============================================================|
| 0110000 | 00000 | rs1 | 001 | rd | 0010011 | CLZ
| 0110000 | 00001 | rs1 | 001 | rd | 0010011 | CTZ
| 0110000 | 00010 | rs1 | 001 | rd | 0010011 | PCNT
| 0110000 | 00011 | rs1 | 001 | rd | 0010011 | BMATFLIP
| 0110000 | 00100 | rs1 | 001 | rd | 0010011 | SEXT.B
| 0110000 | 00101 | rs1 | 001 | rd | 0010011 | SEXT.H
|---------------------------------------------------------------|
| 0110000 | 10000 | rs1 | 001 | rd | 0010011 | CRC32.B
| 0110000 | 10001 | rs1 | 001 | rd | 0010011 | CRC32.H
| 0110000 | 10010 | rs1 | 001 | rd | 0010011 | CRC32.W
| 0110000 | 10011 | rs1 | 001 | rd | 0010011 | CRC32.D
| 0110000 | 11000 | rs1 | 001 | rd | 0010011 | CRC32C.B
| 0110000 | 11001 | rs1 | 001 | rd | 0010011 | CRC32C.H
| 0110000 | 11010 | rs1 | 001 | rd | 0010011 | CRC32C.W
| 0110000 | 11011 | rs1 | 001 | rd | 0010011 | CRC32C.D
|---------------------------------------------------------------|
| 0000101 | rs2 | rs1 | 001 | rd | 0110011 | CLMUL
| 0000101 | rs2 | rs1 | 010 | rd | 0110011 | CLMULR
| 0000101 | rs2 | rs1 | 011 | rd | 0110011 | CLMULH
| 0000101 | rs2 | rs1 | 100 | rd | 0110011 | MIN
| 0000101 | rs2 | rs1 | 101 | rd | 0110011 | MAX
| 0000101 | rs2 | rs1 | 110 | rd | 0110011 | MINU
| 0000101 | rs2 | rs1 | 111 | rd | 0110011 | MAXU
|---------------------------------------------------------------|
| 0000100 | rs2 | rs1 | 001 | rd | 0110011 | SHFL
| 0000100 | rs2 | rs1 | 101 | rd | 0110011 | UNSHFL
| 0100100 | rs2 | rs1 | 110 | rd | 0110011 | BDEP
| 0000100 | rs2 | rs1 | 110 | rd | 0110011 | BEXT
| 0000100 | rs2 | rs1 | 100 | rd | 0110011 | PACK
| 0100100 | rs2 | rs1 | 100 | rd | 0110011 | PACKU
| 0000100 | rs2 | rs1 | 011 | rd | 0110011 | BMATOR
| 0100100 | rs2 | rs1 | 011 | rd | 0110011 | BMATXOR
| 0000100 | rs2 | rs1 | 111 | rd | 0110011 | PACKH
| 0100100 | rs2 | rs1 | 111 | rd | 0110011 | BFP
|---------------------------------------------------------------|
| 000010 | imm | rs1 | 001 | rd | 0010011 | SHFLI
| 000010 | imm | rs1 | 101 | rd | 0010011 | UNSHFLI
|===============================================================|
| immediate | rs1 | 000 | rd | 0011011 | ADDIW*
| immediate | rs1 | 100 | rd | 0011011 | ADDIWU
| 00001 | imm | rs1 | 001 | rd | 0011011 | SLLIU.W
|---------------------------------------------------------------|
| 0000000 | rs2 | rs1 | 000 | rd | 0111011 | ADDW*
| 0100000 | rs2 | rs1 | 000 | rd | 0111011 | SUBW*
| 0000101 | rs2 | rs1 | 000 | rd | 0111011 | ADDWU
| 0100101 | rs2 | rs1 | 000 | rd | 0111011 | SUBWU
| 0000100 | rs2 | rs1 | 000 | rd | 0111011 | ADDU.W
| 0100100 | rs2 | rs1 | 000 | rd | 0111011 | SUBU.W
|---------------------------------------------------------------|
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|===============================================================|
| 0010000 | rs2 | rs1 | 001 | rd | 0111011 | SLOW
| 0010000 | rs2 | rs1 | 101 | rd | 0111011 | SROW
| 0110000 | rs2 | rs1 | 001 | rd | 0111011 | ROLW
| 0110000 | rs2 | rs1 | 101 | rd | 0111011 | RORW
|---------------------------------------------------------------|
| 0010000 | rs2 | rs1 | 010 | rd | 0111011 | SH1ADDU.W
| 0010000 | rs2 | rs1 | 100 | rd | 0111011 | SH2ADDU.W
| 0010000 | rs2 | rs1 | 110 | rd | 0111011 | SH3ADDU.W
|---------------------------------------------------------------|
| 0100100 | rs2 | rs1 | 001 | rd | 0111011 | SBCLRW
| 0010100 | rs2 | rs1 | 001 | rd | 0111011 | SBSETW
| 0110100 | rs2 | rs1 | 001 | rd | 0111011 | SBINVW
| 0100100 | rs2 | rs1 | 101 | rd | 0111011 | SBEXTW
| 0010100 | rs2 | rs1 | 101 | rd | 0111011 | GORCW
| 0110100 | rs2 | rs1 | 101 | rd | 0111011 | GREVW
|---------------------------------------------------------------|
| 0010000 | imm | rs1 | 001 | rd | 0011011 | SLOIW
| 0010000 | imm | rs1 | 101 | rd | 0011011 | SROIW
| 0110000 | imm | rs1 | 101 | rd | 0011011 | RORIW
|---------------------------------------------------------------|
| 0100100 | imm | rs1 | 001 | rd | 0011011 | SBCLRIW
| 0010100 | imm | rs1 | 001 | rd | 0011011 | SBSETIW
| 0110100 | imm | rs1 | 001 | rd | 0011011 | SBINVIW
| 0010100 | imm | rs1 | 101 | rd | 0011011 | GORCIW
| 0110100 | imm | rs1 | 101 | rd | 0011011 | GREVIW
|---------------------------------------------------------------|
| rs3 | 10| rs2 | rs1 | 001 | rd | 0111011 | FSLW
| rs3 | 10| rs2 | rs1 | 101 | rd | 0111011 | FSRW
| rs3 | 10| imm | rs1 | 101 | rd | 0011011 | FSRIW
|---------------------------------------------------------------|
| 0110000 | 00000 | rs1 | 001 | rd | 0011011 | CLZW
| 0110000 | 00001 | rs1 | 001 | rd | 0011011 | CTZW
| 0110000 | 00010 | rs1 | 001 | rd | 0011011 | PCNTW
|---------------------------------------------------------------|
| 0000101 | rs2 | rs1 | 001 | rd | 0111011 | CLMULW
| 0000101 | rs2 | rs1 | 010 | rd | 0111011 | CLMULRW
| 0000101 | rs2 | rs1 | 011 | rd | 0111011 | CLMULHW
|---------------------------------------------------------------|
| 0000100 | rs2 | rs1 | 001 | rd | 0111011 | SHFLW
| 0000100 | rs2 | rs1 | 101 | rd | 0111011 | UNSHFLW
| 0100100 | rs2 | rs1 | 110 | rd | 0111011 | BDEPW
| 0000100 | rs2 | rs1 | 110 | rd | 0111011 | BEXTW
| 0000100 | rs2 | rs1 | 100 | rd | 0111011 | PACKW
| 0100100 | rs2 | rs1 | 100 | rd | 0111011 | PACKUW
| 0100100 | rs2 | rs1 | 111 | rd | 0111011 | BFPW
|---------------------------------------------------------------|
Changes in v0.93 of the RISC-V Bitmanip Spec (+
for addition,
-
for removal):
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|===============================================================|
| 0010000 | rs2 | rs1 | 010 | rd | 0110011 | + SH1ADD
| 0010000 | rs2 | rs1 | 100 | rd | 0110011 | + SH2ADD
| 0010000 | rs2 | rs1 | 110 | rd | 0110011 | + SH3ADD
|---------------------------------------------------------------|
| 0010000 | rs2 | rs1 | 010 | rd | 0111011 | + SH1ADDU.W
| 0010000 | rs2 | rs1 | 100 | rd | 0111011 | + SH2ADDU.W
| 0010000 | rs2 | rs1 | 110 | rd | 0111011 | + SH3ADDU.W
|---------------------------------------------------------------|


2.13 Future compressed instructions
The RISC-V ISA has no dedicated instructions for bitwise inverse (not
).
Instead not
is implemented as xori rd, rs, -1
and
neg
is implemented as sub rd, x0, rs
.
In bitmanipulation code not
is a very common operation. But there is
no compressed encoding for those operation because there is no c.xori
instruction.
On RV64 (and RV128) zext.w
and zext.d
(pack
and packw
)
are commonly used to zero-extend unsigned values <XLEN.
It presumably would make sense for a future revision of the “C” extension to include compressed opcodes for those instructions.
An encoding with the constraint rd = rs
would fit nicely in the
reserved space in c.addi16sp/c.lui
.
The entire RVC encoding space is 15.585 bits wide, the remaining reserved encoding space in RVC is 11.155 bits wide, not including space that is only reserved on RV32/RV64. This means that above encoding would use 0.0065% of the RVC encoding space, or 1.4% of the remaining reserved RVC encoding space. Preliminary experiments have shown that NOT instructions alone make up approximately 1% of bitmanipulation code size. [Wolf17A]
2.14 Macro-op fusion patterns
Some bitmanip operations have been left out of this spec because of lack of a (sensible) way to encode them in the 32-bit RISC-V encoding space. Instead we present fuse-able sequences of up to three instructions for those operations, so that high-end processors can implement them in a single fused macro-op, should they decide to do so.
For this document we only consider sequences fuse-able if they read at most two registers, only write one register, and contain no branch instructions.
2.14.1 Fused -bfp
sequences
The bfp
instruction is most commonly used in sequences of one the the following forms.
For 32-bit (RV32 or *W instructions on RV64):
addi rd, zero, ...
pack[w] rd, rs2, rd
bfp[w] rd, rs1, rd
lui rd, ...
addi rd, rd, ...
bfp[w] rd, rs1, rd
And for 64-bit:
lui rd, ...
pack rd, rs2, rd
bfp rd, rs1, rd
2.14.2 Load-immediate
For 32-bit code (RV32 or *W instructions on RV64) we recommend to fuse the lui+addi
pattern for loading a 32-bit constants:
lui rd, imm
addi[w] rd, rd, imm
For RV64 we also recommend to fuse the followings patterns for loading a 32-bit constant zero-extended into a 64-bit register:
lui rd, imm
addiwu rd, rd, imm
addiw rd, zero, v
pack rd, rd, zero
Further, for loading 64-bit consants in two macro-ops:
lui rd, imm
addiw rd, rd, imm
pack rd, rd, rs
2.14.3 Fused shift sequences
Pairs of left and right shifts are common operations for extracting a bit field.
To extract the contiguous bit field starting at pos
with length len
from rs
(with pos
> 0, len
> 0, and
pos
+ len
≤ XLEN):
slli rd, rs, (XLEN-len-pos)
srli rd, rd, (XLEN-len)
Using srai
instead of srli
will sign-extend the extracted bit-field.
Similarly, placing a bit field with length len
at the position pos
:
slli rd, rs, (XLEN-len)
srli rd, rd, (XLEN-len-pos)
Note that the postfix right shift instruction can use a compressed encoding, yielding a 48-bit fused instruction if the left shift is a 32-bit instruction.
More generally, a processor might fuse all destructive shift operations following any other ALU operation.
We define the following assembler pseudo-ops for sr[la]i
postfix fusion:
bfext rd, rs, len, pos -> slli rd, rs, (XLEN-len-pos); srai rd, rd, (XLEN-len)
bfextu rd, rs, len, pos -> slli rd, rs, (XLEN-len-pos); srli rd, rd, (XLEN-len)
bfmak rd, len, pos -> sroi rd, zero, len; srli rd, rd, (XLEN-len-pos)
(The names bfext
, bfextu
, and bfmak
are borrowed from m88k,
that had dedicated instructions of those names (without bf
-prefix) with
equivalent semantics. [m88k])
2.15 Other micro-architectural considerations
In addition to macro-op fusion, we issue the following recommendations for cores that aim at better performance for bitmanipulation tasks.
2.15.1 Unaligned memory access
The base ISA spec requires load/store operations that are not naturally aligned to succeed in U-mode, but explicitly states that execution may be slow.
For many bitmanipulation tasks it can be of great advantage to be able to perform load and store operations with arbitrary alignments quickly. Thus we recommend that cores optimized for bitmanipulation tasks provide fast hardware support for load/store with arbitrary alignment.
There should be a getauxval()
-based mechanism as part of the RISC-V Linux
ABI that can be used to query information on support for unaligned memory
access. [FastLdSt]
2.15.2 Fast multiply
A lot of bitmanipulation tricks rely on multiplication with “magic numbers” and similar tricks involving multiply/divide instructions. Thus, cores optimized for bitmanipulation tasks should provide reasonably fast implementations of the “M”-extension multiply/divide instructions.

<rvintrin.h>
2.16 C intrinsics via <rvintrin.h>
A C header file <rvintrin.h>
is provided that contains assembler
templates for directly creating assembler instructions from C code.
The header defines _rv_(...)
functions that operate on the long
data type, _rv32_(...)
functions that operate on the int32_t
data type, and _rv64_(...)
functions that operate on the int64_t
data type. The _rv64_(...)
functions are only available on
RV64. See table 1.11 for a complete list of intrinsics defined in
<rvintrin.h>
.
Usage example:
#include <rvintrin.h>
int find_nth_set_bit(unsigned int value, int cnt) {
return _rv32_ctz(_rv32_bdep(1 << cnt, value));
}
Defining RVINTRIN_EMULATE
before including <rvintrin.h>
will
define plain C functions that emulate the behavior of the RISC-V instructions.
This is useful for testing software on non-RISC-V platforms.