# Five EmbedDev

An Embedded RISC-V Blog
RISC-V Bitmanip Extension

# 2 RISC-V Bitmanip Extension

In the proposals provided in this chapter, the C code examples are for illustration purposes only. They are not optimal implementations, but are intended to specify the desired functionality.

The final standard will likely define a range of Z-extensions for different bit manipulation instructions, with the “B” extension itself being a mix of instructions from those Z-extensions. It is unclear as of yet what this will look like exactly, but it will probably look something like this:

The main open questions are:

• Which of the “Zb*” extensions should be included in “B”?

• There is some redundancy/overlap with add[i]wu, subwu, addu.w, subu.w, and slliu.w. Which instructions should stay?

• Which “Zbp” pseudo-ops should be included in “Zbb”?

These decisions will be informed in big part by evaluations of the cost and added value for the individual instructions.

For the purpose of tool-chain development “B” is currently everything.

For extensions that only implement certain pseudo-instructions (such as “Zbb” implements rev8 and rev, which are pseudo-instructions for grevi rd, rs1, -8 and grevi rd, rs1, -1 respectively, the same binary encoding is used for those instructions as are used on a core with full support for the grev[i] instruction.

Like in the base ISA, we add W instruction variants on RV64 with the semantic of the matching RV32 instruction. Those instructions ignore the upper 32 bit of their input and sign-extend their 32 bit output values. The W instruction is omitted when the 64-bit instruction produces the same result as the W instruction would when the 64-bit instruction is fed sign-extended 32 bit values.

## 2.1 Basic bit manipulation instructions

### 2.1.1 Count Leading/Trailing Zeros (clz, ctz)

RV32, RV64: clz rd, rs ctz rd, rs

RV64 only: clzw rd, rs ctzw rd, rs

The clz operation counts the number of 0 bits at the MSB end of the argument. That is, the number of 0 bits before the first 1 bit counting from the most significant bit. If the input is 0, the output is XLEN. If the input is -1, the output is 0.

The ctz operation counts the number of 0 bits at the LSB end of the argument. If the input is 0, the output is XLEN. If the input is -1, the output is 0.

[bextcref-clz-ctz]

The expression XLEN-1-clz(x) evaluates to the index of the most significant set bit, also known as integer base-2 logarithm, or -1 if x is zero.

These instructions are commonly used for scanning bitmaps for set bits, for example in malloc(), in binary GCD, or in priority queues such as the sched_find_first_bit() function used in the Linux kernel real-time scheduler.

Another common applications include normalization in fixed-point code and soft float libraries, null suppression in data compression.

### 2.1.2 Count Bits Set (pcnt)

RV32, RV64: pcnt rd, rs

RV64 only: pcntw rd, rs

This instruction counts the number of 1 bits in a register. This operations is known as population count, popcount, sideways sum, bit summation, or Hamming weight. [HammingWeight Warren12]

[bextcref-pcnt]

### 2.1.3 Logic-with-negate (andn, orn, xnor)

RV32, RV64: andn rd, rs1, rs2 orn rd, rs1, rs2 xnor rd, rs1, rs2

This instructions implement AND, OR, and XOR with the 2nd arument inverted.

[bextcref-andn]

This can use the existing inverter on rs2 in the ALU that’s already there to implement subtract.

Among other things, those instructions allow implementing the “trailing bit manipulation” code patterns in two instructions each. For example, (x - 1) & ~x produces a mask from trailing zero bits in x.

### 2.1.4 Pack two words in one register (pack, packu, packh)

RV32, RV64: pack rd, rs1, rs2 packu rd, rs1, rs2 packh rd, rs1, rs2

RV64 only: packw rd, rs1, rs2 packuw rd, rs1, rs2

The pack instruction packs the XLEN/2-bit lower halves of rs1 and rs2 into rd, with rs1 in the lower half and rs2 in the upper half.

[bextcref-pack]

The packu instruction packs the upper halves of rs1 and rs2 into rd.

[bextcref-packu]

And the packh instruction packs the LSB bytes of rs1 and rs2 into the 16 LSB bits of rd, zero extending the rest of rd.

[bextcref-packh]

Applications include XLEN/2-bit funnel shifts, zero-extend XLEN/2 bit values, duplicate the lower XLEN/2 bits (e.g. for mask creation), loading unsigned 32 constants on RV64, and packing C structs that fit in a register and are therefore passed in a register according to the RISC-V calling convention.

  ; Constructing a 32-bit int from four bytes (RV32)
packh a0, a0, a1
packh a1, a2, a3
pack a0, a0, a1
  ; Load 0xffff0000ffff0000 on RV64
lui rd, 0xffff0
pack rd, rd, rd
  ; Same as FSLW on RV64
pack rd, rs1, rs3
rol rd, rd, rs2
addiw rd, rd, 0
  ; Clear the upper half of rd
pack rd, rd, zero

Paired with shfli/unshfli and the other bit permutation instructions, pack can interleave arbitrary power-of-two chunks of rs1 and rs2. For example, interleaving the bytes in the lower halves of rs1 and rs2:

  pack rd, rs1, rs2
zip8 rd, rd

pack is most commonly used to zero-extend words <XLEN. For this purpose we define the following assembler pseudo-ops:

  RV32:
zext.b rd, rs   ->    andi  rd, rs, 255
zext.h rd, rs   ->    pack  rd, rs, zero

RV64:
zext.b rd, rs   ->    andi  rd, rs, 255
zext.h rd, rs   ->    packw rd, rs, zero
zext.w rd, rs   ->    pack  rd, rs, zero

RV128:.
zext.b rd, rs   ->    andi  rd, rs, 255
zext.h rd, rs   ->    packw rd, rs, zero
zext.w rd, rs   ->    packd rd, rs, zero
zext.d rd, rs   ->    pack  rd, rs, zero

### 2.1.5 Min/max instructions (min, max, minu, maxu)

RV32, RV64: min rd, rs1, rs2 max rd, rs1, rs2 minu rd, rs1, rs2 maxu rd, rs1, rs2

We define 4 R-type instructions min, max, minu, maxu with the following semantics:

[bextcref-minmax]

Code that performs saturated arithmetic on a word size < XLEN needs to perform min/max operations frequently. A simple way of performing those operations without branching can benefit those programs.

Some applications spend a lot of time on calculating the absolute values of signed integers. One example of that would be SAT solvers, due to the way CNF literals are commonly encoded [BiereComm]. With max (or minu) this is a two-instruction operation:

  neg a1, a0
max a0, a0, a1

### 2.1.6 Sign-extend instructions (sext.b, sext.h)

RV32, RV64: sext.b rd, rs sext.h rd, rs

For sign-extending a byte or half-word we define two unary instructions:

[bextcref-sext]

Additionally, we define pseudo-instructions for zero extending bytes or half-words and for zero- and sign-extending words:

  RV32:
zext.b rd, rs   ->   andi rd, rs, 255
zext.h rd, rs   ->   pack rd, rs, zero

RV64:
zext.b rd, rs   ->   andi rd, rs, 255
zext.h rd, rs   ->   packw rd, rs, zero
zext.w rd, rs   ->   pack rd, rs, zero
sext.w rd, rs   ->   addiw rd, rs, 0

Sign extending 8-bit and 16-bit values is needed when calling a function that accepts an 8-bit or 16-bit signed argument, because the RISC-V calling conventions dictates that such an argument must be passed in sign-extended form.

### 2.1.7 Single-bit instructions (sbset, sbclr, sbinv, sbext)

RV32, RV64: sbset rd, rs1, rs2 sbclr rd, rs1, rs2 sbinv rd, rs1, rs2 sbext rd, rs1, rs2 sbseti rd, rs1, imm sbclri rd, rs1, imm sbinvi rd, rs1, imm sbexti rd, rs1, imm

RV64: sbsetw rd, rs1, rs2 sbclrw rd, rs1, rs2 sbinvw rd, rs1, rs2 sbextw rd, rs1, rs2 sbsetiw rd, rs1, imm sbclriw rd, rs1, imm sbinviw rd, rs1, imm

We define 4 single-bit instructions sbset (set), sbclr (clear), sbinv (invert), and sbext (extract), and their immediate-variants, with the following semantics:

[bextcref-sbx]

### 2.1.8 Shift Ones (Left/Right) (slo, sloi, sro, sroi)

RV32, RV64: slo rd, rs1, rs2 sro rd, rs1, rs2 sloi rd, rs1, imm sroi rd, rs1, imm

RV64 only: slow rd, rs1, rs2 srow rd, rs1, rs2 sloiw rd, rs1, imm sroiw rd, rs1, imm

These instructions are similar to shift-logical operations from the base spec, except instead of shifting in zeros, they shift in ones.

[bextcref-sxo]

ISAs with flag registers often have a "Shift in Carry" or "Rotate through Carry" instruction. Arguably a "Shift Ones" is an equivalent on an ISA like RISC-V that avoids such flag registers.

The main application for the Shift Ones instruction is mask generation.

When implementing this circuit, the only change in the ALU over a standard logical shift is that the value shifted in is not zero, but is a 1-bit register value that has been forwarded from the high bit of the instruction decode. This creates the desired behavior on both logical zero-shifts and logical ones-shifts.

## 2.2 Bit permutation instructions

The following sections describe 3 types of bit permutation instructions: Rotate shift, generalized reverse, and generalized shuffle.

A bit permutation essentially is applying an invertible function to the bit addresses. (Bit addresses are 5 bit values on RV32 and 6 bit values on RV64.)

Rotate shift by k is simply addition (rol) or subtraction (ror) modulo XLEN.
irot := i ± kmod XLEN

Generalized reverse with control argument k is simply XOR-ing the bit addresses with k:
igrev := i ⊕ k

And generalized shuffle is performing a bit permutation on the bits of the bit addresses:
ishfl := permk(i)

With the caveat that a single shfl/unshfl instruction can only perferm a certain sub-set of bit address permutations, but a sequence of 4  shfl/unshfl instructions can perform any of the 120 such permutations on RV32, and a sequence of 5 shfl/unshfl instructions can perform any of the 720 such permutations on RV64.

Combining those three types of operations makes a vast number of bit permutations accessible within only a few instructions [Wolf19A] (see Table [numperms]).

N ROT-only GREV-only SHFL-only ROT+GREV ROT+GREV+SHFL
0 1 1 1 1 1
1 32 32 24 62 85
2 86 864 3 030
3 119 4 640 78 659
4 120 23 312 2 002 167
5 92 192 50 106 844
6 294 992 1 234 579 963
7 703 744 est. 30 000 000 000
8 1 012 856 est. 700 000 000 000
9 1 046 224 est. 15 000 000 000 000
10 1 048 576 ⋯ ⋯ ⋯ ⋯ ⋯
11 ⋯ ⋯ ⋯ ⋯ ⋯

Sequences of ror, grev, and [un]shfl instructions can generate any arbitrary bit permutation. Often in surprising ways. For example, the following sequence swaps the two LSB bits of a0:

Instruction State (XLEN=8) Bit-Index Op
initial value 7 6 5 4 3 2 1 0
rori a0, a0, 2 1 0 7 6 5 4 3 2 i′ := i − 2
unshfli a0, a0, -1 1 7 5 3 0 6 4 2 i′ := ror(i)
roli a0, a0, 1 7 5 3 0 6 4 2 1 i′ := i + 1
shfli a0, a0, -1 7 6 5 4 3 2 0 1 i′ := rol(i)
  rori a0, a0, 2
unshfli a0, a0, -1
roli a0, a0, 1
shfli a0, a0, -1

The mechanics of this sequence is closely related to the fact that  rol(ror(x-2)+1) is a function that maps 1 to 0 and 0 to 1 and every other number to itself. (With rol and ror denoting 1-bit rotate left and right shifts respectively.) See Table [permdemo] for details.

The numbers in the right column of Table [numperms] might appear large, but they are tiny in comparison to the total number of 32-bit permutations (2.63 ⋅ 1035). Fortunately the [un]shfl instruction “explores” permutations that involve fields that have a size that’s a power-of-two, and/or moves that are a power-of-two first, which means we get shorter sequences for permutations we more often care about in real-world applications.

For example, there are 24 ways of arranging the four bytes in a 32-bit word. ror, grev, and [un]shfl can perform any of those permutations in at most 3 instructions. See Table [permbytes] for a list of those 24 sequences.

There are 40320 ways of arranging the eight bytes in a 64-bit word or the eight nibbles in a 32-bit word. ror, grev, and [un]shfl can perform any of those permutations in at most 9 instructions. [Wolf19A]

Besides the more-or-less arbitrary permutations we get from combining  ror, grev, and [un]shfl in long sequences, there are of course many cases where just one of these instructions does exactly what we need. For grev and [un]shfl we define rev* and [un]zip* pseudo instructions for the most common use-cases.

### 2.2.1 Rotate (Left/Right) (rol, ror, rori)

RV32, RV64: ror rd, rs1, rs2 rol rd, rs1, rs2 rori rd, rs1, imm

RV64 only: rorw rd, rs1, rs2 rolw rd, rs1, rs2 roriw rd, rs1, imm

These instructions are similar to shift-logical operations from the base spec, except they shift in the values from the opposite side of the register, in order. This is also called ‘circular shift’.

[bextcref-rox]

### 2.2.2 Generalized Reverse (grev, grevi, rev)

RV32, RV64: grev rd, rs1, rs2 grevi rd, rs1, imm

RV64 only: grevw rd, rs1, rs2 greviw rd, rs1, imm

This instruction provides a single hardware instruction that can implement all of byte-order swap, bitwise reversal, short-order-swap, word-order-swap (RV64), nibble-order swap, bitwise reversal in a byte, etc, all from a single hardware instruction.

The Generalized Reverse (GREV) operation iteratively checks each bit i in the 2nd argument from i = 0 to log2(XLEN) − 1, and if the corresponding bit is set, swaps each adjacent pair of 2i bits.

[bextcref-grev]

The above pattern should be intuitive to understand in order to extend this definition in an obvious manner for RV128.

The grev operation can easily be implemented using a permutation network with log2(XLEN) stages. Figure 1.1 shows the permutation network for ror for reference. Figure 1.2 shows the permutation network for grev.

Pseudo-instructions are provided for the most common GREVI use-cases. Their names consist of a prefix and and optional suffix. Each prefix and suffix corresponds to a bit mask (see Table [grevi-names]). The GREVI control word is obtained by AND-ing the two masks together.

rev 111111 111111
rev2 111110 .w 011111 (w = word)
rev4 111100 .h 001111 (h = half word)
rev8 111000 .b 000111 (b = byte)
rev16 110000 .n 000011 (n = nibble)
rev32 100000 .p 000001 (p = pair)

In other words, the prefix controls the number of zero bits at the LSB end of the control word, and the suffix controls the number of zeros at the MSB end of the control word.

rev8 reverses the order of bytes in a word, thus performs endianness conversion. This is equivalent to the ARM REV instructions or BSWAP on x86. ARM also has instructions for swapping the bytes in 16-bit and 32-bit words, and reversing the bit order (see table [grevi-cmp]).

RISC-V ARM X86
rev RBIT
rev8.h REV16
rev8.w REV32
rev8 REV BSWAP

### 2.2.3 Generalized Shuffle (shfl, unshfl, shfli, unshfli, zip, unzip)

RV32, RV64: shfl rd, rs1, rs2 unshfl rd, rs1, rs2 shfli rd, rs1, imm unshfli rd, rs1, imm

RV64 only: shflw rd, rs1, rs2 unshflw rd, rs1, rs2

Shuffle is the third bit permutation instruction in the RISC-V Bitmanip extension, after rotary shift and generalized reverse. It implements a generalization of the operation commonly known as perfect outer shuffle and its inverse (shuffle/unshuffle), also known as zip/unzip or interlace/uninterlace.

Bit permutations can be understood as reversible functions on bit indices (i.e. 5 bit functions on RV32 and 6 bit functions on RV64).

A generalized (un)shuffle operation has log2(XLEN) − 1 control bits, one for each pair of neighbouring bits in a bit index. When the bit is set, generalized shuffle will swap the two index bits. The shfl operation performs this swaps in MSB-to-LSB order (performing a rotate left shift on contiguous regions of set control bits), and the unshfl operation performs the swaps in LSB-to-MSB order (performing a rotate right shift on contiguous regions of set control bits). Combining up to log2(XLEN) of those shfl/unshfl operations can implement any bitpermutation on the bit indices.

The most common type of shuffle/unshuffle operation is one on an immediate control value that only contains one contiguous region of set bits. We call those operations zip/unzip and provide pseudo-instructions for them. The naming scheme for those pseudo-instructions is similar to the naming scheme for the grevi pseudo-instructions (see Tables 1.3 and [grevi-names]), except that the LSB bit of the masks in Table [grevi-names] is not used for zip/unzip.

Shuffle/unshuffle operations that only have individual bits set (not a contiguous region of two or more bits) are their own inverse.

Like GREV and rotate shift, the (un)shuffle instruction can be implemented using a short sequence of elementary permutations, that are enabled or disabled by the shamt bits. But (un)shuffle has one stage fewer than GREV. Thus shfli+unshfli together require the same amount of encoding space as grevi.

[bextcref-gzip32]

Or for RV64:

[bextcref-gzip64]

The above pattern should be intuitive to understand in order to extend this definition in an obvious manner for RV128.

Alternatively (un)shuffle can be implemented in a single network with one more stage than GREV, with the additional first and last stage executing a permutation that effectively reverses the order of the inner stages. However, since the inner stages only mux half of the bits in the word each, a hardware implementation using this additional “flip” stages might actually be more expensive than simply creating two networks.

[bextcref-gzip32-alt]

Figure 1.7 shows the (un)shuffle permutation network with “flip” stages and Figure 1.6 shows the (un)shuffle permutation network without “flip” stages.

The zip instruction with the upper half of its input cleared performs the commonly needed “fan-out” operation. (Equivalent to bdep with a 0x55555555 mask.) The zip instruction applied twice fans out the bits in the lower quarter of the input word by a spacing of 4 bits.

For example, the following code calculates the bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32:

  andi a0, a0, 0xff
zip a0, a0
zip a0, a0
slli a1, a0, 4
slli a1, a0, 8
slli a1, a0, 16
c.add a0, a1

The final prefix sum is stored in the 8 nibbles of the a0 output word.

Similarly, the following code stores the indices of the set bits in the LSB nibbles of the output word (with the LSB bit having index 1), with the unused MSB nibbles in the output set to zero:

  andi a0, a0, 0xff
zip a0, a0
zip a0, a0
orc.n a0, a0
li a1, 0x87654321
and a1, a0, a1
bext a0, a1, a0

Other zip modes can be used to “fan-out” in blocks of 2, 4, 8, or 16 bit. zip can be combined with grevi to perform inner shuffles. For example on RV64:

  li a0, 0x0000000012345678
zip4 t0, a0    ; <- 0x0102030405060708
rev4.b t1, t0  ; <- 0x1020304050607080
zip8 t2, a0    ; <- 0x0012003400560078
rev8.h t3, t2  ; <- 0x1200340056007800
zip16 t4, a0   ; <- 0x0000123400005678
rev16.w t5, t4 ; <- 0x1234000056780000

Another application for the zip instruction is generating Morton code [MortonCode].

The x86 PUNPCK[LH]* MMX/SSE/AVX instructions perform similar operations as zip8 and zip16.

### 2.2.4 Crossbar Permutation Instructions (xperm.[nbhw])

RV32, RV64: xperm.n rd, rs1, rs2 xperm.b rd, rs1, rs2 xperm.h rd, rs1, rs2

RV64 only: xperm.w rd, rs1, rs2

These instructions operate on nibbles/bytes/half-words/words. rs1 is a vector of data words and rs2 is a vector of indices into rs1. The result of the instruction is the vector rs2 with each element replaced by the corresponding data word from rs1, or zero then the index in rs2 is out of bounds.

[bextcref-xperm]

The xperm.[nbhw] instructions can be implemented with an XLEN/4-lane nibble-wide crossbar switch.

The xperm.n instruction can be used to implement an arbitrary 64-bit bit permutation in 15 instructions, using 8 control words and 3 constant masks [Wolf20A]:

  uint64_t perm64_bitmanip_cmix(perm64t &p, uint64_t v)
{
uint64_t v0 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[0]) & p.mask[0], 3);   //  3 insns
uint64_t v1 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[1]) & p.mask[1], 3);   //  6 insns
uint64_t v2 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[2]) & p.mask[2], 3);   //  9 insns
uint64_t v3 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[3]) & p.mask[3], 3);   // 12 insns

v0 = _rv_cmix(0x1111111111111111LL, v0, v1);                            // 13 insns
v2 = _rv_cmix(0x4444444444444444LL, v2, v3);                            // 14 insns
return _rv_cmix(0x3333333333333333LL, v0, v2);                          // 15 insns
}

## 2.3 Generalized OR-Combine (gorc, gorci)

RV32, RV64: gorc rd, rs1, rs2 gorci rd, rs1, imm

RV64 only: gorcw rd, rs1, rs2 gorciw rd, rs1, imm

The GORC operation is similar to GREV, except that instead of swapping pairs of bits, GORC ORs them together, and writes the new value in both positions.

[bextcref-gorc]

GORC can be usefull for copying naturally aligned fields in a word, and testing such fields for being equal zero.

gorci pseudo-instructions follow the same naming scheme as grevi pseudo-instructions (see Tables 1.3 and [grevi-names]), except the prefix orc is used instead of rev. See Table 1.8 for a full list of gorci pseudo-instructions.

An important use-case is strlen() and strcpy(), which can utilize orc.b for testing for zero bytes, and counting trailing non-zero bytes in a word.

## 2.4 Bit-Field Place (bfp)

RV32, RV64: bfp rd, rs1, rs2

RV64 only: bfpw rd, rs1, rs2

The bit field place (bfp) instruction places up to XLEN/2 LSB bits from rs2 into the value in rs1. The upper bits of rs2 control the length of the bit field and target position. The layout of rs2 is chosen in a way that makes it possible to construct rs2 easily using pack[h] instructions and/or andi/lui.

[bextcref-bfp]

The layout of the control word in rs2 is as follows for RV32. LEN=0 encodes for LEN=16.

    |  3                   2                   1                    |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|---------------|---------------|---------------|---------------|
|       |  LEN  |     |   OFF   |             DATA              |
|---------------|---------------|---------------|---------------|

And on RV64 (with LEN=0 encoding for LEN=32):

    |      6                   5                   4                   3             |
|3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 .... 2 1 0|
|---------------|---------------|---------------|---------------|------ -- ------|
|SEL| |   LEN   |   |    OFF    |     |   LEN'  |   |    OFF'   |      DATA      |
|---------------|---------------|---------------|---------------|------ -- ------|

When SEL=10 then LEN and OFF are used, otherwise LEN’ and OFF’ are used.

Placing bits from a0 in a1, with results in t0 on RV32:

  addi t0, zero, {length[3:0], offset[7:0]}
pack t0, a0, t0
bfp t0, a1, t0

And on RV64:

  lui t0, zero, {3'b 100, length[4:0], offset[7:0], 4'b 0000}
pack t0, a0, t0
bfp t0, a1, t0

Or with a2=length and a3=offset:

  packh t0, a3, a2
pack t0, a0, t0
bfp t0, a1, t0

Placing up to 16 constant bits in any contiguous region:

  lui t0, ...
bfp[w] t0, a1, t0

Note that above sequences only modify one register (t0), which makes them fuse-able sequences.

## 2.5 Bit Extract/Deposit (bext, bdep)

RV32, RV64: bext rd, rs1, rs2 bdep rd, rs1, rs2

RV64 only: bextw rd, rs1, rs2 bdepw rd, rs1, rs2

This instructions implement the generic bit extract and bit deposit functions. This operation is also referred to as bit gather/scatter, bit pack/unpack, parallel extract/deposit, compress/expand, or right_compress/right_expand.

bext collects LSB justified bits to rd from rs1 using extract mask in rs2.

bdep writes LSB justified bits from rs1 to rd using deposit mask in rs2.

[bextcref-bext]

Implementations may choose to use smaller multi-cycle implementations of bext and bdep, or even emulate the instructions in software.

Even though multi-cycle bext and bdep often are not fast enough to outperform algorithms that use sequences of shifts and bit masks, dedicated instructions for those operations can still be of great advantage in cases where the mask argument is not constant.

For example, the following code efficiently calculates the index of the tenth set bit in a0 using bdep:

  li a1, 0x00000200
bdep a0, a1, a0
ctz a0, a0

For cases with a constant mask an optimizing compiler would decide when to use bext or bdep based on the optimization profile for the concrete processor it is optimizing for. This is similar to the decision whether to use MUL or DIV with a constant, or to perform the same operation using a longer sequence of much simpler operations.

The bext and bdep instructions are equivalent to the x86 BMI2 instructions PEXT and PDEP. But there is much older prior art. For example, the soviet BESM-6 mainframe computer, designed and built in the 1960s, had APX/AUX instructions with almost the same semantics. [BESM6] (The BESM-6 APX/AUX instructions packed/unpacked at the MSB end instead of the LSB end. Otherwise it is the same instruction.)

Efficient hardware implementations of bext and bdep are described in [Hilewitz06] and demonstrated in [Wolf17B].

## 2.6 Carry-Less Multiply (clmul, clmulh, clmulr)

RV32, RV64: clmul rd, rs1, rs2 clmulh rd, rs1, rs2 clmulr rd, rs1, rs2

RV64 only: clmulw rd, rs1, rs2 clmulhw rd, rs1, rs2 clmulrw rd, rs1, rs2

Calculate the carry-less product [CarryLessProduct] of the two arguments. clmul produces the lower half of the carry-less product and clmulh produces the upper half of the 2XLEN carry-less product.

Carry-less multiplication is equivalent to multiplication in the polynomial ring over GF(2).

clmulr produces bits 2XLEN − 2:XLEN-1 of the 2XLEN carry-less product. That means clmulh is equivalent to clmulr followed by a 1-bit right shift. (The MSB of a clmulh result is always zero.) Another equivalent definition of clmulr is clmulr(a,b) := rev(clmul(rev(a), rev(b))). (The “r” in clmulr means reversed.)

Unlike mulh[[s]u], we add a *W variant of clmulh. This is because we expect some code to use 32-bit clmul intrisics, even on 64-bit architectures. For example in cases where data is processed in 32-bit chunks.

[bextcref-clmul]

The classic applications for clmul are Cyclic Redundancy Check (CRC) [FastCRC Wolf18A] and Galois/Counter Mode (GCM), but more applications exist, including the following examples.

There are obvious applications in hashing and pseudo random number generations. For example, it has been reported that hashes based on carry-less multiplications can outperform Google’s CityHash [CLHASH].

clmul of a number with itself inserts zeroes between each input bit. This can be useful for generating Morton code [MortonCode].

clmul of a number with -1 calculates the prefix XOR operation. This can be useful for decoding gray codes.

Another application of XOR prefix sums calculated with clmul is branchless tracking of quoted strings in high-performance parsers. [ParseJSON]

Carry-less multiply can also be used to implement Erasure code efficiently. [ClmulErasureCode]

SPARC introduced similar instructions (XMULX, XMULXHI) in SPARC T3 in 2010. [sparct3]

TI C6000 introduced a similar instruction (XORMPY) in C64x+. [c64xp]

## 2.7 CRC Instructions (crc32.[bhwd], crc32c.[bhwd])

RV32, RV64: crc32.b rd, rs crc32.h rd, rs crc32.w rd, rs crc32c.b rd, rs crc32c.h rd, rs crc32c.w rd, rs

RV64 only: crc32.d rd, rs crc32c.d rd, rs

Unary Cyclic Redundancy Check (CRC) instructions that interpret the bits of rs1 as a CRC32/CRC32C state and perform a polynomial reduction of that state shifted left by 8, 16, 32, or 64 bits.

The instructions return the new CRC32/CRC32C state.

The crc32.w/crc32c.w instructions are equivalent to executing crc32.h/crc32c.h twice, and crc32.h/crc32c.h instructions are equivalent to executing crc32.b/crc32c.b twice.

All 8 CRC instructions operate on bit-reflected data.

[bextcref-crc]

Payload data must be XOR’ed into the LSB end of the state before executing the CRC instruction. The following code demonstrates the use of crc32.b:

  uint32_t crc32_demo(const uint8_t *p, int len)
{
uint32_t x = 0xffffffff;
for (int i = 0; i < len; i++) {
x = x ^ p[i];
x = crc32_b(x);
}
return ~x;
}

In terms of binary polynomial arithmetic those instructions perform the operation
$$\texttt{rd}'(x) = (\texttt{rs1}'(x) \cdot x^N) \; \textrm{mod} \; \{\texttt{1}, P'\}(x)\textrm,$$
with N ∈ {8, 16, 32, 64}, P = 0xEDB8_8320 for CRC32 and P = 0x82F6_3B78 for CRC32C, a denoting the XLEN bit reversal of a, and {a, b} denoting bit concatenation. Note that for example for CRC32 {1, P′} = 0x1_04C1_1DB7 on RV32 and {1, P′} = 0x1_04C1_1DB7_0000_0000 on RV64.

These dedicated CRC instructions are meant for RISC-V implementations without fast multiplier and therefore without fast clmul[h]. For implementations with fast clmul[h] it is recommended to use the methods described in [FastCRC] and demonstrated in [Wolf18A] that can process XLEN input bits using just one carry-less multiply for arbitrary CRC polynomials.

In applications where those methods are not applicable it is possible to emulate the dedicated CRC instructions using two carry-less multiplies that implement a Barrett reduction. The following example implements a replacement for crc32.w (RV32).

crc32_w:
li t0, 0xF7011641
li t1, 0xEDB88320
clmul a0, a0, t0
clmulr a0, a0, t1
ret

## 2.8 Bit-Matrix Instructions (bmatxor, bmator, bmatflip, RV64 only)

RV64 only: bmator rd, rs1, rs2 bmatxor rd, rs1, rs2 bmatflip rd, rs

These are 64-bit-only instruction that are not available on RV32. On RV128 they ignore the upper half of operands and sign extend the results.

This instructions interpret a 64-bit value as 8x8 binary matrix.

bmatxor performs a matrix-matrix multiply with boolean AND as multiply operator and boolean XOR as addition operator.

bmator performs a matrix-matrix multiply with boolean AND as multiply operator and boolean OR as addition operator.

bmatflip is a unary operator that transposes the source matrix. It is equivalent to zip; zip; zip on RV64.

[bextcref-bmat]

Among other things, bmatxor/bmator can be used to perform arbitrary permutations of bits within each byte (permutation matrix as 2nd operand) or perform arbitrary permutations of bytes within a 64-bit word (permutation matrix as 1st operand).

There are similar instructions in Cray XMT [CrayXMT]. The Cray X1 architecture even has a full 64x64 bit matrix multiply unit [CrayX1]. (See Section [bmat64] for how to implement 64x64 bit matix operations with bmat[x]or.)

The MMIX architecture has MOR and MXOR instructions with the same semantic. [Knuth4A]

The x86 EVEX/VEX/SSE instruction GF2P8AFFINEQB is equivalent to bmatxor.

The bmm.8 instruction proposed in [Hilewitz08] is also equivalent to bmatxor.

## 2.9 Ternary Bit-Manipulation Instructions

### 2.9.1 Conditional Mix (cmix)

RV32, RV64: cmix rd, rs2, rs1, rs3

(Note that the assembler syntax of cmix has the rs2 argument first to make assembler code more readable. But the reference C code code below uses the “architecturally correct” argument order rs1, rs2, rs3.)

The cmix rd, rs2, rs1, rs3 instruction selects bits from rs1 and rs3 based on the bits in the control word rs2.

[bextcref-cmix]

It replaces sequences like the following.

  and rd, rs1, rs2
andn t0, rs3, rs2
or rd, rd, t0

Using cmix a single butterfly stage can be implemented in only two instructions. Thus, arbitrary bit-permutations can be implemented using only 18 instruction (32 bit) or 22 instructions (64 bits).

### 2.9.2 Conditional Move (cmov)

RV32, RV64: cmov rd, rs2, rs1, rs3

(Note that the assembler syntax of cmov has the rs2 argument first to make assembler code more readable. But the reference C code code below uses the “architecturally correct” argument order rs1, rs2, rs3.)

The cmov rd, rs2, rs1, rs3 instruction selects rs1 if the control word rs2 is non-zero, and rs3 if the control word is zero.

[bextcref-cmov]

The cmov instruction helps avoiding branches, which can lead to better performance, and helps with constant-time code as used in some cryptography applications.

### 2.9.3 Funnel Shift (fsl, fsr, fsri)

RV32, RV64: fsl rd, rs1, rs3, rs2 fsr rd, rs1, rs3, rs2 fsri rd, rs1, rs3, imm

RV64 only: fslw rd, rs1, rs3, rs2 fsrw rd, rs1, rs3, rs2 fsriw rd, rs1, rs3, imm

(Note that the assembler syntax for funnel shifts has the rs2 argument last to make assembler code more readable. But the reference C code code below uses the “architecturally correct” argument order rs1, rs2, rs3.)

The fsl rd, rs1, rs3, rs2 instruction creates a 2 ⋅ XLEN word by concatenating rs1 and rs3 (with rs1 in the MSB half), rotate-left-shifts that word by the amount indicated in the log2(XLEN) + 1 LSB bits in rs2, and then writes the MSB half of the result to rd.

The fsr rd, rs1, rs3, rs2 instruction creates a 2 ⋅ XLEN word by concatenating rs1 and rs3 (with rs1 in the LSB half), rotate-right-shifts that word by the amount indicated in the log2(XLEN) + 1 LSB bits in rs2, and then writes the LSB half of the result to rd.

[bextcref-fsl]

[bextcref-fsr]

A shift unit capable of either fsl or fsr is capable of performing all the other shift functions, including the other funnel shift, with only minimal additional logic.

For any values of A, B, and C:

  fsl(A, B, C) = fsr(A, -B, C)

And for any values x and 0 ≤ shamt < XLEN:

  sll(x, shamt) == fsl(x, shamt, 0)
srl(x, shamt) == fsr(x, shamt, 0)
sra(x, shamt) == fsr(x, shamt, sext_x)
slo(x, shamt) == fsl(x, shamt, ~0)
sro(x, shamt) == fsr(x, shamt, ~0)
ror(x, shamt) == fsr(x, shamt, x)
rol(x, shamt) == fsl(x, shamt, x)

Furthermore an RV64 implementation of either fsl or fsr is capable of performing the *W versions of all shift operations with only a few gates of additional control logic.

On RV128 there is no fsri instruction. But there is fsriw and fsrid.

These instructions shift rs1 left by 1, 2, or 3 bits, then add the result to rs2. The sh?addu.w instructions are identical to sh?add, except that bits XLEN-1:32 of the rs1 argument are cleared before the shift.

An opcode for sh4add/sh4addu.w for RV128 and/or RVQ is reserved.

## 2.11 Mixed uint32/int64 arithmetic instructions

Consider RV64 C code that’s using unsigned 32-bit ints as array indices. For example:

  char u32i64demo(char *p, unsigned int i) {
return p[i-1];
}

In such cases the expression within p[…] must overflow according to 32-bit arithmetic, then be zero-extended, and then this zero-extended result must be used in the address calculation.

The instructions below make sure that no explicit zext.w instruction is needed in such cases, to make sure there is no systematic performance penalty for code like shown above on RV64 compared to RV32.

### 2.11.1 Add/sub with postfix zero-extend (addwu, subwu, addiwu)

RV64: addwu rd, rs1, rs2 subwu rd, rs1, rs2 addiwu rd, rs1, imm

These instructions are identical to addw, subw, addiw, except that bits XLEN-1:32 of the result are cleared after the addition. I.e. these instructions zero-extend instead of sign-extend the 32-bit result.

The 12-bit immediate to addiwu is sign-extended, exactly like the immediate to addiw.

### 2.11.2 Add/sub/shift with prefix zero-extend (addu.w, subu.w, slliu.w)

RV64: addu.w rd, rs1, rs2 subu.w rd, rs1, rs2 slliu.w rd, rs1, imm

slliu.w is identical to slli, except that bits XLEN-1:32 of the rs1 argument are cleared before the shift.

addu.w and subu.w are identical to add and sub, except that bits XLEN-1:32 of the rs2 argument are cleared before the add/subtract.

## 2.12 Opcode Encodings

This chapter contains proposed encodings for most of the instructions described in this document. DO NOT IMPLEMENT THESE OPCODES YET. We are trying to get official opcodes assigned and will update this chapter soon with the official opcodes.

The andn, orn, and xnor instruction are encoded the same way as and, or, and xor, but with op[30] set, mirroring the encoding scheme used for add and sub.

All shift instructions use funct3=001 for left shifts and funct3=101 for right shifts. Just like in the RISC-V integer base ISA, the shift-immediate instructions have a 5 bit immediate on RV32, and a 6 bit immediate on RV64, and we reserve encoding space for a 7 bit immediate for RV128. The same sizes apply to sbseti, sbclri, sbinvi, and sbexti.

The immediate for shfli/unshufli is one bit smaller than the immediate for shift instructions, that is 4 bits on RV32, 5 bits on RV64, and we reserve 6 bits for RV128.

op[26]=1 selects funnel shifts. For funnel shifts op[30:29] is part if the 3rd operand and therefore unused for encoding the operation. For all other shift operations op[26]=0.

fsri is also encoded with op[26]=1, leaving a 6 bit immediate. The 7th bit, that is necessary to perform a 128 bit funnel shift on RV64, can be emulated by swapping rs1 and rs3.

There is no shfliw instruction. The slliu.w instruction occupies the encoding slot that would be occupied by shfliw.

On RV128 op[26] contains the MSB of the immediate for the shift instructions. Therefore there is no FSRI instruction on RV128. (But there is FSRIW/FSRID.)

         | SLL  SRL  SRA | SLO SRO | ROL ROR | FSL FSR
op[30] |   0    0    1 |   0   0 |   1   1 |   -   -
op[29] |   0    0    0 |   1   1 |   1   1 |   -   -
op[26] |   0    0    0 |   0   0 |   0   0 |   1   1
funct3 | 001  101  101 | 001 101 | 001 101 | 001 101

Only an encoding for RORI exists, as ROLI can be implemented with RORI by negating the immediate. Unary functions are encoded in the spot that would correspond to ROLI, with the function encoded in the 5 LSB bits of the immediate.

The CRC instructions are encoded as unary instructions with op[24] set. The polynomial is selected via op[23], with op[23]=0 for CRC32 and op[23]=1 for CRC32C. The width is selected with op[22:20], using the same encoding as is used in funct3 for load/store operations.

cmix and cmov are encoded using the two remaining ternary operator encodings in funct3=001 and funct3=101. (There are two ternary operator encodings per minor opcode using the op[26]=1 scheme for marking ternary OPs.)

The single-bit instructions are also encoded within the shift opcodes, with op[27] set, and using op[30] and op[29] to select the operation:

         | SBCLR  SBSET  SBINV | SBEXT   GORC   GREV
op[30] |     1      0      1 |     1      0      1
op[29] |     0      1      1 |     0      1      1
op[27] |     1      1      1 |     1      1      1
funct3 |   001    001    001 |   101    101    101

There is no sbextiw instruction as it can be emulated trivially using sbexti. However, there is sbsetiw, sbclriw, and sbinviw as changing bit 31 would change the sign extend. There are non-immediate *W instructions of all single-bit instructions, including sbextw, because the number of used bits in rs2 is different in sbext and sbextw.

GORC and GREV are encoded in the two remaining slots in the single-bit instruction encoding space.

The remaining instructions are encoded within funct7=0000100 and funct7=0000101.

The funct7=0000101 block contains clmul[hr], min[u], and max[u].

The encoding of clmul, clmulr, clmulh is identical to the encoding of mulh, mulhsu, mulhu, except that op[27]=1.

The encoding of min[u]/max[u] uses funct3=100..111. The funct3 encoding matches op[31:29] of the AMO min/max functions.

The remaining instructions are encoded within funct7=0000100. The shift-like shfl/unshfl instructions uses the same funct3 values as the shift operations. bdep and bext are encoded in a way so that funct3[2] selects the “direction”, similar to shift operations.

bmat[x]or use funct3=011 and funct3=111 in funct7=0000100.

pack occupies funct3=100 in funct7=0000100.

addwu and subwu are encoded like addw and subw, except that op[25]=1 and op[27]=1.

addu.w and subu.w are encoded like addw and subw, except that op[27]=1.

addiwu is encoded using funct3=100 (XOR) instead of funct3=000 in OP-32.

Finally, RV64 has W instructions for all bitmanip instructions, with the following exceptions:

andn, cmix, cmov, min[u], max[u] have no W variants because they already behave in the way a W instruction would when presented with sign-exteded 32-bit arguments.

bmatflip, bmatxor, bmator have no W variants because they are 64-bit only instructions.

crc32.[bhwd], crc32c.[bhwd] have no W variants because  crc32[c].w is deemed sufficient.

There is no [un]shfliw, as a perfect outer shuffle always preserves the MSB bit, thus [un]shfli preserves proper sign extension when the upper bit in the control word is set. There’s still [un]shflw that masks that upper control bit and sign-extends the output.

Relevant instruction encodings from the base ISA are included in the table below and are marked with a .

|  3                   2                   1                    |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|---------------------------------------------------------------|
|    funct7   |   rs2   |   rs1   |  f3 |    rd   |    opcode   |  R-type
|   rs3   | f2|   rs2   |   rs1   |  f3 |    rd   |    opcode   |  R4-type
|          imm          |   rs1   |  f3 |    rd   |    opcode   |  I-type
|===============================================================|
|  0000000    |   rs2   |   rs1   | 111 |    rd   |   0110011   |  AND*
|  0000000    |   rs2   |   rs1   | 110 |    rd   |   0110011   |  OR*
|  0000000    |   rs2   |   rs1   | 100 |    rd   |   0110011   |  XOR*
|  0100000    |   rs2   |   rs1   | 111 |    rd   |   0110011   |  ANDN
|  0100000    |   rs2   |   rs1   | 110 |    rd   |   0110011   |  ORN
|  0100000    |   rs2   |   rs1   | 100 |    rd   |   0110011   |  XNOR
|---------------------------------------------------------------|
|  0000000    |   rs2   |   rs1   | 001 |    rd   |   0110011   |  SLL*
|  0000000    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  SRL*
|  0100000    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  SRA*
|  0010000    |   rs2   |   rs1   | 001 |    rd   |   0110011   |  SLO
|  0010000    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  SRO
|  0110000    |   rs2   |   rs1   | 001 |    rd   |   0110011   |  ROL
|  0110000    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  ROR
|---------------------------------------------------------------|
|  0010000    |   rs2   |   rs1   | 010 |    rd   |   0110011   |  SH1ADD
|  0010000    |   rs2   |   rs1   | 100 |    rd   |   0110011   |  SH2ADD
|  0010000    |   rs2   |   rs1   | 110 |    rd   |   0110011   |  SH3ADD
|---------------------------------------------------------------|
|  0100100    |   rs2   |   rs1   | 001 |    rd   |   0110011   |  SBCLR
|  0010100    |   rs2   |   rs1   | 001 |    rd   |   0110011   |  SBSET
|  0110100    |   rs2   |   rs1   | 001 |    rd   |   0110011   |  SBINV
|  0100100    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  SBEXT
|  0010100    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  GORC
|  0110100    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  GREV
|---------------------------------------------------------------|
|  00000  |     imm     |   rs1   | 001 |    rd   |   0010011   |  SLLI*
|  00000  |     imm     |   rs1   | 101 |    rd   |   0010011   |  SRLI*
|  01000  |     imm     |   rs1   | 101 |    rd   |   0010011   |  SRAI*
|  00100  |     imm     |   rs1   | 001 |    rd   |   0010011   |  SLOI
|  00100  |     imm     |   rs1   | 101 |    rd   |   0010011   |  SROI
|  01100  |     imm     |   rs1   | 101 |    rd   |   0010011   |  RORI
|---------------------------------------------------------------|
|  01001  |     imm     |   rs1   | 001 |    rd   |   0010011   |  SBCLRI
|  00101  |     imm     |   rs1   | 001 |    rd   |   0010011   |  SBSETI
|  01101  |     imm     |   rs1   | 001 |    rd   |   0010011   |  SBINVI
|  01001  |     imm     |   rs1   | 101 |    rd   |   0010011   |  SBEXTI
|  00101  |     imm     |   rs1   | 101 |    rd   |   0010011   |  GORCI
|  01101  |     imm     |   rs1   | 101 |    rd   |   0010011   |  GREVI
|---------------------------------------------------------------|
|   rs3   | 11|   rs2   |   rs1   | 001 |    rd   |   0110011   |  CMIX
|   rs3   | 11|   rs2   |   rs1   | 101 |    rd   |   0110011   |  CMOV
|   rs3   | 10|   rs2   |   rs1   | 001 |    rd   |   0110011   |  FSL
|   rs3   | 10|   rs2   |   rs1   | 101 |    rd   |   0110011   |  FSR
|   rs3   |1|    imm    |   rs1   | 101 |    rd   |   0010011   |  FSRI
|---------------------------------------------------------------|
|  3                   2                   1                    |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|===============================================================|
|  0110000    |  00000  |   rs1   | 001 |    rd   |   0010011   |  CLZ
|  0110000    |  00001  |   rs1   | 001 |    rd   |   0010011   |  CTZ
|  0110000    |  00010  |   rs1   | 001 |    rd   |   0010011   |  PCNT
|  0110000    |  00011  |   rs1   | 001 |    rd   |   0010011   |  BMATFLIP
|  0110000    |  00100  |   rs1   | 001 |    rd   |   0010011   |  SEXT.B
|  0110000    |  00101  |   rs1   | 001 |    rd   |   0010011   |  SEXT.H
|---------------------------------------------------------------|
|  0110000    |  10000  |   rs1   | 001 |    rd   |   0010011   |  CRC32.B
|  0110000    |  10001  |   rs1   | 001 |    rd   |   0010011   |  CRC32.H
|  0110000    |  10010  |   rs1   | 001 |    rd   |   0010011   |  CRC32.W
|  0110000    |  10011  |   rs1   | 001 |    rd   |   0010011   |  CRC32.D
|  0110000    |  11000  |   rs1   | 001 |    rd   |   0010011   |  CRC32C.B
|  0110000    |  11001  |   rs1   | 001 |    rd   |   0010011   |  CRC32C.H
|  0110000    |  11010  |   rs1   | 001 |    rd   |   0010011   |  CRC32C.W
|  0110000    |  11011  |   rs1   | 001 |    rd   |   0010011   |  CRC32C.D
|---------------------------------------------------------------|
|  0000101    |   rs2   |   rs1   | 001 |    rd   |   0110011   |  CLMUL
|  0000101    |   rs2   |   rs1   | 010 |    rd   |   0110011   |  CLMULR
|  0000101    |   rs2   |   rs1   | 011 |    rd   |   0110011   |  CLMULH
|  0000101    |   rs2   |   rs1   | 100 |    rd   |   0110011   |  MIN
|  0000101    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  MAX
|  0000101    |   rs2   |   rs1   | 110 |    rd   |   0110011   |  MINU
|  0000101    |   rs2   |   rs1   | 111 |    rd   |   0110011   |  MAXU
|---------------------------------------------------------------|
|  0000100    |   rs2   |   rs1   | 001 |    rd   |   0110011   |  SHFL
|  0000100    |   rs2   |   rs1   | 101 |    rd   |   0110011   |  UNSHFL
|  0100100    |   rs2   |   rs1   | 110 |    rd   |   0110011   |  BDEP
|  0000100    |   rs2   |   rs1   | 110 |    rd   |   0110011   |  BEXT
|  0000100    |   rs2   |   rs1   | 100 |    rd   |   0110011   |  PACK
|  0100100    |   rs2   |   rs1   | 100 |    rd   |   0110011   |  PACKU
|  0000100    |   rs2   |   rs1   | 011 |    rd   |   0110011   |  BMATOR
|  0100100    |   rs2   |   rs1   | 011 |    rd   |   0110011   |  BMATXOR
|  0000100    |   rs2   |   rs1   | 111 |    rd   |   0110011   |  PACKH
|  0100100    |   rs2   |   rs1   | 111 |    rd   |   0110011   |  BFP
|---------------------------------------------------------------|
|  000010   |    imm    |   rs1   | 001 |    rd   |   0010011   |  SHFLI
|  000010   |    imm    |   rs1   | 101 |    rd   |   0010011   |  UNSHFLI
|===============================================================|
|       immediate       |   rs1   | 000 |    rd   |   0011011   |  ADDIW*
|       immediate       |   rs1   | 100 |    rd   |   0011011   |  ADDIWU
|  00001  |     imm     |   rs1   | 001 |    rd   |   0011011   |  SLLIU.W
|---------------------------------------------------------------|
|  0000000    |   rs2   |   rs1   | 000 |    rd   |   0111011   |  ADDW*
|  0100000    |   rs2   |   rs1   | 000 |    rd   |   0111011   |  SUBW*
|  0000101    |   rs2   |   rs1   | 000 |    rd   |   0111011   |  ADDWU
|  0100101    |   rs2   |   rs1   | 000 |    rd   |   0111011   |  SUBWU
|  0000100    |   rs2   |   rs1   | 000 |    rd   |   0111011   |  ADDU.W
|  0100100    |   rs2   |   rs1   | 000 |    rd   |   0111011   |  SUBU.W
|---------------------------------------------------------------|
|  3                   2                   1                    |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|===============================================================|
|  0010000    |   rs2   |   rs1   | 001 |    rd   |   0111011   |  SLOW
|  0010000    |   rs2   |   rs1   | 101 |    rd   |   0111011   |  SROW
|  0110000    |   rs2   |   rs1   | 001 |    rd   |   0111011   |  ROLW
|  0110000    |   rs2   |   rs1   | 101 |    rd   |   0111011   |  RORW
|---------------------------------------------------------------|
|  0010000    |   rs2   |   rs1   | 010 |    rd   |   0111011   |  SH1ADDU.W
|  0010000    |   rs2   |   rs1   | 100 |    rd   |   0111011   |  SH2ADDU.W
|  0010000    |   rs2   |   rs1   | 110 |    rd   |   0111011   |  SH3ADDU.W
|---------------------------------------------------------------|
|  0100100    |   rs2   |   rs1   | 001 |    rd   |   0111011   |  SBCLRW
|  0010100    |   rs2   |   rs1   | 001 |    rd   |   0111011   |  SBSETW
|  0110100    |   rs2   |   rs1   | 001 |    rd   |   0111011   |  SBINVW
|  0100100    |   rs2   |   rs1   | 101 |    rd   |   0111011   |  SBEXTW
|  0010100    |   rs2   |   rs1   | 101 |    rd   |   0111011   |  GORCW
|  0110100    |   rs2   |   rs1   | 101 |    rd   |   0111011   |  GREVW
|---------------------------------------------------------------|
|  0010000    |   imm   |   rs1   | 001 |    rd   |   0011011   |  SLOIW
|  0010000    |   imm   |   rs1   | 101 |    rd   |   0011011   |  SROIW
|  0110000    |   imm   |   rs1   | 101 |    rd   |   0011011   |  RORIW
|---------------------------------------------------------------|
|  0100100    |   imm   |   rs1   | 001 |    rd   |   0011011   |  SBCLRIW
|  0010100    |   imm   |   rs1   | 001 |    rd   |   0011011   |  SBSETIW
|  0110100    |   imm   |   rs1   | 001 |    rd   |   0011011   |  SBINVIW
|  0010100    |   imm   |   rs1   | 101 |    rd   |   0011011   |  GORCIW
|  0110100    |   imm   |   rs1   | 101 |    rd   |   0011011   |  GREVIW
|---------------------------------------------------------------|
|   rs3   | 10|   rs2   |   rs1   | 001 |    rd   |   0111011   |  FSLW
|   rs3   | 10|   rs2   |   rs1   | 101 |    rd   |   0111011   |  FSRW
|   rs3   | 10|   imm   |   rs1   | 101 |    rd   |   0011011   |  FSRIW
|---------------------------------------------------------------|
|  0110000    |  00000  |   rs1   | 001 |    rd   |   0011011   |  CLZW
|  0110000    |  00001  |   rs1   | 001 |    rd   |   0011011   |  CTZW
|  0110000    |  00010  |   rs1   | 001 |    rd   |   0011011   |  PCNTW
|---------------------------------------------------------------|
|  0000101    |   rs2   |   rs1   | 001 |    rd   |   0111011   |  CLMULW
|  0000101    |   rs2   |   rs1   | 010 |    rd   |   0111011   |  CLMULRW
|  0000101    |   rs2   |   rs1   | 011 |    rd   |   0111011   |  CLMULHW
|---------------------------------------------------------------|
|  0000100    |   rs2   |   rs1   | 001 |    rd   |   0111011   |  SHFLW
|  0000100    |   rs2   |   rs1   | 101 |    rd   |   0111011   |  UNSHFLW
|  0100100    |   rs2   |   rs1   | 110 |    rd   |   0111011   |  BDEPW
|  0000100    |   rs2   |   rs1   | 110 |    rd   |   0111011   |  BEXTW
|  0000100    |   rs2   |   rs1   | 100 |    rd   |   0111011   |  PACKW
|  0100100    |   rs2   |   rs1   | 100 |    rd   |   0111011   |  PACKUW
|  0100100    |   rs2   |   rs1   | 111 |    rd   |   0111011   |  BFPW
|---------------------------------------------------------------|

Changes in v0.93 of the RISC-V Bitmanip Spec (+ for addition, - for removal):

|  3                   2                   1                    |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|===============================================================|
|  0010000    |   rs2   |   rs1   | 010 |    rd   |   0110011   |  + SH1ADD
|  0010000    |   rs2   |   rs1   | 100 |    rd   |   0110011   |  + SH2ADD
|  0010000    |   rs2   |   rs1   | 110 |    rd   |   0110011   |  + SH3ADD
|---------------------------------------------------------------|
|  0010000    |   rs2   |   rs1   | 010 |    rd   |   0111011   |  + SH1ADDU.W
|  0010000    |   rs2   |   rs1   | 100 |    rd   |   0111011   |  + SH2ADDU.W
|  0010000    |   rs2   |   rs1   | 110 |    rd   |   0111011   |  + SH3ADDU.W
|---------------------------------------------------------------|

## 2.13 Future compressed instructions

The RISC-V ISA has no dedicated instructions for bitwise inverse (not). Instead not is implemented as xori rd, rs, -1 and neg is implemented as sub rd, x0, rs.

In bitmanipulation code not is a very common operation. But there is no compressed encoding for those operation because there is no c.xori instruction.

On RV64 (and RV128) zext.w and zext.d (pack and packw) are commonly used to zero-extend unsigned values <XLEN.

It presumably would make sense for a future revision of the “C” extension to include compressed opcodes for those instructions.

An encoding with the constraint rd = rs would fit nicely in the reserved space in c.addi16sp/c.lui.

The entire RVC encoding space is 15.585 bits wide, the remaining reserved encoding space in RVC is 11.155 bits wide, not including space that is only reserved on RV32/RV64. This means that above encoding would use 0.0065% of the RVC encoding space, or 1.4% of the remaining reserved RVC encoding space. Preliminary experiments have shown that NOT instructions alone make up approximately 1% of bitmanipulation code size. [Wolf17A]

## 2.14 Macro-op fusion patterns

Some bitmanip operations have been left out of this spec because of lack of a (sensible) way to encode them in the 32-bit RISC-V encoding space. Instead we present fuse-able sequences of up to three instructions for those operations, so that high-end processors can implement them in a single fused macro-op, should they decide to do so.

For this document we only consider sequences fuse-able if they read at most two registers, only write one register, and contain no branch instructions.

### 2.14.1 Fused -bfp sequences

The bfp instruction is most commonly used in sequences of one the the following forms.

For 32-bit (RV32 or *W instructions on RV64):

  addi rd, zero, ...
pack[w] rd, rs2, rd
bfp[w] rd, rs1, rd
  lui rd, ...
bfp[w] rd, rs1, rd

And for 64-bit:

  lui rd, ...
pack rd, rs2, rd
bfp rd, rs1, rd

For 32-bit code (RV32 or *W instructions on RV64) we recommend to fuse the  lui+addi pattern for loading a 32-bit constants:

  lui rd, imm
addi[w] rd, rd, imm

For RV64 we also recommend to fuse the followings patterns for loading a 32-bit constant zero-extended into a 64-bit register:

  lui rd, imm
addiwu rd, rd, imm
  addiw rd, zero, v
pack rd, rd, zero

  lui rd, imm
pack rd, rd, rs

### 2.14.3 Fused shift sequences

Pairs of left and right shifts are common operations for extracting a bit field.

To extract the contiguous bit field starting at pos with length len from rs (with pos > 0, len > 0, and pos + len ≤ XLEN):

  slli rd, rs, (XLEN-len-pos)
srli rd, rd, (XLEN-len)

Using srai instead of srli will sign-extend the extracted bit-field.

Similarly, placing a bit field with length len at the position pos:

  slli rd, rs, (XLEN-len)
srli rd, rd, (XLEN-len-pos)

Note that the postfix right shift instruction can use a compressed encoding, yielding a 48-bit fused instruction if the left shift is a 32-bit instruction.

More generally, a processor might fuse all destructive shift operations following any other ALU operation.

We define the following assembler pseudo-ops for sr[la]i postfix fusion:

  bfext  rd, rs, len, pos    ->   slli rd, rs, (XLEN-len-pos); srai rd, rd, (XLEN-len)
bfextu rd, rs, len, pos    ->   slli rd, rs, (XLEN-len-pos); srli rd, rd, (XLEN-len)
bfmak  rd, len, pos        ->   sroi rd, zero, len; srli rd, rd, (XLEN-len-pos)

(The names bfext, bfextu, and bfmak are borrowed from m88k, that had dedicated instructions of those names (without bf-prefix) with equivalent semantics. [m88k])

## 2.15 Other micro-architectural considerations

In addition to macro-op fusion, we issue the following recommendations for cores that aim at better performance for bitmanipulation tasks.

### 2.15.1 Unaligned memory access

The base ISA spec requires load/store operations that are not naturally aligned to succeed in U-mode, but explicitly states that execution may be slow.

For many bitmanipulation tasks it can be of great advantage to be able to perform load and store operations with arbitrary alignments quickly. Thus we recommend that cores optimized for bitmanipulation tasks provide fast hardware support for load/store with arbitrary alignment.

There should be a getauxval()-based mechanism as part of the RISC-V Linux ABI that can be used to query information on support for unaligned memory access. [FastLdSt]

### 2.15.2 Fast multiply

A lot of bitmanipulation tricks rely on multiplication with “magic numbers” and similar tricks involving multiply/divide instructions. Thus, cores optimized for bitmanipulation tasks should provide reasonably fast implementations of the “M”-extension multiply/divide instructions.

## 2.16 C intrinsics via <rvintrin.h>

A C header file <rvintrin.h> is provided that contains assembler templates for directly creating assembler instructions from C code.

The header defines _rv_(...) functions that operate on the long data type, _rv32_(...) functions that operate on the int32_t data type, and _rv64_(...) functions that operate on the  int64_t data type. The _rv64_(...) functions are only available on RV64. See table 1.11 for a complete list of intrinsics defined in <rvintrin.h>.

Usage example:

  #include <rvintrin.h>

int find_nth_set_bit(unsigned int value, int cnt) {
return _rv32_ctz(_rv32_bdep(1 << cnt, value));
}

Defining RVINTRIN_EMULATE before including <rvintrin.h> will define plain C functions that emulate the behavior of the RISC-V instructions. This is useful for testing software on non-RISC-V platforms.