# Five EmbedDev

An Embedded RISC-V Blog
RISC-V Bitmanip Extension

# 4 Example Applications

This chapter contains a collection of short code snippets and algorithms using the Bitmanip extension. It also contains some examples of bit manipulation code that doesn’t require any extension beyond the base ISA.

## 4.1 Basic Bitmanipulation

``````  lui a0, ((v - (v << 20 >> 20)) >> 12) & 0xfffff
addi a0, a0, (v << 20 >> 20)``````

(Assuming signed 32-bit arithmetic for the expression `(v << 20 >> 20)`.)

Using the following sequence on RV64 will yield the 32-bit constant in sign-extended form.

``````  lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
addiw a0, a0, (v << 52 >> 52)``````

(`addiw` is needed instead of `addi` to handle the cases correctly that have bits 11-30 of the constant set to one.)

Using `addiwu` instead of `addiw` produces a zero-extended version of the same constant, iff any of the bits 11-31 of the constant is zero.

``````  lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
addiwu a0, a0, (v << 52 >> 52)``````

In the remaining cases with bits 11-31 all set, `addi`+`pack` can be used to produce the constant:

``````  addi a0, zero, v
pack a0, a0, zero``````

64-bit constants of the form R × 1 S × 0 T × 1 (with R + S + T = 64) can be constructed using `sloi` and `rori`:

``````  sloi a0, zero, R+T
rori a0, a0, R``````

Likewise, constructing R × 0 S × 1 T × 0:

``````  sloi a0, zero, S
slli a0, a0, T``````

Any constant that is just a repeating 16-bit pattern can be constructed in two instructions with `lui` and `orc16`. For example, constructing `0xABCD_ABCD_ABCD_ABCD`:

``````  lui a0, 0x0BCDA
orc16 a0, a0``````

Finally, any arbitrary 64-bit constant can be created using the following 5-instruction pattern and one spill register:

``````  lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
addiw a0, a0, (v << 52 >> 52)
lui a1, ((v - (v << 20 >> 20)) >> 44) & 0xfffff
addiw a1, a1, (v << 20 >> 52)
pack a0, a0, a1``````

### 4.1.2 Bitfield extract

Extracting a bit field of length `len` at position `pos` can be done using two shift operations.

``````  slli a0, a0, (XLEN-len-pos)
srli a0, a0, (XLEN-len)``````

Or using `srai` for a signed bit-field.

``````  slli a0, a0, (XLEN-len-pos)
srai a0, a0, (XLEN-len)``````

### 4.1.3 Packing bit-fields

There are different ways of packing bit-fields with the help of RISC-V BitManip instructions.

For example, packing a 16-bit RGB value in 5:6:5 format, from the bytes in a0, a1, and a2, using `pack[h]` and `bext`:

``````  li t0, 0x00f8fcf8

packh a0, a0, a1
pack  a0, a0, a2
bext  a0, a0, t0``````

Or using funnel shifts (assuming `a2` is already in zero-extended form and/or the upper bits of the return value do not matter):

``````  srli a2, a2, 3
slli a1, a1, XLEN-8
fsli a1, a2, a1, 6
slli a0, a0, XLEN-8
fsli a0, a1, a0, 5``````

Using only base-ISA instructions, at least 7 instructions are needed to pack a 5:6:5 RGB value (assuming `a0` is alredy in zero-extended form):

``````  andi a2, a2, 0xf8
slli a2, a2, 9
andi a1, a1, 0xfc
slli a1, a1, 3
srli a0, a0, 3
or a1, a1, a2
or a0, a0, a1``````

Another example for packing bit fields is generating IEEE floats using only integer instructions (aka “soft float”). For example, generating a 32-bit float in the range [ − 1.0… + 1.0) from a signed 16-bit integer:

``````  short2float:
neg a1, a0
max a1, a1, a0
clz a2, a1
srli a0, a0, 31
sll a3, a1, a2
srli a3, a3, 15
neg a2, a2
packh a0, a2, a0
pack a0, a3, a0
slli a4, a4, 7
orc a1, a1
and a0, a0, a1
ret``````

Or using funnel shifts:

``````  short2float:
neg a1, a0
max a1, a1, a0
clz a2, a1
sll a3, a1, a2
slli a3, a3, 1
neg a2, a2
fsri a3, a3, a2, 8
srli a0, a0, 31
fsri a0, a3, a0, 1
cmov a0, a1, a0, zero
ret``````

### 4.1.4 Parity check

The parity of a word (xor of all bits) is the LSB of the population count.

``````  pcnt a0, a0
andi a0, a0, 1``````

### 4.1.5 Average of two integers

The following four instructions calculate the average of the unsigned integers in `a0` and `a1`, with compensation for overflow:

``````  and  a2, a0, a1
xor  a0, a0, a1
srli a0, a0, 1

And likewise the average of two signed integers:

``````  and  a2, a0, a1
xor  a0, a0, a1
srai a0, a0, 1

With `fsri` the unsigned case can be accomplished in just three instructions:

``````  add  a0, a0, a1
sltu a1, a0, a1
fsri a0, a1, 1``````

### 4.1.6 Detecting integer overflow

Overflow in unsigned addition can be detected in two instructions:

``````  add  a0, a1, a2
bltu a0, a1, overflow``````

For signed addition, if the sign of one operand is known, for example because it is constant:

``````  addi a0, a1, +imm
blt  a0, a1, overflow``````
``````  addi a0, a1, -imm
bgt  a0, a1, overflow``````

And for signed addition in the general case:

``````  add  a0, a1, a2
slti a3, a1, 0
slt  a4, a0, a2
bne  a3, a4, overflow``````

And finally, generating the carry flag for an addition:

``````  add  a0, a1, a2
sltu a3, a0, a1``````

Thus, adding `a0`, `a1`, and `a2` with results in `a0` and carry-out in `a1`:

``````  add  a0, a0, a1
sltu a1, a0, a1
sltu a2, a0, a2

### 4.1.7 Fuse-able sequences for logic operations

RISC-V has dedicated instructions for branching on equal/not-equal. But C code such as the following would require set-equal and set-not-equal instructions, similar to `slt`.

``````  int is_equal = (a == b);
int is_noteq = (c != d);``````

Those can be implemented using the following fuse-able sequences:

``````  sub rd, rs1, rs2
sltui rd, rd, 1

sub rd, rs1, rs2
sltu rd, zero, rd``````

Likewise for logic OR:

``````  int logic_or  = (c || d);

or rd, rs1, rs2
sltu rd, zero, rd``````

And for logic AND, if `rd != rs2`:

``````  int logic_and  = (c && d);

orc rd, rs1
and rd, rd, rs2
sltu rd, zero, rd``````

### 4.1.8 Rotate shift of bytes and half-words

Rotate right shift of the byte in `a0` by the shift amount in `a1`, assuming `a0` is stored in zero-extended form:

``````  orc8 a0, a0
ror a0, a1
andi a0, a0, 255``````

And rotate right shift of the 16-bit half-word in `a0` by the shift amount in `a1`, assuming `a0` is stored in zero-extended form:

``````  orc16 a0, a0
ror a0, a1
pack[w] a0, a0, zero``````

### 4.1.9 Rank and select

Rank and select are fundamental operations in succinct data structures [SelectX86].

`select(a0, a1)` returns the position of the `a1`th set bit in `a0`. It can be implemented efficiently using `bdep` and `ctz`:

``````  select:
sbset a1, zero, a1
bdep a0, a1, a0
ctz a0, a0
ret``````

`rank(a0, a1)` returns the number of set bits in `a0` up to and including position `a1`.

``````  rank:
not a1, a1
sll a0, a1
pcnt a0, a0
ret``````

### 4.1.10 OR/AND/XOR-reduce in byte vectors

OR-ing the bytes in a register and returning the resulting byte is easy with GORC:

``````  gorci a0, a0, -8
andi a0, 255``````

AND-ing can accomplished by applying De Morgan’s laws:

``````  not a0, a0
gorci a0, a0, -8
not a0, a0
andi a0, 255``````

``````  andi a1, zero, 0x80
gorci a1, a1, -8
clmulr a0, a0, a1
andi a0, 255``````

Where the first two instructions (andi+gorci) just create the constant 0x8080..8080.

Finally, on RV64, XOR-ing the bytes in a register can also be accomplished with BMATXOR:

``````  andi a1, zero, 0xff
bmatxor a0, a1, a0``````

### 4.1.11 Counting trailing non-zero bytes

Counting the trailing (LSB-end) non-zero bytes in a word is a helpful operation in optimized implementations of `strlen()` and `strcpy()`:

``````  int count_trailing_nonzero_bytes(long x)
{
return _rv_ctz(~_rv_orc_b(x)) >> 3;
}``````

### 4.1.12 Finding bytes of certain values

Finding zero bytes is a useful operations for `strchr()` and `memchr()`:

``````  bool check_zero_bytes(long x)
{
return ~_rv_orc_b(x) != 0;
}``````

To find other bytes we simply XOR the value with a mask of the byte value we are looking for:

``````  bool check_byte(long x, unsigned char c)
{
return ~_rv_orc_b(x ^ _rv_orc8(c)) != 0;
}``````

These schemes can easily be extended with `ctz` and `pcnt` to perform operations such as counting the number of bytes of a certain value within a word, or finding the position of the first such byte.

### 4.1.13 Fill right of most significant set bit

The “fill right” or “fold right” operation is a pattern commonly used in bit manipulation code. [MAGIC]

The straight-forward RV64 implementation requires 12 instructions:

``````  uint64_t rfill(uint64_t x)
{
x |= x >> 1;   // SRLI, OR
x |= x >> 2;   // SRLI, OR
x |= x >> 4;   // SRLI, OR
x |= x >> 8;   // SRLI, OR
x |= x >> 16;  // SRLI, OR
x |= x >> 32;  // SRLI, OR
return x;
}``````

With `clz` it can be implemented in only 4 instructions. Notice the handling of the case where `x=0` using `sltiu+addi`.

``````  uint64_t rfill_clz(uint64_t x)
{
uint64_t t;
t = clz(x);         // CLZ
x = (!x)-1;         // SLTIU, ADDI
x = x >> (t & 63);  // SRL
return x;
}``````

Alternatively, a Trailing Bit Manipulation (TBM) code pattern can be used together with `rev` to implement this function in 4 instructions:

``````  uint64_t rfill_rev(uint64_t x)
{
x = rev(x);         // GREVI
x = x | ~(x - 1);   // ADDI, ORN
x = rev(x);         // GREVI
return x;
}``````

Finally, there is another implementation in 4 instructions using BMATOR, if we do not count the extra instructions for loading utility matrices.

``````  uint64_t rfill_bmat(uint64_t x)
{
uint64_t m0, m1, m2, t;

m0 = 0xFF7F3F1F0F070301LL;  // LD
m1 = bmatflip(m0 << 8);     // SLLI, BMATFLIP

t = bmator(x, m0);          // BMATOR
x = bmator(x, m2);          // BMATOR
x = bmator(m1, x);          // BMATOR
x |= t;                     // OR

return x;
}``````

### 4.1.14 Round to next power of two

One common application of `rfill()` is rounding up to the next power of two:

``````  uint64_t round_pow2(uint64_t x)
{
return rfill(x-1)+1;
}``````

This can also be implemented in just 4 instructions, if we don’t care about the case where the above code overflows because `x` is already larger than the largest power-of-two representable in an `uint64_t`.

``````  uint64_t round_pow2(uint64_t x)
{
uint64_t t;
t = clz(x-1);     // ADDI, CLZ
x = ror(!!x, t);  // SLTU, ROR
return x;
}``````

Note that this code handles 0 → 0 and 1 → 1 correctly, i.e. equivialent to `rfill(x-1)+1`.

## 4.2 Packed vectors

### 4.2.1 Packing bytes

The following RV32 code packs the lower 8 bits from a0, a1, a2, a3 into a 32-bit word returned in a0, ignoring other bits in the input values.

``````  packh a0, a0, a1
packh a1, a2, a3
pack  a0, a0, a1``````

And the following RV64 code packs 8 bytes into a register.

``````  packh a0, a0, a1
packh a1, a2, a3
packh a2, a4, a5
packh a3, a6, a7
packw a0, a0, a1
packw a1, a2, a3
pack  a0, a0, a1``````

### 4.2.2 Permuting bytes

There are 24 ways of arranging the four bytes in a 32-bit word. `ror`, ` grev`, and `[un]shfl` can perform any of those permutations in at most 3 instructions. Table [permbytes] lists those sequences. [Wolf19A]

Bytes Instructions
A B C D initial byte order
A B D C `ROR(24),SHFL(8),ROR(8)`
A C B D `SHFL(8)`
A C D B `ROR(8),GREV(8),SHFL(8)`
A D B C `ROR(16),SHFL(8),ROR(24)`
A D C B `ROR(8),GREV(8)`
B A C D `ROR(8),SHFL(8),ROR(24)`
B A D C `GREV(8)`
B C A D `ROR(16),SHFL(8),ROR(8)`
B C D A `ROR(24)`
B D A C `GREV(8),SHFL(8)`
B D C A `ROR(24),SHFL(8)`
C A B D `ROR(8),GREV(24),SHFL(8)`
C A D B `ROR(16),SHFL(8)`
C B A D `ROR(8),GREV(24)`
C B D A `SHFL(8),ROR(24)`
C D A B `ROR(16)`
C D B A `ROR(8),SHFL(8),ROR(8)`
D A B C `ROR(8)`
D A C B `SHFL(8),ROR(8)`
D B A C `ROR(8),SHFL(8)`
D B C A `GREV(24),SHFL(8)`
D C A B `ROR(24),SHFL(8),ROR(24)`
D C B A `GREV(24)`

### 4.2.3 Widening and narrowing

The `[un]zip` instructions can help with widening and narrowing packed vectors. For example, narrowing the bytes in two words into a single word with the values in nibbles with values from `a0` in LSB half and values from `a1` in MSB half:

``````  unzip4 a0, a0
unzip4 a1, a1
pack a0, a0, a1``````

And widening the nibbles from `a0` into bytes in `a1` (MSB half) and `a0` (LSB half), with zero extension:

``````  srli a1, a0, XLEN/2
pack a0, a0, zero
zip4 a1, a1
zip4 a0, a0``````

And finally the same widening operation with sign extension:

``````  addi t0, zero, 8
orc4 t0, t0
and t0, t0, a0
orc.n t0, t0
srli t1, t0, XLEN/2
srli a1, a0, XLEN/2
pack a1, a1, t1
pack a0, a0, t0
zip4 a1, a1
zip4 a0, a0``````

### 4.2.4 Shifting packed vector elements

Using `zip` we can re-arrange the bits in a packed vector of N elements so that a shift by k of each byte becomes a shift of Nk of the entire new vector. So we zip, shift, and then `unzip` to shuffle everything back. The number of `zip` and `unzip` is log2(N). This works for all kinds of shift operations. For example, rotating a vector of bytes on RV32 in 6 instructions:

``````  zip a0, a0
zip a0, a0
slli a1, a1, 2
ror a0, a0, a1
unzip a0, a0
unzip a0, a0``````

Because `zip; zip; zip` is equal to `unzip; unzip` on RV32, and equal to `unzip; unzip; unzip` on RV64, we need never more than 2 `[un]zip` on RV32, or 3 `[un]zip` on RV64.

The following six instructions will add the elements of the two vectors passed in `a0` and `a1`, and return the vector of sums in `a0`.

This expects a mask in `a2` that marks the MSB bit of each vector element. For a vector of bytes this mask would be `0x8080...80` (which can be obtained in two instructions via `orc8(0x80)`).

``````  xor  a3, a0, a1
and  a3, a3, a2
andn a0, a0, a2
andn a1, a1, a2
xor  a0, a0, a3``````

## 4.3 Funnel shifts

A funnel shift takes two XLEN registers, concatenates them to a 2 × XLEN word, shifts that by a certain amount, then returns the lower half of the result for a right shift and the upper half of the result for a left shift.

The `fsl`, `fsr`, and `fsri` instructions perform funnel shifts.

### 4.3.1 Bigint shift

A common application for funnel shifts is shift operations in bigint libraries.

For example, the following functions implement rotate-shift operations for bigints made from `n` XLEN words.

``````  void bigint_rol(uint_xlen_t data[], int n, int shamt)
{
if (n <= 0)
return;

uint_xlen_t buffer = data[n-1];
for (int i = n-1; i > 0; i--)
data[i] = fsl(data[i], shamt, data[i-1]);
data[0] = fsl(data[0], shamt, buffer);
}

void bigint_ror(uint_xlen_t data[], int n, int shamt)
{
if (n <= 0)
return;

uint_xlen_t buffer = data[0];
for (int i = 0; i < n-1; i++)
data[i] = fsr(data[i], shamt, data[i+1]);
data[n-1] = fsr(data[n-1], shamt, buffer);
}``````

These version only works for shift-amounts <XLEN. But functions supporting other kinds of shift operations, or shifts XLEN can easily be built with `fsl` and `fsr`.

### 4.3.2 Parsing bit-streams of 27-bit words

The following function parses `n` 27-bit words from a packed array of XLEN words:

``````  void parse_27bit(uint_xlen_t *idata, uint_xlen_t *odata, int n)
{
uint_xlen_t lower = 0, upper = 0;
int reserve = 0;

while (n--) {
if (reserve < 27) {
uint_xlen_t buf = *(idata++);
lower |= sll(buf, reserve);
upper = reserve ? srl(buf, -reserve) : 0;
reserve += XLEN;
}
*(odata++) = lower & ((1 << 27)-1);
lower = fsr(lower, 27, upper);
upper = srl(upper, 27);
reserve -= 27;
}
}``````

And here the same thing in RISC-V assembler:

``````  parse_27bit:
li t1, 0              ; lower
li t2, 0              ; upper
li t3, 0              ; reserve
li t4, 27             ; shamt
slo t5, zero, t4      ; mask
beqz a2, endloop      ; while (n--)
loop:
bge t3, t4, output       ; if (reserve < 27)
lw t6, 0(a0)                 ; buf = *(idata++)
sll t7, t6, t3               ; lower |= sll(buf, reserve)
or t1, t1, t7
sub t7, zero, t3             ; upper = reserve ? srl(buf, -reserve) : 0
srl t7, t6, t7
cmov t2, t3, t7, zero
addi t3, t3, 32              ; reserve += XLEN;
output:
and t6, t1, t5           ; *(odata++) = lower & ((1 << 27)-1)
sw t6, 0(a1)
fsr t1, t1, t2, t4       ; lower = fsr(lower, 27, upper)
srl t2, t2, t4           ; upper = srl(upper, 27)
sub t3, t3, t4           ; reserve -= 27
bnez a2, loop         ; while (n--)
endloop:
ret``````

A loop iteration without fetch is 9 instructions long, and a loop iteration with fetch is 17 instructions long.

Without ternary operators that would be 13 instructions and 22 instructions, i.e. assuming one cycle per instruction, that function would be about 30% slower without ternary instructions.

### 4.3.3 Parsing bit-streams of 6-bit words

The following code accepts three 64-bit words in `t0`, `t1`, and `t2` containing a bit-stream of 32 6-bit words, and outputs these 32 values in the bytes of `t0`, `t1`, `t2`, and `t3`.

``````    addi a0, zer0, 0x3f
orc8 a0, a0          // a0 = 0x3f3f3f3f3f3f3f3f

srli t3, t2, 16
bdep t3, t3, a0
fsli t2, t2, t1, 32
bdep t2, t2, a0
fsli t1, t1, t0, 16
bdep t1, t1, a0
bdep t0, t0, a0``````

That’s 7 instructions without the two instructions for constructing the mask in `a0`.

Without funnel shift this operation requires 11 instructions:

``````    addi a0, zer0, 0x3f
orc8 a0, a0          // a0 = 0x3f3f3f3f3f3f3f3f

srli t3, t2, 16
bdep t3, t3, a0
slli t2, t2, 32
srli a1, t1, 32
or t2, t2, a1
bdep t2, t2, a0
slli t1, t1, 16
srli a1, t0, 48
or t1, t1, a1
bdep t1, t1, a0
bdep t0, t0, a0``````

Sign-extending the 6-bit values in the bytes of `t0`, `t1`, `t2`, and `t3` with bit-matrix multiply:

``````    li a0, 0x80c0e01008040201

bmator t0, t0, a0
bmator t1, t1, a0
bmator t2, t2, a0
bmator t3, t3, a0``````

Or without bit-matrix multiply:

``````    addi a0, zer0, 0x60
orc8 a0, a0          // a0 = 0x6060606060606060

xor t0, t0, a0
xor t1, t1, a0
xor t2, t2, a0
xor t3, t3, a0``````

### 4.3.4 Fixed-point multiply

A fixed-point multiply is simply an integer multiply, followed by a right shift. If the entire dynamic range of XLEN bits should be useable for the factors, then the product before shift must be 2*XLEN wide. Therefore ` mul`+`mulh` is needed for the multiplication, and funnel shift instructions can help with the final right shift. For fixed-point numbers with N fraction bits:

``````  mul_fracN:
mulh a2, a0, a1
mul a0, a0, a1
fsri a0, a0, a2, N
ret``````

## 4.4 Arbitrary bit permutations

This section lists code snippets for computing arbitrary bit permutations that are defined by data (as opposed to bit permutations that are known at compile time and can likely be compiled into shift-and-mask operations and/or a few instances of bext/bdep).

### 4.4.1 Using butterfly operations

The following macro performs a stage-`N` butterfly operation on the word in `a0` using the mask in `a1`.

``````  grevi a2, a0, (1 << N)
cmix a0, a1, a2, a0``````

The bitmask in `a1` must be preformatted correctly for the selected butterfly stage. A butterfly operation only has a XLEN/2 wide control word. The following macros format the mask assuming those XLEN/2 bits in the lower half of `a1` on entry:

``````bfly_msk_0:
pack a1, a1, a1
zip a1, a1

bfly_msk_1:
pack a1, a1, a1
zip2 a1, a1

bfly_msk_2:
pack a1, a1, a1
zip4 a1, a1

...``````

A sequence of 2 ⋅ log2(XLEN) − 1 butterfly operations can perform any arbitrary bit permutation (Beneš network):

``````  butterfly(LOG2_XLEN-1)
butterfly(LOG2_XLEN-2)
...
butterfly(0)
...
butterfly(LOG2_XLEN-2)
butterfly(LOG2_XLEN-1)``````

Many permutations arising from real-world applications can be implemented using shorter sequences. For example, any sheep-and-goats operation (SAG, see section 1.4.4) with either the sheep or the goats bit reversed can be implemented in log2(XLEN) butterfly operations.

Reversing a permutation implemented using butterfly operations is as simple as reversing the order of butterfly operations.

### 4.4.2 Using omega-flip networks

The omega operation is a stage-0 butterfly preceded by a zip operation:

``````  zip a0, a0
grevi a2, a0, 1
cmix a0, a1, a2, a0``````

The flip operation is a stage-0 butterfly followed by an unzip operation:

``````  grevi a2, a0, 1
cmix a0, a1, a2, a0
unzip a0, a0``````

A sequence of log2(XLEN) omega operations followed by log2(XLEN) flip operations can implement any arbitrary 32 bit permutation.

As for butterfly networks, permutations arising from real-world applications can often be implemented using a shorter sequence.

### 4.4.3 Using baseline networks

Another way of implementing arbitrary 32 bit permutations is using a baseline network followed by an inverse baseline network.

A baseline network is a sequence of log2(XLEN) butterfly(0) operations interleaved with unzip operations. For example, a 32-bit baseline network:

``````  butterfly(0)
unzip
butterfly(0)
unzip.h
butterfly(0)
unzip.b
butterfly(0)
unzip.n
butterfly(0)``````

An inverse baseline network is a sequence of log2(XLEN) butterfly(0) operations interleaved with zip operations. The order is opposite to the order in a baseline network. For example, a 32-bit inverse baseline network:

``````  butterfly(0)
zip.n
butterfly(0)
zip.b
butterfly(0)
zip.h
butterfly(0)
zip
butterfly(0)``````

A baseline network followed by an inverse baseline network can implement any arbitrary bit permutation.

### 4.4.4 Using sheep-and-goats

The Sheep-and-goats (SAG) operation is a common operation for bit permutations. It moves all the bits selected by a mask (goats) to the LSB end of the word and all the remaining bits (sheep) to the MSB end of the word, without changing the order of sheep or goats.

The SAG operation can easily be performed using `bext` (data in `a0` and mask in `a1`):

``````  bext a2, a0, a1
not a1, a1
bext a0, a0, a1
pcnt a1, a1
ror a0, a0, a1
or a0, a0, a2``````

Any arbitrary bit permutation can be implemented in log2(XLEN) SAG operations.

The Hacker’s Delight describes an optimized standard C implementation of the SAG operation. Their algorithm takes 254 instructions (for 32 bit) or 340 instructions (for 64 bit) on their reference RISC instruction set. [Seander05]

### 4.4.5 Using bit-matrix multiply

`bat[x]or` performs a permutation of bits within each byte when used with a permutation matrix in `rs2`, and performs a permutation of bytes when used with a permutation matrix in `rs1`.

## 4.5 Mirroring and rotating bitboards

Bitboards are 64-bit bitmasks that are used to represent part of the game state in chess engines (and other board game AIs). The bits in the bitmask correspond to squares on a 8 × 8 chess board:

`````` 56 57 58 59 60 61 62 63
48 49 50 51 52 53 54 55
40 41 42 43 44 45 46 47
32 33 34 35 36 37 38 39
24 25 26 27 28 29 30 31
16 17 18 19 20 21 22 23
8  9 10 11 12 13 14 15
0  1  2  3  4  5  6  7``````

Many bitboard operations are simple straight-forward operations such as bitwise-AND, but mirroring and rotating bitboards can take up to 20 instructions on x86.

### 4.5.1 Mirroring bitboards

Flipping horizontally or vertically can easily done with `grevi`:

``````Flip horizontal:
63 62 61 60 59 58 57 56    RISC-V Bitmanip:
55 54 53 52 51 50 49 48       rev.b
47 46 45 44 43 42 41 40
39 38 37 36 35 34 33 32
31 30 29 28 27 26 25 24    x86:
23 22 21 20 19 18 17 16       13 operations
15 14 13 12 11 10  9  8
7  6  5  4  3  2  1  0

Flip vertical:
0  1  2  3  4  5  6  7    RISC-V Bitmanip:
8  9 10 11 12 13 14 15       rev8
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39    x86:
40 41 42 43 44 45 46 47       bswap
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63``````

Rotating by 180 (flip horizontal and vertical):

``````Rotate 180:
7  6  5  4  3  2  1  0    RISC-V Bitmanip:
15 14 13 12 11 10  9  8       rev
23 22 21 20 19 18 17 16
31 30 29 28 27 26 25 24
39 38 37 36 35 34 33 32    x86:
47 46 45 44 43 42 41 40       14 operations
55 54 53 52 51 50 49 48
63 62 61 60 59 58 57 56``````

### 4.5.2 Rotating bitboards

Using `zip` a bitboard can be transposed easily: [transposebitboard]

``````Transpose:
7 15 23 31 39 47 55 63    RISC-V Bitmanip:
6 14 22 30 38 46 54 62       zip, zip, zip
5 13 21 29 37 45 53 61
4 12 20 28 36 44 52 60
3 11 19 27 35 43 51 59    x86:
2 10 18 26 34 42 50 58       18 operations
1  9 17 25 33 41 49 57
0  8 16 24 32 40 48 56``````

A rotation is simply the composition of a flip operation and a transpose operation. This takes 19 operations on x86 [ChessProg]. With Bitmanip the rotate operation only takes 4 operations:

``````rotate_bitboard:
rev8 a0, a0
zip a0, a0
zip a0, a0
zip a0, a0``````

### 4.5.3 Explanation

The bit indices for a 64-bit word are 6 bits wide. Let `i[5:0]` be the index of a bit in the input, and let `i``[5:0]` be the index of the same bit after the permutation.

As an example, a rotate left shift by N can be expressed using this notation as `i``[5:0]` = `i[5:0]` + N   (mod 64).

The GREV operation with shamt N is `i``[5:0]` = `i[5:0]` XOR N.

And a SHFL operation corresponds to a rotate left shift by one position of any contiguous region of `i[5:0]`. For example, `zip` is a left rotate shift of the entire bit index:

`i``[5:0]` = {`i[4:0]`, `i[5]`}

And `zip4` performs a left rotate shift on bits `5:2`:

`i``[5:0]` = {`i[4:2]`, `i[5]`, `i[1:0]`}

In a bitboard, `i[2:0]` corresponds to the X coordinate of a board position, and `i[5:3]` corresponds to the Y coordinate.

Therefore flipping the board horizontally is the same as negating bits `i[2:0]`, which is the operation performed by `grevi rd, rs, 7` (`rev.b`).

Likewise flipping the board vertically is done by `grevi rd, rs, 56` (`rev8`).

Finally, transposing corresponds by swapping the lower and upper half of `i[5:0]`, or rotate shifting `i[5:0]` by 3 positions. This can easily done by rotate shifting the entire `i[5:0]` by one bit position (`zip`) three times.

### 4.5.4 Rotating Bitcubes

Let’s define a bitcube as a 4 × 4 × 4 cube with x = `i[1:0]`, y = `i[3:2]`, and z = `i[5:4]`. Using the same methods as described above we can easily rotate a bitcube by 90 around the X-, Y-, and Z-axis:

3

``````rotate_x:
rev16 a0, a0
zip4 a0, a0
zip4 a0, a0``````
``````rotate_y:
rev.n a0, a0
zip a0, a0
zip a0, a0
zip4 a0, a0
zip4 a0, a0``````
``````rotate_z:
rev4.h
zip.h a0, a0
zip.h a0, a0``````

## 4.6 Manipulating 64x64 Bit Matrices

The `bmat[x]or` and `bmatflip` instructions operate on 8x8 bit matrices stored in single 64-bit registers, where each byte of such a 64-bit value represents one row (column) of a 8x8 bit matrix.

Let’s assume we have a 64x64 bit matrix in memory, stored as one row (column) per 64-bit value. In order to use `bmat[x]or` and `bmatflip` on such a matrix, we must first convert it into a 8x8 block matrix of 64 individual 8x8 matrices, each stored in a 64-bit value. The following function performs this transformation for a single row (column) of the block matrix in 40 instructions.

``````void conv8x8(const uint64_t x[8], uint64_t y[8])
{
uint64_t x0_x1_31_00 = _rv64_pack (x[0], x[1]);
uint64_t x2_x3_31_00 = _rv64_pack (x[2], x[3]);
uint64_t x4_x5_31_00 = _rv64_pack (x[4], x[5]);
uint64_t x6_x7_31_00 = _rv64_pack (x[6], x[7]);
uint64_t x0_x1_63_32 = _rv64_packu(x[0], x[1]);
uint64_t x2_x3_63_32 = _rv64_packu(x[2], x[3]);
uint64_t x4_x5_63_32 = _rv64_packu(x[4], x[5]);
uint64_t x6_x7_63_32 = _rv64_packu(x[6], x[7]);``````
``````  uint64_t x0_x1_31_00_z = _rv64_unzip16(x0_x1_31_00);
uint64_t x2_x3_31_00_z = _rv64_unzip16(x2_x3_31_00);
uint64_t x4_x5_31_00_z = _rv64_unzip16(x4_x5_31_00);
uint64_t x6_x7_31_00_z = _rv64_unzip16(x6_x7_31_00);
uint64_t x0_x1_63_32_z = _rv64_unzip16(x0_x1_63_32);
uint64_t x2_x3_63_32_z = _rv64_unzip16(x2_x3_63_32);
uint64_t x4_x5_63_32_z = _rv64_unzip16(x4_x5_63_32);
uint64_t x6_x7_63_32_z = _rv64_unzip16(x6_x7_63_32);``````
``````  uint64_t x0_x1_x2_x3_15_00 = _rv64_pack (x0_x1_31_00_z, x2_x3_31_00_z);
uint64_t x4_x5_x6_x7_15_00 = _rv64_pack (x4_x5_31_00_z, x6_x7_31_00_z);
uint64_t x0_x1_x2_x3_31_16 = _rv64_packu(x0_x1_31_00_z, x2_x3_31_00_z);
uint64_t x4_x5_x6_x7_31_16 = _rv64_packu(x4_x5_31_00_z, x6_x7_31_00_z);
uint64_t x0_x1_x2_x3_47_32 = _rv64_pack (x0_x1_63_32_z, x2_x3_63_32_z);
uint64_t x4_x5_x6_x7_47_32 = _rv64_pack (x4_x5_63_32_z, x6_x7_63_32_z);
uint64_t x0_x1_x2_x3_63_48 = _rv64_packu(x0_x1_63_32_z, x2_x3_63_32_z);
uint64_t x4_x5_x6_x7_63_48 = _rv64_packu(x4_x5_63_32_z, x6_x7_63_32_z);``````
``````  uint64_t x0_x1_x2_x3_15_00_z = _rv64_unzip8(x0_x1_x2_x3_15_00);
uint64_t x4_x5_x6_x7_15_00_z = _rv64_unzip8(x4_x5_x6_x7_15_00);
uint64_t x0_x1_x2_x3_31_16_z = _rv64_unzip8(x0_x1_x2_x3_31_16);
uint64_t x4_x5_x6_x7_31_16_z = _rv64_unzip8(x4_x5_x6_x7_31_16);
uint64_t x0_x1_x2_x3_47_32_z = _rv64_unzip8(x0_x1_x2_x3_47_32);
uint64_t x4_x5_x6_x7_47_32_z = _rv64_unzip8(x4_x5_x6_x7_47_32);
uint64_t x0_x1_x2_x3_63_48_z = _rv64_unzip8(x0_x1_x2_x3_63_48);
uint64_t x4_x5_x6_x7_63_48_z = _rv64_unzip8(x4_x5_x6_x7_63_48);``````
``````  y[0] = _rv64_pack (x0_x1_x2_x3_15_00_z, x4_x5_x6_x7_15_00_z);
y[1] = _rv64_packu(x0_x1_x2_x3_15_00_z, x4_x5_x6_x7_15_00_z);
y[2] = _rv64_pack (x0_x1_x2_x3_31_16_z, x4_x5_x6_x7_31_16_z);
y[3] = _rv64_packu(x0_x1_x2_x3_31_16_z, x4_x5_x6_x7_31_16_z);
y[4] = _rv64_pack (x0_x1_x2_x3_47_32_z, x4_x5_x6_x7_47_32_z);
y[5] = _rv64_packu(x0_x1_x2_x3_47_32_z, x4_x5_x6_x7_47_32_z);
y[6] = _rv64_pack (x0_x1_x2_x3_63_48_z, x4_x5_x6_x7_63_48_z);
y[7] = _rv64_packu(x0_x1_x2_x3_63_48_z, x4_x5_x6_x7_63_48_z);
}``````

Each of the 5 blocks in this function only consumes the eight outputs of the previous block. Therefore 16 registers are sufficient to run this function in registers only without the need to spill any data on the stack.

Note that this function is its own inverse. Therefore the same function can be used for the convertion from block matrix form back to row (column) major form.

A bit 64x64 bit matrix in block matrix form can easily be transposed by running `bmatflip` (or `zip; zip; zip`) on the blocks of the matrix and then renaming the individual 64-bit variables.

To multiply 64x64 bit matrices in block matrix form, the matrix-matrix-product is decomposed in the obvious way in 8 × 8 × 8 = 512 `bmat[x]or` instructions and 7 × 8 × 8 = 448 `[x]or` instructions.

## 4.7 Inverting Xorshift RNGs

Xorshift RNGs are a class of fast RNGs for different bit widths. There are 648 Xorshift RNGs for 32 bits, but this is the one that the author of the original Xorshift RNG paper recommends. [Xorshift]

``````  uint32_t xorshift32(uint32_t x)
{
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
return x;
}``````

This function of course has been designed and selected so it’s efficient, even without special bit-manipulation instructions. So let’s look at the inverse instead. First, the naïve form of inverting this function:

``````  uint32_t xorshift32_inv(uint32_t x)
{
uint32_t t;
t = x ^ (x << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
x = x ^ (t << 5);
x = x ^ (x >> 17);
t = x ^ (x << 13);
x = x ^ (t << 13);
return x;
}``````

This translates to 18 RISC-V instructions, not including the function call overhead.

Obviously the C expression `x ^ (x >> 17)` is already its own inverse (because 17 ≥ XLEN/2) and therefore already has an effecient inverse. But the two other blocks can easily be implemented using a single `clmul` instruction each:

``````  uint32_t xorshift32_inv(uint32_t x)
{
x = clmul(x, 0x42108421);
x = x ^ (x >> 17);
x = clmul(x, 0x04002001);
return x;
}``````

This are 8 RISC-V instructions, including 4 instructions for loading the constants, but not including the function call overhead.

An optimizing compiler could easily generate the clmul instructions and the magic constants from the C code for the naïve implementation. (`0x04002001 = (1 << 2*13) | (1 << 13) | 1` and `0x42108421 = (1 << 6*5) | (1 << 5*5) | …| (1 << 5) | 1`)

The obvious remaining question is “if `clmul(x, 0x42108421)` is the inverse of `x ^ (x << 5)`, what’s the inverse of `x ^ (x >> 5)`?” It’s `clmulr(x, 0x84210842)`, where `0x84210842` is the bit-reversal of `0x42108421`.

A special case of xorshift is `x ^ (x >> 1)`, which is a gray encoder. The corresponding gray decoder is `clmulr(x, 0xffffffff)`.

## 4.8 Cyclic redundency checks (CRC)

There are special instructions for performing CRCs using the two most widespread 32-bit CRC polynomials, CRC-32 and CRC-32C.

CRCs with other polynomials can be computed efficiently using CLMUL. The following examples are using CRC32Q.

The easiest way of implementing CRC32Q with clmul is using a Barrett reduction. On RV32:

``````  uint32_t crc32q_simple(const uint32_t *data, int length)
{
uint32_t P  = 0x814141AB;  // CRC polynomial (implicit x^32)
uint32_t mu = 0xFEFF7F62;  // x^64 divided by CRC polynomial
uint32_t mu1 = 0xFF7FBFB1; // "mu" with leading 1, shifted right by 1 bit
uint32_t crc = 0;

for (int i = 0; i < length; i++) {
crc ^= rev8(data[i]);
crc = clmulr(crc, mu1);
crc = clmul(crc, P);
}

return crc;
}``````

The following python code calculates the value of `mu` for a given CRC polynomial:

``````  def polydiv(dividend, divisor):
quotient = 0
while dividend.bit_length() >= divisor.bit_length():
i = dividend.bit_length() - divisor.bit_length()
dividend = dividend ^ (divisor << i)
quotient |= 1 << i
return quotient

P = 0x1814141AB
print("0x%X" % (polydiv(1<<64, P)))   # prints 0x1FEFF7F62``````

A more efficient method would be the following, which processes 64-bit at a time (RV64):

``````  uint32_t crc32q_fast(const uint64_t *p, int len)
{
uint64_t P  = 0x1814141ABLL;   // CRC polynomial
uint64_t k1 =  0xA1FA6BECLL;   // remainder of x^128 divided by CRC polynomial
uint64_t k2 =  0x9BE9878FLL;   // remainder of x^96 divided by CRC polynomial
uint64_t k3 =  0xB1EFC5F6LL;   // remainder of x^64 divided by CRC polynomial
uint64_t mu = 0x1FEFF7F62LL;   // x^64 divided by CRC polynomial

uint64_t a0, a1, a2, t1, t2;

assert(len >= 2);
a0 = rev8(p[0]);
a1 = rev8(p[1]);``````
``````    // Main loop: Reduce to 2x 64 bits

for (const uint64_t *t0 = p+2; t0 != p+len; t0++)
{
a2 = rev8(*t0);
t1 = clmulh(a0, k1);
t2 = clmul(a0, k1);
a0 = a1 ^ t1;
a1 = a2 ^ t2;
}``````
``````    // Reduce to 64 bit, add 32 bit zero padding

t1 = clmulh(a0, k2);
t2 = clmul(a0, k2);

a0 = (a1 >> 32) ^ t1;
a1 = (a1 << 32) ^ t2;

t2 = clmul(a0, k3);
a1 = a1 ^ t2;``````
``````    // Barrett Reduction

t1 = clmul(a1 >> 32, mu);
t2 = clmul(t1 >> 32, P);
a0 = a1 ^ t2;

return a0;
}``````

The main idea is to transform an array of arbitrary length to an array with the same CRC that’s only two 64-bit elements long. (That’s the “Main loop” portion of above code.)

Then we further reduce it to just 64-bit. And then we use a Barrett reduction to get the final 32-bit result.

The following python code can be used to calculate the “magic constants” ` k1`, `k2`, and `k3`:

``````  def polymod(dividend, divisor):
quotient = 0
while dividend.bit_length() >= divisor.bit_length():
i = dividend.bit_length() - divisor.bit_length()
dividend = dividend ^ (divisor << i)
quotient |= 1 << i
return dividend

print("0x%X" % (polymod(1<<128, P)))   # prints 0xA1FA6BEC
print("0x%X" % (polymod(1<< 96, P)))   # prints 0x9BE9878F
print("0x%X" % (polymod(1<< 64, P)))   # prints 0xB1EFC5F6``````

The above example code is taken from [Wolf18A]. A more detailed description of the algorithms employed can be found in [FastCRC].

## 4.9 Decoding RISC-V Immediates

The following code snippets decode and sign-extend the immediate from RISC-V S-type, B-type, J-type, and CJ-type instructions. They are nice “nothing up my sleeve”-examples for real-world bit permutations.

2

``````  decode_s:
li t0, 0xfe000f80
bext a0, a0, t0
c.slli a0, 20
c.srai a0, 20
ret``````
``````  decode_b:
li t0, 0xeaa800aa
rori a0, a0, 8
grevi a0, a0, 8
shfli a0, a0, 7
bext a0, a0, t0
c.slli a0, 20
c.srai a0, 19
ret``````
``````  decode_j:
li t0, 0x800003ff
li t1, 0x800ff000
bext a1, a0, t1
c.slli a1, 23
rori a0, a0, 21
bext a0, a0, t0
c.slli a0, 12
c.or a0, a1
c.srai a0, 11
ret``````
``````  // variant 1 (with RISC-V Bitmanip)
decode_cj:
li t0, 0x28800001
li t1, 0x000016b8
li t2, 0xb4e00000
li t3, 0x4b000000
bext a1, a0, t1
bdep a1, a1, t2
rori a0, a0, 11
bext a0, a0, t0
bdep a0, a0, t3
c.or a0, a1
c.srai a0, 20
ret``````
``````  // variant 2 (without RISC-V Bitmanip)
decode_cj:
srli a5, a0, 2
srli a4, a0, 7
c.andi a4, 16
slli a3, a0, 3
c.andi a5, 14
andi a3, a3, 32
srli a4, a0, 1
andi a4, a4, 64
slli a2, a0, 1
andi a2, a2, 128
srli a3, a0, 1
slli a4, a0, 19
andi a3, a3, 768
c.slli a0, 2
andi a0, a0, 1024
c.srai a4, 31
slli a0, a4, 11
ret``````
``````  // variant 3 (with RISC-V Zbp only)
decode_cj:
shfli   a0, a0, 15
rori    a0, a0, 28
shfli   a0, a0, 2
shfli   a0, a0, 14
rori    a0, a0, 26
shfli   a0, a0, 8
rori    a0, a0, 10
unshfli a0, a0, 12
rori    a0, a0, 18
unshfli a0, a0, 14
rori    a0, a0, 28
shfli   a0, a0, 6
rori    a0, a0, 28
unshfli a0, a0, 15
slli    a0, a0, 21
srai    a0, a0, 20
ret``````