21 “V” Standard Extension for Vector Operations, Version 0.2
This chapter presents a proposal for the RISC-V vector instruction set extension. The vector extension supports a configurable vector unit, to tradeoff the number of architectural vector registers and supported element widths against available maximum vector length. The vector extension is designed to allow the same binary code to work efficiently across a variety of hardware implementations varying in physical vector storage capacity and datapath parallelism.
21.1 Vector Unit State
The additional vector unit architectural state consists of 32 vector
data registers (v0–v31), 8 vector predicate registers
(vp0-vp7), and an XLEN-bit WARL vector length CSR, vl. In addition, the current configuration of the vector unit is
held in a set vector configuration CSRs (vcmaxw, vctype,
vcnpred), as described below. The implementation determines an
available maximum vector length (MVL) for the current
configuration held in the vcmaxw and vcnpred registers.
There is also a 3-bit fixed-point rounding mode CSR vxrm, and a
single-bit fixed-point saturation status CSR vxsat.
| CSR name | Number | Base ISA |
|---|---|---|
vl |
0x020 | RV32, RV64, RV128 |
vxrm |
0x020 | RV32, RV64, RV128 |
vxsat |
0x020 | RV32, RV64, RV128 |
vcsr |
0x020 | RV32, RV64, RV128 |
vcnpred |
0x020 | RV32, RV64, RV128 |
vcmaxw |
0x020 | RV32, RV64, RV128 |
vcmaxw1 |
0x020 | RV32 |
vcmaxw2 |
0x020 | RV32, RV64 |
vcmaxw3 |
0x020 | RV32 |
vctype |
0x020 | RV32, RV64, RV128 |
vctype1 |
0x020 | RV32 |
vctype2 |
0x020 | RV32, RV64 |
vctype3 |
0x020 | RV32 |
vctypev0 |
0x020 | RV32, RV64, RV128 |
vctypev1 |
0x020 | RV32, RV64, RV128 |
| ... | ||
vctypev31 |
0x020 | RV32, RV64, RV128 |
21.2 Element Datatypes and Width
The datatypes and operations supported by the V extension depend upon
the base scalar ISA and supported extensions, and may include 8-bit,
16-bit, 32-bit, 64-bit, and 128-bit integer and fixed-point data types
(X8, X16, X32, X64, and X128 respectively), and 16-bit, 32-bit,
64-bit, and 128-bit floating-point types (F16, F32, F64, and F128
respectively). When the V extension is added, it must support the
vector data element types implied by the supported scalar types as
defined by Table 1.1. The largest element width
supported:
$ELEN = max( XLEN , FLEN )$$
Compiler support for vectorization is greatly simplified when any hardware-supported data types are supported by both scalar and vector instructions.
Adding the vector extension to any machine with floating-point support
adds support for the IEEE standard half-precision 16-bit
floating-point data type. This includes a set of scalar
half-precision instructions described in
Section [sec:scalarhalffloat]. The scalar half-precision
instructions follow the template for other floating-point precisions,
but using the hitherto unused fmt field encoding of 10.
We only support scalar half-precision floating-point types as part of the vector extension, as the main benefits of half-precision are obtained when using vector instructions that amortize per-operation control overhead. Not supporting a separate scalar half-precision floating-point extension also reduces the number of standard instruction-set variants.
21.3 Vector Configuration Registers (vcmaxw, vctype, vcp)
The vector unit must be configured before use. Each architectural
vector data register (v0–v31) is configured with the
maximum number of bits allowed in each element of that vector data
register, or can be disabled to free physical vector storage for other
architectural vector data registers. The number of available
vector predicate registers can also be set independently.
The available MVL depends on the configuration setting, but MVL must always have the same value for the same configuration parameters on a given implementation. Implementations must provide an MVL of at least four elements for all supported configuration settings.
Each vector data register’s current maximum-width is held in a
separate four-bit field in the vcmaxw CSRs, encoded as shown in
Table [tab:vcmaxw].
| Width | Encoding |
|---|---|
| Disabled | 0000 |
| 8 | 1000 |
| 16 | 1001 |
| 32 | 1010 |
| 64 | 1011 |
| 128 | 1100 |
Several earlier vector machines had the ability to configure physical vector register storage into a larger number of short vectors or a shorter number of long vectors, in particular the Fujitsu VP series [vp200].
In addition, each vector data register has an associated dynamic type
field that is held in a four-bit field in the vctype CSRs,
encoded as shown in Table [tab:vctype]. The dynamic type field of
a vector data register is constrained to only hold types that have
equal or lesser width than the value in the corresponding vcmaxw
field for that vector data register. Changes to vctype do not
alter MVL.
| Type | vctype encoding |
vcmaxw equivalent |
|---|---|---|
| Disabled | 0000 | 0000 |
| F16 | 0001 | 1001 |
| F32 | 0010 | 1010 |
| F64 | 0011 | 1011 |
| F128 | 0100 | 1100 |
| X8 | 1000 | 1000 |
| X16 | 1001 | 1001 |
| X32 | 1010 | 1010 |
| X64 | 1011 | 1011 |
| X128 | 1100 | 1100 |
Vector data registers have both a maximum element width and a current element data type to support vector function calls, where the caller does not know the types needed by the callee, as described below.
To reduce configuration time, writes to a vcmaxw field also
write the corresponding vctype field. The vcmaxw field
can be written any value taken from the type encoding in
Table [tab:vctype], but only the width information as shown in
Table [tab:vcmaxw] will be recorded in the vcmaxw fields
whereas the full type information will be recorded in the
corresponding vctype field.
Attempting to write any vcmaxw field with a width larger than
that supported by the implementation will raise an illegal instruction
exception. Implementations are allowed to record a vcmaxw value
larger than the value requested. In particular, an implementation may
choose to hardwire vcmaxw fields to the largest supported width.
Attempting to write an unsupported type or a type that requires more
than the current vcmaxw width to a vctype field will raise
an exception.
Any write to a field in the vcmaxw register configures the
vector unit and causes all vector data registers to be zeroed and all
vector predicate registers to be set, and the vector length register
vl to be set to the maximum supported vector length.
Any write to a vctype field zeros only the associated vector
data register, leaving the other vector unit state undisturbed.
Attempting to write a type needing more bits than the corresponding
vcmaxw value to a vctype field will raise an illegal
instruction exception.
Vector registers are zeroed on reconfiguration to prevent security holes and to avoid exposing differences between how different implementations manage physical vector register storage.
In-order implementations will probaby use a flag bit per register to mux in 0 instead of garbage values on each source until it is overwritten. For in-order machines, partial writes due to predication or vector lengths less than MVL complicate this zeroing, but these cases can be handled by adopting a hardware read-modify-write, adding a zero bit per element, or a trap to machine-mode trap handler if first write access after configuration is partial. Out-of-order machines can just point initial rename table at physical zero register.
In RV128, vcmaxw is a single CSR holding 32 4-bit width
fields. Bits (4N + 3)–(4N) hold the maximum width of vector data
register N. In RV64, the vcmaxw2 CSR provides access to the
upper 64 bits of vcmaxw. In RV32, the vcmaxw1 CSR
provides access to bits 63–32 of vcmaxw, while vcmax3 CSR
provides access to bits 127–96.
The vcnpred CSR contains a single 4-bit WLRL field giving the
number of enabled architectural predicate registers, between 0 and 8.
Any write to vcnpred zeros all vector data registers, sets all
bits in visible vector predicate registers, and sets the vector length
register vl to the maximum supported vector length. Attempting
to write a value larger than 8 to vcnpred raises an illegal
instruction exception.
21.4 Vector Length
The active vector length is held in the XLEN-bit WARL vector length
CSR vl, which can only hold values between 0 and MVL inclusive.
Any writes to the maximum configuration registers (vcmaxw or
vcnpred) cause vl to be initialized with MVL. Writes to
vctype do not affect vl.
The active vector length is usually written with the setvl
instruction, which is encoded as a csrrw instruction to the vl CSR number. The source argument to the csrrw is the
requested application vector length (AVL) as an unsigned XLEN-bit
integer. The setvl instruction calculates the value to assign to
vl according to Table [tab:vlcalc].
| AVL Value | vl setting |
|---|---|
| AVL ≥ 2 MVL | MVL |
| 2 MVL > AVL > MVL | ⌊AVL/2⌋ |
| MVL ≥ AVL | AVL |
The rules for setting the vl register help keep vector
pipelines full over the last two iterations of a stripmined loop.
Similar rules were previously used in Cray-designed machines [crayx1asm].
The result of this calculation is also returned as the result of the setvl instruction. Note that unlike a regular csrrw instruction, the
value written to integer register rd is not the original CSR value but
the modified value.
The idea of having implementation-defined vector length dates back
to at least the IBM 3090 Vector Facility [ibm370varch], which
used a special “Load Vector Count and Update” (VLVCU) instruction
to control stripmine loops. The setvl instruction included
here is based on the simpler setvlr instruction introduced by
Asanović [krstephd].
The setvl instruction is typically used at the start of every
iteration of a stripmined loop to set the number of vector elements to
operate on in the following loop iteration. The current MVL can be
obtained by performing a setvl with a source argument that has
all bits set (largest unsigned integer).
No element operations are performed for any vector instruction when
vl=0.
21.5 Rapid Configuration Instructions
It can take several instructions to set vcmaxw, vctype and
vcnpred to a given configuration. To accelerate configuring the
vector unit, specialized vcfg instructions are added that are
encoded as writes to CSRs with encoded immediate values that set
multiple fields in the vcmaxw, vctype, and vncpred
configuration registers.
The vcfgd instruction is encoded as a CSRRW that takes a
register value encoded as shown in Figure 1.3, and which
returns the corresponding MVL in the destination register. A
corresponding vcfgdi instruction is encoded as a CSRRWI that
takes a 5-bit immediate value to set the configuration, and returns
MVL in the destination register.
One of the primary uses of vcfgdi is to configure the vector
unit with single-byte element vectors for use in memcpy and
memset routines. A single instruction can configure the
vector unit for these operation.
The vcfgd instruction also clears the vcnpred register, so
no predicate registers are allocated.
vcfgd value for different base ISAs,
holding 5-bit vector register numbers for each supported
type. Fields must either contain 0 indicating no vector registers
are allocated for that type, or a vector register number greater
than all to the right. All vector register numbers inbetween two
non-zero fields are allocated to the type with the higher vector
register number. The vcfgd value specifies how many vector registers of each
datatype are allocated, and is divided into 5-bit fields, one per
supported datatype. A value of 0 in a field indicates that no
registers of that type are allocated. A non-zero value indicates the
highest vector
Each 5-bit field in the vcfgd value must contain either zero,
indicating that no vector registers are allocated for that type, or a
vector register number greater than all fields in lower bit positions,
indicating the highest vector register containing the associated type.
This encoding can compactly represent any arbitrary allocation of
vector registers to data types, except that there must be at least two
vector registers (v0 and v1) allocated to the narrowest
required type. An example allocation is shown in
Figure 1.4.
vcfgd value to set configuration.Separate vcfgp and vcfgpi instructions are provided, using
the CSRRW and CSRRWI encodings respectively, that write the source
value to the vcnpred register and return the new MVL. These
writes also clear the vector data registers, set all bits in the
allocated predicate registers, and set vl=MVL. A vcfgp or
vcfgpi instruction can be used after a vcfgd to complete a
reconfiguration of the vector unit.
If a zero argument is given to vcgfd the vector unit will be
unconfigured with no enabled registers, and the value 0 will be
returned for MVL. Only the configuration registers vcmaxw and
vcnpred can be accessed in this state, either directly or via
vcfgd, vcfgdi, vcfgp, or vcfgpi
instructions. Other vector instructions will raise an illegal
instruction exception.
To quickly change the individual types of a vector register, each
vector data register n has a dedicated CSR address to access its
vctype field, named vctypevn. The vcfgt and vcfgti instructions are assembler pseudo-instructions for regular
CSRRW and CSRRWI instructions that update the type fields and return
the original value. The vcfgti instruction is typically used to
change to a desired type while recording the previous type in one
instruction, and the vcfgt instruction is used to revert back to
the saved type.
Five EmbedDev
The vector extension is based on the style of vector register architecture introduced by Seymour Cray in the 1970s, as opposed to the earlier packed SIMD approach, introduced with the Lincoln Labs TX-2 in 1957 and now adopted by most other commercial instruction sets.
The vector instruction set contains many features developed in earlier research projects, including the Berkeley T0 and VIRAM vector microprocessors, the MIT Scale vector-thread processor, and the Berkeley Maven and Hwacha projects.