VC4ASM - Assembler instructions

↑ Top, ALU, mov, ldi, nop, read, Semaphore, Branch, Signal, Conditional assignment, Pack/unpack

General rules

Instruction joining

A trailing ; indicates that the assembler my try to merge the current instruction with the next one if next instruction is preceded with ; and the two instructions fit into a single opcode. E.g.

add r0, r0, 1;
;mov r2, 64

generates the same code as

add r0, r0, 1;  mov r2, 64

This might cross macro boundaries. But be aware that dependencies might break your code. E.g. a read to a register file might move immediately after a write to the same register. Take special care of branch targets. They should not be joined with the previous instruction. Placing a bare colon in front of a line defines an anonymous label and prevents instruction joining over this point. Ordinary labels will do the same.

ALU instruction

binaryopcode destination, source1, source2
unaryopcode destination, source1
opcode
.setf ...
opcode.ifcc ...
binaryopcode, unaryopcode
ALU opcode, see below.
destination
Target register.
source1, source2
Source register or small immediate value.

Opcodes

opcode source1 source2 destination flags (if .setf is used)
type type type value Z N C
add uint32 uint32 uint32 source1 + source2 destination == 0 destination >>> 31 source1 + source2 > 0xffffffff
sub uint32 uint32 uint32 source1 - source2 destination == 0 destination >>> 31 source1 < source2
min int32 int32 int32 source1 > source2 ? source2 : source1 destination == 0 destination >>> 31 source1 > source2
max int32 int32 int32 source1 > source2 ? source1 : source2 destination == 0 destination >>> 31 source1 > source2
and uint32 uint32 uint32 source1 & source2 destination == 0 destination >>> 31 0
or uint32 uint32 uint32 source1 | source2 destination == 0 destination >>> 31 0
xor uint32 uint32 uint32 source1 ^ source2 destination == 0 destination >>> 31 0
shl uint32 uint32 uint32 source1 <<< (source2 & 32) destination == 0 destination >>> 31 source1 >>> 32-source2 & 1
shr uint32 uint32 uint32 source1 >>> (source2 & 31) destination == 0 destination >>> 31 source1 >>> (source2 & 31)-1 & 1
asr int32 int32 int32 source1 >> (source2 & 31) destination == 0 destination >>> 31 source1 >>> (source2 & 31)-1 & 1
ror uint32 uint32 uint32 source1 >>< (source2 & 31) destination == 0 destination >>> 31 0
not uint32
uint32 ~source1 destination == 0 destination >>> 31 0
clz uint32
uint32 32 - floor(log₂(source1)) destination == 0 0
0
mul24 uint24 uint24 uint32 source1 * source2 destination == 0 destination >>> 31 source1 * source2 > 0xffffff
fadd float32 float32 float32 source1 + source2 destination == 0 destination >>> 31 destination > 0 (incl. +NaN)
fsub float32 float32 float32 source1 - source2 destination == 0 destination >>> 31 destination > 0 (incl. +NaN)
fmin float32 float32 float32 source1 > source2 ? source2 : source1 destination == 0 destination >>> 31 source1 > source2
fmax float32 float32 float32 source1 > source2 ? source1 : source2 destination == 0 destination >>> 31 source1 > source2
fminabs float32 float32 float32 abs(source1) > abs(source2) ? abs(source2) : abs(source1) destination == 0 destination >>> 31 abs(source1) > abs(source2)
fmaxabs float32 float32 float32 abs(source1) > abs(source2) ? abs(source1) : abs(source2) destination == 0 destination >>> 31 abs(source1) > abs(source2)
fmul float32 float32 float32 source1 * source2 destination == 0 destination >>> 31 0
itof int32
float32 source1 destination == 0 destination >>> 31 0
ftoi float32
int32 source1 destination == 0 destination >>> 31 0
v8adds uint8[4] uint8[4] uint8[4] min(source1[] + source2[], 255) destination == 0 destination >>> 31 0
v8subs uint8[4] uint8[4] uint8[4] max(min(source1[] - source2[], 255), 0) destination == 0 destination >>> 31 0
v8min uint8[4] uint8[4] uint8[4] min(source1[], source2[]) destination == 0 destination >>> 31 0
v8max uint8[4] uint8[4] uint8[4] max(source1[], source2[]) destination == 0 destination >>> 31 0
v8muld uint8[4] uint8[4] uint8[4] (source1 * source2 + 127) / 255 destination == 0 destination >>> 31 0

Example

add.setf r3, ra0, unif
mul24 r0, r1, r2

Move instruction

mov destination, source
mov destination, register << rotate
mov destination, register >> rotate
mov destination, register >> r5
mov destination1, destination2, source
mov.setf ...
mov.ifcc ...
destination
Target register(s).
source
Source register or immediate value.
register
Source register for small rotate instructions.
rotate
Optional rotation of the value.

Strictly speaking mov is no QPU instruction. It is simply a convenient way to create a identity ALU instruction like or with two identical source arguments or an ldi instruction, whatever fits best.

If source is a register, the assembler preferably uses the ADD ALU to realize the movement. If either the ADD ALU is already used by the current instruction or a rotate operation is requested it uses the MUL ALU. The op-code or is used in case of the ADD ALU and v8min for the MUL ALU. Except when 16 bit floating point unpack is requested, in this case the instruction fmin is used.

If source fits into a small immediate value then the assembler prefers this over load immediate. The assembler is quite smart when using small immediate. E.g. the immediate value 64 which has no direct equivalent can be achieved by passing 8 to both inputs of the MUL ALU with instruction mul24. Again the ADD ALU is preferred when available. But some hacks like the example before require the MUL ALU, but the same value could be constructed by the ADD ALU from the value 4 with the shl instruction. Even some pack modes are considered to achieve the desired constant. See the small immediate table for a list of supported values. The value 0 can be assigned without the use of a small immediate value by any ALU using xor or v8subs with an identical source.
Be aware that the carry flag is not well defined in case .setf is used because of the free choice of the opcode.

If neither the second ALU nor signalling flags are used at the end then the instruction is converted back to ldi to save ALU power.

If source does not fit into a small immediate than a ldi instruction is generated.

With some restrictions you can handle two move instructions in a single cycle. E.g. if both sources are registers or if one source is from register file A and the other source fits into a small immediate value of if both sources can be created from the same small immediate value.

Examples

mov ra29, 16
mov r3, rb4 << 2; mov r2, ra11 # Uses the MUL ALU for the first move (because of the vector rotation) and the ADD ALU for the second one.
mov r0, 0x8000000; mov tmurs, 1 # Uses small immediate value 1 with ror r0, 1, 1 to create the 0x80000000.

Load immediate

ldi destination, constant
ldi destination1, destination2, constant
ldi.setf ...
ldi.ifcc ...
destination
Target register.
constant
Immediate value.
In contrast to mov ldi always generates a load immediate instruction even if the constant fits into a small immediate value. The same value can be assigned to two targets at once by using the ADD and the MUL ALU output.

Example

ldi ra7, 0xffff0000

No operation

nop
anop
mnop
mnop destination
destination
Target register.

nop does nothing, well not really. At least it reserves an instruction word that causes a delay. This could be required to meet some instruction constraints.
The variants anop and mnop explicitly schedule to the ADD ALU or MUL ALU respectively. Otherwise vc4asm would take whatever ALU is available.

vc4asm allows the nop instructions to have a target. This can be used to access previous ALU results again.

Read pseudo instruction

read source
source
Source register. This must be from register file A or B including peripherals or a small immediate value.

read is also no instruction of the QPU. It is just an extension of vc4asm to create a register file A or B access without allocation of an ALU instruction. Semantically it is identical to mov -, source, but it will not create any opcode to one of the ALUs. Instead only the raddr_a or raddr_b field is assigned. You can combine up to two read with up to two ALU instructions into a single instruction word as long as they do not require the particular register file source.

Use cases

Wait for peripheral register

read vw_wait

When you read a register only for the purpose to create a QPU stall then there is no need to involve an ALU. In most cases it is a good advise to prefer read ... over mov -, ....

Discard uniform or VPM value

read unif

Prefetch small immediate value

read 8;  ...
and.setf -, elem_num, rb39;  ldtmu0

Small immediate values cannot be combined with signals. But if you can prefetch the value in the previous instruction, you are able to use the value together with a signal without the need for a temporary accumulator or even one of the ALUs of the previous instruction.

Semaphore instruction

sacq destination, number
srel destination, number
mov destination, sacqnumber
mov destination, srelnumber
destination
Target register, usually -, since the output of a semaphore instruction is not generally useful. But if it happens to be useful you may assign the value like with an ldi instruction.
number
Semaphore number to acquire or release. Only the low order 4 bits of the value are used to identify the semaphore number. Bit 4 is controlled by the acquire/release flag and any further bits are placed unchanged into the immediate value field of the instruction and may be chosen arbitrary to if you want to assign a destination.

Example

sacq -, 7
mov -, sacq7

The two instructions above are equivalent. The following function below provides Broadcom compatible syntax.

.set sacq(i) sacq0 + i
mov -, sacq(7)

Branch instruction

bra.cond destination, target
brr.cond destination, target
bra.cond destination, target1, target2
brr.cond destination, target1, target2
bra.cond destination1, destination2, target1, target2
brr.cond destination1, destination2, target1, target2
.cond
Branch condition, optional. One of:
condition zero flag negative flag carry flag
set on all SIMD elements .allz .alln .allc
not set on all SIMD elements .allnz .allnn .allnc
set on at least one SIMD element .anyz .anyn .anyc
not set on at least one SIMD element .anynz .anynn .anync
destination, destination1, destination2
Target register or -. The destination(s) receive the PC position where the branch takes place, i.e. PC + 4, but the assignment only takes place if the branch is actually taken.
The option to have two destination registers require to specify two branch targets also. In doubt use 0 or -, e.g: brr ra_link, r0, -, r:target
target, target1, target2
Register from register file A, constant or label. Branch instructions can add two targets if one of them is a register and the other one is a constant or label.
Note that the use of odd register numbers implies .setf which is generally not intended.

bra creates an absolute branch, i.e. target must be a physical memory address.
brr creates a relative branch, i.e. it adds PC + 4 to the target.

Remember that branch instructions are executed 3 instructions delayed, i.e. three further instructions are always executed before any branch is taken.

Signaling instruction

bkpt
thrsw
thrend
sbwait
sbdone
lthrsw
loadcv
loadc
ldcend
ldtmu0
ldtmu1
loadam
The above signals can be combined with any normal ALU instructions in one line, i.e. no load immediate, no small immediate, no semaphore and no branch. See Broadcom reference guide for details.