vcio2 API Reference

▲ Top, ▼ General IOCTLs, ▼ Memory allocation, ▼ Execute QPU code, ▼ Performance counters, ▶ Programming guide

Access vcio2 device

open device

int vcio2 = open("/dev/vcio2", O_RDWR);

Open the vcio2 device for further usage.

Note that all resources acquired by this device are tied to this device handle. So do not close it unless you no longer need the resources.

Device handles of vcio2 cannot be reasonably inherited nor passed to forked process instances. The resources are always tied to the PID that opened the device.

All calls to vcio2 handles are thread-safe.

close device

close(vcio2);

Close the vcio2 device and release all resources.

If QPU code is executed while calling close close will wait until the execution completed or timed out and then discard all memory results and close the handle.

General IOCTLs

IOCTL_GET_VCIO_VERSION (0x000065C0)

Return the vcio2 API version. The high word is the major version, the low word the minor version. Currently 0x00000003, i.e. 0.3.

int version = ioctl(vcio2, IOCTL_GET_VCIO_VERSION, 0);

It is a good advice to check the compatibility of the driver version before further use to avoid unexpected results. The inline helper function vcio2_version_is_compatible will do the check:

if (!vcio2_version_is_compatible(version))
// show appropriate error

GPU memory allocation

IOCTL_MEM_ALLOCATE (0xC00C650c)

Allocate GPU memory. The memory is continuous in physical address and taken from the reserved GPU memory pool.

typedef struct
{ union {
    struct
    { unsigned int size;
      unsigned int alignment;
      unsigned int flags;
} in; struct { unsigned int handle; } out; }; } vcio_mem_allocate;

vcio_mem_allocate buf;
...
int retval = ioctl(vcio2, IOCTL_MEM_ALLOCATE, &buf);
in.size
Number of bytes to allocate.
Note that with CMA enabled (dynamic GPU memory size) in /boot/config.txt allocations of more than 16 MiB seem to fail.
in.alignment
Alignment of the resulting buffer in physical memory. 4096 (the page size) is strongly recommended.
in.flags
0xC ⇒ cached; 0x4 ⇒ direct
Other flags unknown. The parameter is directly passed to the VCMSG_SET_ALLOCATE_MEM mailbox message.
return value
0 ⇒ success
EINVAL ⇒ some of the parameters are out of range
ENOMEM ⇒ cannot allocate the memory
EFAULT ⇒ failed to access provided data buffer
out.handle
Memory handle. To be used with IOCTL_MEM_LOCK.

vcio2 keeps track of the allocated memory chunks. As soon as the vcio2 device is closed or the application terminates. The memory is given back to the GPU memory pool. So remember to keep the device open!

Besides doing all steps of memory allocations manually you may also allocate the memory directly by calling mmap with a NULL pointer.

IOCTL_MEM_RELEASE (0x0000650f)

int retval = ioctl(vcio2, IOCTL_MEM_RELEASE, handle);

Release GPU memory. This also unlocks the memory segment if still locked.

handle
Memory handle from IOCTL_MEM_ALLOC.
return value
0 ⇒ success
EINVAL ⇒ invalid memory handle
ENOMEM ⇒ firmware failed to release memory

IOCTL_MEM_LOCK (0xC004650d)

uint32_t addr = handle;
int retval = ioctl(vcio2, IOCTL_MEM_LOCK, &addr);

Lock the memory segment at a physical address.

handle
Memory handle from IOCTL_MEM_ALLOC.
return value
0 ⇒ success
EINVAL ⇒ invalid memory handle
ENOMEM ⇒ firmware failed to lock memory
EFAULT ⇒ failed to access provided data buffer
addr
Physical memory address where the memory segment has been locked.

IOCTL_MEM_UNLOCK (0x0000650e)

int retval = ioctl(vcio2, IOCTL_MEM_UNLOCK, handle);

Unlock memory segment and release the binding to a physical address.

handle
Memory handle from IOCTL_MEM_ALLOC.
return value
0 ⇒ success
EINVAL ⇒ invalid memory handle
ENOMEM ⇒ firmware failed to unlock memory
EPERM ⇒ tried to unlock memory that is not locked

Note that unlocking memory has the side effect of invalidation of all memory mappings that refer to this segment. vcio removes the corresponding PTEs from your process so you will get a bus error when you try to access a virtual address formerly mapped to this memory block.

IOCTL_MEM_QUERY (0xc0010658f)

Query information about a memory allocation.

typedef struct
{ unint32_t handle;
unint32_t bus_addr;
void* virt_addr;
unint32_t size; } vcio_mem_query;

vcio_mem_query buf;
...
int retval = ioctl(vcio2, IOCTL_MEM_QUERY, &buf);
handle
Memory handle from IOCTL_MEM_ALLOC.
bus_addr
Physical memory address.
virt_addr
Virtual memory address in user space.
size
Size of the memory segment.
return value
0 ⇒ success
EINVAL ⇒ there is no memory block that matches all given criteria
EFAULT ⇒ failed to access provided data buffer

All fields in vcio_mem_query are optional on input. Simply leave the unneeded fields zero. The driver will fill all missing values on successful return. At least one of handle, bus_addr or virt_addr should be filled or you will get EINVAL. EINVAL is also returned when the supplied address or handle does not belong to an memory allocation made via the same device file handle.

You may also pass a memory address from within an allocated area. In this case the driver will change the address to the beginning of the area. This applies to bus_addr and virt_addr as well.
I.e. the driver will never return partial memory segments. But it depends on the kind of the query what is considered a memory segment. If you ask for a virtual address you may get smaller chunks because virtual address mappings could cover only a part of an allocated memory segment. In this case the returned bus_addr may not match the start of the returned handle but will match the returned start of virt_addr instead.

If you specify size on input the entire range from the start address must be within the same memory segment, otherwise the driver returns EINVAL. This could be used to verify if an address range is valid.
The same applies if you supply multiple fields, e.g. handle and bus_addr. If they do not match you'll get EINVAL.

Memory mapping

To be able to access the GPU memory from the ARM cortex you will need map the memory into you physical address space. Simply use mmap with the vcio2 device handle for this purpose.

uint32_t *mem = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, vcio2, addr);
vcio2
File handle from open.
addr
Physical address of the memory block from IOCTL_MEM_LOCK or 0 to allocate new memory.
Note that mmap requires base to be aligned at a page boundary (4096 bytes).
Note further that on a Raspberry Pi the physical memory address used by the ARM core is not the same than the bus address used by the GPU. But vcio2 will accept the bus from IOCTL_MEM_LOCK as well.
size
Number of bytes to map. This should be the same than the size passed to IOCTL_MEM_ALLOC.
return value mem
Virtual address of the mapped memory or MAP_FAILED on error. errno may be one of:
EINVAL ⇒ invalid flags
EACCES ⇒ tried to map memory that is not allocated by this device handle
ENOMEM ⇒ memory mapping failed

vcio2 validates the memory mappings. I.e. you can only map memory that has been previously allocated with the same device handle. Otherwise you get an EACCES error.

Memory mappings cannot be inherited to forked or child processes. vcio2 simply does not support that.

Automatic memory allocation

Allocate physical GPU memory with mmap. This will allocate memorylock it to a physical address and map it into the virtual address space of the current process in one step. You will always get page aligned memory without VC4 L2 cache.

uint32_t* mem = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, vcio2, 0);
uint32_t bus_address = *mem;
or simply
uint32_t* mem = vcio2_malloc(vcio2, size);
uint32_t bus_address = *mem;
vcio2
File handle from open.
size
Number of bytes to allocate.
return value mem
Virtual address of the mapped memory or MAP_FAILED on error. errno may be one of:
EINVAL ⇒ invalid flags
ENOMEM ⇒ out of memory or memory mapping failed
For uniforms you usually also need the bus address of the memory. To get this address simply read the first uint32_t from the allocated memory. It will always contain the bus address after an automatic allocation.

The memory allocated this way is released as soon as you call unmap.

QPU code execution

IOCTL_ENABLE_QPU (0x00006512)

Power on/off the GPU.

int retval = ioctl(vcio2, IOCTL_ENABLE_GPU, flag);
flag
1 ⇒ power on the QPU
0 ⇒ power off the QPU
Note that the QPU may not really be powered off because another open handle to the vcio2 device might still request QPU power. Only when the last open instance of vcio2 committed the power off the QPU is actually turned off.
return value
0 ⇒ success
ENODEV ⇒ firmware failed to enable QPU

The QPU is automatically powered on at IOCTL_EXEC_QPU and automatically turned off when the last process closes the vcio2 device. So there is normally no need to call this IOCTL explicitly.

IOCTL_EXEC_QPU (0x40106511)

Execute QPU code.

typedef struct
{ unsigned int uniforms;
  unsigned int code;
} vcio_exec_qpu_entry;
typedef struct { unsigned int num_qpus; unsigned int control; unsigned int noflush; unsigned int timeout; } vcio_exec_qpu;

vcio_exec_qpu buf;
...
int retval = ioctl(vcio2, IOCTL_EXEC_QPU, &buf);
num_qpus
Number of QPUs that should be kicked off. Each QPU receives their own shader code and their own set of uniforms. So this is also the size of the control array.
control
Setup entries for each QPU containing the physical start address of the uniforms and the code. This is a physical pointer to an array of vcio_qpu_entry structures with exactly num_qpus elements.
noflush
Flag: do not flush the cache before starting the QPU code.
timeout
Timeout in milliseconds to wait for an host interrupt. If the timeout elapses the function returns with an error. Note that this will not stop the QPU code so far.
return value
0 ⇒ success
EINVAL ⇒ some of the parameters are out of range
EACCES ⇒ some of the addresses passed do not belong to memory that is allocated by this device handle
ENOEXEC ⇒ execution of QPU code failed, probably a timeout
EFAULT ⇒ failed to access provided data buffer

Although vcio2 does some basic checks to prevent accidental access to invalid memory it cannot check for memory access done by the QPU code. So you have to take care to execute only valid QPU code, otherwise the Raspberry might crash. However, in most cases the Raspi will recover from faults after the timeout and no resources will be lost. So GPU development is significantly relaxed.

While QPU code is executing the Raspian kernel can no longer access the the property channel used for several other purposes, e.g. power management or several firmware calls. Every attempt to do such a function is blocked until the QPU code raises an host interrupt or the timeout elapsed. This is a restriction of the firmware rather than vcio2.

If the QPU is not yet powered on, the power will be turned on automatically before this request. The power will not be turned off afterwards unless the device is closed or you explicitly request it by IOCTL_ENABLE_QPU 0 and of course no other process needs QPU power.

Performance counters

IOCTL_SET_V3D_PERF_COUNT (0x000065c1)

Enable or disable V3D performance counters for this instance.

int retval = ioctl(vcio2, IOCTL_SET_V3D_PERF_COUNT, enabled);
enabled
Bit vector of performance counters to activate. Any combination of
V3D_PERF_COUNT_QPU_CYCLES_IDLE
V3D_PERF_COUNT_QPU_CYCLES_VERTEX_SHADING
V3D_PERF_COUNT_QPU_CYCLES_FRAGMENT_SHADING
V3D_PERF_COUNT_QPU_CYCLES_VALID_INSTRUCTIONS
V3D_PERF_COUNT_QPU_CYCLES_STALLED_TMU
V3D_PERF_COUNT_QPU_CYCLES_STALLED_SCOREBOARD
V3D_PERF_COUNT_QPU_CYCLES_STALLED_VARYINGS
V3D_PERF_COUNT_QPU_INSTRUCTION_CACHE_HITS
V3D_PERF_COUNT_QPU_INSTRUCTION_CACHE_MISSES
V3D_PERF_COUNT_QPU_UNIFORMS_CACHE_HITS
V3D_PERF_COUNT_QPU_UNIFORMS_CACHE_MISSES
V3D_PERF_COUNT_TMU_TEXTURE_QUADS_PROCESSED
V3D_PERF_COUNT_TMU_TEXTURE_CACHE_MISSES
V3D_PERF_COUNT_VPM_CYCLES_STALLED_VDW
V3D_PERF_COUNT_VPM_CYCLES_STALLED_VCD
V3D_PERF_COUNT_L2C_L2_CACHE_HITS
V3D_PERF_COUNT_L2C_L2_CACHE_MISSES
return value
0 ⇒ success
EINVALenabled contains a counter taht is not supported by vcio2
EBUSY ⇒ maximum number of concurrent performance counters exceeded

Performance counters are a limited resource of VideoCore IV. No more than 16 counters can be activated at the same time.
Furthermore vcio2 currently does not support switching enabled counters for individual QPU executions of different open driver instances. I.e. no more than 16 counters can be activated at the same time over all vcio2 users. However, if two instances request the same counter it will be physically shared. But every instance has it's own set of counter values. They are only activated when an execution of the own instance is performed. In fact this makes the counter V3D_PERF_COUNT_QPU_CYCLES_IDLE somewhat useless since it will not count the time between executions.

IOCTL_GET_V3D_PERF_COUNT (0x800465c1)

Get currently activated performance counters of this instance.

uint32_t enabled;
int retval = ioctl(vcio2, IOCTL_GET_V3D_PERF_COUNT, &enabled);
return value
0 ⇒ success
EFAULT ⇒ failed to access provided data buffer

IOCTL_READ_V3D_PERF_COUNT (0x804065c2)

Read all enabled performance counters.

uint32_t counters[16];
int retval = ioctl(vcio2, IOCTL_GET_V3D_PERF_COUNT, &counters);
return value
0 ⇒ success
ENODATA ⇒ performance counters are not enabled for this device handle
EFAULT ⇒ failed to access provided data buffer

The counter values are returned in ascending order and disabled counters will not have an empty slot. E.g. if you enabled V3D_PERF_COUNT_QPU_INSTRUCTION_CACHE_HITS|V3D_PERF_COUNT_L2C_L2_CACHE_HITS|V3D_PERF_COUNT_VPM_CYCLES_STALLED_VDW then you will receive exactly 3 values: V3D_PERF_COUNT_QPU_INSTRUCTION_CACHE_HITS in counters[0], V3D_PERF_COUNT_VPM_CYCLES_STALLED_VDW in counters[1] and V3D_PERF_COUNT_L2C_L2_CACHE_HITS in counters[2]. Due to restrictions of VideoCore IV the call will never return more than 16 values.

IOCTL_RESET_V3D_PERF_COUNT (0x000065c3)

Reset performance counters of this instance.

int retval = ioctl(vcio2, IOCTL_RESET_V3D_PERF_COUNT, 0);