VCP Overview
The Video Coprocessor (VCP) is a small programmable unit intended for scanline/pixel-synchronous control tasks such as palette updates and simple decision-making while video is active.
A VCP program is a sequence of 32-bit instructions. Programs are uploaded to the VCP’s internal program memory via DMA and can be started/stopped independently of the CPU.
Program Format
VCP programs are arrays of uint32_t where each word is one instruction. The SDK provides macros in vcp.h to build these instruction words.
EVCPBufferSize
enum EVCPBufferSize
{
PRG_128Bytes = 0, // 32 words
PRG_256Bytes = 1, // 64 words
PRG_512Bytes = 2, // 128 words
PRG_1024Bytes = 3, // 256 words
PRG_2048Bytes = 4, // 512 words
PRG_4096Bytes = 5, // 1024 words
};
Programs must be exactly one of the sizes above. Pad with noop() instructions as needed.
API Documentation
void VCPUploadProgram(struct SPPlatform *ctx, const uint32_t* _program, enum EVCPBufferSize size);
Uploads a VCP program to the device. The SDK copies the program to shared memory and queues a DMA transfer into the VCP’s internal program memory. Programs must be padded to match the selected EVCPBufferSize.
void VCPExecProgram(struct SPPlatform *ctx, const uint8_t _execFlags);
Toggles VCP execution on and off. Currently only the lowest bit is used (one VCP unit). Setting this to zero will stop the VCP after the current instruction fully executes.
uint32_t VCPStatus(struct SPPlatform *ctx);
Reads the VCP status register. The returned bit pattern contains execution state, run state, program counter (PC), FIFO/DMA flags, and the opcode at the current program address.
Status Register Bitfield
0000 OOOO 0CFP PPPP PPPP PPPP RRRR EEEE
E = execstate (execution state machine state, 4 bits)
R = runstate (high for running, low for stopped, 4 bits)
P = address of current instruction (PC, 13 bits)
F = high when command FIFO is not empty (1 bit)
C = program DMA in flight (1 bit)
O = opcode at program address (4 bits)
Instruction Encoding Notes
The instruction macros in vcp.h encode destination/source registers and immediates into fixed bit positions. There are 16 general registers (VREG_ZERO, VREG_1 … VREG_F) and a 1-bit special compare register (cmpreg) used by compare/branch.
Program memory addresses used by jump/branch/load/store must be 4-byte aligned.
Compare Flags
// Base flags (combine with COND_INV to negate)
#define COND_LE 0x01
#define COND_LT 0x02
#define COND_EQ 0x04
#define COND_INV 0x08
// Derived convenience flags
#define COND_GT (COND_LE | COND_INV)
#define COND_GE (COND_LT | COND_INV)
#define COND_NE (COND_EQ | COND_INV)
Only one of COND_LE/COND_LT/COND_EQ is considered (in that order), so prefer using a single base flag and optionally OR with COND_INV.
Register Constants
#define VREG_ZERO 0x00
#define VREG_1 0x01
...
#define VREG_E 0x0E
#define VREG_F 0x0F
VCP Instruction Reference
Instruction Latencies
Cycle counts marked with + are variable-latency operations that can stall in wait states until their condition or interface handshake completes.
| Opcode | Mnemonic | Description | Cycles | Notes |
|---|---|---|---|---|
| 0x0 | NOP | No operation | 4 | Base pipeline |
| 0x1 | LOAD_IMM | Load 24-bit immediate into rd | 4 | Base pipeline |
| 0x2 | PAL_WRITE | Write palette entry | 4 | Base pipeline |
| 0x3 | SCANLINE_WAIT | Wait for scanline match | 4+ | Spins in EXEC until scanline == rs1; variable/unbounded |
| 0x4 | SCANPIXEL_WAIT | Wait for scan pixel match | 4+ | Spins in EXEC until scanpixel == rs1; variable/unbounded |
| 0x5 | MATHOP | ALU op (ADD/SUB/INC/DEC via imm8) | 4 | Base pipeline |
| 0x6 | JMP | Jump (direct or PC-relative) | 4 | Base pipeline; next fetch from new PC |
| 0x7 | CMP | Compare rs1 vs rs2, set flags | 5 | +1 for FINALIZE_COMPARE state |
| 0x8 | BRANCH | Conditional branch on cmpreg | 4 | Base pipeline (taken or not) |
| 0x9 | MEM_WRITE | Write rs2 to program memory at rs1 | 4 | Hijacks PC for the write port |
| 0xA | MEM_READ | Read program memory at rs1 into rd | 6 | +2 for WAIT_READ -> FINALIZE_READ |
| 0xB | READ_SCANINFO | Read scanline or scanpixel into rd | 4 | Base pipeline |
| 0xC | LOADPC | Copy next PC into rd (link register) | 4 | Base pipeline |
| 0xD | LOGICOP | Logic op (AND/OR/XOR/shifts/etc via imm8) | 4 | Base pipeline |
| 0xE | SYSMEM_WRITE | AXI write to system memory | 7+ | +3 min for AW handshake -> W handshake -> B response; each wait state stalls until AXI ready/valid |
| 0xF | SYSMEM_READ | AXI read from system memory | 6+ | +2 min for AR handshake -> R data; each wait state stalls until AXI ready/valid |
Wait instructions
wscn(src)
Operation: Waits for a scanline that matches the contents of register src. Valid range is 0..524. Out-of-range values cause an infinite wait.
Latency: 4+ cycles.
wpix(src)
Operation: Waits for a pixel (X coordinate) that matches the contents of register src. Valid range is 0..799. Out-of-range values cause an infinite wait.
Latency: 4+ cycles.
Color palette access
pwrt(addrs, src)
Operation: Writes the value of register src to palette entry at addrs (0..255, increment by 1).
Latency: 4 cycles.
Arithmetic instructions
radd(dest, src1, src2)
Operation: Adds src2 to src1 and writes the result to dest. Overflow bits are discarded.
Latency: 4 cycles.
rsub(dest, src1, src2)
Operation: Subtracts src2 from src1 and writes the result to dest.
Latency: 4 cycles.
rinc(dest, src1)
Operation: Increments src1 by one and writes the result to dest.
Latency: 4 cycles.
rdec(dest, src1)
Operation: Decrements src1 by one and writes the result to dest.
Latency: 4 cycles.
Branch instructions
jump(addrs)
Operation: Unconditional jump to program memory address contained in register addrs. Address must be 4-byte aligned.
Latency: 4 cycles.
jumpim(offset)
Operation: Unconditional jump to PC-relative target (current PC plus offset). Offset is a 2’s complement (signed) 13-bit value; highest 3 bits are ignored. Target must be 4-byte aligned.
Latency: 4 cycles.
branch(addrs)
Operation: Conditional branch to address in addrs if cmpreg is nonzero. Address must be 4-byte aligned.
Latency: 4 cycles.
branchim(offset)
Operation: Conditional PC-relative branch if cmpreg is nonzero. Offset is a 2’s complement (signed) 13-bit value; highest 3 bits are ignored. Target must be 4-byte aligned.
Latency: 4 cycles.
Program memory access
store(addrs, src)
Operation: Stores register src to program memory at the address held in addrs. Address must be 4-byte aligned.
Latency: 4 cycles.
load(addrs, dest)
Operation: Loads from program memory at the address held in addrs into dest. Address must be 4-byte aligned.
Latency: 6 cycles.
System memory access
sysmem_write(addrs, src)
Operation: Performs an AXI write of register src to system memory at address in addrs.
Latency: 7+ cycles.
sysmem_read(addrs, dest)
Operation: Performs an AXI read from system memory at address in addrs and writes the returned value to dest.
Latency: 6+ cycles.
Internal register access
scanline_read(dest)
Operation: Reads the current scanline into dest. Values range from 0 to 524 (can include off-screen).
Latency: 4 cycles.
scanpixel_read(dest)
Operation: Reads the current pixel (X coordinate) into dest. Values range from 0 to 799 (can include off-screen).
Latency: 4 cycles.
Logic instructions
cmp(cmpflags, src1, src2)
Operation: Compares src1 and src2 using cmpflags, ORs compare results, and writes the 1-bit result into cmpreg for use by branch instructions.
Latency: 5 cycles.
rand(dest, src1, src2)
Operation: Bitwise AND of src1 and src2 into dest.
Latency: 4 cycles.
ror(dest, src1, src2)
Operation: Bitwise OR of src1 and src2 into dest.
Latency: 4 cycles.
rxor(dest, src1, src2)
Operation: Bitwise XOR of src1 and src2 into dest.
Latency: 4 cycles.
rasr(dest, src1, src2)
Operation: Arithmetic shift-right of src1 by src2 (lowest 5 bits used) into dest.
Latency: 4 cycles.
rshr(dest, src1, src2)
Operation: Logical shift-right of src1 by src2 (lowest 5 bits used) into dest.
Latency: 4 cycles.
rshl(dest, src1, src2)
Operation: Shift-left of src1 by src2 (lowest 5 bits used) into dest.
Latency: 4 cycles.
rneg(dest, src)
Operation: Bitwise negation of src into dest (equivalent to src ^ 0xFFFFFF).
Latency: 4 cycles.
rcmp(dest)
Operation: Reads cmpreg into dest (zero-extended).
Latency: 4 cycles.
lctl(dest)
Operation: Loads the VPU control register into the lower 8 bits of dest. This can be used to have CPU-written VPU state influence VCP program flow.
Latency: 4 cycles.
loadpc(dest)
Operation: Copies the next program counter (PC) value into dest, commonly used as a link-register style return target.
Latency: 4 cycles.
Other instructions
ldim(dest, immed)
Operation: Loads a 24-bit immediate value into register dest. (Alias: mvim(dest, imm).)
Latency: 4 cycles.
noop()
Operation: No operation.
Latency: 4 cycles.
mv(dest, src)
Operation: Copies src to dest (implemented as an add with VREG_ZERO).
Latency: 4 cycles.
clr(dest)
Operation: Assigns zero to dest.
Latency: 4 cycles.
Example Usage
// Build a program as uint32_t words using vcp_* macros
// (pad with noop() so the array size matches the chosen EVCPBufferSize)
struct SPPlatform* platform = SPInitPlatform();
uint32_t program[32] = {
// Example: wait for scanline, then write a palette entry
vcp_ldim(VREG_1, 100), // r1 = 100
vcp_wscn(VREG_1), // wait for scanline 100
vcp_ldim(VREG_2, 0), // r2 = palette index
vcp_ldim(VREG_3, 0x00FF00), // r3 = color value
vcp_pwrt(VREG_2, VREG_3), // palette[r2] = r3
vcp_noop(),
};
VCPUploadProgram(platform, program, PRG_128Bytes);
VCPExecProgram(platform, 0x1); // start
// ... later: VCPExecProgram(platform, 0x0); // stop
SPShutdownPlatform(platform);
Throughput
Per pixel clock
aclk / pixel clock = 166.667 MHz / 25 MHz = 6.67 aclk cycles per pixel
At 4 cycles/instruction: ~1.67 instructions per pixel
Per frame (full VGA 640x480 timing with blanking = 800x525 = 420,000 pixel clocks)
Frame rate: 25 MHz / 420,000 = ~59.52 Hz
aclk cycles per frame: 166,666,667 / 59.52 ~= 2,800,112 cycles
Instructions per frame: ~700,028
Per scanline (800 pixel clocks including HBlank)
aclk cycles per line: 800 x 6.667 = 5,333 cycles
Instructions per line: ~1,333
Theoretical peak throughput
At 166.667 MHz with one instruction retired every 4 clocks, the VCP theoretical maximum is approximately 41.67 MIPS.