Video Coprocessor - Sandpiper

VCP Overview

The Video Coprocessor (VCP) is a small programmable unit intended for scanline/pixel-synchronous control tasks such as palette updates and simple decision-making while video is active.

A VCP program is a sequence of 32-bit instructions. Programs are uploaded to the VCP’s internal program memory via DMA and can be started/stopped independently of the CPU.

Program Format

VCP programs are arrays of uint32_t where each word is one instruction. The SDK provides macros in vcp.h to build these instruction words.

EVCPBufferSize

enum EVCPBufferSize
{
		PRG_128Bytes  = 0,  //   32 words
		PRG_256Bytes  = 1,  //   64 words
		PRG_512Bytes  = 2,  //  128 words
		PRG_1024Bytes = 3,  //  256 words
		PRG_2048Bytes = 4,  //  512 words
		PRG_4096Bytes = 5,  // 1024 words
};

Programs must be exactly one of the sizes above. Pad with noop() instructions as needed.

API Documentation

void VCPUploadProgram(struct SPPlatform *ctx, const uint32_t* _program, enum EVCPBufferSize size);

Uploads a VCP program to the device. The SDK copies the program to shared memory and queues a DMA transfer into the VCP’s internal program memory. Programs must be padded to match the selected EVCPBufferSize.

void VCPExecProgram(struct SPPlatform *ctx, const uint8_t _execFlags);

Toggles VCP execution on and off. Currently only the lowest bit is used (one VCP unit). Setting this to zero will stop the VCP after the current instruction fully executes.

uint32_t VCPStatus(struct SPPlatform *ctx);

Reads the VCP status register. The returned bit pattern contains execution state, run state, program counter (PC), FIFO/DMA flags, and the opcode at the current program address.

Status Register Bitfield

0000 OOOO 0CFP PPPP PPPP PPPP RRRR EEEE
E = execstate (execution state machine state, 4 bits)
R = runstate (high for running, low for stopped, 4 bits)
P = address of current instruction (PC, 13 bits)
F = high when command FIFO is not empty (1 bit)
C = program DMA in flight (1 bit)
O = opcode at program address (4 bits)

Instruction Encoding Notes

The instruction macros in vcp.h encode destination/source registers and immediates into fixed bit positions. There are 16 general registers (VREG_ZERO, VREG_1 … VREG_F) and a 1-bit special compare register (cmpreg) used by compare/branch.

Program memory addresses used by jump/branch/load/store must be 4-byte aligned.

Compare Flags

// Base flags (combine with COND_INV to negate)
#define COND_LE   0x01
#define COND_LT   0x02
#define COND_EQ   0x04
#define COND_INV  0x08

// Derived convenience flags
#define COND_GT   (COND_LE | COND_INV)
#define COND_GE   (COND_LT | COND_INV)
#define COND_NE   (COND_EQ | COND_INV)

Only one of COND_LE/COND_LT/COND_EQ is considered (in that order), so prefer using a single base flag and optionally OR with COND_INV.

Register Constants

#define VREG_ZERO 0x00
#define VREG_1    0x01
...
#define VREG_E    0x0E
#define VREG_F    0x0F

VCP Instruction Reference

Instruction Latencies

Cycle counts marked with + are variable-latency operations that can stall in wait states until their condition or interface handshake completes.

Opcode	Mnemonic	Description	Cycles	Notes
0x0	NOP	No operation	4	Base pipeline
0x1	LOAD_IMM	Load 24-bit immediate into rd	4	Base pipeline
0x2	PAL_WRITE	Write palette entry	4	Base pipeline
0x3	SCANLINE_WAIT	Wait for scanline match	4+	Spins in EXEC until scanline == rs1; variable/unbounded
0x4	SCANPIXEL_WAIT	Wait for scan pixel match	4+	Spins in EXEC until scanpixel == rs1; variable/unbounded
0x5	MATHOP	ALU op (ADD/SUB/INC/DEC via imm8)	4	Base pipeline
0x6	JMP	Jump (direct or PC-relative)	4	Base pipeline; next fetch from new PC
0x7	CMP	Compare rs1 vs rs2, set flags	5	+1 for FINALIZE_COMPARE state
0x8	BRANCH	Conditional branch on cmpreg	4	Base pipeline (taken or not)
0x9	MEM_WRITE	Write rs2 to program memory at rs1	4	Hijacks PC for the write port
0xA	MEM_READ	Read program memory at rs1 into rd	6	+2 for WAIT_READ -> FINALIZE_READ
0xB	READ_SCANINFO	Read scanline or scanpixel into rd	4	Base pipeline
0xC	LOADPC	Copy next PC into rd (link register)	4	Base pipeline
0xD	LOGICOP	Logic op (AND/OR/XOR/shifts/etc via imm8)	4	Base pipeline
0xE	SYSMEM_WRITE	AXI write to system memory	7+	+3 min for AW handshake -> W handshake -> B response; each wait state stalls until AXI ready/valid
0xF	SYSMEM_READ	AXI read from system memory	6+	+2 min for AR handshake -> R data; each wait state stalls until AXI ready/valid

Wait instructions

wscn(src)

Operation: Waits for a scanline that matches the contents of register src. Valid range is 0..524. Out-of-range values cause an infinite wait.

Latency: 4+ cycles.

wpix(src)

Operation: Waits for a pixel (X coordinate) that matches the contents of register src. Valid range is 0..799. Out-of-range values cause an infinite wait.

Latency: 4+ cycles.

Color palette access

pwrt(addrs, src)

Operation: Writes the value of register src to palette entry at addrs (0..255, increment by 1).

Latency: 4 cycles.

Arithmetic instructions

radd(dest, src1, src2)

Operation: Adds src2 to src1 and writes the result to dest. Overflow bits are discarded.

Latency: 4 cycles.

rsub(dest, src1, src2)

Operation: Subtracts src2 from src1 and writes the result to dest.

Latency: 4 cycles.

rinc(dest, src1)

Operation: Increments src1 by one and writes the result to dest.

Latency: 4 cycles.

rdec(dest, src1)

Operation: Decrements src1 by one and writes the result to dest.

Latency: 4 cycles.

Branch instructions

jump(addrs)

Operation: Unconditional jump to program memory address contained in register addrs. Address must be 4-byte aligned.

Latency: 4 cycles.

jumpim(offset)

Operation: Unconditional jump to PC-relative target (current PC plus offset). Offset is a 2’s complement (signed) 13-bit value; highest 3 bits are ignored. Target must be 4-byte aligned.

Latency: 4 cycles.

branch(addrs)

Operation: Conditional branch to address in addrs if cmpreg is nonzero. Address must be 4-byte aligned.

Latency: 4 cycles.

branchim(offset)

Operation: Conditional PC-relative branch if cmpreg is nonzero. Offset is a 2’s complement (signed) 13-bit value; highest 3 bits are ignored. Target must be 4-byte aligned.

Latency: 4 cycles.

Program memory access

store(addrs, src)

Operation: Stores register src to program memory at the address held in addrs. Address must be 4-byte aligned.

Latency: 4 cycles.

load(addrs, dest)

Operation: Loads from program memory at the address held in addrs into dest. Address must be 4-byte aligned.

Latency: 6 cycles.

System memory access

sysmem_write(addrs, src)

Operation: Performs an AXI write of register src to system memory at address in addrs.

Latency: 7+ cycles.

sysmem_read(addrs, dest)

Operation: Performs an AXI read from system memory at address in addrs and writes the returned value to dest.

Latency: 6+ cycles.

Internal register access

scanline_read(dest)

Operation: Reads the current scanline into dest. Values range from 0 to 524 (can include off-screen).

Latency: 4 cycles.

scanpixel_read(dest)

Operation: Reads the current pixel (X coordinate) into dest. Values range from 0 to 799 (can include off-screen).

Latency: 4 cycles.

Logic instructions

cmp(cmpflags, src1, src2)

Operation: Compares src1 and src2 using cmpflags, ORs compare results, and writes the 1-bit result into cmpreg for use by branch instructions.

Latency: 5 cycles.

rand(dest, src1, src2)

Operation: Bitwise AND of src1 and src2 into dest.

Latency: 4 cycles.

ror(dest, src1, src2)

Operation: Bitwise OR of src1 and src2 into dest.

Latency: 4 cycles.

rxor(dest, src1, src2)

Operation: Bitwise XOR of src1 and src2 into dest.

Latency: 4 cycles.

rasr(dest, src1, src2)

Operation: Arithmetic shift-right of src1 by src2 (lowest 5 bits used) into dest.

Latency: 4 cycles.

rshr(dest, src1, src2)

Operation: Logical shift-right of src1 by src2 (lowest 5 bits used) into dest.

Latency: 4 cycles.

rshl(dest, src1, src2)

Operation: Shift-left of src1 by src2 (lowest 5 bits used) into dest.

Latency: 4 cycles.

rneg(dest, src)

Operation: Bitwise negation of src into dest (equivalent to src ^ 0xFFFFFF).

Latency: 4 cycles.

rcmp(dest)

Operation: Reads cmpreg into dest (zero-extended).

Latency: 4 cycles.

lctl(dest)

Operation: Loads the VPU control register into the lower 8 bits of dest. This can be used to have CPU-written VPU state influence VCP program flow.

Latency: 4 cycles.

loadpc(dest)

Operation: Copies the next program counter (PC) value into dest, commonly used as a link-register style return target.

Latency: 4 cycles.

Other instructions

ldim(dest, immed)

Operation: Loads a 24-bit immediate value into register dest. (Alias: mvim(dest, imm).)

Latency: 4 cycles.

noop()

Operation: No operation.

Latency: 4 cycles.

mv(dest, src)

Operation: Copies src to dest (implemented as an add with VREG_ZERO).

Latency: 4 cycles.

clr(dest)

Operation: Assigns zero to dest.

Latency: 4 cycles.

Example Usage

// Build a program as uint32_t words using vcp_* macros
// (pad with noop() so the array size matches the chosen EVCPBufferSize)

struct SPPlatform* platform = SPInitPlatform();

uint32_t program[32] = {
		// Example: wait for scanline, then write a palette entry
		vcp_ldim(VREG_1, 100),      // r1 = 100
		vcp_wscn(VREG_1),           // wait for scanline 100

		vcp_ldim(VREG_2, 0),        // r2 = palette index
		vcp_ldim(VREG_3, 0x00FF00), // r3 = color value
		vcp_pwrt(VREG_2, VREG_3),   // palette[r2] = r3

		vcp_noop(),
};

VCPUploadProgram(platform, program, PRG_128Bytes);
VCPExecProgram(platform, 0x1); // start

// ... later: VCPExecProgram(platform, 0x0); // stop

SPShutdownPlatform(platform);

Throughput

Per pixel clock

aclk / pixel clock = 166.667 MHz / 25 MHz = 6.67 aclk cycles per pixel
At 4 cycles/instruction: ~1.67 instructions per pixel

Per frame (full VGA 640x480 timing with blanking = 800x525 = 420,000 pixel clocks)

Frame rate: 25 MHz / 420,000 = ~59.52 Hz
aclk cycles per frame: 166,666,667 / 59.52 ~= 2,800,112 cycles
Instructions per frame: ~700,028

Per scanline (800 pixel clocks including HBlank)

aclk cycles per line: 800 x 6.667 = 5,333 cycles
Instructions per line: ~1,333

Theoretical peak throughput

At 166.667 MHz with one instruction retired every 4 clocks, the VCP theoretical maximum is approximately 41.67 MIPS.