Skip to main content

MatMul Proof-of-Work

Specification of the BTX MatMul proof-of-work consensus algorithm: finite-field matrix multiplication, transcript compression, Freivalds verification, and DoS mitigation.

Overview

BTX replaces traditional hash-based proof-of-work with MatMul PoW, an AI-native proof-of-work system based on matrix multiplication over a finite field. The protocol, derived from the cuPOW construction (Komargodski, Schen, Weinstein 2025), achieves a multiplicative overhead of only ~16.5% above a bare matrix multiply at the production parameters. Miners perform real linear-algebra computation rather than brute-force hashing, making the work inherently useful as a GPU compute benchmark and a stepping stone toward externally useful matrix workloads in a future v2.

Consensus Parameters

ParameterSymbolValueRole
Matrix dimensionn512Compute cost O(n³); ~134M field multiplications per attempt
Transcript block sizeb16Hashing granularity; (n/b)³ = 32,768 intermediates per attempt
Noise rankr8Security parameter; denoising cost O(n²r)
Field modulusq2³¹ − 1 (2,147,483,647)Mersenne prime M31; GPU-friendly int32 arithmetic
Pre-hash epsilon bitsε10 (18 after height 55,000)Sigma gate: target « ε before expensive MatMul path
Freivalds roundsk2False-positive probability < 2−62
Validation window1,000 blocksDepth beyond which full Phase 2 revalidation is not required
Block time90 s (steady-state); 250 ms (fast-mine, h < 50,000)Target spacing for difficulty adjustment

Note: b (transcript block size) and r (noise rank) are independent parameters. Changing b affects transcript size and cache behavior; changing r affects security conjecture strength and denoising overhead.

Finite-Field Arithmetic (FM31)

All arithmetic operates in the Mersenne prime field Fq where q = 231 − 1. This prime was chosen for GPU friendliness: reduction is a single add-and-mask on int32 hardware, and the field supports native Apple Metal and CUDA int32 pipelines.

Reduction

The reduce64 function uses a double Mersenne fold, proven safe for all uint64 inputs (the single-fold variant is buggy for x ≥ 262). The dot() accumulator is the only sanctioned path, applying per-step reduction with a length-independent safety proof.

Matrix Generation from Seeds

Matrices A and B are deterministically expanded from 32-byte seeds carried in the block header. The expansion uses SHA-256 as a PRF with LE32 counters and rejection sampling for uniform distribution mod M31. Domain separation tags prevent cross-component collisions:

  • matmul_noise_EL_v1 — left noise factor EL
  • matmul_noise_ER_v1 — right noise factor ER
  • matmul_noise_FL_v1 — left noise factor FL
  • matmul_noise_FR_v1 — right noise factor FR
  • matmul_compress_v1 — compression vector σ

Mining Algorithm (Solve)

The mining loop for a single nonce attempt proceeds as follows:

  1. Sigma gate (pre-hash): Compute a cheap SHA-256 pre-hash from the candidate header. If the pre-hash does not satisfy target « ε, skip the expensive MatMul entirely. This filters out ~99.9% of nonces at ε=10.
  2. Noise generation: Derive four low-rank factor matrices from the oracle seed σ: EL ∈ Fqn×r, ER ∈ Fqr×n, FL ∈ Fqn×r, FR ∈ Fqr×n. Form E = EL·ER and F = FL·FR (each rank ≤ r).
  3. Noisy MatMul: Compute C' = (A + E)·(B + F) using the canonical block decomposition with block size b=16. The computation iterates over (n/b)³ = 32,768 intermediate b×b partial-sum blocks in a fixed order (i, j, l).
  4. Transcript compression: Each intermediate block is compressed to a single field element via a random inner-product derived from σ. The compressed elements are fed into a rolling SHA-256d hasher, reducing hashed data from ~33.5 MB (naive) to ~131 KB.
  5. Difficulty check: If H(compressed transcript) < target, the nonce is valid. Denoise: C = C' − A·F − E·(B + F). Emit the block with matmul_digest, seed_a, seed_b, and the product matrix C'.

Transcript Compression

Naive full-block hashing would require SHA-256 over ~33.5 MB per nonce attempt, adding 30–100% overhead on GPU miners. BTX uses field-algebraic inner-product compression to reduce this by ~256×:

ApproachBytes HashedSHA-256 Time (SHA-NI)Overhead vs MatMul
Naive full-block33.5 MB~67 ms30–100%
Field-algebraic compression~131 KB~0.26 ms< 1%

Security is preserved: the random inner-product family is pairwise independent, so forging a different b×b block that produces the same compressed element requires guessing a field element (probability 1/q ≈ 2−31 per intermediate). A single σ-derived compression vector is reused across all 32,768 intermediates; Carter-Wegman per-intermediate binding plus σ-unpredictability makes per-intermediate vectors unnecessary.

On GPU architectures, the compression dot-product is fused into the matmul kernel to avoid the ~33.5 MiB data transfer penalty. Compressed elements (~128 KiB total) are written to a ring buffer and consumed by a CPU SHA-256 thread.

Verification (Freivalds' Algorithm)

Full verification requires recomputing the entire O(n³) transcript — the dominant engineering constraint. BTX uses a two-phase validation model:

Phase 1: Header Checks (O(1))

  • Validate header format, timestamp, and nBits target.
  • Verify matmul_digest matches the claimed SHA-256d of the compressed transcript.
  • Check proof-of-work: H(header) < target.

Phase 2: Transcript Recomputation (O(n³))

  • Reconstruct A, B from seeds. Regenerate noise. Recompute the canonical block MatMul.
  • Verify the compressed transcript matches matmul_digest.
  • Apply Freivalds' algorithm as a fast product check: verify (A+E)·((B+F)·r) = C'·r for k=2 random vectors r. Each round has false-positive probability 1/(231−1); with k=2, error < 2−62.

Graduated Peer Punishment

Phase 1 pass followed by Phase 2 fail triggers graduated response: disconnect → discourage → ban (after 3rd offense in 24h). The fMatMulStrictPunishment flag enables immediate bans for post-stabilization networks.

DoS Mitigation

Phase 2 verification at n=512 costs approximately 5 ms per block. Rate limits bound CPU exposure:

  • Per-peer budget: 32 Phase 2 verifications per minute per peer.
  • Global budget: 512 Phase 2 verifications per minute across all peers (~2.56 s CPU/min).
  • Validation window: Blocks deeper than 1,000 from the tip skip Phase 2 (assumevalid model for IBD).
  • Fast-phase scheduling: During the 250 ms fast-mine phase, Phase 2 is deferred to a bounded queue that must drain completely after the transition to 90 s blocks.

Block Header Fields

The MatMul block header extends the standard Bitcoin header with:

FieldSizeDescription
nNonce648 bytes64-bit mining nonce
matmul_digest32 bytesSHA-256d of compressed transcript
matmul_dim2 bytesMatrix dimension (512)
seed_a32 bytesDeterministic seed for matrix A
seed_b32 bytesDeterministic seed for matrix B

Block size with seed-derived matrices is approximately 200 KiB, compared to ~2.2 MiB if full matrices were included. All serialization uses little-endian byte order; big-endian architectures are unsupported.

Overhead Analysis

At production parameters (n=512, b=16, r=8), total protocol overhead is ~16.5% above a bare n×n matrix multiply:

ComponentCost% of Baseline
Noisy MatMul (A+E)·(B+F)134M mulsbaseline
Transcript compression dot-products8.4M muls6.3%
Denoising (rank-r factor products)8.4M muls6.3%
Noise generation + matrix additions4.7M ops3.5%
Rolling SHA-256d on compressed elements131 KB hashed< 0.1%
Total overhead~22M muls~16.5%

On GPUs where the compression dot-product is fused into the matmul kernel, measured total overhead drops to ~10–12%.

Difficulty Adjustment Interaction

MatMul PoW difficulty is governed by the ASERT algorithm (see the Difficulty Adjustment specification). The sigma pre-hash gate interacts with the difficulty target: at higher difficulty, the pre-hash filter rejects more nonces before the expensive MatMul path, keeping average GPU utilization efficient. The pre-hash epsilon upgrade at height 55,000 (from 10 to 18 bits) further reduces wasted compute as the chain matures.

GPU Optimization

The field choice (M31 = 231 − 1) was selected specifically for GPU friendliness. Key optimization points:

  • M31 reduction is a single add-and-mask on int32 ALUs, native on Apple Metal and CUDA.
  • The compression dot-product must be fused into the matmul kernel to avoid a ~33.5 MiB device-to-host transfer per attempt.
  • Compressed elements (~128 KiB) are written to a ring buffer consumed asynchronously by a CPU SHA-256 thread.
  • The 512×512 working set (~2 MiB) fits comfortably in GPU L2 cache on modern hardware.
  • Denoising runs only once per found block (not per attempt), so its O(n²r) cost is negligible in practice.

Security Properties

  • Completeness: An honest miner performing the full MatMul is accepted with probability ε per attempt, determined by the difficulty target.
  • Hardness: Under the direct-product conjecture for random rank-r matrix products, no adversary gains meaningful advantage with less than a full matrix multiplication.
  • Collision resistance: Per-intermediate collision probability is 2−31; per-block union bound gives ≤ 2−16. A dual-projection upgrade path exists for 2−62 per intermediate if future risk assessment warrants.
  • ASIC resistance: ASICs are not cryptographically impossible but are economically misaligned with commodity AI/GPU hardware. The MatMul workload is identical to standard GEMM operations that AI accelerators are optimized for.

Monetary Policy

ParameterValue
Total supply21,000,000 BTX
Initial block subsidy20 BTX
Halving interval525,000 blocks
PremineNone
Fast-mine emission (blocks 0–49,999)1,000,000 BTX (4.76% of supply)

The issuance follows an infinite geometric series: 20 × 525,000 × 2 = 21M. MoneyRange() and GetBlockSubsidy() enforce the hard cap at consensus level.