Built on RISC-V: The Open Standard ISA

RISC-V represents a fundamental shift in processor architecture. Unlike proprietary instruction sets controlled by single vendors, RISC-V is developed through an open, collaborative process. This openness enables SABZEH to optimize at the deepest level while maintaining complete transparency with our customers.

The base RISC-V specification provides a clean, modular foundation. We extend this with custom instructions specifically designed for neural network operations, eliminating the overhead of general-purpose architectures that were never intended for AI workloads.

Zero Licensing Fees

No per-chip royalties or annual licensing costs that inflate your total cost of ownership.

Full Auditability

Every instruction, every register, every pipeline stage is documented and verifiable.

Custom Extensions

Add domain-specific instructions without breaking compatibility or waiting for vendor approval.

Growing Ecosystem

Major semiconductor companies, governments, and research institutions are adopting RISC-V.

Base ISA
RV64GC - Standard RISC-V
Vector
RVV 1.0 - Vector Extension
Matrix
SABZEH Matrix Extension
Tensor
SABZEH Tensor Core ISA
AI Ops
Attention, Softmax, LayerNorm

Three Pillars of AI Performance

01

Tensor Processing Units

Each NOVA processor contains hundreds of tensor cores designed specifically for matrix multiplication, the fundamental operation in neural networks. These units achieve near-theoretical peak throughput by eliminating the instruction decode and scheduling overhead present in general-purpose cores. Native support for INT8, FP16, BF16, and FP32 precision allows optimal precision selection for each layer of your model.

02

Intelligent Memory Hierarchy

Neural networks exhibit access patterns fundamentally different from traditional applications. Our memory subsystem is designed around these patterns, with large on-chip SRAM buffers that hold entire transformer attention matrices, eliminating costly off-chip memory accesses. A novel prefetch engine predicts data requirements based on network topology, ensuring compute units are never starved.

03

Scalable Interconnect

Modern AI workloads increasingly require multi-chip configurations. Our SerDes-based interconnect provides 800Gbps bandwidth between chips with latency measured in nanoseconds. Collective operations like AllReduce are implemented in dedicated hardware, achieving near-linear scaling efficiency up to thousands of chips in distributed training scenarios.

Purpose-Built Instructions for AI

Matrix Extension

Beyond standard RISC-V vector operations, our Matrix Extension adds native support for 2D tile operations. A single instruction can compute an 8x8 or 16x16 matrix multiply-accumulate, dramatically reducing instruction overhead for convolution and attention layers.

Native GEMM support up to 16x16 tiles
Fused multiply-add with accumulator preservation
Automatic precision conversion between tile operations
Register blocking for complex matrix expressions

Attention Accelerator

Transformer models spend the majority of compute time in attention mechanisms. We've implemented FlashAttention-style algorithms directly in hardware, computing attention scores with optimal memory access patterns and no intermediate materialization.

Fused QKV projection and attention computation
Online softmax with numerical stability
Multi-head attention parallelism
KV-cache optimization for autoregressive generation

Activation Functions

Non-linear activation functions are ubiquitous in neural networks but expensive to compute. Dedicated hardware units implement GELU, SiLU, Softmax, and LayerNorm with single-cycle throughput, eliminating the performance penalty of these operations.

Single-cycle GELU and SiLU computation
Fused LayerNorm with residual addition
RoPE positional encoding in hardware
Group normalization for vision models

Sparsity Engine

Production models are often pruned to reduce computational requirements. Our sparsity engine detects and skips zero values at runtime, achieving effective throughput improvements of up to 4x for appropriately structured sparse models.

Runtime zero detection with minimal overhead
Support for structured and unstructured sparsity
Compressed sparse tensor formats
Dynamic sparsity for activation tensors

Designed for Neural Network Access Patterns

Traditional cache hierarchies optimize for temporal and spatial locality assumptions that don't hold for neural networks. Weights are accessed once per inference, activations flow forward through layers, and attention patterns span the entire sequence length.

Our memory system is purpose-built for these patterns. Large on-chip SRAM buffers can hold entire layer weights, eliminating repeated off-chip fetches. A software-managed scratchpad allows explicit control over data placement, enabling optimal tiling strategies for large models.

The result is 85% sustained memory bandwidth utilization compared to 40-50% typical in conventional GPU architectures. This efficiency directly translates to higher throughput and lower latency for your AI workloads.

85% Bandwidth Utilization
128MB On-Chip SRAM
6.4 TB/s HBM3 Bandwidth
256GB Maximum HBM Capacity
Registers
32KB per core
L1 Cache
256KB per cluster
Shared SRAM
128MB total
HBM3
96-256GB capacity

Transparency and Trust at the Silicon Level

Hardware Root of Trust

Secure boot chain validates firmware integrity from power-on through application launch. Tamper detection mechanisms protect against physical attacks on the device.

Memory Encryption

AES-256 encryption protects data at rest in HBM. Memory controllers perform encryption and decryption transparently with minimal performance impact.

Secure Enclaves

Isolated execution environments protect sensitive model weights and inference data from other workloads sharing the same processor.

Full Auditability

Complete RTL documentation enables security verification by your team or third-party auditors. No black boxes or hidden functionality.

Production-Ready Software Stack

Hardware performance means nothing without software that can utilize it. SABZEH provides a complete software stack from low-level drivers through high-level framework integrations.

Our compiler automatically maps standard model representations to SABZEH hardware, optimizing for data layout, operation fusion, and memory placement. Framework integrations for PyTorch and TensorFlow allow you to run existing models without modification.

Frameworks PyTorch, TensorFlow, ONNX
Graph Compiler Optimization and scheduling
Runtime Memory management, dispatch
Driver Linux kernel module
Hardware SABZEH NOVA Processors
# Run your existing PyTorch model
import torch
import sabzeh

# Load model and move to SABZEH
model = torch.load("model.pt")
model = model.to("sabzeh")

# Compile for optimal performance
model = sabzeh.compile(model)

# Run inference
output = model(input_tensor)

Explore Our Technical Documentation

Access detailed architecture specifications, performance benchmarks, and integration guides. Our engineering team is available to discuss your specific requirements.