Technology | SABZEH AI Processors

Foundation

Built on RISC-V: The Open Standard ISA

RISC-V represents a fundamental shift in processor architecture. Unlike proprietary instruction sets controlled by single vendors, RISC-V is developed through an open, collaborative process. This openness enables SABZEH to optimize at the deepest level while maintaining complete transparency with our customers.

The base RISC-V specification provides a clean, modular foundation. We extend this with custom instructions specifically designed for neural network operations, eliminating the overhead of general-purpose architectures that were never intended for AI workloads.

Zero Licensing Fees

No per-chip royalties or annual licensing costs that inflate your total cost of ownership.

Full Auditability

Every instruction, every register, every pipeline stage is documented and verifiable.

Custom Extensions

Add domain-specific instructions without breaking compatibility or waiting for vendor approval.

Growing Ecosystem

Major semiconductor companies, governments, and research institutions are adopting RISC-V.

Base ISA

RV64GC - Standard RISC-V

Vector

RVV 1.0 - Vector Extension

Matrix

SABZEH Matrix Extension

Tensor

SABZEH Tensor Core ISA

AI Ops

Attention, Softmax, LayerNorm

Architecture

Three Pillars of AI Performance

01

Tensor Processing Units

Each NOVA processor contains hundreds of tensor cores designed specifically for matrix multiplication, the fundamental operation in neural networks. These units achieve near-theoretical peak throughput by eliminating the instruction decode and scheduling overhead present in general-purpose cores. Native support for INT8, FP16, BF16, and FP32 precision allows optimal precision selection for each layer of your model.

02

Intelligent Memory Hierarchy

Neural networks exhibit access patterns fundamentally different from traditional applications. Our memory subsystem is designed around these patterns, with large on-chip SRAM buffers that hold entire transformer attention matrices, eliminating costly off-chip memory accesses. A novel prefetch engine predicts data requirements based on network topology, ensuring compute units are never starved.

03

Scalable Interconnect

Modern AI workloads increasingly require multi-chip configurations. Our SerDes-based interconnect provides 800Gbps bandwidth between chips with latency measured in nanoseconds. Collective operations like AllReduce are implemented in dedicated hardware, achieving near-linear scaling efficiency up to thousands of chips in distributed training scenarios.

Custom Extensions

Purpose-Built Instructions for AI

Matrix Extension

Beyond standard RISC-V vector operations, our Matrix Extension adds native support for 2D tile operations. A single instruction can compute an 8x8 or 16x16 matrix multiply-accumulate, dramatically reducing instruction overhead for convolution and attention layers.

Native GEMM support up to 16x16 tiles

Fused multiply-add with accumulator preservation

Automatic precision conversion between tile operations

Register blocking for complex matrix expressions

Attention Accelerator

Transformer models spend the majority of compute time in attention mechanisms. We've implemented FlashAttention-style algorithms directly in hardware, computing attention scores with optimal memory access patterns and no intermediate materialization.

Fused QKV projection and attention computation

Online softmax with numerical stability

Multi-head attention parallelism

KV-cache optimization for autoregressive generation

Activation Functions

Non-linear activation functions are ubiquitous in neural networks but expensive to compute. Dedicated hardware units implement GELU, SiLU, Softmax, and LayerNorm with single-cycle throughput, eliminating the performance penalty of these operations.

Single-cycle GELU and SiLU computation

Fused LayerNorm with residual addition

RoPE positional encoding in hardware

Group normalization for vision models

Sparsity Engine

Production models are often pruned to reduce computational requirements. Our sparsity engine detects and skips zero values at runtime, achieving effective throughput improvements of up to 4x for appropriately structured sparse models.

Runtime zero detection with minimal overhead

Support for structured and unstructured sparsity

Compressed sparse tensor formats

Dynamic sparsity for activation tensors

Memory Architecture

Designed for Neural Network Access Patterns

Traditional cache hierarchies optimize for temporal and spatial locality assumptions that don't hold for neural networks. Weights are accessed once per inference, activations flow forward through layers, and attention patterns span the entire sequence length.

Our memory system is purpose-built for these patterns. Large on-chip SRAM buffers can hold entire layer weights, eliminating repeated off-chip fetches. A software-managed scratchpad allows explicit control over data placement, enabling optimal tiling strategies for large models.

The result is 85% sustained memory bandwidth utilization compared to 40-50% typical in conventional GPU architectures. This efficiency directly translates to higher throughput and lower latency for your AI workloads.

85% Bandwidth Utilization

128MB On-Chip SRAM

6.4 TB/s HBM3 Bandwidth

256GB Maximum HBM Capacity

Registers

32KB per core

L1 Cache

256KB per cluster

Shared SRAM

128MB total

HBM3

96-256GB capacity

Security

Transparency and Trust at the Silicon Level

Hardware Root of Trust

Secure boot chain validates firmware integrity from power-on through application launch. Tamper detection mechanisms protect against physical attacks on the device.

Memory Encryption

AES-256 encryption protects data at rest in HBM. Memory controllers perform encryption and decryption transparently with minimal performance impact.

Secure Enclaves

Isolated execution environments protect sensitive model weights and inference data from other workloads sharing the same processor.

Full Auditability

Complete RTL documentation enables security verification by your team or third-party auditors. No black boxes or hidden functionality.

Software

Production-Ready Software Stack

Hardware performance means nothing without software that can utilize it. SABZEH provides a complete software stack from low-level drivers through high-level framework integrations.

Our compiler automatically maps standard model representations to SABZEH hardware, optimizing for data layout, operation fusion, and memory placement. Framework integrations for PyTorch and TensorFlow allow you to run existing models without modification.

Frameworks PyTorch, TensorFlow, ONNX

Graph Compiler Optimization and scheduling

Runtime Memory management, dispatch

Driver Linux kernel module

Hardware SABZEH NOVA Processors

                        # Run your existing PyTorch model

                        import torch

                        import sabzeh

                        # Load model and move to SABZEH

                        model = torch.load("model.pt")

                        model = model.to("sabzeh")

                        # Compile for optimal performance

                        model = sabzeh.compile(model)

                        # Run inference

                        output = model(input_tensor)

Engineering Intelligence at the Silicon Level

Built on RISC-V: The Open Standard ISA

Zero Licensing Fees

Full Auditability

Custom Extensions

Growing Ecosystem

Three Pillars of AI Performance

Tensor Processing Units

Intelligent Memory Hierarchy

Scalable Interconnect

Purpose-Built Instructions for AI

Matrix Extension

Attention Accelerator

Activation Functions

Sparsity Engine

Designed for Neural Network Access Patterns

Transparency and Trust at the Silicon Level

Hardware Root of Trust

Memory Encryption

Secure Enclaves

Full Auditability

Production-Ready Software Stack

Explore Our Technical Documentation