CUDA-like programming of Cerebras WSE

GitHub - greg1232/cerebras-py-sim · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

greg1232

cerebras-py-sim

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 17 Commits 17 Commits

docs

src/cerebras_sim

tests

.gitignore

README.md

requirements.txt

View all files

Repository files navigation

Cerebras CS3 Simulator

A high-fidelity architectural simulator for the Cerebras CS3 Wafer-Scale Engine (WSE). This project models the hardware architecture, interconnects, and execution environment to enable performance analysis and software development for massively parallel 2D mesh architectures.

🚀 Project Overview

The Cerebras-Sim is designed to model the CS3 WSE, featuring a massive array of 720,000 processing elements (PEs). It provides a full-stack simulation environment, from a high-level Python DSL down to a custom 32-bit ISA binary.

Key Goals

Performance Analysis : Estimate total runtime and identify bottlenecks using a hybrid performance/functional model.

Software Development : Verify kernel correctness via a CUDA-like programming model before deploying to hardware.

Architectural Exploration : Model the impact of mesh bisection bandwidth, latency, and SRAM constraints.

🏗️ Architecture

Hardware Model

Processing Element (PE) : Each core implements an 8-wide SIMD unit, vector registers, and a private 48KB local SRAM .

Interconnect : A 2D Mesh (800x900) where communication occurs via SEND/RECV primitives and global address space abstractions.

Memory Hierarchy :

Local SRAM : Private high-speed memory per PE (analogous to CUDA Shared Memory).

Weight Server : External DRAM accessed via a global address space for large-scale model weights and data.

Host-Device Interface : A driver model implementing a command queue (CS3Queue) and memory movement (cs3_memcpy).

Execution Model

The simulator employs a Bulk Synchronous Parallel (BSP) model, dividing execution into discrete "supersteps":

Compute : PEs perform local SIMD operations.

Communicate : PEs exchange data across the mesh or with the Weight Server.

Synchronize : A global barrier (SYNC) aligns the execution state.

To balance accuracy and speed, the simulator uses a hybrid execution track :

Performance Track (Global) : All PEs are tracked for cycle counts and timing.

Functional Track (Sampled) : A stochastic sampling strategy is used where only a subset of blocks is fully simulated functionally to verify correctness.

💻 Software Stack

The project implements a complete toolchain: Python DSL $\rightarrow$ Tungsten-IR $\rightarrow$ ISA Binary $\rightarrow$ Simulator

Programming Example

Kernels are written in a CUDA-like Python DSL. For example, a simple SAXPY ($\mathbf{z} = \alpha \mathbf{x} + \mathbf{y}$) kernel:

@cs3_kernel(block_w=16, block_h=16) def saxpy_kernel(ctx): # Load inputs from global memory (Weight Server) x = ctx.load_global(None, 0) y = ctx.load_global(None, 4)

# Compute: z = 2.0 * x + y z = 2.0 * x + y

# Store result back to global memory ctx.store_global(None, 8, z)

Frontend : A CUDA-like DSL embedded in Python using @cs3_kernel decorators.

Intermediate Representation (Tungsten-IR) : A dataflow-centric IR mapping compute nodes and synchronization points.

Compiler Backend :

Mapping & Scheduling : Assigns IR nodes to the physical 2D mesh and manages the SRAM budget.

Assembler : Emits the final 32-bit binary stream.

Simulator Engine : A Python-based engine that decodes the ISA and drives the hardware model.

⏱️ Performance Modeling

Instead of exhaustive packet-level simulation, the system uses a latency-and-bandwidth-aware abstract model :

Latency : Calculated based on physical Manhattan distance: $$\text{Latency}_{\text{op}} = \text{Base Latency} + (\text{Manhattan Distance} \times \text{Hop Latency})$$

Bandwidth & Congestion : The simulator enforces a Bisection Bandwidth Constraint . If total bytes transferred per superstep exceed...

CUDA-like programming of Cerebras WSE

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y