CUDA-like programming of Cerebras WSE

gdiamos1 pts0 comments

GitHub - greg1232/cerebras-py-sim · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

greg1232

cerebras-py-sim

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>17 Commits<br>17 Commits

docs

docs

src/cerebras_sim

src/cerebras_sim

tests

tests

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

View all files

Repository files navigation

Cerebras CS3 Simulator

A high-fidelity architectural simulator for the Cerebras CS3 Wafer-Scale Engine (WSE). This project models the hardware architecture, interconnects, and execution environment to enable performance analysis and software development for massively parallel 2D mesh architectures.

🚀 Project Overview

The Cerebras-Sim is designed to model the CS3 WSE, featuring a massive array of 720,000 processing elements (PEs). It provides a full-stack simulation environment, from a high-level Python DSL down to a custom 32-bit ISA binary.

Key Goals

Performance Analysis : Estimate total runtime and identify bottlenecks using a hybrid performance/functional model.

Software Development : Verify kernel correctness via a CUDA-like programming model before deploying to hardware.

Architectural Exploration : Model the impact of mesh bisection bandwidth, latency, and SRAM constraints.

🏗️ Architecture

Hardware Model

Processing Element (PE) : Each core implements an 8-wide SIMD unit, vector registers, and a private 48KB local SRAM .

Interconnect : A 2D Mesh (800x900) where communication occurs via SEND/RECV primitives and global address space abstractions.

Memory Hierarchy :

Local SRAM : Private high-speed memory per PE (analogous to CUDA Shared Memory).

Weight Server : External DRAM accessed via a global address space for large-scale model weights and data.

Host-Device Interface : A driver model implementing a command queue (CS3Queue) and memory movement (cs3_memcpy).

Execution Model

The simulator employs a Bulk Synchronous Parallel (BSP) model, dividing execution into discrete "supersteps":

Compute : PEs perform local SIMD operations.

Communicate : PEs exchange data across the mesh or with the Weight Server.

Synchronize : A global barrier (SYNC) aligns the execution state.

To balance accuracy and speed, the simulator uses a hybrid execution track :

Performance Track (Global) : All PEs are tracked for cycle counts and timing.

Functional Track (Sampled) : A stochastic sampling strategy is used where only a subset of blocks is fully simulated functionally to verify correctness.

💻 Software Stack

The project implements a complete toolchain:<br>Python DSL $\rightarrow$ Tungsten-IR $\rightarrow$ ISA Binary $\rightarrow$ Simulator

Programming Example

Kernels are written in a CUDA-like Python DSL. For example, a simple SAXPY ($\mathbf{z} = \alpha \mathbf{x} + \mathbf{y}$) kernel:

@cs3_kernel(block_w=16, block_h=16)<br>def saxpy_kernel(ctx):<br># Load inputs from global memory (Weight Server)<br>x = ctx.load_global(None, 0)<br>y = ctx.load_global(None, 4)

# Compute: z = 2.0 * x + y<br>z = 2.0 * x + y

# Store result back to global memory<br>ctx.store_global(None, 8, z)

Frontend : A CUDA-like DSL embedded in Python using @cs3_kernel decorators.

Intermediate Representation (Tungsten-IR) : A dataflow-centric IR mapping compute nodes and synchronization points.

Compiler Backend :

Mapping & Scheduling : Assigns IR nodes to the physical 2D mesh and manages the SRAM budget.

Assembler : Emits the final 32-bit binary stream.

Simulator Engine : A Python-based engine that decodes the ISA and drives the hardware model.

⏱️ Performance Modeling

Instead of exhaustive packet-level simulation, the system uses a latency-and-bandwidth-aware abstract model :

Latency : Calculated based on physical Manhattan distance:<br>$$\text{Latency}_{\text{op}} = \text{Base Latency} + (\text{Manhattan Distance} \times \text{Hop Latency})$$

Bandwidth & Congestion : The simulator enforces a Bisection Bandwidth Constraint . If total bytes transferred per superstep exceed...

model simulator cerebras latency global memory

Related Articles