GitHub - greg1232/cerebras-py-sim · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
greg1232
cerebras-py-sim
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>17 Commits<br>17 Commits
docs
docs
src/cerebras_sim
src/cerebras_sim
tests
tests
.gitignore
.gitignore
README.md
README.md
requirements.txt
requirements.txt
View all files
Repository files navigation
Cerebras CS3 Simulator
A high-fidelity architectural simulator for the Cerebras CS3 Wafer-Scale Engine (WSE). This project models the hardware architecture, interconnects, and execution environment to enable performance analysis and software development for massively parallel 2D mesh architectures.
🚀 Project Overview
The Cerebras-Sim is designed to model the CS3 WSE, featuring a massive array of 720,000 processing elements (PEs). It provides a full-stack simulation environment, from a high-level Python DSL down to a custom 32-bit ISA binary.
Key Goals
Performance Analysis : Estimate total runtime and identify bottlenecks using a hybrid performance/functional model.
Software Development : Verify kernel correctness via a CUDA-like programming model before deploying to hardware.
Architectural Exploration : Model the impact of mesh bisection bandwidth, latency, and SRAM constraints.
🏗️ Architecture
Hardware Model
Processing Element (PE) : Each core implements an 8-wide SIMD unit, vector registers, and a private 48KB local SRAM .
Interconnect : A 2D Mesh (800x900) where communication occurs via SEND/RECV primitives and global address space abstractions.
Memory Hierarchy :
Local SRAM : Private high-speed memory per PE (analogous to CUDA Shared Memory).
Weight Server : External DRAM accessed via a global address space for large-scale model weights and data.
Host-Device Interface : A driver model implementing a command queue (CS3Queue) and memory movement (cs3_memcpy).
Execution Model
The simulator employs a Bulk Synchronous Parallel (BSP) model, dividing execution into discrete "supersteps":
Compute : PEs perform local SIMD operations.
Communicate : PEs exchange data across the mesh or with the Weight Server.
Synchronize : A global barrier (SYNC) aligns the execution state.
To balance accuracy and speed, the simulator uses a hybrid execution track :
Performance Track (Global) : All PEs are tracked for cycle counts and timing.
Functional Track (Sampled) : A stochastic sampling strategy is used where only a subset of blocks is fully simulated functionally to verify correctness.
💻 Software Stack
The project implements a complete toolchain:<br>Python DSL $\rightarrow$ Tungsten-IR $\rightarrow$ ISA Binary $\rightarrow$ Simulator
Programming Example
Kernels are written in a CUDA-like Python DSL. For example, a simple SAXPY ($\mathbf{z} = \alpha \mathbf{x} + \mathbf{y}$) kernel:
@cs3_kernel(block_w=16, block_h=16)<br>def saxpy_kernel(ctx):<br># Load inputs from global memory (Weight Server)<br>x = ctx.load_global(None, 0)<br>y = ctx.load_global(None, 4)
# Compute: z = 2.0 * x + y<br>z = 2.0 * x + y
# Store result back to global memory<br>ctx.store_global(None, 8, z)
Frontend : A CUDA-like DSL embedded in Python using @cs3_kernel decorators.
Intermediate Representation (Tungsten-IR) : A dataflow-centric IR mapping compute nodes and synchronization points.
Compiler Backend :
Mapping & Scheduling : Assigns IR nodes to the physical 2D mesh and manages the SRAM budget.
Assembler : Emits the final 32-bit binary stream.
Simulator Engine : A Python-based engine that decodes the ISA and drives the hardware model.
⏱️ Performance Modeling
Instead of exhaustive packet-level simulation, the system uses a latency-and-bandwidth-aware abstract model :
Latency : Calculated based on physical Manhattan distance:<br>$$\text{Latency}_{\text{op}} = \text{Base Latency} + (\text{Manhattan Distance} \times \text{Hop Latency})$$
Bandwidth & Congestion : The simulator enforces a Bisection Bandwidth Constraint . If total bytes transferred per superstep exceed...