Shape Suffixes — Good Coding Style | by Noam Shazeer | MediumSitemapOpen in appSign up<br>Sign in
Medium Logo
Get app<br>Write
Search
Sign up<br>Sign in
Shape Suffixes — Good Coding Style
Noam Shazeer
2 min read·<br>Feb 28, 2024
Listen
Share
Variable names should be concise and informative. For a tensor, nothing is more informative than how many dimensions it has, and what those dimensions represent.<br>We have been keeping this convention at Character.AI since 2022. Give it a try and let me know if you feel saner:<br>Designate a system of single-letter names for logical dimensions, e.g. B for batch size, L for sequence length, etc., and document it somewhere in your file/project/codebase<br>When known, the name of a tensor should end in a dimension-suffix composed of those letters, e.g. input_token_id_BL for a two-dimensional tensor with batch and length dimensions.<br>That’s all. You can use shape suffixes with torch, JAX, whatever. See the example below.<br>""" Example Transformer code with shape suffixes.
This code is incomplete and possibly has bugs. Don't try to run it.<br>Its purpose is to illustrate shape suffixes.
Dimension key:
B: batch size<br>L: sequence length<br>M: memory length (length of sequence being attended to)<br>D: model dimension (sometimes called d_model or embedding_dim)<br>V: vocabulary size<br>F: feed-forward subnetwork hidden size<br>H: number of attention heads in a layer<br>K: size of each attention key or value (sometimes called d_kv)<br>"""
def transformer(input_token_id_BL, params):<br>hidden_BLD = params.embedding_VD[input_token_id_BL]<br>for layer_num in range(params.num_layers):<br>hidden_BLD += attention(hiddden_BLD, params.attention_params[i])<br>hidden_BLD += ffn(hiddden_BLD, params.ffn_params[i])<br>hidden_BLD = layer_norm(hidden_BLD, params.final_layernorm_params)<br>logits_BLV = torch.matmul(hidden_BLD, params.embedding_VD.T)<br>return logits_BLV
def ffn(input_BLD, params):<br>input_BLD = layer_norm(input_BLD, params.layernorm_params)<br>hidden_BLF = torch.gelu(torch.matmul(input_BLD, params.w_in_DF))<br>output_BLD = torch.matmul(hidden_BLF, params.w_out_FD)<br>return output_BLD
def attention(input_BLD, params):<br>input_BLD = layer_norm(input_BLD, params.layernorm_params)<br>query_BLHK = torch.einsum('BLD,DHK->BLHK', input_BLD, params.w_q_DHK)<br>key_BMHK = torch.einsum('BLD,DHK->BLHK', input_BLD, params.w_k_DHK)<br>value_BMHK = torch.einsum('BLD,DHK->BLHK', input_BLD, params.w_k_DHK)<br>logits_BHLM = torch.einsum('BLHK,BMHK->BHLM', query_BLHK, key_BMHK)<br>B, L, H, K = query_BLHK.shape()<br>logits_BHLM /= K ** 0.5<br>masked_out_LM = torch.arange(L).unsqueeze(1) BLHK', value_BMHK, logits_BHLM)<br>out_BLD = torch.einsum('BLHK,HKD->BLD', wtd_values_BLHK, params.w_o_HKD)<br>return out_BLD
Large Language Models
Deep Learning
Coding Style
Transformers
Written by Noam Shazeer<br>309 followers<br>·1 following
(co)Inventor of (Transformer), MoE, Multihead Attention, Multiquery Attention, Tensor-Parallel LLM Training, SwiGLU, etc. Previously @Google, now @Character.AI
Help
Status
About
Careers
Press
Blog
Store
Privacy
Rules
Terms
Text to speech