DeepSeek-OCR Visualized

coarchitect1 pts2 comments

DeepSeek-OCR fully visualised. Understand SAM, Token compression… | by Frederik vom Lehn | Advanced Deep Learning | MediumSitemapOpen in appSign up<br>Sign in

Medium Logo

Get app<br>Write

Search

Sign up<br>Sign in

Advanced Deep Learning

Deep learning is a subset of machine learning focused on training artificial neural networks to automatically learn and extract hierarchical representations from data.

DeepSeek-OCR fully visualised

Frederik vom Lehn

6 min read·<br>Dec 11, 2025

Listen

Share

Understand SAM, Token compression, DeepSeek-MoE, Multi-Head-Latent-Attention.

DeepSeek-OCR is essentially a combination of known architectures, namely SAM, CLIP and CNNs for the vision encoder and MoE decoder language model. I visualised the architecture on a massive 6000x2000 Image. I provide this as an SVG image in a link, so that you can easily follow the full information flow and zoom in at certain parts. The images I used in this article are taken from this massive image. However, on smaller devices it might be hard to read the text, in that case, just use the link at the end of the article.<br>DeepSeek-OCR compresses vision tokens , which means that the vision tokens that are used as input to the language decoder are fewer compared to other architectures , but each token contains more information about the input image.<br>Let’s start with the vision encoder. I used toy parameters for the illustration. However, I noted the real parameters at the top of each image. We start with SAM, then we continue with a 2 Layer CNN and finally we finish with CLIP. The input image is resized to a square of size 1024x1024 pixels, while preserving aspect ratio, filling any leftover area with a constant grey color.<br>from PIL import Image, ImageOps<br>global_view=ImageOps.pad(image, (base_size, base_size),<br>color=tuple(int(x * 255) for x in image_transform.mean))If the image is larger than 640px in at least one dimension, the image is also splitted into a dynamic number of image crops, each 640x640pixels. The number of image crops depend on the image aspect ration.

Figure 2 by Author. Sam Encoder.<br>We start with a CNN Layer which converts the image pixels into 768 feature maps. In our toy example only into 6 feature maps of size (4,4). I marked the output of the first kernel operation with blue, and the output of the last kernel operation with yellow. Thus the pixels in the top right corner are represented in blue. We then input this matrix into 12 sequential transformer blocks, where we first start with layernorm.<br>It is applied over the last dimension (the channel/embedding dim). For each token vector, it subtracts its mean and divides by its standard deviation, then applies learned weight and bias. Then we continue with Multi-Head-Attention (MhA). However, the MhA operation differs depending on the current transformer block. In every third block (2, 5, 8, 11), the model uses the standard global attention procedure (As depicted in Figure 2 at the top). Which means attention is calculated for all 16x16 tokens. Remember our input was (6,4,4) and got reshaped to (4,4,6) and then combined to (16,6), thus we have vision tokens each with a dimension of size 6.<br>However, in all the other blocks 0,1,3,4,6,7,9,10, the model uses window attention, which reduces the attention operation to a local window. In our case we chose a (2x2) window size, thus instead of 16x16 attention scores, we use 4 batches of size (2,2,6). Which means that the attention calculation is only (4x4) but 4 times in parallel.<br>I skip the part of the MhA because that is just the standard 2x MLP layer. However, after all 12 blocks, we have two additional CNN layers. Dont get confused because this is still part of Sam. Notably, there are two more CNN layers after the SAM encoder. Those two CNN layers are the often citated 16x down compression mechanism. You can see it in Figure 3.

Figure 3 by Author. From Sam to CNN to Clip to Language Decoder.<br>Figure 3 shows how the SAM encoder output of size (6,4,4) gets further compressed to a single vector of size (12,1,1). Remember that this the case for our toy example. Usually one has more vectors here, because the input image might be much larger compared to our toy image of 24x24 pixels. Before we pass this vector into the CLIP encoder, we also add the CLIP class embedding. The class embedding starts as a learned vector, but after it goes through the transformer blocks it attends to (and is attended by) all image patch tokens. That means its final representation aggregates information from the other embeddings, serving as a global summary.<br>Notably, the CLIP encoder output is concatenated with the Sam output and then multiplied with a final weight matrix. This whole process was the feed forward process for the global image features. We do the same again for the local image features. Now we concatenate global, local, a seperator embedding and language embeddings together to one matrix which will be the input to the language decoder.

Figure 4 by...

image attention encoder input size deepseek

Related Articles