Understand Vision Language Models

coarchitect1 pts1 comments

How AI Sees and Reads: Visualising Vision Language Models | by Frederik vom Lehn | Advanced Deep Learning | MediumSitemapOpen in appSign up<br>Sign in

Medium Logo

Get app<br>Write

Search

Sign up<br>Sign in

Advanced Deep Learning

Deep learning is a subset of machine learning focused on training artificial neural networks to automatically learn and extract hierarchical representations from data.

How AI Sees and Reads: Visualising Vision Language Models

Follow the complete journey of the pixel and token data flow, and tensor transformations. Understand the context window, Multi-Head Attention, Grouped-Query Attention, and Sliding-Window Attention. Explore the autoregressive nature as well as spatial reasoning limitations.

Frederik vom Lehn

11 min read·<br>Oct 27, 2025

Listen

Share

This article is a technical deep dive with a main focus on visualising the information flow. In my opinion, the following illustrations are useful to fully understand all the research papers on architectural improvements. I assume a basic understanding of Transformers, Softmax, and the CLIP encoder. However, I will also link corresponding articles for a refresher. In the later sections, I summarize current research on the issues regarding spatial reasoning, bag-of-words tendencies, and large L2 norms of vision tokens.<br>The following illustrations are based on the LLava Model [1], one of the first open vision language models, and aim to represent the architecture accurately! I chose this model because a lot of researchers also use this model in their papers. Most vision language models work in a similar fashion! However, I decided to use my own toy dimensions for the sake of simplicity. Thus, all parameters such as patch_size, hidden_dimensions, and head_size are much smaller compared to the original model.<br>Figure 1

Figure by Author. Understanding the information flow in vision language models, including Multi-Head-Attention and context window.<br>We start simple. Most vision language models have a vision encoder and a language decoder. Let’s start with Figure 1. We have one input image and the input text “Describe this Image”.<br>Image Input. First the input image is divided into patches. Each patch represents a number of pixels (usually 142 or 162). In our example, we just assume a patch size of 4, thus our squared patch represents 16 pixels in total. Each patch is now flattened into a vector of size (16); these are the image embeddings. Normally, this also includes the 3 color channels from RGB, which means you would have a dimension of size 16×3. But for the sake of simplicity, we just assume 16 pixels. We have 9 patches/ embeddings in total.<br>#In LLava, each image is downsampled to 336x336 pixels, which always returns 576 patches.<br>#Patchsize is 14x14 pixels per square<br>image_length: 336/14=24<br>image_width: 336/14=24<br>nr_patches: 24*24=576#Raw Patch Dimension in LLava (p²*colour_channels)<br>p*p*3: 14*14*3 =588We concatenate them, add a CLS token (which represents the full image), and then pass the new matrix of (10, 4) into a vision encoder, such as CLIP. The original LLava model [1] discards the CLS token, but we keep it in our explanation. In most cases, the vision encoder is a vision transformer, which converts the vision embeddings from 4 (588) to 5 (1024) dimensions and each embedding is now also contextualised and contain information about visual features such as colour, textures etc.<br>There are different vision encoders, however the most popular ones are CLIP (OpenAI), SigLip (DeepMind), SAM and Dino (Meta). In addition there are more recent encoders that outperform models such as CLIP/ Siglip, but are less known such as AIMV2 by Apple [11] or Perception Encoder (PE) by Meta [12]. All of them use a vision transformer as backbone!<br>It is crucial to map the vision embeddings into the same space as the text input, so that we can proceed processing both with the decoder, which generates language. For that we use a simple Multi-Layer-Perceptron (MLP) with two layers, which converts each vision embedding dimension from 5 (1024)to the language dimension 6 (4096).<br>Text Input. Our text “Describe the image” is first tokenised. In our example we use the token IDs (6443), (495), (3621) — which depend on the used tokenizer. With those token IDs we can now look up the corresponding word embeddings from a matrix where each row corresponds to one token in the full vocabulary. LLava 1.5 uses LLama2[2] as language model, which was trained with a fixed vocabulary size of 32000 tokens. Current language models such as Qwen3 use 151k or LLama4 use 202k.<br>Thus, the look up matrix has a shape of (32000,4096)for Lama [2]. But instead of 4096, I am using a toy dimension of size 6 only. So the three words will be represented by a matrix of size (3,6). I am ignoring additional tokens such as , etc.<br>Concatenating Text and Image Embeddings. Since our image tokens now have the same dimension as the text tokens we can simply concatenate the text tokens to our image...

vision language image models text token

Related Articles