DeepSeek-V4 Can't Read Images? I Made It Read
Sign in<br>Subscribe
Introduction<br>Have you ever had that frustrating moment: you are coding with deepseek-v4 in OpenCode, your code throws an error, you want to screenshot it and send it to DeepSeek, and then you remember that DeepSeek cannot read images.<br>I have to say deepseek-v4 is cheap, easy to use, and has a long context. It has already become my main coding model. But as of mid-June, DeepSeek still hasn't released a multimodal version. That means anything involving images, like reading error screenshots, interpreting charts, or recreating pages from visual designs, it cannot do.<br>I am not the only one frustrated. My friends are all waiting eagerly too.<br>But I found a way: I developed a small plugin called observer in OpenCode that lets deepseek-v4 call a multimodal agent to gain the ability to read images indirectly.<br>After more than a month of polishing, this plugin now handles all image-related coding tasks in my daily work. Today, I will share how I built this plugin, hoping it can help you too.<br>The plugin code and agent definitions mentioned in this article are at the end. Feel free to grab them.<br>Demo of Real-World Results<br>Before diving into the long tutorial, you probably care most about how well this plugin works and whether it is worth your time to try. So let me show you some screenshots of the plugin in action.<br>1. Interpreting error stack traces<br>We start with the simplest task: have deepseek-v4 interpret a screenshot of an error stack trace and find key information. I randomly picked a screenshot of an error I encountered at work:<br>A screenshot of a common error stack. Image by AuthorThen in OpenCode Desktop, I sent this image to the plan agent using deepseek-v4-pro and asked it to provide a solution:<br>The deepseek-v4-pro agent quickly picked up the error message. Image by AuthorAs you can see, the plan agent gave an answer based on the screenshot information.<br>2. Interpreting charts<br>Another multimodal use case is interpreting charts from documents. For this example, I took a screenshot of a company's annual revenue chart and tested it. I still used the plan agent with deepseek-v4-pro. For an extra challenge, I asked the agent to give some key insights on the numbers in the chart:<br>A screenshot of a listed company's financial report. Image by AlphaStreetThe agent read the numbers from the chart and provided some key insights:<br>The agent accurately spotted the data in the chart and offered key insights. Image by Author3. Developing HTML pages from designs<br>In frontend development, the biggest demand for multimodal capability is recreating visual designs. Here I found a design with complex page elements to see if the build agent using deepseek-v4-flash could recreate the page:<br>A screenshot of a web design draft. Image by dribbble.comHere is the recreated page:<br>The page that deepseek v4 flash recreated. Image by AuthorOne thing is sure: the deepseek-v4-flash model generated the frontend code, and it only took one prompt to get this result. It did not get a 100% match, but with a few more rounds of conversation, you can tweak it until it is perfect. Keep in mind deepseek-v4-flash is dirt cheap.<br>It costs several times or even ten times less than multimodal models like kimi k2.6 or qwen3.7 plus. They are not in the same league.<br>Of course, you can also crop a section of the page, mark the areas that need attention, and ask DeepSeek to adjust them, like this:<br>You can take a screenshot of the webpage and have deepseek-v4-flash make adjustments. Image by AuthorThe agent perceives the marked area and gives the primary agent an adjustment plan per your request.<br>4. Generating HTML pages from hand-drawn sketches<br>Maybe you are like me and have zero design skills. No problem. We can hand-draw rough sketches. The agent can understand them. For example, in a recent project, I hand-drew a few web page design sketches:<br>This is a hand-drawn sketch of the webpage. Image by AuthorThen deepseek-v4-flash helped me recreate the page:<br>The agent restored the page based on my handwritten reference. Image by AuthorImpressive, right?<br>Detailed Implementation Walkthrough<br>I know you cannot wait any longer. Let me jump straight into the implementation details.<br>The whole image-reading plugin consists of two parts:<br>A sub-agent configured with a multimodal LLM. It runs in a separate sub-session, reads the images uploaded by the user, parses them into detailed text descriptions based on the scenario, and returns the results to the DeepSeek model in the main session.<br>An OpenCode plugin that intercepts images uploaded by the user, saves them as files, and triggers the sub-agent to read the images at the right time.<br>In other words, the plugin is the "dispatcher," and the sub-agent is the "image reader." They work together through independent sub-sessions without messing up the main session's context.<br>Let me start with the design of the agent.<br>Designing the image-reading agent<br>Since the source...