Local GitHub Copilot with Lemonade Server on Windows
Photo by Angélica Echeverry / Unsplash
3 months ago<br>by<br>Adam Cooper
— 6 min read
Local GitHub Copilot with Lemonade Server on Windows
Perhaps you, like me, saw the specs for AMD Ryzen AI Max (Strix Halo) processors and thought 'cool I can run a local LLM as a coding assistant'. Then once it arrived you looked at it and thought 'actually I have absolutely no idea how to do that'.
So here's a quickstart guide for Windows
Windows<br>amd strix halo<br>lemonad-server<br>ai<br>llm
This guide is for Windows, there's a Linux version here.<br>Perhaps you, like me, saw the specs for AMD Ryzen AI Max (Strix Halo) processors and thought 'cool I can run a local LLM as a coding assistant' on a general purpose PC. Then once it arrived you looked at it and thought 'actually I have absolutely no idea how to do that'.<br>So here's the quickstart guide that I wish I had when I first unwrapped my new Framework Desktop.<br>1. Prerequisites<br>A Strix Halo - I'm assuming you've already completed this step, if not let's pause while you go shopping and wait for delivery.<br>Windows 11.<br>VSCode- Again, I assume you've got this installed and setup.<br>GitHub Copilot - You'll need the Copilot Chat extension, installed and working which will also need you to sign up for at least the free plan.<br>2. Get AMD Adrenalin<br>Adrenalin is the AMD equivalent of Nvidia Geforce, a bloated mess of marketing dark patterns and a convenient utility for updating drivers and managing configuration. If this isn't installed by your manufacturer, it didn't come with my Framework Desktop, you can grab it from the link above. Next make sure you have the latest chipset driver and software versions via Settings > System > Manage Updates .<br>Manage UpdatesFor me that's AMD Software: Adrenalin Edition Version 26.1.1 and AMD Ryzen Chipset Driver: 7.11.26.2142.<br>3. Configuring CPU/GPU Memory split<br>The Strix Halo has a unified memory architecture which means that it's (up to) 128GB of memory is available to both the CPU and the GPU, kind of, in fact up to 96GB can be reserved for the GPU on Windows. You can configure the memory configuration in Adrenalin via Performance > Tuning > Variable Graphics Memory .<br>Variable Graphics MemoryIt's tempting at this point to head for the Custom option and reserve 96GB for the GPU but that can actually result in problems loading models as explained in the Lemonade FAQ.<br>On Windows, the GPU can access both unified RAM and dedicated GPU RAM, but the CPU is blocked from accessing dedicated GPU RAM. For this reason, allocating too much dedicated GPU RAM can interfere with model loading, which requires the CPU to access a substantial amount unified RAM.<br>So instead let's set the Dedicated Graphics Memory/Remaining System Memory to a 64GB/64GB split. With this configuration the GPU can still use up to 96GB but we avoid starving the CPU of memory.<br>4. Get Lemonade<br>What's that?<br>Lemonade Server is an Open Source project from AMD that bundles everything you need to run LLMs locally. It also includes an HTTP API that is OpenAI compatible and web UI and CLI for downloading, loading, unloading models and checking server stats and status.<br>Install<br>We can download an msi installer from lemonade-server.ai but I prefer to install it using winget.<br># Install Lemonade<br>winget install AMD.LemonadeServer
# Reload PowerShell to refresh your path<br>pwsh winget install<br>Now we should have the lemonade cli available, let's confirm by checking its version.<br>lemonade -vlemonade-server version<br>lemonade versionDownload a model<br>Great, let's use the pull command to download a model. We're going to start with Qwen3-Coder-30B-A3B-Instruct-GGUF, it's not the most powerful or modern model but it's coding focused, a reasonable size (~18GB) and supports tool calling which we need for Copilot.<br>lemonade pull Qwen3-Coder-30B-A3B-Instruct-GGUFDownload Qwen3 Coder<br>lemonade pullOK, that's going to take a while to download so let's take a moment to talk about a couple of important concepts.<br>Context Size<br>This is the maximum number of tokens that the model can process at any one time, you can think of it as the models working memory. It's measured in tokens which, in English, map to approximately 4 characters and includes both the input you send to the model and the response it returns. As the context grows the processing and memory costs grow which in turn is going to mean more latency, however a larger context allows the model to provide better quality responses. We can control the maximum size of the context when running a model as well as setting it globally for the Lemonade server.<br>Modality, Recipes and Backends<br>Lemonade supports several modalities (types of data processing) currently including Text generation, Speech-to-text, Text-to-speech and Image generation. For coding assistance we're interested in the Text generation modality. For each modality there is one or more recipe available and each recipe is supported by one or more backend which will...