Learnings from training a frontier font generation model — Mixfont Blog<br>A font that looks like cheddar cheeseI recently finished the training work for the font generation model that serves as the technology behind Mixfont. If you've followed the blog, you might already know that I set out on this journey two months ago after I had initially trained my first model for identifying known fonts from images. Because that project was a success from a learning perspective, I decided to take on the much more ambitious project of training a model that can generate fonts from images or a text prompt. In this post, I'll dive into some of the learnings from this whole process and I'll also share some results from the model at the end.<br>So, fonts?!<br>The first surprising thing about this whole endeavor was the fact that there wasn't already an AI model for fonts yet. It seems like there's an AI model for everything - images, video, music, code, 3D assets, SVGs, etc etc. So I was pretty surprised when I first started doing the research that there wasn't anything that could go directly from a text prompt or an image to a functional, working TTF font file. Even though the leading image models can generate text in a reliable way now, isolating editing the text after the fact is still a pain.<br>Even though the market size for fonts might not be as large as something like images, I personally have just been interested in this problem for a long time. Fonts are a really creative medium, and it's always been pretty interesting to me how the same set of letters can be rendered (and understood) in so many different decorative formats. Fonts are also pretty difficult to make - each glyph needs to be separately designed but use a common style and format. Then, when assembling into a full font file, there are challenges around creating a common baseline, letter-size, and not to mention getting kerning and spacing right.<br>A font generator model seemed to check a lot of boxes for me. It was a fun, creative problem, it had a lot of technical challenges in areas that I wanted to grow, and finally, it seemed relatively possible to do. Or so I thought at the beginning.<br>Surprising Learning Lessons<br>Letters contain many surprising forms<br>When you think about generating a font file, you might think it’s as simple as teaching a model to learn about the style of each individual letter. But one of the first things that came up was just how many variations there are for individual letters, even the basic English ones. Our brains just happen to read them naturally, but when it comes to creating them, there's a bit of nuance for the model to understand. Take a look at these letters - they are all different formats of the letter g (upper and lowercase) and all need to be accounted for by a model that can generate fonts in different styles.<br>All these shapes for the letter gOne major challenge that I stumbled across basically on day one was trying to figure out the different between the single story ɑ that is more common when writing, vs the double story a that is used in many fonts. Separating these glyphs and creating a model that saw enough samples of each format was a significant challenge that made the problem a lot harder than I thought it would be.
Data availability and cleanliness is everything<br>Related to the above, having high quality data in reproducible, normalized formats was essential to the training process. Luckily, there are many fonts available on the internet, many of which are free and open source like Google Fonts. However, just having the font files themselves was not enough. There are many fonts that just don’t support certain characters (you’d be surprised at how many fonts simply just don’t render a symbol like the ampersand). Many fonts themselves would render fallback characters (like a blank box or a logo) instead of just leaving a character empty, so it was important to ensure that the data going into the model was not contaminated and was as high quality as possible.
Oh, don't mind me, I'm just a tiny local folder of data<br>A surprising learning was that the training techniques and general scripting behind running a training run seems relatively set by now. In other words, it seems that no matter what kind of model you are training - whether it’s the next Claude or even a small classifier - the techniques are fairly standardized and the quality of the outputted model depends almost entirely on the quality and quantity of the input data. Understanding this through trial and error made me much less intimidated of the training process.
GPU availability is scarce (and used as a moat for large labs)<br>Even if you have scripts set up and a large amount of high quality data, there is one more crucial piece to any foundational training run - access to GPUs. GPUs are essential to training runs because the “trained model” in the end is, at its core, a set of weights that defines how to generate the fonts. Creating this weights file takes a...