Tagging Blog Posts with BERTopic and LLMs

Tagging my blog posts with BERTopic and LLMs | ✰Vicki Boykis✰ Tagging my blog posts with BERTopic and LLMs May 18 2026 I recently added tags to my blog using BERTopic and a mix of LLMs. You can see the tags in the sidebar to the right (or in the footer on mobile). I’ve done this before in 2023, with GGUF Mistral using llama-cpp, but never finished the project. Now, because the models have been getting so good, and my project was small, relatively well-defined, and easy to evaluate, the project took me about 6-10 hours over a month, using BERTopic, Claude Code, and Pi with Deepseek. Why so many different AI tools? Mostly to evaluate their different ways of working. Much of that time was spent noodling on the UX experience of the tags rather than iterating on the tags themselves. One of the genuinely useful use-cases of LLMs these days is for finishing personal projects that don’t touch production and have a small surface area that’s personalized for you. In other words, as Robin Sloan wrote, an app can be a home-cooked meal. I love having a static site because of how easy it is to write and publish content and how fast it loads, but sometimes I wish it was slightly more fully-featured. LLMs have allowed me to add features like search. The theme I use, Hugo Bear Blog, already has support for tags, but I’d never added them to posts, and I also wanted a slightly different way to visualize them. We consumers generally use LLMs for text or image or, if we’re developers, for code generation. But, one of the most underrated features of LLMs is the ability to compress rather than generate. This is really unsurprising: LLMs are, after all, natural language models. Since they were trained and fine-tuned originally on language modeling tasks they also perform really well at all the tasks that language models are meant for, such as summarization, information retrieval, question answering. LLMs are really good at labelling things. That is, they’re good at topic modeling, the machine learning task behind tagging, especially in a zero-shot context (where they have no previous training data from you specifically). What were tags? In the early days of online blogging, tags were important for facilitating content discovery. People initially started tagging their blog posts on their individual blogs. Eventually, site aggregators like Delicious, surfaced top links with tags by aggregating tags across top links shared by users. Pinboard was another prominent platform where finding content through tags and looking at aggregated tags was an important feature of the platform. Coincidentally, Delicious was later acquired by Pinboard. Early on, the best way to find things that you liked was to manually participate in the curation of these folksonomies. Twitter and Tumblr developed some of the most creative folksonomic tagging systems. On Tumblr, tags became a way to not only discover posts, but to have conversations with other people about your post. On Twitter, hashtags became a way to signal communal discovery of people with shared interests before the implementation of Twitter’s SimClusters algorithm. Tagging and hashtags served as an implicit contract of content discovery across platforms for over a decade, across Twitter, Instagram, TikTok, and many other services. However, big social has been in big decline for a while now. Group chats have arisen as a medium for exchange and discovery, as well as Discord groups that rely less on traditional tagging mechanisms. Bluesky, a social platform that was founded more recently, has the ability to add hashtags to posts, but most folks don’t do so. Discovery happens, like with many platforms today, through starter packs and custom algorithmic feeds. The rise of LLMs led to an even greater decrease in the power of individual websites to add signal. An increase in AI overview features in search results that offer either summarization or RAG-assisted source synthesis has meant that visits to actual websites are dropping faster than ever. With LLMs, the rise of semantic and blended agentic-style search as a discovery mechanism means tags are not as important. Within blogging, content surfaces like (oh God) LinkedIn native posts and X’s longform posts are contributing to platform-specific lock-in. All of our public blog content is being scraped as training data anyway, and agentic search and RAG mean people access content through an LLM’s interpretation of it rather than going directly to a page. RSS still exists, but who is going to syndicate a site when they could write an article on X or LinkedIn? (Me, sure.) You can subscribe to my feed!. Blog tags as a mechanism for understanding what my blog is about realistically still probably matter only to me. But I still want them! Historical approaches to tagging with LDA Generally, synthesizing and detecting topics across a body of text is an unsupervised learning machine learning...

Tagging Blog Posts with BERTopic and LLMs

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast