1.2M Messages to Obsidian - Building a Relationship Map from 20 Years of Chat History
Skip to content
Am I a Bad Friend?
I analysed 20 years of my chats and turned 1.2M messages into a structured vault of my life - to win friends and influence people. Instead, I learnt things about my emotional bandwidth, endearment cycles, and friendship half-lives, that I'd rather I didn't.
27 May 2026
·<br>MLX, data analysis, LLMs, 2nd brain
In 2014, Tim Urban of WaitButWhy published Your Life in Weeks - a grid where each square is one week of one's life, and most of the grid is already filled. The image bothered me for years. I started tracking things partly because of it - I wanted the grid to mean something, not just count down. But the biometric data is an odd representation of how fulfilling my life has been. The grid suggests it's the events that matter - jobs, trips, schools, marriages - and those are easy to mark. But they hardly tell how I felt during those weeks, or what I was like to the people around me. That was what I wanted to measure.
So I tried journaling. Paper first, then text files, then daily notes in Obsidian. The journal captured what I thought was important on the day I wrote it. It missed the conversations I forgot to jot down or the slow-moving patterns I couldn't see at the time.
My notes and their connections growing over the years.
Tired of being bad at maintaining relationships[1]1. Not bad per se - I just procrastinate a lot. Once I learnt to shoot and stalk deer because I wanted to cook a steak - and cooking is way easier than human interactions. and wanting the data to compensate, I set off on a quest to build a personal CRM of sorts, built from the record rather than from memory - thanks to the trail left by my prolific time-wasting on the Internet for the past few decades.
My digital history ¶
My online presence breaks into roughly three eras:
ICQ, IRC, DC++ in 2000s: midnight channels for script kiddies and banter - all gone, and probably for the best. The ten-year-old I was in those chats doesn't need a structured archive.
VK[2]2. A now-obscure social network, popular in the post-Soviet space in the noughties. I haven't been to Russia for a decade or so, but the archives going back to 2008 are still there. Gotta love totalitarian states, eh?, Twitter, Facebook in 2010s: school, university, early career - evenly spread.
Instagram and Telegram in 2010s-2020s: surprisingly, even though I don't post much on Instagram, it's often easier to catch up with people in DMs, and there are more and more people swapping WhatsApp for Telegram too.
Armed with GDPR and data access laws, I got myself archives with all my messages, reactions, and social graphs.
Data archives ¶
Parsing a bunch of JSONs and HTMLs wasn't hard but wasn't fun either. Instagram double-encodes Cyrillic through latin-1. Telegram assigns different internal message IDs between exports taken at different dates. Facebook introduced E2E encryption at some point, so the same messages show up in three different folders. Telegram lets you export group chats or just your own messages. VK exports everything without asking. Instagram doesn't differentiate between broadcasts and personal chats at all.
Once parsed into a uniform tab-separated format, the five exports produce different kinds of signal. Telegram and VK are mostly DMs. Instagram adds story interactions and a follower graph. Twitter is its own thing: standalone tweets are a publication corpus, DMs are half support requests and half conference coordination, so I needed the reply/mention graph to catch real signals.
I wanted to capture a daily note per conversation-day, a profile per person, a stub per place, a life timeline, and whatever else surfaces - recipes, cocktails, meeting notes.
Drowning in noise ¶
Before worrying about classification, you have to deal with the fact that most of the data is noise.
In my longest thread - 486,000+ messages with my partner across ten years - the content has 2.4% links, 9.1% media, 1.5% emoji-only messages, 28.4% of short fillers, and 58.7% of substantive text. This means, 41% is noise for the purpose of this exercise. Emojis, links, and media were easy to filter, but catching conversational filler words - short words that look like content until you see them hundreds of times per month - is harder.
My first idea was filtering out all messages shorter than three words, but there is a lot that can be said in two (he died, we lost, etc). Building a denylist of hahahas and noices didn't work either, especially across languages.
What worked was sampling from five offset positions across the chat, frequency-counting every short token, reviewing the top 80 manually, and pair the denylist with a protected set for short messages that are life events.
Across all platforms and years, the cleaned corpus contains roughly 52,000 unique lemmas. The novelty rate - the share of words I hadn't used before in any chat - has...