Making a metasearch engine
JSON API<br>txt<br>md<br>Making a metasearch engine<br>1/14/2024<br>In 2020, tired of every search engine seemingly having suboptimal results and missing the instant answers I wanted, I decided to make a search engine for myself.<br>I knew making a general-purpose web search engine from scratch by myself was infeasible, so instead I opted to make a meta-search engine, which aggregates results from other web search engines.<br>First I tried forking Searx, but it was slow and the old Python codebase was annoying to work with.<br>So instead of forking an existing project, I made my own (but with several ideas borrowed from Searx) in NodeJS which I called simply ”metasearch” (very unique name).<br>I used it as my primary search engine for over a year, but it was slow (mostly due to it being hosted on Replit and being written in JS) and brittle to the point where at the time of writing the only working search engine left is Bing.<br>A few weeks ago I decided to rewrite metasearch as (brace for it) metasearch2 (my project names only continue to get more original).<br>In this rewrite I implemented several of the things I wish I would’ve done when writing my first metasearch engine, including writing it in a blazingly fast 🚀🚀🚀 language.<br>There’s a hosted demo at s.matdoes.dev, but I’d much rather you host it yourself so I don’t start getting captcha’d and ratelimited.<br>This blog post will explain what you should know if you want to make a metasearch engine for yourself.
Other (meta)search engines<br>First, some prior art.<br>The metasearch engine most people know is probably Searx (now SearxNG), which is open source, written in Python, and supports a very large number of engines.<br>It was the biggest inspiration for my metasearch engine. The main things I took from it were how result engines are shown in the search page and its ranking algorithm.<br>However, as mentioned previously, it’s slow and not as hackable as their readme would like you to think.<br>The (probably) second most well-known metasearch engine is Kagi, which sources its results from its own crawler, Google, Yandex, Mojeek, Marginalia Search, and Brave (I’ll talk about these search engines later).<br>One interesting feature Kagi has that users seem to appreciate is the ability to raise/lower rankings for chosen domains.<br>I haven’t used Kagi much, but the reasons I don’t use it is because it’s paid (I can’t afford to pay $10/month for a search engine) and because I can’t customize it as much as I can customize my own code.<br>There’s also been some other metasearch engines in the past like Dogpile and metacrawler (both still exist, surprisingly) but they’re not worth talking about.
Also, of course, there’s my metasearch engine.<br>Instead of just listing what engines I use, I’ll tell my opinion of every search engine that I think is interesting.<br>I haven’t used some of these in years, so if you think their quality has changed in that time, let me know.<br>Google: Some people deny it, but from my experience it still tends to have the best results out of any other normal search engine. However, they do make themselves somewhat annoying to scrape without using their (paid) API.<br>Google’s API: It’s paid, and its results appear to be worse sometimes, for some reason. You can see its results by searching on Startpage (which sources exclusively from Google’s API). However, you won’t have to worry about getting captcha’d if you use this.<br>Bing: Bing’s results are worse than Microsoft pretends, but it’s certainly a search engine that exists. It’s decent when combined with other search engines.<br>DuckDuckGo/Yahoo/Ecosia/Swisscows/You.com: They just use Bing. Don’t use these for your metasearch engine.<br>DuckDuckGo noscript: Definitely don’t use this. I don’t know why, but when you disable JavaScript on DuckDuckGo you get shown a different search experience with significantly worse results. If you know why this is, please let me know.<br>Brave: I may not like their browser or CEO, but I do like Brave Search. They used to mix their own crawler results with Google, but not anymore. Its results are on-par with Google.<br>Neeva: It doesn’t exist anymore, but I wanted to acknowledge it since I used it for my old metasearch engine. I liked its results, but I’m guessing they had issues becoming profitable and then they did weird NFT and AI stuff and died.<br>Marginalia: It’s an open source search engine that focuses on discovering small sites. Because of this, it’s mostly only good at discovering new sites and not so much for actually getting good results. I do use it as a source for my metasearch engine because it’s fast enough and I think it’s cute, but I heavily downweigh its results since they’re almost never actually what you’re looking for.<br>Yandex: I haven’t used Yandex much. Its results are probably decent? It captchas you too frequently though and it’s not very fast.<br>Gigablast: Rest in peace. It’s open source, which is cool, but its results sucked. Also the privacy.sh thing they advertised looked...