Show HN: ELDC – Natural language identification, faster than FastText and CLD2

nitotm1 pts0 comments

I want to introduce ELDC, an efficient language detector, written in C, designed to maximize speed and accuracy within a relatively constrained memory footprint.ELDC is the latest iteration of the ELD software I made years ago. This version is available as an executable, a library, and a Python package.This is my first C software, or anything compiled for that matter, I previously built this in pure PHP, JavaScript, and Python.Highlights: - Performance: In my benchmarks, it runs faster than CLD2 and much faster than FastText. I believe the results are reproducible for any workload. - Accuracy: Within its supported language set, the benchmarks show it to be more accurate than Lingua, CLD3, CLD2, FastText, and etc. Accuracy is very benchmark dependent, so I will make no claim other than ELDC is highly accurate. - It supports 60 languages. Its architecture is highly efficient with database size scaling, I can add more n-grams or languages with a relatively low impact. - Memory usage: The compiled software is about 26MB, and it also builds a 32MB hashtable on load.Notes: - Database size: I do have other database sizes (featured in the PHP version), but I went for simplicity and used the optimal size. But more sizes could be added. - Single Detection: I optimized for multi-detection. For single, a B-tree would offer faster loading and lower memory usage than the current hashtable. I haven t anticipated to be the most common use case, but it could be optimized for.I would like to get some feedback, I m curious to see if my speed claims hold true against your own tests. :)

eldc faster language fasttext cld2 accuracy

Related Articles