This is my first attempt at running an eval of this nature so would love some methodology feedback.I can t guarantee the sources weren t already in the model s inputs without getting novel translations from native speakers, but from my experience using the top models, they feel very accurate. Even encountering somewhat obscure texts from a relatively small language the translations generally beat Google Translate for proper idiomatic meaning.