What are your experiences with all the different variations of finetunes on small models ( 40B) with those popular datasets? My personal experience is mostly with the Opus-Reasoning ones on qwen models, and aside from the output being subjectively better looking (ascii charts and all), in actual coding performance every one I ve tried tends to become a lot more overconfident, writing more messy and buggy code and tries to gaslight me that the task I give it is impossible when it cannot achieve it.I have seen them perform better on public benchmarks in some cases, which shouldn t be ignored completely, but that doesn t seem to translate to better output on real work in my limited testing.What are your observations? Any specific ones that you lean towards, or have had good experiences with?