Some Thoughts on AI Safety

stevekinney1 pts0 comments

Thoughts on AI Safety | Steve Kinney

Skip to main content siGithubsiInstagramsiXsiYoutube<br>June 19, 2026 Thoughts on AI Safety<br>A cautious, nuanced case for AI optimism: why safety, interpretability, bias, and alignment matter as much as raw capability.<br>To be on the Internet in the Modern Era™ is to be inundated with opinions, hype, and various flavors of doom and gloom. So, I decided to take a short respite from the infinite stream of 30-second reels and do a bit of a deeper dive. (Narrator : He downloaded a bunch of research onto his iPad and sat on the couch instead of doomscrolling.)<br>I’m going to make the argument that boiling things down to either AI === Good or AI === Bad is a (dangerous) oversimplification that makes for a fine 30-second hot take, but it loses all of the necessary nuance required to have the important conversations around what our shared future with AI is going to look like. Not taking the risks and implicit bias seriously just because you’ve drunk the Kool-Aid doesn’t help prepare us for potential risks nor does writing off a statistical model as inherently evil.<br>At this point, we’re unlikely to put the genie back in the bottle. That ship has sailed.<br>I’m a (cautious) optimist . It’s hard to be a total pessimist about a technology that could speed up critical cancer and vaccine research. At the same time, there are lots of reasons to have a dollop or two of anxiety: The same technology can be used for nefarious purposes. Which leaves you with a few thorny questions: How do you make sure that an AI model can’t be used to do Bad Things®? How do you prevent it from doing those bad things without also limiting its ability to do the important things? And, who exactly decides where that line is?<br>But, I’m equally worried about the subtler impacts. It’s one thing to try to prevent someone from trying to crack the nuclear codes, but what about implicit bias ? Models are trained off of human-created data and we all know that humans have been known to have a bias or two. These are trickier to suss out and carry the same—if not more—of a philosophical and ethical dilemma about where you draw the line. The impacts that these biases can have on various populations can’t be ignored.<br>Despite my optimistic leanings, I won’t opine on the various positive impacts that AI might have going forward. Dario Amodei’s essay Machines of Loving Grace lays out the case better than I can: the realistic version of the upside is curing diseases that have shadowed our species for millennia, compressing decades of biological progress into a few years, lifting the poorest parts of the world onto a different trajectory entirely. That’s not a fever dream. It’s a reasonable extrapolation of what systems already in the lab can begin to do.<br>Regardless, a tool powerful enough to design a vaccine is powerful enough to design a pathogen. A system competent enough to run an autonomous research pipeline is competent enough to pursue a goal you didn’t intend and didn’t notice you’d given it. You don’t get the magnitude of one without the magnitude of the other. So the question that matters isn’t “how powerful can we make these things?” It’s “can we understand and steer what we’ve made before it gets more capable than we are?”<br>Right now, the honest answer is: not as well as we’d like. Let me explain why, what could go wrong, and—because this isn’t a doomer pamphlet—the concrete work that gives me real hope we can get this right.<br>TL;DR<br>The first step is that we need to be able to have a complete understanding in terms of what is going on inside of the model. Right now? We don’t. So then, step one is interpretability : the degree to which a human can understand the cause-and-effect relationship between a model’s inputs and its outputs. It measures how easily a user can trace, comprehend, and trust the reasoning behind an AI’s decisions or predictions.

We Grow These Systems More than We Build Them<br>Start with the single weirdest fact about modern AI, because everything else follows from it. A large language model is not engineered the way a bridge or a database is engineered. It’s grown. We pick an architecture, define an objective, pour in a staggering amount of data and computation, and what comes out the other side is a tangle of billions of numbers—the model’s “weights”—that does astonishing things for reasons nobody can fully explain.<br>Sit with how strange that is. We deploy these systems to hundreds of millions of people, and we cannot open one up and read off why it answered the way it did, the way you’d step through code in a debugger. The subfield trying to fix that is called interpretability —reverse-engineering a network’s internal machinery into something a human can actually follow—and it’s young, and it’s losing the race against raw capability. We’re much better at making models more powerful than at making them more understandable. Hold onto that asymmetry. It’s the load-bearing problem under everything else in this...

things model safety bias from going

Related Articles