Embeddings as Encodings

tosh1 pts0 comments

Embeddings as Encodings – HASH Developer Blog

Dev Blog/Our Approach,Technologies

Embeddings as Encodings<br>Correctly conceptualizing and handling vectorization in knowledge graphs<br>January 26th, 2026

Dei VilkinsonsCEO & Founder, HASH

Background

Embeddings are now a default building block in modern data services, powering semantic search, retrieval-augmented generative AI (RAG), clustering, deduplication, recommendations, anomaly detection, and more.

In fact, if you’re building AI-native products, you’re almost certainly storing vectors somewhere... but the how and where vary wildly.

The moment embeddings sit alongside a knowledge graph, a deceptively simple question becomes operationally important: are embeddings part of an entity, or are they metadata about an entity?

This arose internally at HASH in the way such questions often do: a few engineers, a few competing intuitions, and some spirited debate. We've since converged on a framework that’s less about word choice and more about building systems that remain reliable under model changes, scale, and security constraints.

The resulting best practice is straightforward: Embeddings are encodings of entities (derived representations of them). Embeddings therefore require content-grade access control , with the added consideration of metadata-grade lifecycle management . These principles resolve most downstream design debates.

“Derived” doesn’t automatically mean “metadata”

A common argument goes: embeddings are computed from entity content, therefore they’re metadata.

However, that’s not quite right. Embeddings represent the content. Not in a way that may be intelligible to you or me, but in a way that tells us far more than the created_at date of a million entities of different types.

Plenty of things are derived from “original data” or “content”, yet are simply different encodings of it:

a JPEG derived from RAW

an MP3 derived from WAV

a thumbnail derived from a full-resolution image

a translated document derived from an original

These are all different representations of information — often lossy, often dependent on an algorithm/codec/model, but fundamentally still encodings of the same underlying content.

Embeddings fit better in this bucket than as descriptive “metadata”, at least as most people (including developers) intuitively use the term (for example, to refer to file authors, timestamps, tags, etc.)

JPEG<br>RAW image→<br>Compressed image<br>Lossy reduction of fine-grained pixel data. Cheap to ship, easy to view.

MP3<br>WAV audio→<br>Compressed audio<br>Drops imperceptible frequencies. Tiny file, plays anywhere.

Thumbnail<br>Full-resolution→<br>Small preview<br>Drops resolution. Optimized for at-a-glance recognition.

AАTranslation<br>Original document→<br>Target-language doc<br>Re-encoded into another language. Optimized for a different audience.

Embeddings<br>Embedding<br>Entity attributes→<br>Dense vector<br>Lossy reduction into a vector space optimized for similarity. Opaque to humans, fast for similarity ops.

Five derived encodings sitting alongside one another: JPEG from RAW images, MP3 from WAV audio, thumbnail from full-resolution image, translation from an original document, and a dense vector embedding from an entity's attributes. Each is a lossy re-encoding of canonical content for a specific downstream purpose.<br>So why think about embeddings as “metadata” at all? The value isn’t, of course, in the label, but in the required systems discipline that typically comes with handling metadata: provenance, versioning, refresh policies, and explicit separation from canonical truth.

A practical taxonomy for graph/AI systems

In knowledge graphs the term “entity” usually means more than a blob of text. It’s a stable identifier plus a set of claims (attributes + relationships), ideally with provenance. In HASH, an entity is even more than this.

A durable way to structure this world consists of at least three layers:

1. Canonical claims

The graph’s explicit record of what is asserted:

Company: HASH

hasWebsite(hash.dev)

employs(Person: …)

provenance: where the claim came from, when, confidence, etc.

This is the layer you can audit, reason over, and reconcile.

2. Representations of those claims

Alternate encodings used for consumption or computation:

rendered HTML

translations

compressed media formats

thumbnails

...and of course embeddings.

These are often lossy, often recomputable, and usually dependent on a specific model/codec. They are valuable, but they are not the canonical record.

3. Metadata about claims and representations

The lifecycle and provenance data that keeps everything sane:

model name/version used to compute a representation

which fields were embedded

when it was generated

input fingerprint

retention policy, ACL references, lineage

This is “data about the data”—and it’s what prevents the system from devolving into unexplainable artifacts.

Because embeddings are computed representations of canonical claims they live in layer 2, instead...

embeddings from encodings metadata derived entity

Related Articles