The YAML document from hell (2023)

vismit20001 pts0 comments

The yaml document from hellFor a data format, yaml is extremely complicated. It aims to be a human-friendly format, but in striving for that it introduces so much complexity, that I would argue it achieves the opposite result. Yaml is full of footguns and its friendliness is deceptive. In this post I want to demonstrate this through an example.<br>This post is a rant, and more opinionated than my usual writing.<br>Yaml is really, really complex<br>Json is simple. The entire json spec consists of six railroad diagrams. It’s a simple data format with a simple syntax and that’s all there is to it. Yaml on the other hand, is complex. So complex, that its specification consists of 10 chapters with sections numbered four levels deep and a dedicated errata page.<br>The json spec is not versioned. There were two changes to it in 2005 (the removal of comments, and the addition of scientific notation for numbers), but it has been frozen since — almost two decades now. The yaml spec on the other hand is versioned. The latest revision is fairly recent, 1.2.2 from October 2021. Yaml 1.2 differs substantially from 1.1: the same document can parse differently under different yaml versions. We will see multiple examples of this later.<br>Json is so obvious that Douglas Crockford claims to have discovered it — not invented. I couldn’t find any reference for how long it took him to write up the spec, but it was probably hours rather than weeks. The change from yaml 1.2.1 to 1.2.2 on the other hand, was a multi-year effort by a team of experts:<br>This revision is the result of years of work by the new YAML language development team. Each person on this team has a deep knowledge of the language and has written and maintains important open source YAML frameworks and tools.

Furthermore this team plans to actively evolve yaml, rather than to freeze it.<br>When you work with a format as complex as yaml, it is difficult to be aware of all the features and subtle behaviors it has. There is an entire website dedicated to picking one of the 63 different multi-line string syntaxes. This means that it can be very difficult for a human to predict how a particular document will parse. Let’s look at an example to highlight this.<br>The yaml document from hell<br>Consider the following document.<br>server_config:<br>port_mapping:<br># Expose only ssh and http to the public internet.<br>- 22:22<br>- 80:80<br>- 443:443

serve:<br>- /robots.txt<br>- /favicon.ico<br>- *.html<br>- *.png<br>- !.git # Do not expose our Git repository to the entire world.

geoblock_regions:<br># The legal team has not approved distribution in the Nordics yet.<br>- dk<br>- fi<br>- is<br>- no<br>- se

flush_cache:<br>on: [push, memory_pressure]<br>priority: background

allow_postgres_versions:<br>- 9.5.25<br>- 9.6.24<br>- 10.23<br>- 12.13Let’s break this down section by section and see how the data maps to json.<br>Sexagesimal numbers<br>Let’s start with something that you might find in a container runtime configuration:<br>port_mapping:<br>- 22:22<br>- 80:80<br>- 443:443{"port_mapping": [1342, "80:80", "443:443"]}<br>Huh, what happened here? As it turns out, numbers from 0 to 59 separated by colons are sexagesimal (base 60) number literals. This arcane feature was present in yaml 1.1, but silently removed from yaml 1.2, so the list element will parse as 1342 or "22:22" depending on which version your parser uses. Although yaml 1.2 is more than 10 years old by now, you would be mistaken to think that it is widely supported: the latest version libyaml at the time of writing (which is used among others by PyYAML) implements yaml 1.1 and parses 22:22 as 1342.<br>Anchors, aliases, and tags<br>The following snippet is actually invalid:<br>serve:<br>- /robots.txt<br>- /favicon.ico<br>- *.html<br>- *.png<br>- !.gitYaml allows you to create an anchor by adding an & and a name in front of a value, and then you can later reference that value with an alias: a * followed by the name. In this case no anchors are defined, so the aliases are invalid. Let’s avoid them for now and see what happens.<br>serve:<br>- /robots.txt<br>- /favicon.ico<br>- !.git{"serve": ["/robots.txt", "/favicon.ico", ""]}<br>Now the interpretation depends on the parser you are using. The element starting with ! is a tag. This feature is intended to enable a parser to convert the fairly limited yaml data types into richer types that might exist in the host language. A tag starting with ! is up to the parser to interpret, often by calling a constructor with the given name and providing it the value that follows after the tag. This means that loading an untrusted yaml document is generally unsafe , as it may lead to arbitrary code execution. (In Python, you can avoid this pitfall by using yaml.safe_load instead of yaml.load.) In our case above, PyYAML fails to load the document because it doesn’t know the .git tag. Go’s yaml package is less strict and returns an empty string.<br>The Norway problem<br>This pitfall is so infamous that it became known as “the Norway problem”:<br>geoblock_regions:<br>- dk<br>- fi<br>- is<br>- no<br>- se{"geoblock_regions": ["dk", "fi", "is", false,...

yaml document from json team data

Related Articles