What makes a good variable naming convention – Ben Harrap
In particular, what makes a good variable naming convention for a longitudinal or panel survey? I wanted to answer this question so I could propose a good naming convention for a project at work. I have lots of opinions about this, largely based on working with variable names I don’t like, but I thought I should see what others thought so I posed the question on BlueSky…
I get to propose a variable naming convention.
Give me your "I'll die on this hill" opinions about variable names #databs #rstats
— Ben Harrap (@bharrap.bsky.social) February 25, 2025 at 7:27 PM
… and I got just as many comments on what not to do as what to do! So let’s go through the do’s and don’ts and I’ll use the following question to demonstrate.
How would you rate your general health?
Poor
Fair
Good
Excellent
In terms of features, this question:
Permits only one response choice
Creates ordinal data
Is asked every wave
Appears in the ‘general health’ sub-section of the questionnaire
Which is in the ‘health’ section
Asks about participants’ general health
What makes a bad naming convention?
Let’s start out with what makes for a bad naming convention. If we figure out what we don’t like, we can start to figure out what we do.
Not using a convention
Making variable names up as you go along is clearly a bad idea. It creates much more work as you’ll have to either remember every specific variable name or look at the data dictionary every time. Planning variable names ahead of time is also important, if you make up a few names then try and fit new names into the pattern you’ve just made up, you’ll quickly run into issues where the names don’t work. So at the very least we need a naming convention.
For our general health question, maybe we called it generalhealth. Not terrible, but this isn’t going to be adequate in the context of an entire survey, as hopefully you’ll come to realise.
Prioritising brevity
Making variable names as short as possible will save a few keystrokes at the cost of constantly referring back to the data dictionary to make sure you’re using the right one.
We could call the general health question genh or gh or ghr (general health rating). Yes, they’re short, but see how I had to explain what ghr stood for? It’s not immediately clear.
Also, text-completion exists in many IDEs, so don’t prioritise brevity!
Prioritising interpretability
Conversely, prioritising interpretability can lead to excessively long variable names, which is at the other extreme!
wave_1_health_general_health_general_health_rating_ordinal encodes lots of information - the wave, the section, the sub-section, some of the question wording, and the type of data. This kind of length might work for your own solo projects but not for a dataset that’s going to be used by lots of people.
Names that are easy to mix up
This is more common with conventions that prioritise brevity, as they tend to use abbreviations and avoid delimiters, which makes variables difficult for our brains to process quickly and accurately. Imagine the convention {wave}{respondent}{topic}{2-letter ID} resulted in the variables bcgenmo and bcgemno. They’re from entirely different topics, gen and gem, but it is very easy to mix them up visually and through typos.
Using more than one case
camelCase, snake_case, kebab-case, SCREAMING-KEBAB-CASE, they’re all candidates but switching between cases is a no-no for variable names. It might be useful in other situations where you use one case for functions, another for variable names. But we’re just talking about variable names here, so don’t use more than one case.
Using more than one language
This was a good point from my BlueSky post (thanks Russell!). In my context in Australia, I’d use English, and only English, to name the variables in my dataset.
People are going to have an easier time if we call the general health question general_health instead of allgemeine_gesundheit. If you were collecting data in Germany, allgemeine_gesundheit might make more sense.
Using letters for wave number
Ok this is probably a controversial one because I see it done time and time again. Yes, it’s a concise and convenient way of representing a number. However, it only works up to 26 waves (our survey’s going to run forever, right!?). It’s also a pain trying to remember which letter is which number and it forces you to write extra code to translate letters into numbers (e.g. match("k",letters[1:26])).
Imagine we’ve got 20 waves of survey data, what wave is the general health question hhlthgenord in? I’m sat here counting out letters on my fingers because I don’t know off the top of my head.
Also, it can lead to odd variable names - some examples from the wild include:
asdtype - Wave 1 survey type, not type of autism spectrum disorder
ewluge - A question about geriatric care in wave 5, not a disdain for sleds
These are fairly innocuous examples but I’ve seen some...