Sometime last year, I was reading up on a longstanding open math problem that
turns out to have a really deep connection with computability theory [1].
For the unfamiliar reader, this is a corner of theoretical computer science that
deals with “math computers” — basically, defining mathematical objects that can
do computations, and proving mathematical theorems about them. These math
computers are more formally called models of computation, and there are some
really common ones like state machines and context free grammars that every
programmer probably encounters at some point in their career. I’m probably
biased though, since I’m a compiler engineer during standard working hours, so
these are kind of my bread and butter.
Anyway, this researcher, Liesbeth De Mol, used a kind of math computer called a
tag system to do number theory, and I think that’s neat. I wasn’t familiar with
this model before, which isn’t surprising because there is a vast zoo of obscure
models of computation, but this kind turns out to be particularly interesting.
The history of their development is long and storied, and starts with Emil Leon
Post (who is probably better known for his other contributions like the
Post correspondence problem).
Post posed the problem of “predicting” the behavior of tag systems as early as
1920, which he cutely named “the problem of tag” after the classic recess game.
Less cutely, it turns out that this problem is equivalent to the halting
problem, and is therefore provably impossible to solve in general. Interestingly,
this represents a relatively early brush with the limits of decidability, and
runs in tandem with many of the other major developments of computability theory.
De Mol has also written a thorough accounting of this history in a few places
[2][3], which is definitely worth
checking out if you’re curious.
Understanding Tag Systems
So… what actually is a tag system? I’ll start with an intuitive explanation.
Imagine a clear, narrow tube full of marbles of various colors. Following a very
simple set of rules, we’ll pull a fixed number marbles of various colors out of
one end of the tube, and depending on the color of the first one, we’ll add some
sequence of marbles in the other end. Because the tube is narrow, the marbles
stay in order. Because the tube is clear, we can watch them go by as we go along
removing and adding marbles. The process looks something like this:

The rules in this example are quite simple: at each step, remove two marbles from
the front of the tube. Look at the color of the first marble that was removed:
if it’s red, add a red and a blue marble to the back, in that order. If it’s
green, add two red marbles and a blue marble to the back. If it’s blue, add a
single green marble. Note that for this tag system, we always take two marbles
out of the front, but depending on the color of the marble we take out first, we
might insert one, two, or three. This means that the number of marbles in the
tube can change! If we ever find that there aren’t two marbles to take out, we
stop the process. Indeed, for our initial state of three green marbles in the
animation, we find ourselves with only one green marble in the tube after 11
steps, so we say that the tag system halts on this initial state.
\[GGG \Rightarrow GRRB \Rightarrow RBRRB \Rightarrow RRBRB \\
\Rightarrow BRBRB \Rightarrow BRBG \Rightarrow BGG \\
\Rightarrow GG \Rightarrow RRB \Rightarrow BRB \\
\Rightarrow BG \Rightarrow G\]
Tag systems don’t always halt. There are two other behaviors that are possible:
we might eventually come back to some sequence of marbles that has happened
before, and because the way we proceed is deterministic, we will do this over
and over again. When this happens, we say that the tag system loops. It’s also
possible to do neither of these things by having an ever-growing sequence of
marbles, in which case we say that the tag system diverges. Note that a
particular tag system might do different things on different initial sequences
of marbles. To make this concrete, here’s a sequence for our example tag system that loops every three steps:
\[GRB \Rightarrow BRRB \Rightarrow RBG \Rightarrow GRB \Rightarrow \dots\]
You could also imagine that a tag system neither halts nor loops on a particular
initial state. This is possible if the string keeps on growing indefinitely, in
which case we say that the tag system diverges on the initial state. This one
seems to be the trickiest to prove or reason about. For our example tag system,
I think it always either halts or loops, but I haven’t been able to prove it
yet. For an example of a system that can diverge, consider the following tag
system:
\[\begin{cases}
\mathrm{remove}\, 2\, \mathrm{marbles} \\
R \rightarrow G \\
G \rightarrow RG \\
B \rightarrow BBB
\end{cases}\]
This tag system diverges when starting with two blue marbles, because on each
step we remove two blue marbles and add three more.
\[BB \Rightarrow BBB \Rightarrow BBBB \Rightarrow \dots\]
Now you should have all the intuition necessary to understand Post’s “problem
of tag:” given the rules of a particular tag system and a particular initial
state, does the tag system eventually halt, loop, or diverge?
Into the Weeds
My actual intention with writing this post is to describe some formalization
work I’ve been doing to systematize results about tag systems in the Lean
theorem prover, so everything that follows is going to be a bit dry and very
formal. In formal mathematical style, we’ll start with definitions and enumerate
some theorems in a carefully chosen order. In less formal style, I’ll include
some informal exposition for each proof as we go along, so that you can choose
your own adventure and read only the theorems, or the first paragraph of each
proof, or the whole proof, and get a progressively more detailed but coherent
picture in any case. If you find your eyes glazing over, feel free to start
skimming!
Now let’s get to it. Instead of talking about marbles and tubes, we’ll now start
talking about an alphabet and strings. Let \(\Sigma\) be a finite set we
call the alphabet, and whose elements we call symbols. Finite sequences of
symbols are called strings, which we denote \(\Sigma^\ast\). A tag system is
defined by a deletion number \(v > 0\), alphabet \(\Sigma\), and a function
\(\sigma : \Sigma \rightarrow \Sigma^\ast\) that associates each symbol with the
string that should be appended. That is, a tag system is a tuple \(T =
(v, \Sigma, \sigma)\).
Now we can define the semantics of the tag system. For starters, we say that
\(T\) has halted on a string \(A\) iff \(|A| < v\). We also define the function
\(next\) in terms of concatenation (\(\texttt{++}\)) which joins two strings
together one after the other, \(drop\) which removes the given number of symbols
from the start of a string, and \(head\) which gives the first symbol of a
string:
\[next(T, A) \coloneqq \begin{cases}
A & \mathrm{if} \, halted(T, A) \\
drop (A \, \texttt{++} \, \sigma(head(A)), v) & \mathrm{otherwise}
\end{cases}\]
And then we can define our “end behaviors” in terms of \(next\). Halting is the
most straightforward of the three: the tag system halts on \(A\) if there is
some number of steps \(n\) after which it has halted.
\[halts(T, A) \coloneqq \exists n \in \mathbb{N}, halted(T, next^n(T, A))\]
Looping is not much more complicated, but we require an auxilary definition.
When a tag system loops on a string, it doesn’t necessarily start in the loop
itself — it may have a transient before settling into the repeating states.
Thus, we call a state periodic with period \(p\) iff we return to that state
after \(p\) steps, and a tag system loops on a string iff it becomes periodic
after \(t\) steps. Because \(next(T, A) = A\) if \(T\) is halted on \(A\), we
explicitly exclude this case. Furthermore, because \(next^0(T, A) = A\), we also
require that \(p\) is at least 1.
\[periodic(T, A, p) \coloneqq \neg halted(T, A) \wedge 0 < p \wedge next^p(T, A) = A \\
loops(T, A) \coloneqq \exists t, p \in \mathbb{N}, periodic(T, next^t(T, A), p)\]
The messiest of the three is divergence. There are actually multiple distinct
definitions used in the literature. The first is more “intuitive” to me, and I
see in a lot of the older sources, which I call weak divergence: \(T\) weakly
diverges on \(A\) iff we can always find an even longer string later in the
computation. Symbolically:
\[diverges(T, A) \coloneqq
\forall i \in \mathbb{N},
\exists j \in \mathbb{N},
j > i \wedge length(next^j(T, A)) > length(next^i(T, A))\]
The second definition is the one used by De Mol[3], which I
call strong divergence. \(T\) strongly diverges on \(A\) iff for any length
\(n\), there is a “point of no return” \(i\) where the string never gets shorter
than \(n\). Symbolically:
\[diverges'(T, A) \coloneqq
\forall n \in \mathbb{N},
\exists i \in \mathbb{N},
\forall j > i,
length(next^j(T, A)) > n\]
Some meaningful results
Now that I’ve been really pedantic and formal in these definitions, it’s time
to do some really pedantic and formal proofs. After reading a handful of papers
on tag systems, including Post’s paper from the 1940s that first defined them,
I’ve never seen anyone bother to prove the following results. Indeed, I only
really thought to prove them when I started trying to work with tag systems in
the Lean theorem prover — for better or worse, interactive theorem provers don’t
let you skip any steps (though some are more automated than others). As a
result, one often finds themselves spending a lot of time trying to prove really
basic things. In this particular case, this meant proving that halting, looping,
and diverging are mutually exclusive and exhaustive — that is, a tag system
operating on a particular string always does exactly one of those three things.
It also meant proving that weak divergence and strong divergence are equivalent,
and therefore it is appropriate to simply speak of divergence. All of the Lean
definitions and proofs can be found on my fork of Mathlib.
Mutual exclusion
Our goal in this section is to prove the following theorem:
Theorem 1: For a tag system \(T\) and a string \(A\) over the alphabet of
\(T\), no more than one of the following is true:
\[halts(T, A) \\
loops(T, A) \\
diverges'(T, A)\]
We can prove this by showing that taking any pair of these statements leads to
a contradiction, which we do in a series of lemmas to be combined later.
Lemma 1: \(T\) cannot both halt and loop on \(A\).
Proof: Suppose that \(T\) halts on \(A\) in \(s\) steps, and that \(T\)
also loops on \(A\), with transient \(t\) and period \(p\). We now show the
former assumption implies that the computation has halted at a particular future
step, whereas the latter assumption implies that the same step has not, which is
a contradiction and will complete the proof.
First, note that if \(halted(T, A')\), then also \(halted(T, next(T, A'))\).
This follows from the definition of \(next\), since \(halted(T, A')\) we have
that \(next(T, A') = A'\). Inductively, this is true for all subsequent steps,
which is to say that \(\forall i \in \mathbb{N}, halted(T, A') \rightarrow
halted(T, next^i(T, A'))\).
Next, note that \(1 \le p\). Multiplying both sides by \(s\), we have that
\(s \le ps\). Now add \(t\) to the right side of the inequality, which shows
that \(s \le ps + t\). Since \(T\) halts on \(A\) in \(s\) steps, it is also
halted after \(ps + t\) steps, which is to say \(halted(T, next^{ps+t}(T, A))\).
However, because \(loops(T, A)\), we have \(periodic(T, next^t(T, A), p)\). From
the definition of \(periodic\), we have that
\(next^t(T, A) = next^p(T, next^t(T, A)) = next^{p+t}(T, A)\). We can apply this
procedure arbitrarily many times, incrementing the exponent by \(p\) each time,
and so by induction we have that \(\forall n \in \mathbb{N},
next^t(T, A) = next^{pn+t}(T, A)\), and taking \(n = s\), we have that
\(next^t(T, A) = next^{ps+t}(T, A)\). Now, since \(periodic(T, next^t(T, A), p)\),
we have \(\neg halted(T, next^t(T, A))\), which is equivalent to
\(\neg halted(T, next^{ps+t}(T, A))\), which is the contradiction we were
looking for.
□
Lemma 2: \(T\) cannot both halt and strongly diverge on \(A\).
Proof: suppose that \(T\) halts on \(A\) in \(s\) steps, and that \(T\)
also strongly diverges on \(A\). As before, the former assumption ensures that
the computation has halted at a particular future step, whereas the latter
implies that it does not.
First, we turn our attention towards strong divergence. Taking \(n = v\) in the
definition of strong divergence, we have some \(i \in \mathbb{N}\) such that
\(\forall j > i, length(next^j(T, A)) > v\). We can weaken this inequality to
\(\forall j > i, length(next^j(T, A)) \ge v\), which can now be rewritten in
terms of \(halted\) as \(\forall j > i, \neg halted(T, next^j(T, A))\).
However, because \(halts(T, A)\), we have
\(\forall j \ge s, halted(T, next^j(T, A))\) by the same argument as the last
proof. This leads to our contradicition: let \(t = max(s, i+1)\), so that
\(t > i\) and \(t \ge s\). Since \(t > i\), we have \(\neg halted(T, next^t(T, A))\),
but because \(t \ge s\), we also have \(halted(T, next^t(T, A))\), which is a
contradiction.
□
Lemma 3: \(T\) cannot both loop and strongly diverge on \(A\).
Proof: Suppose that \(T\) loops on \(A\) with transient \(t\) and period
\(p\). We will show that this means strings of a certain length reoccur
arbitrarily far into the computation, which contradicts strong divergence.
Let \(n = length(next^t(T, A))\) be the length of the first string reached in
the cycle. By the same argument used in lemma 1, we have that \(\forall m \in \mathbb{N},
next^t(T, A) = next^{pm+t}(T, A)\). By the definition of \(n\), it follows that
\(\forall m \in \mathbb{N}, length(next^{pm+t}(T, A)) = n\).
However, by the definition of divergence, taking the string length to be \(n\),
we have \(i \in \mathbb{N}\) such that \(\forall j > i, length(next^j(T, A)) > n\).
Note that \(i + 2 > i\), and since \(p > 0\) we have that \(p (i + 2) + t > i\).
Therefore, \(length(next^{p (i + 2) + t}(T, A)) > n\). On the other hand, taking
the result from the previous paragraph with \(m = i + 2\), we have that
\(length(next^{p(i+2)+t}(T, A)) = n\), which is a contradiction.
□
Theorem 1: For a tag system \(T\) and a string \(A\) over the alphabet of
\(T\), no more than one of the following is true:
\[halts(T, A) \\
loops(T, A) \\
diverges'(T, A)\]
Proof: suppose for the sake of contradiction that more than one of the
statements are true. Then at least two of the statements are true. This means
we have at least one of the following cases:
case 1: \(halts(T, A) \wedge loops(T, A)\), which is a contradiction by lemma 1.
case 2: \(halts(T, A) \wedge diverges'(T, A)\), which is a contradiction by lemma 2.
case 3: \(halts(T, A) \wedge diverges'(T, A)\), which is a contradiction by lemma 3.
□
Exhaustiveness
In the last section, we proved that a tag system operating on a string has at
most one of our end behaviors. In this section we show that it is an exhaustive
list, which is to say that we also have at least one of our end behaviors. This
proves that we haven’t missed any possibilities.
Theorem 2: For a tag system \(T\) and a string \(A\) over the alphabet of
\(T\), \(halts(T, A) \vee loops(T, A) \vee diverges'(T, A)\).
Proof: either \(T\) strongly diverges on \(A\) or it does not. If it
does, then the claim trivially follows. So now, suppose \(\neg diverges'(T, A)\).
We now show \(halts(T, A) \vee loops(T, A)\).
First, note that applying De Morgan’s law on \(\neg diverges'(T, A)\) tells us
there is some string length \(n\), such that \(\forall i : \mathbb{N},
\exists j > i, length(next^j(T, A)) \le n\). This means that for any threshold
step number \(i\), we can always find some step number \(j > i\) such that
\(next^j(T, A)\) is no longer than \(n\). The crux of the proof is that we can
therefore find infinitely many indices of strings shorter than \(n\), but there
are finitely many such strings, which means one of them must be repeated.
Invoking the axiom of choice, there is some
function \(\varphi : \mathbb{N} \rightarrow \mathbb{N}\) that takes in \(i\) and
returns such a \(j\). Without loss of generality, suppose that \(\varphi\) is
strictly monotone. Now define a new function \(f(i) \coloneqq next^{\varphi(i)}(T, A)\).
By the definition of \(\varphi\), it must be that \(\forall i \in \mathbb{N}, length(f(i)) \le n\).
Because there are finitely many strings of length less than or equal to \(n\),
by the pigeonhole principle \(\exists i_1, i_2 \in \mathbb{N}, i_1 \ne i_2
\wedge f(i_1) = f(i_2)\). Again without loss of generality, suppose \(i_1 < i_2\).
Because we have a repeated string, we are ready to show that the tag system
halts or enters a loop. It’s easy to tell these two cases apart by the length
of the string. Now, either \(length(f(i_1)) < v\) or it is not.
Case 1: suppose \(length(f(i_1)) < v\). Then we have that
\(length(next^{\varphi(i_1)}(T, A)) < v\), so
\(halted(T, next^{\varphi(i_1)}(T, A))\). Therefore \(halts(T, A)\).
Case 2: suppose \(length(f(i_1)) \ge v\). Then we have
\(\neg halted(T, next^{\varphi(i_1)}(T, A))\). We also have that
\(next^{\varphi(i_1)}(T, A) = next^{\varphi(i_2)}(T, A)\), so
\(periodic(T, next^{\varphi(i_1)}(T, A))\) with period \(i_2 - i_1\),
and therefore \(loops(T, A)\).
□
Strong divergence and weak divergence are equivalent
You may be wondering why I didn’t prove the two definitions of divergence
equivalent right after defining them. It turns out the proof is nontrivial, and
taking advantage of the last two theorems was quite economical. The only new
building blocks we need here are mutual exclusion lemmas for weak divergence.
Lemma 4: \(T\) cannot both halt and weakly diverge on \(A\).
Proof: suppose that \(T\) halts on \(A\) in \(s\) steps. By the definition
of \(diverges\) and DeMorgan’s law, we want to show that
\(∃i \in \mathbb{N}, ∀ j > i, length(next^j A) ≤ length(next^i A)\). It happens
that \(s\) is such a number: since \(halted(T, next^s(T, A))\), we have that
\(next^j(T, A) = next^s(T, A)\), and therefore \(length(next^j A) ≤ length(next^s A)\) for
all \(j > s\).
□
Lemma 5: \(T\) cannot both loop and weakly diverge on \(A\).
Proof: suppose that \(T\) loops on \(A\) with transient \(t\) and period
\(p\). We will show that no string longer than the longest string in the loop
occurs after entering the loop, which contradicts weak divergence.
First, we must construct the longest string in the loop. Let \(L\) be the
sequence of \(p\) strings that occur in the loop. That is, \(L_j \coloneqq
next^{t+j}(T, A), 0 \le j < p\). Let \(i\) be the index of the longest string in
\(L\); if there are multiple strings of the longest length, take \(i\) to be the
smallest index of such a string. By construction, we have that \(\forall
0 \le j < p, length(next^{t+j}(T, A)) \le length(next^{t+i}(T, A))\).
Now by the definition of \(diverges\) and De Morgan’s law, the claim is that
there is some step of the computation after which a longer string is never
produced. That is, \(∃i' \in \mathbb{N}, ∀ j > i', length(next^j(T, A)) ≤
length(next^{i'}(T, A))\). Let’s demonstrate that \(t + i\) is such a value.
Now, for any \(j > t + i\), we also have that \(j \ge t\). Let
\(\delta = j - t \in \mathbb{N}\). It now suffices to show that
\(\forall \delta \in \mathbb{N}, length(next^{t+\delta}(T, A)) \le length(next^{t+i}(T, A))\).
To do this, decompose \(\delta\) into its quotient and remainder divided by \(p\).
That is, define \(q, r \in \mathbb{N}\) such that \(\delta = qp + r\) and
\(r < p\). We proceed by induction on \(q\): if \(q = 0\), then we have
\(\delta = r < p\), and the claim follows by construction of \(i\). Now we have
the induction hypothesis \(length(next^{t + qp + r}(T, A))
\le length(next^{t+i}(T, A))\). By periodicity, we have that
\(next^{t + qp + r}(T, A) = next^{(t + qp + r) + p}(T, A) = next^{t + (q+1)p + r}(T, A)\),
which we can substitute to obtain
\(length(next^{t + (q+1)p + r}(T, A)) \le length(next^{t+i}(T, A))\) which
completes our inductive argument. Therefore, the length of the string after
\(i\) steps is bounded by \(length(next^{t+i}(T, A))\), and \(T\) does not
weakly diverge on \(A\).
□
Theorem 3: \(diverges'(T, A) \iff diverges(T, A)\)
Proof: proving weak divergence from strong divergence, as the names
imply, is possible with a straightforward direct proof. Suppose
\(diverges'(T, A)\). To show that \(diverges(T, A)\), we need to prove that
\(\forall i \in \mathbb{N}, \exists j \in \mathbb{N}, j > i \wedge
length(next^j(T, A)) > length(next^i(T, A))\).
Let \(n \coloneqq length(next^i(T, A))\), and let \(i'\) be the threshold of
strong divergence for strings of length \(n\), so that \(\forall j > i',
length(next^j(T, A)) > n\). Now, for any given \(i\), taking \(j \coloneqq
max(i+1, i'+1)\) suffices. Clearly \(max(i+1, i'+1) \ge i+1 > i\). We also have
that \(max(i+1, i'+1) \ge i'+1 > i'\), so \(length(next^j(T, A)) >
length(next^i(T, A))\). Therefore, \(diverges'(T, A) \rightarrow diverges(T, A)\).
Next, to show that weak divergence implies strong divergence, consider the
contrapositive. Suppose \(\neg diverges'(T, A)\). By theorem 2, we have that
\(halts(T, A) \vee loops(T, A)\). Consider the two cases separately:
Case 1: suppose \(halts(T, A)\). By lemma 4, we have \(\neg diverges(T, A)\).
Case 2: suppose \(loops(T, A)\). By lemma 5, we have \(\neg diverges(T, A)\).
□
Even though all of these theorems are at some level intuitively obvious — for
example, clearly a thing that stops doesn’t get caught in a loop — it turns out
that none of the proofs are as trivial as the tag system literature would have
you believe. Indeed, these claims have been brushed of as “obvious” for close to
80 years, and I personally only bothered to write these proofs because the Lean
theorem prover pointed out these gaps in my reasoning for more complicated
theorems. I don’t expect anyone to be all that interested in these proofs
themselves, but they do represent a step forward in the Lean formalization of
tag systems that is my real contribution.
My efforts have also shown several gaps in the published research on tag systems. one cause of this is Post’s tendency to reference his own unpublished work
[4], and another is the abundance of paywalls and bit rot of
old math publications. There are several important results that I am familiar
with through citations, but cannot access the original works. Overall, I hope to
contribute a meaningful systematization of this work. I already have another
post planned with some more theorems, so stay tuned!
Liesbeth De Mol. Tag systems and Collatz-like functions. Theoretical Computer Science, Volume 390, Issue 1, 2008, Pages 92-101. Link
Liesbeth De Mol. Tracing Unsolvability: A Mathematical, Historical and Philosophical Analysis with a Special Focus on Tag Systems. Ghent University. Faculty of Arts and Philosophy, 2007.
Liesbeth De Mol. Solvability of the Halting and Reachability Problem for Binary 2-tag Systems. Fundamenta Informaticae, Volume 99, Issue 4, 2025, Pages 435-471. Link.
Emil Leon Post. Formal Reductions of the General Combinatorial Decision Problem. American Journal of Mathematics, Volume 65, NO 2, 1943. Link.