# TOM S JUZEK'S BLOG

 << Hacking Der Spiegel <<     – HOME

### The Zodiac Killer's unsolved z340: not your vanilla substitution cipher

By the way have you cracked the last cipher I sent you? (The Zodiac Killer, 20 Apr 1970)

To this day, the Zodiac Killer's second major cipher, z340, remains unsolved. In this post, I argue that z340 is not a classical substitution cipher and that it could well be a fake cipher. These hypotheses have been put forward before, but I provide substantial evidence for them, mainly through the analysis of mean squared distances for ngrams. This analysis is applied across various ciphers: z340, z408, which is the other major cipher by the Zodiac Killer, twenty generated ciphers based on real texts, and twenty fake ciphers based on random letter sequences. The results show that z340 behaves unlike other true substitution ciphers. Instead, it associates with the fake ciphers. However, it remains a possibility that z340 is a more intricate cipher, e.g. a Vigenère cipher.

### Introduction

Back in around 2012, I learnt about an intriguing, to this day unsolved cipher: The Zodiac Killer's z340. Here is what the beginning of the cipher looks like:

You can find the full cipher here [image link]. The cipher has gripped people's imagination for almost 50 years now. Various solutions were suggested, some more credible [html link] than others [html link]. And there are some pretty decent analyses out there [html link]. Arguably, however, the cipher remains unsolved [html link]. In my view, the reason for why there has been so little progress on the cipher is that the focus was too much on trying to solve it, instead of analysing whether or not z340 is a "bona fide" cipher in the first place.

### The z340 cipher

z340 consists of 20 rows with 17 characters each, making use of a set of 63 characters. It is the Zodiac's second and also his last major cipher, following z408, which has been cracked by the Hardens shortly after publication [html link]. Simple descriptives make z340 look like a solvable cipher. At a first glance, the character frequencies look promising (more on character frequency: [html link]) and there are some bigrams and two trigrams (these are character ngrams; more on ngrams: [html link]). The PLUS character has a relative frequency of about 7%, suggesting that it should probably be either "E", "T", "A", or, "O". Further, the PLUS occurs in various bigrams. In fact, PLUS-PLUS is one of the most frequent bigrams. In theory, these facts alone should give us a massive head start to solving z340.

Please ignore the msd's for now; we'll come back to them shortly.

A further boost comes from the fact that z340 has a lot in common with z408 – and z408 has been cracked. z408 is also of the form 17*x, in this case 17*24 characters. It has a character set of 54 characters, 47 of which it shares with z340. While the Hardens deciphered most of z408, the last 18 characters remain undeciphered. As per the Hardens' Solution, z408 is a one-to-many substitution cipher. It is also worth noting that the Zodiac Killer mapped "E" to "E" and "R" to reversed "R", which is sometimes interpreted as a double bluff ("'E' as 'E' would be too easy, it must be something else"). Choosing high-frequency letters for this ruse indicates that the Zodiac Killer was aware of English letter frequencies. Here is a snippet of the z408 cipher, taken from Wikisource [html link]:

So altogether, z340 looks pretty doable and yet, it hasn't been solved. The Zodiac Killer seemed to enjoy that people struggled to crack the cipher. He taunted the investigators with it, which is the opening quote above [html link]. However, this also bothered me about z340 right from the beginning: It looks almost too easy. If z340 is just another z408, which has been deciphered within about a week after its publication, then why bother to encrypt z340 in the first place? So, the assumption is that the Zodiac Killer must have stepped up his game big time.

Several hypotheses have been put forward, trying to explain why z340 has been so difficult to decipher. Among them:

1) There is some trick to the cipher. For instance, the text is mirrored, certain characters have to be disregarded, etc.
2) The cipher contains a obscure message that make it hard to decipher.
3) It's a many-to-many cipher.
4) The cipher does not contain any message, it's a ruse.
5) It is something more advanced, e.g. a Vigenère cipher.

I will address these points one by one. However, let's first look at the key metric that I use throughout this post: mean squared distances for ngrams.

### Mean squared distances for ngrams

I decided to use a metric based on letter/character ngrams (throughout this post, whenever I speak of ngrams, I refer to ngrams of letters/characters). To give a quick example, consider the bigrams of the word "dadaism": "da", "ad", "da", "ai", "is", and "sm". ngrams used on ciphers are a great measure of how much language is present/accessible. If ngrams are present to an expected degree, then there is language underlying the cipher. If ngrams are not present to an expected degree, then there is either no language underlying the cipher or the encryption of the cipher is so strong that ngrams do not surface sufficiently. Further, a nice feature of ngrams is that the direction of the cipher is irrelevant. If a cipher is mirrored, then ngrams will surface just as much. If you have any doubts whether ngrams are a good basis for cipher analysis or wonder what "to an expected degree" means, then please bear with me. Below, these points will become clearer.

Any ngram occurring more than once is valuable information, thus measuring an ngram's distance from 1 is a plausible metric. And the higher the frequency, the more valuable an ngram becomes, arguably in a non-linear way. Thus, squaring the distance makes sense. We can do this for all ngrams and then take the average. This is what the mean squared distance from one metric (msd) captures. The following figure, based on hypothetical data, illustrates the msd.

The formula is as follows:

N.B.: Alternatively, one could use the distance from 0, which, in my view, works equally well. There are pros and cons to both. Some transformation of the msd, e.g. taking the log or the square root could also be useful. However, for our purposes, the bare msd is sufficient.

We already saw the bigram msd and trigram msd for z340 in the figure above. It can be hard to make sense of what the values mean. So, where it becomes really interesting is when we start comparing a cipher's msd to a meaningless baseline. Such a Δ between msd's could be extremely insightful. For example, the bigram msd for z340 is 0.097. If we had the msd of a meaningless baseline, then the Δ between the two would be valuable information about the meaningfulness of z340.

The problem with the msd is that there are difficulties with comparing msd's across data sets. This is because the length of a text influences the msd, as well as the length of a text's character set. A 400 character cipher using 10 characters will see a different ngram distribution to a 100 character cipher using 40 characters. So, ideally, we would compare any given cipher to something that 1) is of the same length, 2) uses a character set of the same length, and 3) is clearly meaningless. 3) allows us to detect whether meaning is present in the target.

Luckily, we have such a baseline for any data set! We can read in the same cipher in different ways, e.g. horizontally vs vertically. This way, we have a baseline of the same length, using the same character set, and we know that at least one of the two interpretations has to be meaningless. If there is no substantial difference in msd's, then we know that both interpretations of the cipher are meaningless: The Δ in msd's between a meaningless text and another meaningless text is approximating 0. If there is a considerable difference, then we know that the interpretation with the higher msd consists of more language underlyingly and hence, this is the right way to interpret the cipher.

This is somewhat abstract and will become clearer soon. Let's return to z340.

### z340 vs z408

When I tried to solve z340, assuming that it's a one-to-many substitution cipher that goes left-to-right, I didn't get anywhere. So, I took a step back and compared z340 to z408. Comparing character frequencies and ngrams give interesting results.

First, the PLUS has a relative frequency of 7% in z340, which in itself is suspicious. Second, note how z408's character frequencies make a nice S-shaped tail, this is the relatively sharp drop towards the end, while z340 behaves like a step function. Further, z408 has a lot more bi- and trigrams. Something fishy is going on.

This is where the analysis of Δ-msd's provides important insights. Compare the |Δ-msd| for z340 vs |Δ-msd| for z408. (Note that in the following, we will always look at absolute Δ's.)

What these graphs tell us is that if you read in z408 horizontally, then you get a certain ngram distribution (the pink line). If you read it in vertically, then that distribution collapses (the dark pink triangles). This contrasts to z340, where horizontally vs vertically makes little difference. Basically, what this means is that interpreting z340 horizontally is as pointless as doing so vertically. We will back this up further below by comparing z340 to other fake ciphers.

z408 gives us an idea what the |Δ-msd|'s should look like for a meaningful one-to-many cipher vs a meaningless one-to-many cipher. Let's plot this in a two dimensional space.

### Further ways to read in z340

It doesn't matter whether or not the cipher is written left to right or right to left; this also holds for reading in lines top to bottom vs bottom to top. An ngram analysis would still capture this. These symmetries also hold when we read in the cipher vertically; top to bottom vs bottom to top would be equally reflected in the ngram analysis.

This already covers a lot of ground in terms of different ways to read in the cipher. I have added a few more, viz. only reading in every other character (these are the "alternating 1" and "alternating 2" conditions in the graph below), skipping every third character ("two-one, shift 1-3"), ignoring the PLUS ("no plus"), randomly splitting the PLUS into two characters ("plus split"), reading in the cipher diagonally ("diagonally"), and reading in the cipher from the outer edge spiraling towards the centre ("spiral"). Each one is compared to their vertical counterparts, except the diagonal condition, which is compared to the other diagonal, and the spiral, which is compared to z340 read-in normally. The |Δ-msd|'s come out as follows:

It is also worth noting that some other tricks, e.g. skipping every fourth character, alternating real words with gibberish, etc., will lead to differences in |Δ-msd|'s, and thus, such tricks could not break this analysis. So, unless I have missed a really clever trick, I don't think that different ways of reading in the cipher is the key to making sense of z340.

### Contextualising

We saw how z340 compares to z408; but some more context would be helpful. To this end, I have created forty more substitution ciphers, twenty based on real texts and twenty based on random letter sequences. The twenty real texts, ""true ciphers", are based on various sources: the first letter attributed to the Zodiac Killer [html link], the Ramsey letter sent to the police in 1996 [html link], the "Dear Boss" letter attributed to Jack the Ripper [html link], the Unabomber Manifesto [html link], the Genesis [html link], Karl Marx' The Capital [html link], Message 5 by the Shadow Brokers [html link], the Nano Whitepaper [html link], the Declaration of Independence [html link], the user guide of the Nokia 3310 [html link]. The lengths of the ciphers range from 306 to 459 characters. The twenty random letter sequences, "fake ciphers", were created with a Python script and observe letter frequency in English. They, too, range from 306 to 459 characters. All files, scripts, and further notes can be found on GitHub [html link]. In a next step, I have encrypted the forty texts, using a similar encryption to z408. The character frequencies for the unencrypted and encrypted texts are as follows, with z408 and z340 added as points of comparison. "z408 (H. S.)" in the left graph is the Hardens' Solution.

I then calculated the |Δ-msd|'s for each of the forty ciphers, where the target was the cipher read in horizontally and the baseline was the cipher read in vertically. I also did this for the unencrypted texts ("plain text" in the right graph), to add some more context. The results are as follows:

We can see that z340 associates with the fake ciphers. Interestingly, we can also see that the other true ciphers form their own cluster, away from z408, implying that the encryption mechanism used for them is a lot stronger than the encryption mechanism of z408. I think that the main reason for this is that the script is in its one-to-many mappings pseudo-random, meaning it's mimicking true randomness pretty well [html link]. It is likely that the Zodiac Killer did not reach such levels of randomness, which will be reflected in the ngrams.

This interpretation of z340 as a fake cipher is backed up by a K-means clustering analysis, as shown in the graph below. K-means clustering is a basic categorisation algorithm [html link]. The nice thing is that we know the number of clusters that we wish to see: two – one cluster for the fake ciphers and one for the true ciphers.

The K-means analysis was done with Python and graphed with R. By and large, the learning algorithm clusters the ciphers as one would expect. There are a few mis-categorisations, though. Crucially, the learning puts z340 in the same cluster as the other fake ciphers, this is the cluster near the origin, and not in the cluster of the other true ciphers, the other cluster further out. I also ran an SVM, which produces similar results (see my GitHub, [html link]).

One might speculate that maybe z340 is some kind of weird word salad and that is why the |Δ-msd|'s come out so low. But here's the thing: Even random word sequences follow the rules of English phonology, something that should be reflected in the ngram analysis. To illustrate this point, I analysed twelve more data sets, four real texts, based on the Wikipedia article on Charles Manson ("real text 21" to "real text 24", [html link]), four texts consisting of random word sequences ("random words 1" to "random words 4"), sampled from the same article, and four texts consisting of random letter sequences, similar to the random data sets from above ("random letters 21" to "random letters 24"). All texts were 400 characters long. They were encrypted in the same way as the previous data sets. Here is a bigram analysis of the encrypted texts.

The word salad ciphers behave very similar to the text ciphers and unlike the random letter sequence ciphers. So, this is also not the key to z340.

### Many-to-many ciphers

What if z340 is a many-to-many cipher? Could that be the reason for the results above? First, a quick reminder what many-to-many ciphers are. In a one-to-one cipher, any plain text letter is mapped to exactly one cipher character, e.g. "A" to "Α", "B" to "Β", etc. Critically, each cipher character is only used once. In a one-to-many cipher, plain text letters might be mapped to more than one cipher character, e.g. "A" could be "Α" and "Ω", "B" could be "Β" and "Ψ", etc. Here, too, each cipher character is only used for one plain text character. That is, "Α" is always "A". In a many-to-many cipher, this restriction is lifted. It could for instance be the case that "A" maps to "Α" and "Ω" and "B" maps to "Β" – but also "Ω". This way, it becomes a lot harder to decipher such a cipher.

Many-to-many ciphers come in degrees. Strong many-to-many ciphers are very hard, sometimes virtually impossible to crack. I wondered how many-to-many ciphers would come out with respect to |Δ-msd|'s, so I encrypted the texts from above, using many-to-many mappings. The encryption algorithm is pretty strong and arguably really, really hard to crack, maybe at times even beyond crackable. The character frequencies for the different ciphers are shown below. I repeated the analysis from above and produced |Δ-msd|'s for each data set. The results are as follows:

The clusters are less clear now; true and fake clusters intermingle to some degree. The random many-to-many ciphers are in the same space as before, which is unsurprising. The true many-to-many ciphers are closer to the origin than the one-to-many ciphers from above. Crucially: z340 still associates with the fake ciphers and less so with the true text ciphers. This is backed up by another K-means analysis and another SVM analysis, both of which can be found on GitHub [html link].

Again, many-to-many ciphers come in degrees and we see some many-to-many ciphers that behave like fake ciphers. It is a possibility that z340 is an extremely strong many-to-many cipher. If this was the case, then it is questionable whether such a strong many-to-many cipher could be cracked.

### Vigenère ciphers

An intriguing way to create a many-to-many cipher is through a Vigenère encryption, which makes use of repeated keyword-shifted mappings [html link]. This is not to be confused with a simple keyword cipher [html link], where the latter would be easily detected through the ngram-based analysis above.

To illustrate Vigenère ciphers, consider a one-to-one substitution 'cipher' without a shift, where both the plain text and the cipher use the Latin alphabet as their character sets. "HELLOWORLD" in the plaintext would also be "HELLOWORLD" in the cipher. However, now a keyword is added. Assume it is "BEE". "BEE" represents repeated shifts of 1, 4, and 4 positions. For "HELLOWORLD", "H" becomes "I" ("B", shift 1), "E" becomes "I" ("E", shift 4), the first "L" becomes "P" ("E", shift 4), the second "L" becomes "M" ("B, shift 1), and so forth. In our example, plain text "HELLOWORLD" becomes "IIPMSAPVPE". Note how plain text "L" now maps to both "P" and "M" and how cipher text "P" maps to both plain text "L" and "O". Thus, we created a many-to-many cipher with a relatively simple trick applied to a 0-shift one-to-one 'cipher'.

In principle, a Vigenère encryption can be added on top of any cipher, no matter whether it's a one-to-one cipher, a one-to-many cipher, or a many-to-many cipher. However, in my view, it is likely that the ngram-based analysis from above would still pick up on the underlying information. As to why, see here [html link]. This depends on the length of the keyword, though. A Vigenère cipher with a keyword of two or three characters should behave like a true cipher in the analysis above. Longer keywords make things more difficult as they increase the number of possible mappings. This is something I wish to follow up on.

Either way, deciphering a Vigenère cipher can be "fiendishly tricky", as asetniop puts it in his comment here [html link]. One would have to find the right keyword, with hundred of thousands of words to choose from, and find the right order of the cipher character set – because an ordered character set is the underlying assumption of a Vigenère encryption. But for z340's character set, it is not clear what the order would look like. Which character follows the "A"? "B" and then "C" seems plausible, but it's not a given. And what comes next? "Reversed C" or "D"? Further, what happens to all the non-Latin letters like "THETA" and the different boxes, etc.? And even if we had all that right, then we could still not be sure that we have covered the entire cipher character set, as asetniop pointed out. For instance, maybe the original character set included a SIGMA, but it just happened that it never got used.

### Discussion and concluding remarks

We saw how the Zodiac Killer's z340 compares to his other major cipher, z408, and to other true and fake ciphers. The results show that z340 behaves like fake ciphers and unlike true ciphers. The difficulties with z340 are also not caused by some more-or-less obvious trick of how to read in the cipher, e.g. having the cipher mirrored. Further, even if the cipher was just some obscure word salad, this should have surfaced in the analysis above.

So, z340 is not your classical substitution cipher. If you approach it as such, chances are that you won't solve it. What is going then? There is a possibility that z340 is a very strong many-to-many cipher. If z340 was a classical many-to-many cipher, then it would have to make use of ultra-strong mappings – otherwise the analysis above would have probably detected something. However, considering that the first cipher, z408, was fairly weak, this seems unlikely.

Another possibility is that the Zodiac Killer made use of a Vigenère encryption or some other, strong trick, e.g. transposition. In case of a Vigenère encryption, there is a good chance that underlying information would be reflected in the analysis above, certainly for shorter keywords. This is something I wish to follow up on. It is possible that z340 is a very strong Vigenère cipher with a long keyword or uses some other, not-so-standard encryption trick that cannot be detected by an ngram-based analysis. This is indeed a possibility – but one has to keep in mind that z340 was created before computers were readily accessible to the public.

Another possibility is that the cipher is a ruse. Serial killers tend to be subject to hubris, as Douglas points out in his excellent book on the subject [html link]. It must have been a shock to the Zodiac Killer when he learnt that his z408 cipher had been deciphered so quick, by two hobby cryptographers. To prove intellectual superiority, he needed another cipher, one so strong that couldn't be deciphered, certainly not as easily as z408. I think it is a real possibility that the Zodiac Killer was unable to produce such a cipher and that he chose to take another route: cheating. It is a route the Zodiac Killer has taken before, as he promised to reveal his name in his first cipher, which turned out to be a lie. Thus, it could well be that to keep face, the Zodiac Killer has created a fake cipher, one that no-one could ever decrypt.

### Notes

All files can be found on GitHub [html link]. I enjoyed the project and wanted a write up. If you really want to donate something, then feel free to send some amount of Nanos. I think that Nanos are truly great [html link]. Here are my address and QR code:
xrb_39s7e8mu4wbuo77k3idfgmquep3cmkb4r56mf8rnfq6r99q9gxzsnzzj9k9x
Alternatively, here is my Bitcoin address: 14nnVvewsrvE99sm4xGvqGvy4UiB8pdQa9

An earlier version compared z340 to only four true ciphers and four fake ciphers, which seemed a bit low. The results were very similar – in fact, the previous results showed an even clearer divide between true and fake ciphers. The K-means analysis' outcome with respect to z340 was the same. Also, the first version did not include the SVM analysis, which I added some kind of 'linear sanity check'. Further, I made some adjustments and edits, taking feedback into account. The edits even include the title. However, the original post is still up on Github [html link].

The spiral-analysis was added post publication to work in feedback by Sasja. Both Reddit user asetniop and Sasja pointed me to Vigenère encryptions. asetniop's comment and explanation can be found here [html link]. To address this, I've added the section "Vigenère ciphers".

Jarl from www.zodiackillersite.com [html link] got in touch after reading the present entry. Jarl argued that there is a possibility that z340 is a transposition cipher [html link]. I agree and in my view, a sensible follow-up would be to write a script that automatically tests as many transposition combinations for their |Δ-msd|'s as possible.

tsj; originally posted on 16 Mar 2018