Hypothesis Testing

From Zodiac Killer Ciphers Wiki
Jump to: navigation, search

Introduction

The Zodiac Killer's 340-character cryptogram (referred to here as Z340) has many things in common with his 408-character cryptogram (Z408), but still remains unsolved. It seems very likely that Z340 is not constructed using the same method as Z408, otherwise it would have been solved by now. Many software tools have been built to effectively solve homophonic substitution ciphers that have the same properties as Z408. If Z340 was a simple homophonic substitution cipher, it would have been solved by now by one of the numerous manual or automatic attempts.

Zodiac may have used some other scheme to produce Z340. In their attempts to crack Z340, people have explored many different schemes, but there is not yet a comprehensive study about which schemes are feasible. Moreover, it is not clear if a particular scheme can be broken if Z340 uses it. We need a comprehensive way to answer these questions about a scheme:

  • Is it possible to create a Z340-like cipher using this scheme?
  • Can a Z340-like cipher created using this scheme be cracked reliably?
  • Is this scheme more likely than other schemes to produce some of the unusual features observed in Z340?

Methodology

In this article I will present my approach to answer the questions above. Here is my strategy:

  • First, define a hypothesis for Z340's encipherment scheme. The hypothesis is a collection of statements about properties of Z340's plaintext, and the operations performed on the plaintext to turn it into ciphertext.
  • Create at least 100 test ciphers under the scheme.
  • Guide the generation of test ciphers so that they contain features and statistical properties that are very similar to Z340.
  • Cryptanalyze the test ciphers, and recover their plaintext messages.
  • If all the test ciphers can be solved reliably, but the same procedure fails to solve Z340, then we have stronger evidence that the chosen hypothesis is false.
  • Select another hypothesis and repeat the process.

This method is labor intensive but helps to rule out specific schemes. If we are lucky, one of the schemes will lead to a solution for Z340, or the newly gained knowledge will guide the search to more plausible encipherment schemes.

Approach for generation of test ciphers

Test cipher production consists of two phases. First, candidate plaintexts are randomly sampled from a large collection of material from Project Gutenberg. The combined material amounts to over 3.6 million words of English text. Plaintexts that, under a given scheme, do not retain some of the desired features described below are discarded. Then when a suitable plaintext is found, a multiobjective optimization algorithm explores a space of encipherments under a given scheme. The algorithm looks for encipherments that maximize the generated ciphers' similarity to Z340. The following section describes the features and statistics the search attempts to maximize.

Features and statistics of Z340 that are desired in the generated test ciphers

  • Unigram distribution: The generated cipher should have the same frequencies of individual symbols.
  • Bigram distribution: The generated cipher should have the same number of repeating bigrams.
  • Trigram distribution: The generated cipher should have the same number of repeating trigrams.
  • Periodic bigram distribution: The generated cipher should have the same unusual number of repeating bigrams at periods 19 and 15.
  • Z340 has several pseudo-words ("her", "god", "zodiac") that appear directly in the cipher text. The generated ciphers should have similar words.
  • The generated cipher should contain similar box corner patterns
  • The generated cipher should contain similar "fold marks"
  • The generated cipher should contain similar "pivot patterns", that are each oriented in the same directions.
  • The generated cipher should have similar degrees of symbol cycling, wherein regularity is found in the homophonic assignments of symbols to individual plaintext letters.

We can include more features, such as the other strange phenomena like "prime phobia" and "even/odd bias". However, this limits the effectiveness of the optimization algorithm, and so only the higher-priority features above are considered during the search.

Base Hypothesis

This base hypothesis is shared by all other hypotheses and is enforced during production of test ciphers.

  • The plaintext is in the English language. Other languages are not yet considered since there is no known indication of Zodiac writing in another language.
  • The plaintext is written out in a normal reading direction (left to right, top to bottom) before any subsequent encipherment operations are performed.
  • The plaintext contains a significant number of misspellings, a feature present in much of Zodiac's writings.
  • The plaintext contains a section of filler. This is included in the base hypothesis because the last 18 letters of Z408's plaintext is either filler or a still undetermined message.
  • Entire words may be omitted or doubled from the plaintext. This is included because Z408's plaintext is missing some words (more details here).
  • The cipher contains encipherment errors due to transcription errors caused by "symbol confusion". This is included because Zodiac was known for making these errors in Z408 (more details here).

Hypothesis 1: Z340 is a monoalphabetic homophonic substitution cipher.

Contents

  • After the plaintext is written out to a 17 by 20 grid, a homophonic substitution key is applied.

This hypothesis is considered because Zodiac used it for Z408, and Z340 has similar qualities.

Generated ciphers

Analysis and conclusions

  • I expect that running these through zkdecrypto, azdecrypt, or similar automatic solving program will easily solve these ciphers.
  • This hypothesis is likely false for Z340.

Hypothesis 2: Columnar transposition was applied to the plaintext prior to homophonic substitution

Contents

  • A columnar transposition key with length somewhere between 2 and 20 was selected.
  • A key word or phrase may have been used, but we consider the broader case of any random selection of column ordering.
  • When the key length evenly divides 340 (such as 2, 4, 5, 17, etc), regular columnar transposition is used. Otherwise, irregular columnar transposition is used (that is to say, no null symbols are inserted as padding).
  • The plaintext was transposed using the columnar transposition key.
  • The result was homophonically encoded.

This hypothesis is considered because It could explain the unusual spike in the number of repeating bigrams in Z340 at period 19.

Generated ciphers

Analysis and conclusions

  • Test ciphers with shorter key lengths should be easy to solve by generating all the possible untranspositions.
  • However, as key length increases, the number of possible untranspositions increases significantly and makes brute force solves intractable.
    • key length, number of possible untranspositions
    • 2, 2
    • 3, 6
    • 4, 24
    • 5, 120
    • 6, 720
    • 7, 5040
    • 8, 40320
    • 9, 362880
    • 10, 3628800
    • 11, 39916800
    • 12, 479001600
    • 13, 6227020800
    • 14, 87178291200
    • 15, 1307674368000
    • 16, 20922789888000
    • 17, 355687428096000
    • 18, 6402373705728000
    • 19, 121645100408832000
    • 20, 2432902008176640000
  • For larger n, a hillclimbing approach is probably required to explore the search space of possible solutions. Also, perhaps some statistics about the candidate untranspositions may help guide the search (for example, detection of repeating fragments and/or symbol cycles).
  • Note: the generated ciphers use keys that are not based on keywords. Therefore, scanning a dictionary for possible keys will not work unless the key happens to coincide with a word.

Hypothesis 3: Scytale transposition was applied to the plaintext prior to homophonic substitution

Contents

  • A scytale key with a diameter (length or period) somewhere between 2 and 170 was selected.
  • The plaintext is rewritten using the selected scytale key
  • The resulting text is homophonically encoded.

This hypothesis is considered because it could explain the unusual spike in the number of repeating bigrams in Z340 at period 19.

Generated ciphers

Analysis and conclusions

  • Since the scytale key has length between 2 and 170 for each of these ciphers, they should be easy to solve by generating all the possible untranspositions and running them through an automatic solver program. There are only 169 possible untranspositions per cipher.
  • Or, perhaps the hillclimber suggested in Hypothesis 2 could be effective.

Addendum: General Hypotheses

Here are some more general hypotheses about Z340:

  • Z340 contains a real plaintext message and is enciphered using homophonic substitution, similar to the construction of Z408.
  • Z340 contains a real plaintext message and is enciphered using some other undiscovered method
  • Z340 contains no real plaintext (i.e., Zodiac deliberately created a fake cipher to keep everyone busy)
  • Z340 contains a real plaintext message but its ngram statistics are anomalous, preventing discovery of the solution (here's an example of such a message).
  • Z340 contains a real plaintext message and is enciphered using some commonly known method, but Zodiac made sufficient mistakes to prevent recovery of the plaintext.