Hypothesis Testing
Contents
- 1 Introduction
- 2 Methodology
- 3 Approach for generation of test ciphers
- 4 Base Hypothesis
- 5 Hypothesis 1: Z340 is a monoalphabetic homophonic substitution cipher.
- 6 Hypothesis 2: Columnar transposition was applied to the plaintext prior to homophonic substitution
- 7 Hypothesis 2: Scytale transposition was applied to the plaintext prior to homophonic substitution
Introduction
The Zodiac Killer's 340-character cryptogram (referred to here as Z340) has many things in common with his 408-character cryptogram (Z408), but still remains unsolved. It seems very likely that Z340 is not constructed using the same method as Z408, otherwise it would have been solved by now. Many software tools have been built to effectively solve homophonic substitution ciphers that have the same properties as Z408. If Z340 was a simple homophonic substitution cipher, it would have been solved by now by one of the numerous manual or automatic attempts.
Zodiac may have used some other scheme to produce Z340. In their attempts to crack Z340, people have explored many different schemes, but there is not yet a comprehensive study about which schemes are feasible. Moreover, it is not clear if a particular scheme can be broken if Z340 uses it. We need a comprehensive way to answer these questions about a scheme:
- Is it possible to create a Z340-like cipher using this scheme?
- Can a Z340-like cipher created using this scheme be cracked reliably?
- Is this scheme more likely than other schemes to produce some of the unusual features observed in Z340?
Methodology
In this article I will present my approach to answer the questions above. Here is my strategy:
- First, define a hypothesis for Z340's encipherment scheme. The hypothesis is a collection of statements about properties of Z340's plaintext, and the operations performed on the plaintext to turn it into ciphertext.
- Create at least 100 test ciphers under the scheme.
- Guide the generation of test ciphers so that they contain features and statistical properties that are very similar to Z340.
- Cryptanalyze the test ciphers, and recover their plaintext messages.
- If all the test ciphers can be solved reliably, but the same procedure fails to solve Z340, then we have stronger evidence that the chosen hypothesis is false.
- Select another hypothesis and repeat the process.
This method is labor intensive but helps to rule out specific schemes. If we are lucky, one of the schemes will lead to a solution for Z340, or the newly gained knowledge will guide the search to more plausible encipherment schemes.
Approach for generation of test ciphers
Test cipher production consists of two phases. First, candidate plaintexts are randomly sampled from a large collection of material from Project Gutenberg. The combined material amounts to over 3.6 million words of English text. Plaintexts that, under a given scheme, do not retain some of the desired features described below are discarded. Then when a suitable plaintext is found, a multiobjective optimization algorithm explores a space of encipherments under a given scheme. The algorithm looks for encipherments that maximize the generated ciphers' similarity to Z340. The following section describes the features and statistics the search attempts to maximize.
Features and statistics of Z340 that are desired in the generated test ciphers
- Unigram distribution: The generated cipher should have the same frequencies of individual symbols.
- Bigram distribution: The generated cipher should have the same number of repeating bigrams.
- Trigram distribution: The generated cipher should have the same number of repeating trigrams.
- Periodic bigram distribution: The generated cipher should have the same unusual number of repeating bigrams at periods 19 and 15.
- Z340 has several pseudo-words ("her", "god", "zodiac") that appear directly in the cipher text. The generated ciphers should have similar words.
- The generated cipher should contain similar box corner patterns
- The generated cipher should contain similar "fold marks"
- The generated cipher should contain similar "pivot patterns", that are each oriented in the same directions.
- The generated cipher should have similar degrees of symbol cycling, wherein regularity is found in the homophonic assignments of symbols to individual plaintext letters.
Base Hypothesis
This base hypothesis is shared by all other hypotheses and is enforced during production of test ciphers.
- The plaintext is in the English language. Other languages are not yet considered since there is no known indication of Zodiac writing in another language.
- The plaintext is written out in a normal reading direction (left to right, top to bottom) before any subsequent encipherment operations are performed.
- The plaintext contains a significant number of misspellings, a feature present in much of Zodiac's writings.
- The plaintext contains a section of filler. This is included in the base hypothesis because the last 18 letters of Z408's plaintext is either filler or a still undetermined message.
- Entire words may be omitted or doubled from the plaintext. This is included because Z408's plaintext is missing some words (more details here).
- The cipher contains encipherment errors due to transcription errors caused by "symbol confusion". This is included because Zodiac was known for making these errors in Z408 (more details here).
Hypothesis 1: Z340 is a monoalphabetic homophonic substitution cipher.
- After the plaintext is written out to a 17 by 20 grid, a homophonic substitution key is applied.
Hypothesis 2: Columnar transposition was applied to the plaintext prior to homophonic substitution
TODO
Hypothesis 2: Scytale transposition was applied to the plaintext prior to homophonic substitution
TODO