Cipher comparisons

= Introduction =

This is a comparison of various calculations applied to a large assortment of cipher texts, including the Zodiac ciphers. The purpose is to provide some quantitative ways to compare the cipher texts to each other.

= Cipher texts =

In the table below are 78 different cipher texts. Ciphers with known solutions are displayed with green backgrounds. Unsolved ciphers have white backgrounds. Each cipher is linked to the transcription that was used to generate the computations. Due to the use of the symbol "?" during gapped n-gram analysis, any transcription that uses it was modified to replace every occurrence of "?" with an alternate symbol. Also, any transcription that uses multiple letters to represent individual symbols or glyphs of cipher text have been modified to use single letters or symbols. Also, each transcription is a stream of text on a single line. This was done for the convenience of obtaining the calculations.

The original Zodiac ciphers appear towards the top of the results. Exploratory variations of the 340- and 408-character ciphers are also included.

Many test ciphers I've collected over the years are also included in the table.

The results also include other famous unsolved ciphers, such as the Beale ciphers, D'Agapeyeff cipher, Dorabella cipher, Feynman ciphers, Kryptos, and the Voynich manuscript.

Please let me know of any errors you find, or of other ciphers you want to add to this table.

= Explanation of columns =

Click on the little arrows in the table headings to sort the table by that column. The columns contain the following information:


 * #: The cipher number
 * Cipher: Description of the cipher text
 * Len: Total number of characters in the cipher text
 * S: Number of unique symbols in the cipher alphabet
 * M: Multiplicity (S divided by Len). Lower multiplicity values indicate higher numbers of constraints on substitution ciphers, making them easier to solve.  High multiplicity values indicate larger degrees of freedom on substitution ciphers, making them much harder to solve.
 * IOC: Index of Coincidence: A useful calculation in the analysis of cipher text.
 * Chi^2: The chi^2 test calculation, a useful calculation in the analysis of cipher text.
 * Entropy: A calculation of the amount of information (in bits) per letter of input text. (More info)
 * n2: The total number of distinct 2-grams that repeat in the cipher text. For example, if the cipher text is ABABABCDEFCD, we see that AB occurs three times, and CD occurs twice.  Thus n2 is equal to 2.


 * n3: The total number of distinct 3-grams that repeat in the cipher text.
 * n4: The total number of distinct 4-grams that repeat in the cipher text.
 * nt: A weighted sum of n2, n3, and n4, calculated by: (n2/4) + (n3/2) + n4.
 * g2: The total number of distinct 2-grams, with gaps, that repeat in the cipher text. For example, consider the cipher text AXYBAVWBATUBCDECXE.  The pattern A??B appears three times, and the pattern C?E appears twice (where "?" represents any single cipher letter).  Thus, g2 is equal to 2.
 * g3: The total number of distinct 3-grams, with gaps, that repeat in the cipher text.
 * g4: The total number of distinct 4-grams, with gaps, that repeat in the cipher text.
 * gt: A weighted sum of g2, g3, and g4, calculated by: (g2/4) + (g3/2) + g4.
 * t: The sum of nt and gt, used to score all the repeated n-grams (gapped and non-gapped) that appear in the cipher text, where n can be 2, 3, or 4.
 * t/len: The n-gram score averaged out over the entire length of cipher text. Given two ciphers, the one with higher t/len suggests more internal structure compared to the other.
 * cs sum: The sum of maximum cosine similarities found for each cipher text symbol. "Cosine similarity" is a measurement of similarity between the distribution of symbols in the cipher text.  It is based on a technique used by Knight et. al to cluster cipher text symbols that behave similarly, suggesting that they decode to the same plaintext letter.  Here we compute cs sum by first considering all possible pairs of symbols in the cipher alphabet.  The cosine similarity is computed for each of these pairs.  Then, for each unique symbol in the cipher text alphabet, we preserve only the maximum cosine similarity that was found in any pair that involves the symbol.  All such maximums are then summed to produce a total cosine similarity.
 * cs min: The smallest cosine similarity found when analyzing all possible pairs of cipher text symbols.
 * cs max: The highest cosine similarity found when analyzing all possible pairs of cipher text symbols.
 * cs mean: This is the cs sum calculation averaged over the number of symbols in the cipher text alphabet. Given two homophonic substitution ciphers, the one with the higher cs mean value implies it has more symbols on average that behave like other symbols.
 * h sum: A computation of two-symbol homophone sequences (patterns). This is computed by looking at all possible pairs of cipher text symbols.  For each pair, we look for repeated sequences that occur when all the other cipher text symbols are removed.  These repeated sequences happen if the cipher text author used sequential assignments of symbols for the same underlying plaintext letter.  For each symbol, the longest repeated sequence in which it participates is retained, and the number of repetitions is noted.  Then, h sum is computed by summing the number of repetitions associated with each symbol.
 * h min: This is the smallest number of repetitions found during the search for two-symbol homophone sequences.
 * h max: This is the largest number of repetitions found during the search for two-symbol homophone sequences.
 * h mean: This is h sum averaged over the number of symbols in the cipher text alphabet.  A high value suggests strong use of sequential assignment of cipher symbols to plaintext letters.
 * total score (sums): An overall score produced by a product of t, cs sum, and h sum. This score is an attempt to quantify the overall internal structure of a cipher text.
 * total score (means): An overall score produced by a product of t/len, cs mean, and h mean. This score is an attempt to quantify the overall internal structure of a cipher text.

TODO: Figure out why IoC for Voynich is negative.

= Table of results =