Brute force search for homophone sequences

Introduction
This is a tabulation of results from a brute force search for homophone sequences of the 408 and 340 ciphers. We consider all combinations of symbol sequences up to 7 symbols in length. For each combination, all cipher text is removed except for the symbols in the sequence. Then, the remaining cipher text is analyzed for repetitions. In the solved 408 cipher, homophones are known, so we easily see the repetitions due to the intentional use of sequences by the cipher author. Perhaps detection of similar repetitions in the 340 will reveal the cipher author's intentional use of sequences.

The results are divided into two sections, one per cipher. Each section contains tables of results for each cipher. The results are grouped by the length of the repeated sequence, then by a score which is based on the overall strength of the repetition. This strength score is composed of an odds estimate based on a combinatorial analysis of the patterns, and a summation of cosine distances of the symbols that participate in the repeated sequence.

Here is a description of the columns:


 * L: The length of the repeated sequence. We tested L from 2 to 7.  For each L, the table shows only the 200 highest scoring results.
 * S: This is the sequence we are testing for repetitions.
 * Odds: This is an estimate of how rare this kind of sequence would be if you examined all permutations of the cipher text.  The value is calculated using a combinatorial analysis of how many similar patterns can form using the same set of symbols, the same value of L, and the same number of repetitions.  The odds are then computed by dividing the total number of permutations of the pattern sequence by the number of similar patterns that can form.  Higher values suggest the pattern would rarely form by chance.  Here is some detail about a similar calculation.
 * Rep: The number of repetitions of the sequence found in the cipher text.
 * Tot: This is the total number of times each symbol occurs in the overall cipher text.
 * %: The percentage of cipher text this sequence covers. Compare this to letter frequencies (for instance, a homophone sequence that is assigned to the plaintext letter 'E' should have a percentage value similar to the distribution of the letter 'E' in normal English text, about 12.7%.)
 * CS: This is a sum of the cosine distances of all pairs within the sequence. For instance, the sequence ABC has pairs AB, BC, and AC.  The cosine distance of each pair is computed, and summed.  The cosine distance is a measurement of similarity between the distribution of symbols in the cipher text.  It is based on a technique used by Knight et. al to cluster cipher text symbols that behave similarly, suggesting that they decode to the same plaintext letter.
 * Score: Computed by multiplying CS by Odds. This lets us sort the results by both measurements.
 * Matches: Shows the occurrences of the sequence in the cipher text when all other symbols are removed.  Matches are enclosed by brackets.
 * Dec: In the 408 cipher results, this column shows the corresponding plaintext known to be assigned to these symbols.
 * Act?: In the 408 cipher results, this column tells us if the detected sequence is a true homophone sequence assigned to the same plaintext letter.

You can click on the little arrows in the column headings to perform an ascending sort for that column. Click it again to switch it to a descending sort.

L=2
Showing only the top 200 results. Click here for the full raw results (614 records)

L=3
Showing only the top 200 results. Click here for the full raw results (3052 records)

L=4
Showing only the top 200 results. Click here for the full raw results (7605 records)

L=5
Showing only the top 200 results. Click here for the full raw results (16801 records)

L=6
Showing only the top 200 results. Click here for the full raw results (14688 records)

L=7
Showing only the top 200 results. Click here for the full raw results (10568 records)

L=2
Showing only the top 200 results. Click here for the full raw results (531 records)

L=3
Showing only the top 200 results. Click here for the full raw results (1663 records)

L=4
Showing only the top 200 results. Click here for the full raw results (3124 records)

L=5
Showing only the top 200 results. Click here for the full raw results (3377 records)

L=6
Showing only the top 200 results. Click here for the full raw results (2929 records)

L=7
Showing only the top 200 results. Click here for the full raw results (3229 records)

Comparison to other 340 variations
TODO: I need to recompute the Top Scores in this section, since there was a bug in the original algorithm.

These are the top scores (odds multiplied by cosine distances) for L from 2 to 4 for several 340 cipher texts:
 * The original 340 cipher
 * The text flipped 180 degrees horizontally, then rotated 90 degrees counterclockwise. This is the same as reading the original text top to bottom, left to right.
 * The text flipped 180 degrees horizontally only.

Results:

For some reason, the top scores are higher if you flip the 340 cipher by 180 degrees horizontally.

Top scores plot
TODO: I need to recompute the Top Scores in this section, since there was a bug in the original algorithm.

Here is a plot showing the top scores for each L, for each cipher.



The 408's top scores are represented by the green bars. The 340's top scores are repsented by the red bars. The Y axis is logarithmic to show the comparison more clearly. You can see here that the top-scoring sequences for the 408 are several orders of magnitude superior to those of the 340. The only exception is for L=2, where the sequence B+ has a higher score, due to its 10 repetitions. The 408's sequence 9k only repeats 9 times. However, the repetitions of 9k are all contiguous, suggesting a more orderly arrangement. This kind of orderliness is not yet considered by the analysis on this page.

Observations and Questions

 * Cosine distances and detection of homophone sequences both produce many false positives, so it is difficult to know with certainty which sequences are correct.
 * Among false positives, it is possible to distinguish between the falsely included symbols and the correctly identified symbols.
 * Take, for example, the sequence "pW+U" which is found repeating 6 times in the 408 cipher. The symbols p, W, and + are correctly identified, since they belong to plaintext letter 'E'.  But the symbol U is actually plaintext letter 'I'.  Let's look at the cosine distances for every possible pair of symbols in "pW+U":
 * pW: 0.4082483
 * p+: 0.34020692
 * pU: 0.23570225
 * W+: 0.33333337
 * WU: 0.18042195
 * +U: 0.19245009
 * Observe that the pairs involving the symbol U (in boldface) have the lowest cosine distance values, suggesting that the similarity relationship between U and the other symbols is weak compared to that of the other symbols. Another way to look at this is to add up the cosine distances for all pairs involving a given symbol.  Here is the result for pairs in "pW+U":
 * Sum for symbol [U]: 0.6085743
 * Sum for symbol [W]: 0.9220036
 * Sum for symbol [p]: 0.98415744
 * Sum for symbol [+]: 0.8659904
 * Again we can see that the value is lowest for the symbol U.
 * If the cipher author assigns homophone sequences in an orderly fashion, the number of detected sequences will increase overall. Also, it is believed that more symbols will exhibit higher cosine distances.  So, taken as a whole, these measurements can be used to detect orderliness in homophone cipher text construction.
 * The 340 cipher shows significantly weaker and fewer homophone sequences than the 408 (see the plot above). This makes it difficult to make conclusions about the construction of the 340.  Did the author use sequential homophones for part of the cipher, and then switch to randomly assigned homophones?  Or do the sequences arise by chance alone?  The results suggest that at least some form of orderly homophonic substitution was used, but to a much smaller extent than for the 408.
 * Many of the repeated sequences of the 340 cipher themselves contain repeated symbols. For example: |G2++, #zKz/D, .GO_OD, RDY&lt;R4, .fO_OD, and many more.  What does this mean, if anything?  How does this phenomenon compare to the 408?
 * Did the cipher author perform homophonic substitution prior to some other scheme that rearranged the cipher text? If so, it would disrupt the occurrences of repeated n-grams and orderly homophone sequences, reducing the orderliness measurements.  Thus, reversing the rearrangement steps would increase the orderliness measurements.  This can be exploited in cryptanalysis using a heuristic search.
 * Is the 340 a standard homophonic substitution cipher? The detected sequences above suggest that it is.  Evidence against this would be test ciphers that have similar distributions of sequence repetitions that do not actually correspond to real homophones.  Perhaps we could create such test ciphers by using random homophone assignment, and measuring how many "interesting" patterns occur by chance.
 * What is the best way to include other orderliness measures into the odds computation? For example, many sequences in the 408 are very strong because they occur without interruption.  But the odds computation considers only the number of repetitions.
 * Why do the scores increase when the cipher text is flipped horizontally?