Levenshtein distance analysis

From Zodiac Killer Ciphers Wiki
Revision as of 04:45, 12 December 2010 by Doranchak (talk) (Open questions)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Patterns can be found in the cipher text by identifying repeated n-grams. But what about repeated sequences in which some of the characters are permitted to be different? In the 408 cipher, for example, the trigram #BV is repeated in two positions of the cipher text. But upon closer inspection, the trigrams appear within these two sequences: A9#BV and AP#BV. These two sequences differ only in their 2nd characters. The known solution of the 408 tells us that 9 and P both represent the plaintext letter "I". So, because the two 5-character sequences differ only by their 2nd characters, it is possible to infer that 9 and P represent the same letter. (Source)

We wish to determine all such partially-matching sequences, and see if they tell us anything about the 408 and 340 cipher texts.

To enumerate all such sequences, we use the Levenshtein distance as a measurement of similarity between sequences. Two strings are compared, and the Levenshtein distance is used as an error count. Errors falling within a certain threshold are considered strong "fuzzy matches" of the two strings. The table below tabulates the results of running an exhaustive search of fuzzy matches of string pairs in each cipher text. Each fuzzy match is further constrained by rejecting any matched string pairs that do not share the same beginning and end characters - this avoids including trivial cases of string matches. The table also includes counts derived from rotation and flip operations on the cipher texts.

Distance thresholds are capped to half of the string length. This avoids counting many spurious matches.

The totals are clickable, and will take you to a page that shows all of the matches (but only for the non-rotated and non-flipped cipher texts). Cells with a gray background indicate the count is at a maximum for the given cipher.

Note: When the threshold is zero, this is equivalent to searching for exact matches.

Data

String length Threshold Total (408) Total (408 ROT90) Total (408 FLIP) Total (408 FLIP ROT90) Total (340) Total (340 ROT90) Total (340 FLIP) Total (340 FLIP ROT90)
2 0 84 18 74 26 29 31 33 29
2 1 88 87 97 86 125 147 135 136
3 0 12 0 10 0 2 0 2 1
3 1 107 82 102 82 71 83 84 81
4 0 2 0 1 0 0 0 0 0
4 1 15 1 16 2 1 3 2 4
4 2 173 142 149 139 107 105 105 103
5 0 0 0 0 0 0 0 0 0
5 1 4 0 3 0 0 0 0 0
5 2 27 13 18 11 12 8 10 10
6 0 0 0 0 0 0 0 0 0
6 1 2 0 0 0 0 0 0 0
6 2 5 1 4 0 0 0 0 2
6 3 39 13 30 12 18 16 21 14
7 0 0 0 0 0 0 0 0 0
7 1 0 0 0 0 0 0 0 0
7 2 2 0 2 0 0 0 0 0
7 3 7 0 7 0 1 0 2 1
8 0 0 0 0 0 0 0 0 0
8 1 0 0 0 0 0 0 0 0
8 2 0 0 0 0 0 0 0 0
8 3 1 0 1 0 0 0 0 0
8 4 7 0 9 2 3 3 4 3
9 0 0 0 0 0 0 0 0 0
9 1 0 0 0 0 0 0 0 0
9 2 0 0 0 0 0 0 0 0
9 3 0 0 0 0 0 0 0 0
9 4 3 0 3 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0
10 1 0 0 0 0 0 0 0 0
10 2 0 0 0 0 0 0 0 0
10 3 0 0 0 0 0 0 0 0
10 4 1 0 1 0 0 0 0 0
10 5 5 0 1 0 1 0 2 1
11 0 0 0 0 0 0 0 0 0
11 1 0 0 0 0 0 0 0 0
11 2 0 0 0 0 0 0 0 0
11 3 0 0 0 0 0 0 0 0
11 4 0 0 0 0 0 0 0 0
11 5 1 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0
12 1 0 0 0 0 0 0 0 0
12 2 0 0 0 0 0 0 0 0
12 3 0 0 0 0 0 0 0 0
12 4 0 0 0 0 0 0 0 0
12 5 0 0 0 0 0 0 0 0
12 6 1 0 1 0 0 0 0 0

Results beyond length 12 are not included because no further matches were found.

Observations

  • The count is at maximum the most for the 408 when it is not transformed via rotations and flips.
  • The count is at maximum the most for the 340 when it is flipped horizontally.

Open questions

  • What happens to the 408-related counts if the analysis is applied to the cipher text with homophones replaced with single substitution?
  • What happens to the counts when other transpositions and transformations are applied to the cipher text?