Comparison of character repetition rates

From Zodiac Ciphers
Jump to: navigation, search

This is a discussion about character repetition rates in the 340 and 480 ciphers. Participants: Forum users Glen Claston and Bert on zodiackiller.com, September 21, 2000 (source: http://www.zodiackiller.com/mba/zc/69.html):

Glen,

It's interesting to look at how the 408- & 340-ciphers compare in terms of ciphertext character repetition rates.

For example, using ciphertext notation similar to your http://www.geocities.com/cryptography_2000/340.txt we have:

408Row#1: 18P/Z/UB8kORnpXnB (4 reps/17)
408Col#3: P+Y/LqT/KDpDMIpSaXTeFdJX (4 reps/24)
...
340Row#3: ByncM+UZGWzaL6uHJ (0 reps/17)
340Row#4: Spp2^l37VmpO++RKh (3 reps/17)
340Col#2: Epyp130GvuevoGBcceFM (5 reps/20)
340Col#3: RenpMRehMlRc1FXsTakD (4 reps/20)
...

(The definition is such that in "ABAAB" there are three reps, since A is repeated twice and B is repeated once.)

We can find the rep rates row-by-row and get the average rep rate within rows for each cipher; similarly, we can do the same thing column-by-column to get the avg rep rate within columns for each cipher.

Here's what I find (manually) for the 408- & 340-ciphers ...

#reps in each row
408-cipher: 4 2 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 2 2 2 2 2 1 2
340-cipher: 0 0 0 3 2 1 0 0 1 2 0 1 0 2 0 0 1 2 3 1 0

#reps in each column
408-cipher: 1 3 4 4 5 6 4 7 7 3 1 5 5 3 4 5 4
340-cipher: 0 5 4 2 2 6 2 2 4 4 5 3 4 4 2 3 3

Those counts give the following average repetition rates:

By rows
408-cipher: 7% = 27 reps /(24*17)
340-cipher: 6% = 19 reps /(20*17)

By columns
408-cipher: 17% = 71 reps /(17*24)
340-cipher: 16% = 55 reps /(17*20)

Behavior(1): *Within* each cipher, the rep rate in rows is notably less than the rep rate in columns (6-7% << 16-17%).

Behavior(1) is consistent with homophonic substitution, possibly combined with transpositions of some kind. (The transpositions would have no affect on the rep rates). It would be a natural consequence of deliberately suppressing (on avg) the short-term repetition of characters at the time the ciphertext substitutions are being made. This would reduce the row rates but would have much less affect on column rates, which depend on column alignments that are not likely to be controlled. On the other hand, isn't this behavior very unlikely with just about any other substitution scheme (e.g. polyalphabetic)?

Behavior(2): *Between* ciphers, the rep rates are practically the same, by row (6% ~= 7%) and by column (16% ~= 17%).

Behavior(2) seems to strongly indicate that the type of substitution scheme used in the 408-cipher, i.e. homophonic, was used again in the 340-cipher. (And again this behavior would not be affected by combining the substitution with transpositions. Also, it seems to me that the addition of some transpositions would be sufficient to account for the fact that the 340-cipher has not yet been solved.)

Doesn't it seem likely that the rep rates would have changed significantly by any fundamental change in the substitution scheme (e.g. switching from homophonic to polyalphabetic)?

In a "Paradice Slaves" thread, you mention some results of chi^2 & IoC tests performed on the 340-cipher alone. It might be interesting to see how those compare *between* the two ciphers.

Regards, Bert


Bert,

I actually did look at repetition rates among many other things, and if you’ll check my old posts you’ll see that I’ve been arguing for awhile that there are very few differences between the 408 and 340-ciphers. Your illustration is a very good one and a very good addition to the already overwhelming evidence that the 340 is probably a homophonic cipher or some variant.

By rows
408-cipher: 7% = 27 reps /(24*17)
340-cipher: 6% = 19 reps /(20*17)

By columns
408-cipher: 17% = 71 reps /(17*24)
340-cipher: 16% = 55 reps /(17*20)

Your figures above are accurate, and you’ll notice that in every measure of repetition the 340 cipher rates lower than the 408. The extra characters don’t add the extra necessary to make up the diffenerence either. Something is causing this, and as you’ve stated, transposition makes no difference in the counts, so it must be something else.

Every check I’ve made for transposition comes up negative, which is disturbing in light of all the other evidence that makes this an unsolved homophonic substitution cipher. Group repetitions and characters at distances all appear to be linear and written left to right. I’ll make a chart of percentages on these figures compared to the 408 and post it.

Behavior(1): *Within* each cipher, the rep rate in rows is notably less than the rep rate in columns (6-7% << 16-17%).

Behavior(1) is consistent with homophonic substitution, possibly combined with transpositions of some kind. (The transpositions would have no affect on the rep rates). It would be a natural consequence of deliberately suppressing (on avg) the short-term repetition of characters at the time the ciphertext substitutions are being made. This would reduce the row rates but would have much less affect on column rates, which depend on column alignments that are not likely to be controlled. On the other hand, isn't this behavior very unlikely with just about any other substitution scheme (e.g. polyalphabetic)?

Unlikely with a textbook polyalphabetic, but try a low-level polyalphabetic using the keyword “Abracadabra” for instance. It generates roughly the same statistics as a homophonic substitution cipher, especially in conjunction with randomly selected character alphabets. Too tricky for Zodiac? Not if you consider using the repetition “slavesslavesslaves” or something similar. This was the angle of my investigation into polyalphabetics in the 340. Unfortunately the statistics nixed the idea just about as soon as I came up with it!

Behavior(2): *Between* ciphers, the rep rates are practically the same, by row (6% ~= 7%) and by column (16% ~= 17%).

Behavior(2) seems to strongly indicate that the type of substitution scheme used in the 408-cipher, i.e. homophonic, was used again in the 340-cipher. (And again this behavior would not be affected by combining the substitution with transpositions. Also, it seems to me that the addition of some transpositions would be sufficient to account for the fact that the 340-cipher has not yet been solved.)

Doesn't it seem likely that the rep rates would have changed significantly by any fundamental change in the substitution scheme (e.g. switching from homophonic to polyalphabetic)?

In a "Paradice Slaves" thread, you mention some results of chi^2 & IoC tests performed on the 340-cipher alone. It might be interesting to see how those compare *between* the two ciphers.

I’ve run the IoC tests on both ciphers and there is nothing to compare the two against. The 408 has random peaks as a result of the language used, while there are some lower peaks in the 340, none matching each other. The 340 has an extremely high peak at 78 (13x6, 26x3) which I am checking out through other means (still only 5.5%, but much closer to 6.8 than any peak in the 408). It might be of some significance if we figure the Paradice Slaves cross to be of 13 characters instead of 14, the “a” being used twice in each word. How it could be used is still a mystery however.

I’m glad you’re looking at this cipher, and it seems you’re interested enough to have come up with your own transcription. How does your transcription compare to mine? Do I have any mistakes I need to look at?



Glen,

You said "... try a low-level polyalphabetic using the keyword “Abracadabra” for instance. It generates roughly the same statistics as a homophonic substitution cipher, especially in conjunction with randomly selected character alphabets."

I don't find that to be the case. Suppose we apply a simple Vigenere cipher to, say, the first 256 letters of your paragraph that reads as follows:

"Unlikely with a textbook polyalphabetic, but try a low-level polyalphabetic using the keyword “Abracadabra” for instance. It generates roughly the same statistics as a homophonic substitution cipher, especially in conjunction with randomly selected character alphabets. Too tricky for Zodiac? Not if you co ..."

Thus "unlikelywithatex ..." keyed by "abracadabra" (repeating) produces "uocimeoyxzthbkez..."

The resulting ciphertext, listed 16 letters per row, is as follows:

uocimeoyxzthbkez 3
teopbpompanpkacv 5
tidsuvtuybcowmvv 4
glsompalqyadewid 3
lsioxtjeneznorer 5
btafaerbrbwotiqs 3
urncfztieqesrtet 5
iowgklzkhetrmgsw 3
auzstjtscsdhpdop 5
ifnkcvucjtiultko 5
qcjghesvsrefibcl 3
yjecqnmuottipewk 2
tkrbedoncyueoedk 5
eddyataftfialqya 7
dewsufotszcmyios 3
qodjrcpowigpoudf 4

5335355125534353

where the rep counts are in the margins. The resulting averages are

by-row avg rep rate: 25%
by-col avg rep rate: 23%

For this particular example, the avg rep rate has turned out to be even a bit *greater* in rows than in columns, but I think the difference between 25% and 23% here is not significant. (In this regard, there may also be no significance to the difference between 6% & 7% nor between 16% & 17% in the rates found for the 340-cipher, even if the variation is systematic between the two cases.)

In summary for this example, the rep rates by row and by column are practically the same, as expected for a non-homophonic cipher of this type. (Progressively keyed and autokeyed polyalphabetics of this kind would exibit the same behavior, I believe.)

Have I misunderstood what you mean by "low level" versus "textbook" polyalphabetic ciphers?

(BTW, thank you for creating a text-only transcription of the 340-cipher -- it's the one I used in my posting, if I didn't err in copying it. I referred to "ciphertext notation similar to yours" only because minor extensions were needed to post the samples from the 408-cipher.)

Regards, Bert


Glen speculates on the nature of the cipher based on his analysis:

IMHO - The fact that the 340-cipher imitates the 408-cipher in just about every way makes it pretty clear that it holds a message. It certainly wasn't intended to be a joke or "gibberish". Creating a fake with these properties would be far more difficult than creating the real thing. The 340-cipher is a very difficult puzzle, obviously not easily broken. How this relates to Zodiac's intellectual sophistication is another question entirely.

Z chose a relatively simple system for the 408-cipher, but didn't stick to that system long. He may have felt that straying from his original system in the 408-cipher added more security, but his actions actually made the cipher easier to crack. I assume his original intentions were that the cipher be read after much labor, so his intended goal of getting his message out was met in the 408-cipher.

Z's history with the 408-cipher makes the question of transposition in the 340-cipher a very important question. If Z knew enough about cipher to include transposition, but then failed to stick to the system, the cipher could be virtually insolvable. For all intents and purposes it would remain "gibberish" forever. What sense is there in encoding intelligent information, sending that information to the press, but making it impossible to read? If transposition is used in an inconsistent manner in the 340-cipher, Z's lack of knowledge defeated his intentions, since there is clearly a message that may never be read.

I for one think the difficulty here lies in Z's lack of knowledge of cipher, not in his overwhelming mastery of the subject.

Thanks for the post, Bert!

Bert wrote:

You said "... try a low-level polyalphabetic using the keyword “Abracadabra” for instance. It generates roughly the same statistics as a homophonic substitution cipher, especially in conjunction with randomly selected character alphabets."

I don't find that to be the case. Suppose we apply a simple Vigenere cipher to, say, the first 256 letters of your paragraph that reads as follows:

"Unlikely with a textbook polyalphabetic, but try a low-level polyalphabetic using the keyword “Abracadabra” for instance. It generates roughly the same statistics as a homophonic substitution cipher, especially in conjunction with randomly selected character alphabets. Too tricky for Zodiac? Not if you co ..."

Thus "unlikelywithatex ..." keyed by "abracadabra" (repeating) produces "uocimeoyxzthbkez..."

.....................................

where the rep counts are in the margins. The resulting averages are

by-row avg rep rate: 25% by-col avg rep rate: 23%

For this particular example, the avg rep rate has turned out to be even a bit *greater* in rows than in columns, but I think the difference between 25% and 23% here is not significant. (In this regard, there may also be no significance to the difference between 6% & 7% nor between 16% & 17% in the rates found for the 340-cipher, even if the variation is systematic between the two cases.)

In summary for this example, the rep rates by row and by column are practically the same, as expected for a non-homophonic cipher of this type. (Progressively keyed and autokeyed polyalphabetics of this kind would exibit the same behavior, I believe.)

Have I misunderstood what you mean by "low level" versus "textbook" polyalphabetic ciphers?

You haven’t misunderstood me in any way! You did use the textbook Vig cipher as an example however. Our cipher in question has over 60 characters selected for frequency suppression. Let me set up an example this weekend and post it. I’ll use the first 340 characters of the 408 cipher, since we ought to be examining this using something that represents Z’s language. Here is my transcription of the 408-cipher and its plaintext in delimited format below so we’re on the same page. Using formatting tags in this text triggers unwanted formatting codes, so you'll have to cut and paste to a text editor and use a fixed width text. Sorry.

408 ciphertext:

1,8,P,/,Z,/,U,B,8,k,O,R,9,p,X,9,B
W,V,+,e,G,Y,F,o,1,H,P,5,K,n,q,Y,e
M,J,Y,^,U,x,k,2,q,T,t,N,Q,Y,D,0,a
S,z,/,1,8,B,P,O,R,A,U,8,f,R,l,q,E
k,^,L,M,Z,J,d,r,,p,F,H,V,W,e,2,Y
5,+,q,G,D,1,K,x,a,o,q,X,3,0,u,S,z
R,N,t,x,Y,E,l,O,1,q,G,B,T,o,S,8,B
L,d,/,P,8,B,5,X,q,E,H,M,U,^,R,R,k
c,Z,K,q,p,x,a,W,q,n,3,0,L,M,r,1,6
B,P,D,R,+,J,9,o,,N,z,e,E,U,H,k,F
Z,c,p,O,V,W,x,0,+,t,L,a,l,^,R,o,H
x,1,D,R,4,T,Y,r,,d,e,/,5,X,J,Q,A
P,0,M,3,R,U,t,8,L,a,N,V,E,K,H,9,G
r,x,n,J,k,0,1,3,L,M,l,N,A,a,Z,z,P
u,U,p,k,A,1,6,B,V,W,,+,V,T,t,O,P
^,9,S,r,l,f,U,e,o,2,D,u,G,8,8,n,M
N,k,a,S,c,E,/,1,8,8,Z,f,A,P,6,B,V
p,e,X,q,W,q,4,F,6,3,c,+,5,1,A,1,B
8,O,T,0,R,U,c,+,4,d,Y,q,4,^,S,q,W
V,Z,e,G,Y,K,E,4,T,Y,A,1,8,6,L,t,4
H,n,F,B,X,1,u,X,A,D,d,,2,L,n,9,q
4,e,d,6,6,o,e,0,P,O,R,X,Q,F,8,G,c
Z,5,J,T,t,q,4,3,J,x,+,r,B,P,Q,W,o
V,E,X,r,1,W,x,o,q,E,H,M,a,9,Y,x,k

408 plaintext

I,L,I,K,E,K,I,L,L,I,N,G,P,E,O,P,L
E,B,E,C,A,U,S,E,I,T,I,S,S,O,M,U,C
H,F,U,N,I,T,I,S,M,O,R,E,F,U,N,T,H
A,N,K,I,L,L,I,N,G,W,I,L,D,G,A,M,E
I,N,T,H,E,F,O,R,R,E,S,T,B,E,C,A,U
S,E,M,A,N,I,S,T,H,E,M,O,S,T,D,A,N
G,E,R,O,U,E,A,N,I,M,A,L,O,F,A,L,L
T,O,K,I,L,L,S,O,M,E,T,H,I,N,G,G,I
V,E,S,M,E,T,H,E,M,O,S,T,T,H,R,I,L
L,I,N,G,E,X,P,E,R,E,N,C,E,I,T,I,S
E,V,E,N,B,E,T,T,E,R,T,H,A,N,G,E,T
T,I,N,G,Y,O,U,R,R,O,C,K,S,O,F,F,W
I,T,H,A,G,I,R,L,T,H,E,B,E,S,T,P,A
R,T,O,F,I,T,I,S,T,H,A,T,W,H,E,N,I
D,I,E,I,W,I,L,L,B,E,R,E,B,O,R,N,I
N,P,A,R,A,D,I,C,E,A,N,D,A,L,L,T,H
E,I,H,A,V,E,K,I,L,L,E,D,W,I,L,L,B
E,C,O,M,E,M,Y,S,L,A,V,E,S,I,W,I,L
L,N,O,T,G,I,V,E,Y,O,U,M,Y,N,A,M,E
B,E,C,A,U,S,E,Y,O,U,W,I,L,L,T,R,Y
T,O,S,L,O,I,D,O,W,N,O,R,S,T,O,P,M
Y,C,O,L,L,E,C,T,I,N,G,O,F,S,L,A,V
E,S,F,O,R,M,Y,A,F,T,E,R,L,I,F,E,E
B,E,O,R,I,E,T,E,M,E,T,H,H,P,I,T,I

So far every lead I’ve followed on keywords has come up short. I think we’re still stuck with a homophonic substitution cipher of some variety.




Mike & Glen,

The row & column repetition rates seem to be very clear "statistical signatures" in the 408- & 340-ciphers, being practically the same pair of rates for each. However, this alone is not sufficient to rule out plaintext that's "pure gibberish". I believe that regardless of whether or not the plaintext is "random", a homophonic scheme like that used for the 408-cipher, even combined with certain simple transpositions, would still tend to produce a very similar rep rate signature. (The rep rates say more about the enciphering scheme than about the plaintext.)

I ran a few plaintext simulations with random (26) letters, and found the repetition rate in such *plaintext* to be just under 30%, compared to a rep rate around 40% in the 408-plaintext. This means, it seems to me, that enough repetitions should occur in gibberish plaintext to give ample opportunity for a homophonic scheme to show its expected signature in the *ciphertext* rep rates.

Glen said "The fact that the 340-cipher imitates the 408-cipher in just about every way makes it pretty clear that it holds a message... Creating a fake with these properties would be far more difficult than creating the real thing."

I don't really agree with this. I would conclude instead that the 340-cipher *scheme* is very similar to that of the 408, but that the 340-*plaintext* could be gibberish. But beyond the simple rep rates, I defer to Glen's experience with the various other statistical measures applied to these ciphers; however, as explained above, the rep rates neither support nor oppose that conclusion, IMO.


As Bert points out, a single measure alone does not mean the plaintext is not "gibberish". I've looked at many other factors, but since we are performing scientific examination of the cipher, all the possible statistics should be quantified and compared to the 408-cipher. Doubles, triples, characters at distances, etc., entropy calculations against the sample plaintext. I've got them all scattered around my computer from different sittings, but they need to be in one place to allow for testing.

It's quite possible that Bert or someone else may see something overlooked that might help solve the cipher when it's all viewed together.

I propose we also use two control texts, one pseudo-random and one full random of 340 characters. The 340-cipher, the first 20 lines of the 408-cipher along with the first 20 lines of plaintext. These would give us a very good statistical base for formulating and testing theories, as well as examining the core of the 340-cipher.

Full random of 26 letters is normal, but the pseudo-random should take into account appropriate Zodiac character frequency - any differences noted between the pseudo-random and the 340-cipher might then possibly point to structure. Perhaps we could scramble the first 20 lines of the 408-cipher and plaintext?

Statistics generated from these data sets should quickly demonstrate any errors in assumption, don't you think?


Glen