TOM S JUZEK'S BLOG


<< Jobs_in_linguistics <<     –
HOME
     >> The Zodiac Killer's z340 >>






Hacking Der Spiegel's paywall


The other day, my brother-in-law suggested a certain Der Spiegel article to me [html link]. Unfortunately, the article is behind a paywall – and I was too lazy/cheap to create an account and pay for it.

But then, I've noticed something intriguing. As you scroll further down, the article still shows what appears to be some, if not all the text. However, the text is obfuscated. It looks something like this:


obfuscated_text


Maybe I could get to the article after all? I've opened the Inspector in the Web Developer Tools and tried to delete the masking veil to get to the real text. This didn't work, I couldn't delete the veil without also deleting the text (edit: I was deleting the wrong node as I_Cant_Ink_Straight has pointed out; details: [html link]). So, I've extracted the bare html text instead. This gives you something like this:

<p class="obfuscated">Eboo nvtt fs mpt obdi Cfsmjo/ #Bvg {vn mfu{ufo Hfgfdiu#- svgu Sfefotdisfjcfs Kpobt Ijstdioju{- bmt tjdi ejf Svoef fsifcu/ Ijstdioju{ ibu fjof spuf TQE.Gbiof ebcfj- ejf fs kfu{u bvg efs Ufssbttf opdi fjonbm tdixfolu/</p><p class="obfuscated">#Oå#- tbhu Tdivm{/ #Ojdiu {vn mfu{ufo Hfgfdiu/#</p><p class="obfuscated">

It looks like gibberish at first glance, but my immediate suspicion was that the real text is still in there, that it's just hidden behind simple cipher [html link]. As a computational linguist, I now had to dig deeper into this.

It looked like a simple cipher, because apparently, the whitespaces were not encrypted, thus giving a nice stream of pseudo-words. I guess Der Spiegel does this for aesthetical/marketing reasons: The article looks as if it's just one click away. So, I started with two simple analyses: A character frequency count and a two-gram analysis [html link]. Both analyses produced the expected distributions [html link]. Here's a plot for the chararacter frequencies:

char_freq_cipher

Looks as if there is language in there, exciting! So, my next step was to get a point of comparison. I've collected the text of related articles, also from Der Spiegel, also from articles on politics and the SPD. I did the same analyses and the results came out strikingly similar:

char_freq_baseline

My premise now became that this is a simple substitution cipher [html link]. To decipher it, one more thing would be useful: word counts. The script is almost identical to the character count, so it was easy to add.

At this point, creating the substitution table became a kinderspiel. The char frequencies between the cipher and the baseline line up rather nicely:

Cipher
char_count_cipher_table
...

Baseline
char_count_baseline_table
...

And the word frequencies also give clear clues:

Cipher
word    abs_count
word_count_cipher
...

Baseline
word    abs_count
word_count_baseline
...

f will be e, o will be n, etc., ejf will be die, efs will be der, etc. After a couple of minutes, the substitution table was done. As it turns out, they use a Caesar cipher, using shift 1 [html link]! Check out the ordered substitution table:

substitution_table
...

I ran the encrypted article through a script and behold, it came out fine! The paragraph from above, after the mark-up removed, now looks like this:

Schulz nippt an seinem Kräutertee. "Ich hab jetzt alles gegeben, was ich geben konnte", sagt er. "Physisch und psychisch." Das gebe ihm, wenn er jetzt hier sitze, ein Gefühl von innerer Freiheit.

Awesome!


All files can be found on github [html link]. I wrote this post with zero expectations to get anything back. I enjoyed the project and wanted a write up. However, if you insist on giving something back, then feel free to send some amount of Nanos. I love Nanos [html link]. Here are my address and QR code:
xrb_39s7e8mu4wbuo77k3idfgmquep3cmkb4r56mf8rnfq6r99q9gxzsnzzj9k9x

my_xrb_wallet

Addendum: cX207 has pointed out here [html link] that David Kriesel has deciphered Der Spiegel's paywall before [html link]. I didn't know about this and will keep the post up, because our approaches are different: Kriesel paid for an article and then compared cipher vs plain text to reverse engineer the cipher. I used data analysis to make sense of the cipher.



tsj; originally posted on 24 Feb 2018


last modified: 17 Mar 2018