Context-free Grammars

This is just an intuitive idea about what such grammars are. This post is thin on technical details — you have been warned. The ideas and examples are taken from here, there, and a lecture by Andrew McCallum.

The best way to understand context-free grammars is to first understand why they are called context-free. To understand why they are called context-free, consider this example:

The man from Amherst grew beautiful russet potatoes

The first underlined group of words is a Noun Phrase (because it starts with a Noun) and the second is an Adjective Phrase. These groups of words can be replaced by any other groups of words of the same type regardless of context. For example,

Joe from Florida grew large oranges

Programming languages are similar as well. A context-free grammar G can be defined as:

G = <T, N, S, R> where,

  • T: is a set of terminals (can’t be broken down further, like words of the english language used to construct sentences)
  • N: is a set of non-terminals (this is made up of terminals and other non-terminals similar to the noun phrase made up of words and so, can be broken down further)
  • R: is a set of rules that says what can be replaced with what. So, it could say that a noun phrase can be replaced by another noun phrase (non-terminal) or a noun (terminal)
  • S: ignore this, its not important to this explanation

Now, such a grammar can be ambiguous. Consider this statement in the C language:

x * y

This can be interpreted as y being a pointer to something of type x, or x multiplied with y. Its ambiguous. The only way a compiler tells the difference is by checking if x has previously been declared as a type or not.

Thinking About Security

Three main security services protect the data, the system, and its users from the most common threats:

  1. Confidentiality: conceal information and resources (e.g. using cryptography)
  2. Integrity: of data (e.g. using message digests), and of origin (e.g. authentication using passwords, or public-key schemes)
  3. Availability: ability to use the desired resource (e.g. resistance to denial-of-service attacks)

So what are the most common threats?

Shirey divides threats into four categories:

  1. Disclosure: unauthorized access. Specific threats include snooping, and wiretapping. Confidentiality services counter this class of threats.
  2. Deception: acceptance of false data. Specific threats include data modification, masquerading, spoofing, repudiation of origin, and denial of receipt. Integrity and availability services counter this class of threats.
  3. Disruption: prevention of correct operation. Specific threats include data modification, denial-of-service, and delay. Availability services counter this class of threats.
  4. Usurpation: gaining unauthorized control. Specific threats include data modification, masquerading, and spoofing. Integrity services counter this class of threats.

The above discussion describes which security services to use to counter which class of threats. It does not describe how to build a secure system. That can be done by first, writing a security policy and then, building a security mechanism that actually enforces that policy. Definitions:

  1. Security policy: a statement of what is, and what is not allowed. For example, a policy can be expressed as a state diagram consisting of secure and insecure states. Any transition from a secure to insecure state should not be allowed.
  2. Security mechanism: method, tool, procedure, or any implementation that enforces the said security policy. Mechanisms are secure, precise, or broad. Let $$Q$$ be the set of secure states, and $$R$$ be the set of states that the security mechanism restricts the system to. Then, if $$R \subset Q$$, the system is secure; if $$R = Q$$, the system is precise; and if there is a state $$r \in R$$ but $$r \notin Q$$, the system is broad.

A secure system has three basic goals:

  1. Prevent an attack (an attack is a realization of a threat). For example, use encryption to prevent access to data from wiretapping.
  2. If not, detect an attack in progress. For example, run a virus scanner to detect existing infections.
  3. Finally, recover from an attack. For example, a virus scanner may restore any software damage done by the virus.

To be able to trust a security mechanism, the designer of the system must provide some form of assurance that the mechanism correctly enforces the policy, and is working as expected. One way is to write a detailed set of system specifications, then design a system that satisfies the specifications, and then implement a system that satisfies the design. This three step process is easier said than done, because it is not trivial to verify if an implementation satisfies a design, which in turn satisfies the specification. Just verifying that the implementation itself works as expected is already very hard (if this was easy, no software would ever have bugs!)

Now even if you understand security, and you’ve written a policy, and enforced that policy using various mechanisms, there are still many issues to deal with:

  1. Cost-benefit: make sure that the cost of protecting a resource is less than the value of the resource itself.
  2. Prioritize: more important assets may require more resources to protect
  3. Laws and customs: security policy will have to be tailored to what is acceptable and what is not in different societies and cultures.
  4. People problems: security provides no direct rewards to the user. Usually its more of a hindrance, so no one wants a part in it (the hit “OK” mentality). People may have simple passwords, or reveal it simply for a cup of coffee. Thus, security must keep user interface issues in mind.

Ideas from this post were taken from this book by Matt Bishop.

Attack the MAC

The goal is to find the $$k$$-bit key $$K$$ used to create a given $$n$$-bit MAC for a given message $$M$$. There are two disjoint cases:

  • $$k \leq n$$
  • k > n

When $$k \leq n$$ then just try all $$2^k$$ combinations of the key on message $$M$$ using the respective MAC algorithm. When the computed MAC matches the given MAC, you are done.

Things get more interesting when k > n. Notice that you can create $$2^k$$ MACs with a $$k$$-bit key. However, the MAC is only $$n$$ (< $$k$$) bits long. Thus, more than one key will create the same MAC, this is called a collision. Two questions arise at this point: a) How many keys will create the same MAC? and b) If more than one key creates a given MAC, how do we find the right one?

The answer for (a) is that on average there will be $$2^k/2^n = 2^{k-n}$$ collisions or keys that will create the same MAC. The answer for (b) is that multiple rounds of attacks with different messages will be required to crack the MAC. Here is how it works:

  1. Given $$M_1, MAC_1 = C_K(M_1)$$ where $$C$$ is the given MAC algorithm, we compute the MAC for all $$2^k$$ values of $$K$$. We will get $$2^{k-n}$$ keys which will result in the same MAC ($$MAC_1$$).
  2. To narrow it down, we need another $$M_2, MAC_2 = C_K(M_2)$$ pair. By repeating the process (using only the remaining $$2^{k-n}$$ keys) we will narrow the key space down to $$2^{k-n}/2^n = 2^{k-2n}$$ keys.
  3. The next round will narrow the key space down to $$2^{k-2n}/2^n = 2^{k-3n}$$ keys and so on till $$2^{k-\alpha n} = 1$$ key. Notice that the MAC chosen in each round is created using the same key $$K$$.
  4. Thus, it will take $$\alpha$$ rounds and $$\alpha$$ different messages to crack the MAC when the size of the key $$K$$ is $$k = \alpha n$$ bits.

Most of the ideas for this post came from this book by William Stallings.

Email Security for Everyone

We all need it but we hardly ever use it. So I want to describe the concepts behind a popular method that will do this for you no matter which email client you are using (even Gmail). Its called PGP — Pretty Good Privacy.

Now, I won’t delve into the little details about click here and do this or do that. The goal is to give you an overview of how email security works. You can then download any PGP software you like or use the one integrated with your email client and find the buttons or commands associated with the security functions described here.

The idea behind securing your email is very simple. First, you and each of your friends need to have a pair of keys (public key, private key). Then, when you want to send an email to your friend Bob, you encrypt the email with Bob’s public key. When you do this, no one but Bob can decrypt the message because only he has the decryption key — Bob’s private key.

Although the idea is so simple, there are obstacles that make it hard to use for everyone:

  1. Not only do you need to use PGP to secure your email, everyone you send emails to, must do so as well.
  2. How do you find keys for people you need to send emails to?
  3. How do you trust the keys?

There is no good solution for (1). You have to personally convince your friends to get going with securing their emails. I know its hard but keep trying.

Fortunately, there are decent solutions for (2) and (3). Solution for (2): When you download PGP software, the first thing you need to do is find a way to create a key (pair) for yourself. Note that your friends will be doing the same. The software will then have a way for you to upload your keys. Don’t worry, only your public key is uploaded. Your private key is probably encrypted and stored safely on your computer. Only you can decrypt and use your private key because the encryption is protected with a password that the software should have asked you for when you created the key.

The software will also provide you with a way to search public keys using names or email addresses. This is how you find a friend’s public key. The software should then allow you to store and organize any keys you find.

The email client comes in next. It should provide you with a way to choose your friend’s public key when sending secure email to him or her. It basically knows about the PGP software and where it stores your friend’s keys.

Solution for (3): Each key that you store has a trust ranking. Suppose you create your own key. That key has “ultimate” trust. Then your friend might personally give you her key and you can mark that trusted as well. Your friend might also recommend you a website to download her key from. In this situation, make sure to call your friend after downloading the key and read it out to her to see if it is the correct one.

The way you mark a key trusted is by signing it with your own key and then maybe clicking a check box or two saying how much you trust it. Then you go ahead and upload your friend’s key as well. Now, if her key is already listed, its trust ranking should go up. If not, a trusted key for her can now be found by your other friends.

You should not mark a key trusted unless you are absolutely sure about its source and the integrity of the key itself.

To summarize, download PGP software or just find out how your email client supports PGP (believe me, most already do, you just have to find the feature). Then create your key, find your friend’s keys, call them and verify the key is correct, mark the keys trusted, upload all the keys, and encrypt email you send with the respective friend’s key.

That’s it! I hope all of us start using PGP, because without it, our emails just travel in the clear every day across the world through possibly unprotected servers. So much happens over emails now that it is absolutely imperative that we protect the information contained in them.

Hash, MAC, and HMAC

What’s what and when to use it?

Hash. Think of it as a checksum, but one that has been mathematically proven to have certain security properties (a cryptogrphic checksum). A hash or digest can be created using algorithms like md5. Typically, the digest is sent along with the message. Once received, the receiver recreates the digest using the same algorithm and compares the computed digest with the received one — A match indicates that the message has not been changed since the digest was first created by the sender.

A digest when included with the message and then encrypted (using symmetric cryptography) also provides authentication because the shared encryption key is held only by the two communicating parites. This however, does not provide non repudiation because the shared key indicates that the message could be sent by any one of the communicating parties, e.g. sender sends an encrypted message with an enclosed hash value and then denies sending it. Furthermore, he claims that the receiver actually sent the message. Unfortunately, in this situation, the receiver cannot prove that he was not the sender. Remember, looking at message sender or receiver is not enough because such attributes of a message can easily be spoofed.

MAC. A Message Authentication Code is a code (like a hash value) that provides authentication. It is usually constructed using symmetric encryption ciphers e.g. CBC-MAC. Now you may ask “why not include the hash in the encrypted message to get authentication?” (just like we talked about above). The answer is that, yes, it can be done but its not practical. Because now, to authenticate one must decrypt. By separating the encryption/decryption from integrity verification and authentication, we reduce computation required for each operation and introduce flexibility when implementing security protocols. MACs are generally computed using the plain text message and then concatenated to the encrypted message before transmission.

HMAC. A Hash Message Authentication Code is a MAC constructed using hash functions rather than block ciphers. One of the main reasons for making hash based MACs was faster performance.

Block cipher based MACs (like CBC-MAC) existed even when hash functions (like md5) were being used. Later, Kaliski et al. showed how to construct a MAC using md5. However, the methods for constructing MACs using hash functions like md5 or SHA-1 were mostly ad-hoc and without sound security analysis. Then came HMAC by Bellare et al. who described a general construction for MACs based on hash functions and showed that it would be as secure as the hash function used in the construction.

Digest Internal or External

There is a subtle difference in the security properties of an encrypted message when the message digest is internal (added before the encryption process) or external (added afterwards). The difference is: Authentication is a by-product of the operation in which the digest is internal. However, that is not the case when the digest is external.

The thing to remember is that if a message is decrypted successfully by the receiver, then it must have come from the expected sender — because they share the same key (assuming symmetric ciphers only). So if we assure somehow that when a message is decrypted, it results in the one that was sent, we will have authentication. We make this guarantee using message digests.

When the message digest is external, an attacker could drop the original message and digest, and simply replace them with new ones. The digest will still match but the message will be garbled (since the attacker’s key was used to encrypt it). But how can a receiver automatically distinguish between a garbled message and a correct one? A human could easily do that for a message in English, but it will be much harder if the data sent is a byte stream. Additionally, introducing a human in the picture is simply impractical, imagine having someone reading and verifying every encrypted HTTP packet while receiving data from an HTTPS connection. Thus, when the digest is outside, we cannot guarantee that the message decrypted is the same one that was sent.

Now, when the message digest is internal, then any modification by an attacker will be detected automatically during integrity verification. The attacker could also replace the entire cipher text, but since the decryption will be garbled (because attacker’s key was used to encrypt,) the digest verification will fail.

Update: Notice now that all of the above does not apply to a Message Authentication Code. The MAC can be placed internally or externally because the code (digest) is tied to a shared secret key. Since the attacker does not have this key, he can’t create an encrypted message for which digest verification succeeds at the receiver.

keywords: apply, digest, when

Cryptanalysis Goals

The eventual goal of an attacker trying to compromise a cryptosystem is to recover plaintext. Plaintext can be recovered from ciphertext by:

  1. Finding the key used for decryption
  2. Finding a direct mapping between plaintext and cipthertext

The most common way is to try to find the key. So most attacks are carried out with that in mind e.g. brute force attack, factoring attack on RSA etc.

Obvious as this post may seem, one tends to forget these goals while absorbed in the details of cryptographic algorithms. Usually, I find myself deeply involved in figuring out how the algorithm works rather than ask the question why this way?

The goal of designing any cryptographic algorithm is to make (1) and (2) very very hard.

keywords: strength, weakness, cryptography

ECC For Dummies

I have always wanted to get a high level idea of Elliptic Curve Cryptography without going into the mathematical details. The goal is to understand what makes it strong, and why its keys are smaller than RSA for equally strong encryption.

The idea is still the same: find a function that’s easy to solve in one direction (using some publicly known information) but hard to solve without a trap door (the private information) in the other. This type of function can then be used for encryption. Interestingly, we can build such an encryption function using the addition (and consequently multiplication) operation on points from a carefully chosen elliptic curve.

To get an intuitive idea, consider this example: given 6 = k x 3, we can easily calculate k = 6/3 = 2. However, given a point (i, j) — on a carefully chosen elliptic curve — finding (l, m) = k x (i, j) for some k is easy, but finding k = (l, m)/(i, j) given only the two points is very hard. Now if k is your private key, (l, m) is your public key, and (i, j) is publicly known, then it will be very hard for an attacker to determine your private key k by using just (l, m) and (i, j). This is ECC’s source of strength.

So how does encryption work? First, a message M is converted into points on the chosen elliptic curve (I don’t know by what miracle this happens, but the linked document has an algorithm for it). Then, generate a number n and calculate C = {A, B} = {n x (i, j), P1 + n x (l, m)}; where P1 is the point on the elliptic curve representing a portion of the message, and C is the cipher text intended for a person whose public key is (l, m). Remember, (i, j) is a publicly known parameter of the chosen elliptic curve.

The receiver gets C and computes B – k x A = P1 + n x (l, m) – k x n x (i, j) = P1 + n x k x (i, j)k x n x (i, j) = P1. Notice the substitution (l, m) = k x (i, j). Now we translate the point P1 back to the portion of the message M. voila!

So why are ECC private keys smaller than RSA’s for equally strong encryption? Because it is harder to crack ECC with small keys than RSA with large keys. The fastest known method to determine k from (l, m) and (i, j) is much slower than the fastest known method for determining p and q from n (p and q are prime factors of n).

But if keys are smaller than RSA, then doesn’t the brute force attack become easier on ECC? Well, the key point here is that RSA keys are extraordinarily large to prevent the factoring attack and not the brute force attack. Mounting a successful brute force attack even on a 128 bit key would take longer than the age of the universe; why else would Uncle SAm be OK with it:

The design and strength of all key lengths of the AES algorithm (i.e., 128, 192 and 256) are sufficient to protect classified information up to the SECRET level. TOP SECRET information will require use of either the 192 or 256 key lengths. The implementation of AES in products intended to protect national security systems and/or information must be reviewed and certified by NSA prior to their acquisition and use.

A few quick points:

  1. During encryption, if the generated number n is not used, someone can easily determine P1 from B because (l, m) is publicly known.
  2. The computational requirements for ECC and RSA are similar for the same key lengths. Since ECC has shorter keys for the same level of security, there is a computational advantage in using ECC.
  3. A lot of the ideas for this article are from this book by Stallings.

keywords: simplified, dumbed down, simple, explanation, summary

Merkle’s Puzzles

I will go through this only briefly, details can be found here. After Merkle decided to address the problem of key establishment between communicating parities without prior arrangements (through an insecure channel), he came up with an interesting solution. He called it “puzzles”. The key idea is that the puzzle can be solved more easily by the communicating parties than the eavesdropper. Solving the puzzles results in a secret key that can be used to encrypt any future communications between the parties using symmetric key cryptography.

The protocol begins with X sending a list of N weakly encrypted (Key, ID) pairs — each called a puzzle — to Y. Note that only X knows the ID of each key at first. Y then randomly chooses one of N puzzles to solve. A puzzle is “solved” when the encryption key is found (this can be done using brute force. Remember, the encryption is deliberately weak). Y then decrypts the puzzle and sends the ID back to X. At this time an eavesdropper knows N puzzles sent to Y and the ID that Y sent back. However, E does not know which ID corresponds to which key; to figure this out E would have to find encryption keys for an average of N/2 puzzles, where as Y, only needs to find one. So there is a high probability that Y and X can agree on a key before Z can find which one it is.

Timeline: Public Key Cryptography

I always get confused about when it all started, who invented what first. So here is what I found out:

  1. It all started when Merkle wrote two project proposals for his crypto class. One of them was largely the seed idea behind Public Key Cryptography
  2. Merkle started writing drafts about this concept which he called “puzzles”. This was mainly a way for two parties to agree on a key without prior arrangements.
  3. Then came the groundbreaking paper by Diffie-Hellman which actually defined the characteristics of a public key cryptosystem. The kind of mathematical functions (trap-door one way) that will need to be used, the existence of two separate keys for encryption and decryption, and so on. They did not actually find that function, they just described its properties. However, they did find a mathematical way to perform key agreement without prior shared secrets (the Diffie-Hellman key agreement protocol)
  4. Now this is where my knowledge gets a little muddy because in the same year the Merkle–Hellman as well as the RSA public key cryptosystem were published. However, I feel that RSA came first since it did not cite the Merkle–Hellman work but the opposite was true. Anyway, these systems described mathematical functions that could be used for public key cryptography in practice.
  5. Now here we are: RSA took hold and became immensley popular and until today I didn’t even know there was a competing algorithm at the time.