09/27/2013

Tokenization IS Encryption – NOT! – Part 4

This is the first addendum post of a three (now four)-part series written by Steve Sommers, Shift4’s SVP of Applications Development. The first three sections can be found here, here, and here.

Recently I found out that PCI SSC is taking on the tokenization subject again. This time, the goal is to take the guidance document they released back in 2011, which I commonly refer to as “The Tokenization Bastardization of 2011,” and create a tokenization standard. Since the council is re-addressing the topic, I thought I would take a stab at educating the council on where their SIG went astray in the original guidance. Hopefully the guidance mistakes will be corrected.

On a related side-note, I am still miffed that PCI SSC never approached us, the inventors of “tokenization,” back in the tokenization SIG days until just before the tokenization guidance release — after it was too late. It’s almost as if they went out of their way to keep us out of the loop and then we receive an “oh, by the way” notification a couple days before the final deadline for feedback. But that is just the conspiracy streak in me. I’ll put my foil hat back on now and return to the topic.

This last Monday I received four new documents and a copy of the 2011 tokenization guidance document. The first document was a Tokenization Task Force definition. The remaining three new documents are all part of a four part series that make up the Payment Card Industry (PCI) Tokenization Standard.

• Terms of Reference – Tokenization Task Force, version 1.0 dated May 13, 2013
• PCI Tokenization Standard – General Principles, version 0.6 dated July 2013
• PCI Tokenization Standard – Irreversible Tokens, version 0.3.1 dated August 2013
• PCI Tokenization Standard – Reversible Tokens, version 0.3 dated September 2013

Obviously all are very new and still in a draft format so here is my chance!

Before I address each document individually, let’s get right to the heart of my beef with PCI’s version of tokenization. I’ve said this before but apparently I’m not conveying my thought clearly enough – or no one cares. I’ll assume the former and try to clarify my argument once again, since I cannot correct the latter.

Tokenization is NOT encryption. Period. Nor is it a new way to describe hashing. Exclamation point! With this in mind, I’m still baffled as to why the 2011 tokenization guidance document proceeded to rename encryption and hashing as valid forms of tokens. By incorporating this marketing twist into the tokenization definition, PCI greatly complicated a solution that was created to simplify compliance for merchants. Under the guise of “vendor-neutrality,” vendors in the SIG were able to stamp their non-tokenization solutions as tokenization solutions.

Tokenization’s simplicity and security strength (pre-PCI guidance) were centered on the fact that tokens be not mathematically related to the PANs they are protecting. With PCI’s “guidance,” the simplicity went away and in its place we now will have at least four standards documents that cover tokenization principals, irreversible tokens, reversible tokens, and a tokenization implementation guide. And with all these documents, PCI tokenization will not be nearly as secure (and definitely not as simple) as the original “vendor-neutral” definition that was released to the public domain by Shift4 Corporation back in 2005.

After reading the documents, I realized PCI’s view of tokenization and my view are different worlds. I apologize in advance for any syntax errors as I try to put my thoughts through the PCI alternate universe prism.

Let’s reference some definitions that hopefully are the same in both worlds:

Encrypt/Encryption (http://dictionary.reference.com)
1. to put (a message) into code
2. to put (computer data) into a coded form
3. to distort (a television or other signal) so that it cannot be understood without the appropriate decryption equipment

While I’m not a fan of referencing social media based definitions, I feel the above definition comes up a little short so here is the Wikipedia definition (http://en.wikipedia.org/wiki/Encryption):

In cryptography, encryption is the process of encoding messages (or information) in such a way that eavesdroppers or hackers cannot read it, but that authorized parties can.[1]:374 In an encryption scheme, the message or information (referred to as plaintext) is encrypted using an encryption algorithm, turning it into an unreadable ciphertext (ibid.). This is usually done with the use of an encryption key, which specifies how the message is to be encoded. Any adversary that can see the ciphertext should not be able to determine anything about the original message. An authorized party, however, is able to decode the ciphertext using a decryption algorithm, that usually requires a secret decryption key, that adversaries do not have access to. For technical reasons, an encryption scheme usually needs a key-generation algorithm to randomly produce keys.

Hash/Hashing (http://dictionary.reference.com)
1. Radio. interference of signals between two stations on the same or adjacent frequencies.
2. Computers. a technique for locating data in a file by applying a transformation, usually arithmetic, to a key.
Again, a little lacking so let’s go to Wikipedia (http://en.wikipedia.org/wiki/Hash_function):
Hash functions are primarily used to generate fixed-length output data that acts as a shortened reference to the original data. This is useful when the output data is too cumbersome to use in its entirety. One practical use is a data structure called a hash table where the data is stored associatively. Searching for a person’s name in a list is slow, but the hashed value can be used to store a reference to the original data and retrieve constant time (barring collisions). Another use is in cryptography, the science of encoding and safeguarding data. It is easy to generate hash values from input data and easy to verify that the data matches the hash, but hard to ‘fake’ a hash value to hide malicious data.

While not obvious from any of the definitions above, encryption is usually thought of as reversible, provided you have the keys to decrypt the data. Hashing is irreversible, or one-way, and you cannot normally recreate the original data from the hash or any keys used to create the hash.

Now let’s look at some definitions of a token:
Token (2005 Shift4 original definition)
1. A random value not mathematically related to the PAN, used as a reference to the sensitive data it is protecting

Token (2011 PCI “vendor-neutral” guidance document)
1. A mathematically reversible cryptographic function, based on a known strong cryptographic algorithm and strong cryptographic key (with a secure mode of operation and padding mechanism)
2. A one-way, non-reversible cryptographic function (e.g., a hash function with strong, secret salt)
3. Assignment through an index function, sequence number or a randomly generated number (not mathematically derived from the PAN)

Token (2013 PCI tokenization standard)
1. Tokenization as used within this standard is a process by which a surrogate value called a “token” replaces the primary account number (PAN), and optionally other data. The tokenization process may or may not have a process that changes a token back into the original PAN. If the token does have a process for obtaining its associated PAN, then that is the process of “de-tokenization.” The security of an individual token relies predominantly on the infeasibility of determining the original PAN knowing only the surrogate value (token).

Depending on the particular implementation of a tokenization solution, tokens used within merchant systems and applications may not need the same level of security protection associated with the use of PAN. Storing certain types of tokens instead of PANs is an alternative that may help to reduce the amount of cardholder data in the environment, potentially reducing the merchant’s CDE. Additionally, the merchant’s scope reduction may depend on the implementation of the tokenization solution or the type of solution implemented (e.g., bespoke, packaged, or outsourced to a third party). For more information on implementation refer to Tokenization Standards: Implementation Requirements. A mathematically reversible cryptographic function, based on a known strong cryptographic algorithm and strong cryptographic key (with a secure mode of operation and padding mechanism)

2. There are two classifications of tokens and each have two sub-types:

a. Reversable (De-tokenizable)
i. Cryptographic
ii. Non-cryptographic
b. Irreversible
i. Authenticatable
ii. Non-authenticatable

Did I mention how simple tokenization was? Forget that now!
With the original Shift4 definition, since tokens are not mathematically related to the PAN, there is no way tokens can be comprised of encrypted or hashed PANs.

Let’s do a quick history lesson. How many people really know where or how the term tokenization was derived? (Hint, it was referenced in the original whitepaper on tokenization.) The token that inspired tokenization was an arcade token – Chuck E. Cheese’s to be precise (ugh!). Parents buy tokens to give to their unsecured kids who in turn run around the arcade exchanging the tokens for quality game play. With tokens kids aren’t running around with money and big “got money?” targets on their backs just begging for thieves to steal it. Instead, they are running around with tokens worth no monetary value. In a brainstorming session at Shift4 in late 2004/early 2005, we were trying to create a term to describe the process of exchanging CHD for a token – and poof, tokenization was born.

Now let’s add some complexity to this previously simple concept. As of September 2011 with the PCI guidance redefinition, tokens can be:

1. Encrypted data, and fall under all the various encryption requirements of PCI – consult your acquirer or QSA for guidance.

2. Hashed data, and fall under all the various hash requirements – consult your acquirer or QSA for guidance.

3. A pre-PCI defined non-mathematically related token – I should be happy the original definition was accommodated in the guidance BUT, consult your acquirer or QSA because no one really knows if the token is a random non-mathematically related value or a mathematically related encrypted or hashed value, and the latter requires further examination (strong enough encryption algorithm?, proper key storage and management?, proper storage and use of salt values?, etc.).

So now we take something that is 100% secure (I’m talking the token here, and only the token), simple, and in no way related to the information it is protecting other than by reference and add enough accommodations and ambiguities as to make it a security hole waiting for an exploit. And then in turn, the requirement to follow up with additional standards to mitigate the PCI introduced risks. Back to my conspiracy mindset, maybe someone realized that tokens were out of PCI scope and crafted a land grab? Or maybe it’s just natural when various vendors’ marketing departments gather to create a security guidance document. Either way, it was to the detriment of a simple and secure and public domain concept (oh, by the way, public domain implies vendor-neutral and tokenization was in the public domain prior to PCI SIGs squawking about vendor neutrality).

Let’s get to the documents.

Now, here is where the alternate universe prism starts to take effect. As a side reference, my stance on irreversible vs. reversible focuses on the token and tokenization method itself. To me, irreversible means that the token is not decryptable – it is not mathematically related to the PAN. Reversible means it is mathematically related by either encryption or a hashing. Now in the PCI universe, irreversible vs. reversible has nothing to do with the token or method and instead refers to whether or not the PAN is retrievable in any way – via decryption or lookup. In my mind, our original tokenization definition, as well as Shift4’s TrueTokenization® solution, are irreversible. In the PCI universe, our solution is considered reversible. Just keep that in mind if something gets refracted sideways through the prism and ends up in the wrong universe.

Terms of Reference – Tokenization Task Force, version 1.0 (4 pages)
This document simply describes the task force by stating a purpose, some background, minimal objectives, and some overall ground rules – all very boilerplate. The only item of importance that I noted was a Q4 2014 goal for the final documents, so – thankfully – we’ve got time to fix this!

Payment Card Industry (PCI) Tokenization Standard: General Principals, version 0.6 (25 pages)
This document is intended to define tokenization and give a bird’s eye view of the complexities of PCI tokenization (I remind you here, the beauty of tokenization was its simplicity). This document covers the common requirements of all four flavors and PCI tokenization (irreversible i & ii, reversible i & ii).

The document defines the stakeholders, roles, and four domains that may or may not apply to each flavor of tokenization. Their “at-a-glance” section does a fairly decent job of giving an overview of tokenization and the differences of the various flavors (although I don’t agree with having all the flavors – more later). Also included here are ten general principal requirements for tokenization – referred to as GP 1.1 – 1.10. I imagine this is a fluid list and will grow substantially before final release (or sub-requirements will be added essentially making the 10 requirements into 100 or so sub-requirements).
While most of the requirements seem pretty straight forward to me, I did want to make a comment on a couple:

GP 1.1 Any environment seeking PCI DSS CDE scope relief based on the use of tokenization must not retain PAN within that environment and must not be connected to an environment where PAN is stored.

My only comment here is the vagueness of “connected.” When I read something like this my mind automatically goes to the courtroom, where the battles will arise. I assume they mean directly connected but one could argue any connectivity to a gateway or processor is “connected,” which would void the entire compliance of a system using tokenization.

GP 1.6 The tokenization solution shall include a mechanism for distinguishing between tokens and actual PANs…

Does this deem format-preserving tokenization out-of-compliance? If so, will the P2PE standards adopt similar wording to deem format preserving encryption out-of-compliance?

Payment Card Industry (PCI) Tokenization Standard: Irreversible Tokens, version 0.3.1 (34 pages)
Much of the first third of the document is a repeat of the first third of the General Principals document. The meat starts in the domain and requirements sections. The domain section is basically an overview of what the requirements will be covering. The requirements are broken up into requirements and sub-requirements grouped by the domain: 4A-1 through 4D-1.2 – I count 78 of them (+/-). Good thing tokenization is simple!

My original thought here was that irreversible tokens have no value within the tokenization specification. Tokens and the whole tokenization concept was based around protecting payments. As far as PCI is concerned, I believe this should be their only focus as well – after all, they call themselves the “Payment Card Industry.” I could not fathom how an irreversible token, authenticatable or not, could be of any use since the token references nothing that can be retrieved. Reading most of the document didn’t change my mind until I got to Annex A – Use Cases for Tokenization. In here, there is a use case for irreversible, authenticatable tokens: warranty enforcement to verify that the form of payment was used in the event of a lost receipt. I guess I didn’t think of everything.

Now, I still don’t believe mathematically related tokens should be allowed as it complicates the tokenization concept and opens many possibilities for exploitation. I’ll cover more on this when discussing the next document.

Payment Card Industry (PCI) Tokenization Standard: Reversible Tokens, version 0.3 (32 pages)
Again, the first third is a rehash except for three Christmas-colored figures showing two permissible forms of format preserving tokenization (encryption) and one non-permissible form. With my decades of industry experience, I “think” I get what they are talking about with the non-permissible form, but it will take some serious work for them to convey this to a layperson.

Because I still firmly believe tokens should not be mathematically related to the PANs they are protecting, I’m not going to waste much of your time or mine on the “cryptographically reversible” sections of this document – they really don’t have a place in tokenization. Plain and simple, when anyone talking about their “cryptographically reversible token” or “vaultless tokenization” solution, they are simply talking about encrypted data that they (mis)labeled as tokenization. For the life of me, I do not understand why this is not covered in the P2PE standards. Does this mean all P2PE solutions need to comply with both standards?

For any vendors reading this and taking offense, feel free to debate me on why you think your solution is not simply a P2PE solution labeled as tokenization – or a TINO (tokenization in name only). One word of warning though, if you somehow win your argument in proving that your reversible tokens are not using encryption, you’ll have to prove to the world that your tokens are secure since they are reversible yet not encrypted. Good luck with that one.

Again, in this document the meat starts in the domain and requirements sections. The part here that baffles me is that while this standard includes cryptographically reversible tokens (and by now you should know my thoughts on that!), I count only 25 requirements and sub-requirements vs. 78 in the irreversible standard. To me, if properly implemented and of sufficient strength, hashed data (which irreversible, authenticatable tokens use) represents less risk than encrypted data (which reversible, cryptographic tokens use). This means the riskier data has fewer requirements. Maybe this is just due to the immaturity of the documents and, as it grows, requirements for reversible tokens may surpass those for irreversible tokens. Simplicity at its finest!

As to the non-cryptographic requirements in this document, I’m mostly OK with them until I get to the token generation randomization requirement. Provided the token is not mathematically related, a person (or program) cannot predict a PAN knowing only the token. And tokens are only valid for a merchant, so why can’t the token simply be a sequential number? For simplicity, if 1=93478753, 2=32897399 and 3=08547343, can anyone tell me what 4 will equal?

The last issue is with a device HSM requirement. I understand this requirement for solutions that are installed in merchant environments. Password management and protection is complex and many merchants would have a very difficult time creating, securing, and managing this password ecosystem – HSMs are much easier to standardize. But currently, I don’t see an exception for anyone, including a merchant using a system provided by a Level 1 Service Provider. This means the Level 1 Service Provider would be required to incorporate an HSM in any tokenization solution where POI devices are used. Is PCI saying that a PCI-compliant and assessed data center, while secure enough for handling card data, is not secure enough to handle passwords?

A quick note that I believe to be a minor miscalculation in the requirements for “odds of guessing” (the same miscalculation is in the irreversible document as well, but since I all but skipped writing about that document, I’ll mention it here):

Requirement 1B: The probability of guessing a token to PAN relationship must be less than 1 in 106. (The token must not give any advantage to an attacker trying to guess the corresponding PAN.)
I believe this should be 105, not 106. This number comes from the chance of guessing a truncated or masked PAN. Under truncation rules, the first 6 digits are allowed + the last 4 digits. For a sixteen digit PAN, that means 6 bytes are masked or truncated — 106. BUT, what is forgotten here is that the last four digits includes a luhn-mod-10 check digit and this checksum eliminates one of the factors, thus 105. In addition, all this assumes a sixteen digit PAN — American Express uses fifteen digit PANs.

Payment Card Industry (PCI) Tokenization Standard: Implementation Requirements, version ?.? (?? pages)

This document was not provided and may not exist as of yet. The introduction sections of the prior standards documents all refer to this document so I know it will be coming. When it comes I’m sure I’ll have some thoughts for you to read – so watch out for part 5 of this series!

In Parting
I’m sure I’ll have other suggestions as I reread these documents and as they change over time. I’m going to keep saying this until I am blue in the face: TOKENIZATION IS NOT ENCRYPTION. PCI has a P2PE standard in the works, just like the tokenization standard. The two should be separate and stay separate! Until next time…