Flexible Architectures in Communication Security Application

Authors Avatar

Master of Science in Computer Science and Engineering Thesis:

Flexible Architectures in Communication Security Application

Bryan Chong

Advanced Computer Architecture Laboratory

University of Michigan

Ann Arbor, MI 48109

Table of Contents

Acknowledgement............................................................................................ 5

Abstract ........................................................................................................ 7

Chapter 1 Introduction..................................................................................... 9

Section 1.1 Cryptography............................................................................. 9

Section 1.2 Contribution of This Thesis ...........................................................11

Chapter 2 The Nature of Cryptography ................................................................13

Chapter 3 Cipher Kernel Analysis ......................................................................17

Section 3.1 Cipher Analysis Tools..................................................................17

Section 3.2 Cipher Throughput Analysis ..........................................................18

Section 3.3 Bottleneck Analysis ....................................................................19

Section 3.4 Cipher Relative Run Time Cost.......................................................21

Section 3.5 Cipher Kernel Characterization .......................................................22

Chapter 4 Architectural Extensions.....................................................................25

Chapter 5 CryptoManiac Architecture..................................................................29

Section 5.1 System Architecture ....................................................................29

Section 5.2 Processing Element Architecture .....................................................31

Section 5.3 Instruction Set Architecture ...........................................................32

Section 5.4 Design Methodology ...................................................................33

Section 5.5 The Super Optimizer ...................................................................35

Section 5.6 Physical Design Characteristics.......................................................37

Chapter 6 Performance Analysis ........................................................................39

Section 6.1 Performance Analysis of ISA Extensions ...........................................39

Section 6.2 Performance Analysis of CryptoManiac.............................................42

Section 6.3 System Analysis of CryptoManiac ...................................................44

Chapter 7 Related Work..................................................................................47

Chapter 8 Conclusions and Future Work...............................................................49

References ....................................................................................................51

Page 5 4/22/01

Acknowledgement

Credit for much of the work described in this thesis belongs to my advisor, Professor Todd Austin, for

his insight, guidance, and patience. He provided for an excellent research environment, left me enough

freedom to do things the way I thought they should be done, and was always available to discuss ideas

and problems.

I would also like to thank my committee members Professor Steve Reinhardt and Professor Gary

Tyson for reviewing this document and serving on the defense committee.

Other people that have worked on the CryptoManiac project include Chris Weaver for hardware

design and synthesis support, Jerome Burke and John McDonald for earlier versions of ISA extensions

code modifications.

Page 7 4/22/01

Abstract

The growth of the Internet as a vehicle for secure communication and electronic commerce has

brought cryptographic processing performance to the forefront of high throughput system design.

Cryptography provides the mechanisms necessary to implement accountability, accuracy, and

confidentiality in communication. This trend will be further underscored with the widespread adoption of

secure protocols such as secure IP (IPSEC) and virtual private networks (VPNs). Efficient cryptographic

processing, therefore, will become increasingly vital to good system performance.

In this thesis, we explore hardware/software-design techniques to improve the performance of secretkey

cipher algorithms. We introduce new instructions that improve the efficiency of the analyzed

algorithms, and further introduce the CryptoManiac processor, a fast and flexible co-processor for

cryptographic workloads.

Our first approach is to add instruction set support for fast substitutions, general permutations,

rotates, and modular arithmetic. Performance analysis of the optimized ciphers shows an overall speedup

of 59% over a baseline machine with rotate instructions and 74% speedup over a baseline without

rotates. Even higher speedups are demonstrated with optimized substitutions (SBOX’s) and additional

functional unit resources. Our analyses of the original and optimized algorithms suggest future directions

for the design of high-performance programmable cryptographic processors.

To follow up on these suggestions, our second approach is to design an efficient piece of hardware

that runs cryptographic algorithms. We present analysis of a 0.25um physical design that runs the

standard Rijndael cipher algorithm 2.25 times faster than a 600MHz Alpha 21264 processor. Moreover,

our implementation requires 1/100th the area and power in the same technology. We demonstrate that the

performance of our design rivals a state-of-the-art dedicated hardware implementation of the 3DES

(triple DES) algorithm, while retaining the flexibility to simultaneously support multiple cipher

algorithms. Finally, we define a scalable system architecture that combines CryptoManiac processing

elements to exploit inter-session and inter-packet parallelism available in many communication

protocols. Using I/O traces and detailed timing simulation, we show that chip multiprocessor

configurations can effectively service high throughput applications including secure web and disk I/O

processing.

Page 9 4/22/01

Chapter 1 Introduction

Cryptography provides the mechanisms necessary to provide accountability, accuracy and

confidentiality in inherently public communication mediums such as the Internet. The widespread

adoption of the Internet as a trusted medium for communication and commerce has made cryptography an

essential component of modern information systems. The trend towards virtual private networks (VPNs)

[15] and secure IP (IPSEC) [3] will further emphasize the significance of cryptographic processing among

all types of communication. Security-related processing can consume as much as 95 percent of a server’s

processing capacity [25]. As demands for secure communication bandwidth grow, efficient cryptographic

processing will become increasingly critical to good system performance.

Section 1.1 Cryptography

Cryptography is a Greek word that literally means the art of writing secrets [21]. In practice,

cryptography is the task of transforming information into a form that is incomprehensible, but at the same

time allows the intended recipient to retrieve the original information using a secret key. Cryptographic

algorithms (or ciphers, as they are often called) are special programs designed to protect sensitive

information on public communication networks. During encryption, ciphers transform the original

plaintext message into unintelligible ciphertext. Decryption is the process of retrieving plaintext from

ciphertext. Two forms of cryptography are commonly used in information systems today: secret-key

ciphers and public-key ciphers. Secret-key ciphers (sometimes referred to as symmetric-key ciphers) use a

single private key to encrypt and decrypt as illustrated in Figure 1. Public-key ciphers (or asymmetric-key

ciphers) use a well-known public key to encrypt and require a different private key to decrypt. The

process may also be reversed to produce what is known as a digital signature. Digital signatures

authenticate the sender. Since only the person holding the private key knows its value, only that person

can create a digital signature that others can decrypt with the public key.

Figure 1. Public-Key and Secret-Key Ciphers.

F(x) G(x)

G(x) G(x)

Public-Key Cipher

Secret-Key Cipher

plaintext ciphertext plaintext

plaintext ciphertext plaintext

public key private key

private key private key

Lisa Wu Page 10 4/22/01

Public-key ciphers have the advantage of being able to establish a secure communication channel

without an unsafe exchange of keys. Private-key ciphers, on the other hand, require a shared private key

before secure communication can commence. The distribution of the shared private key is the primary

obstacle in making secret-key ciphers secure. Strong public-key ciphers are computationally very

expensive. Public-key encryption requires exponentiation and modular multiplication of large multiprecision

numbers of 1024 bits in length or more. Secret-key ciphers have the advantage of running as

much as 1000 times faster than comparable public-key ciphers [30]. To maximize security and

performance, most secure protocols use both forms of cryptography. Public key encryption is used at the

start of a session to authenticate communicating parties and to securely distribute a shared secret key. The

remainder of the session employs efficient secret key algorithms using the private key exchanged during

authentication. We refer this type of key management as public-secret key cryptography. An example of a

system that uses this particular session management strategy is the Secure Sockets Layer (SSL) protocol

[39].

SSL is a standard secure protocol that provides secure communication between web servers and web

clients. It is supported by most popular web browsers [25]. SSL extends TCP/IP to support secure

encrypted connections with authentication of senders and receivers. The protocol is used by web servers

to establish secure HTTP connections. For very short sessions, fast public-key cipher processing is critical

for high transaction throughput. For longer sessions, private-key cipher performance becomes more

important. Figure 2 illustrates the SSL run-time breakdown by server processing type. Data shown is

collected from a heavily loaded web server running on an iA32 platform [29]. Clearly, for very short

SSL Session Length (bytes)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1k 2k 4k 8k 16k 32k

Publ ic

Ot her

Private

average size of a

single web object (21k)

Figure 2. SSL Protocol Relative Contribution

to Run Time.

Lisa Wu Page 11 4/22/01

sessions fast public-key cipher processing is crucial for high transaction throughput. A recent study [2]

found that the average size of a web object was 21k bytes. For a 32k session length, secret-key processing

overheads rise to 48% of overall run time. Given that a single session will be composed of many web

objects, cryptographic processing will be quickly dominated by secret-key cipher execution. To further

reduce the cost of public-key authentication, SSL allows the use of a session cache, where authenticated

private keys are held and can later be reused when users reconnect to view other web pages. As such, the

focus of our design effort is on improving secret-key cipher performance.

Section 1.2 Contribution of This Thesis

Cryptography can be implemented with software routines, directly in hardware, or a combination of

both. A software only approach is the lowest-cost solution but with accordingly lower performance. An

example of the hardware-only approach is the IDEA engine [23]. It is targeted specifically at the efficient

execution of the IDEA cipher and renders excellent performance. However, its performance is at the

expense of flexibility as the hardware cannot be used for other cryptographic processing tasks. In this

thesis, we present two hardware/software mixed solutions for efficient cryptographic processing.

The first hardware/software mixed approach is to add architectural extensions that streamline cipher

kernel processing. We first examine the execution of eight widely known strong secret-key ciphers. We

analyze their performance on detailed microarchitectural models, where we are able to clearly show their

performance and the bottlenecks that slow their progress. Armed with these insights, we propose a

general set of instructions that from which all these kernels can benefit. The technique improved

performance of secret-key ciphers through fast substitutions, general bit permutations, rotates, and

modular arithmetic. We re-code the ciphers using these new instructions and then examine their

performance on microarchitectural models with varying levels of support for fast cryptography.

We further extended this approach through the design of a fast and flexible cryptographic coprocessor.

Our design, called the CryptoManiac, addresses the primary bottleneck in secret-key ciphers,

namely efficiency, through the application of an efficient VLIW architecture with a well-matched

instruction set and functional unit resources. The programmable feature supports many secret-key ciphers,

in contrast to the IDEA engine. We hand-optimized the eight kernels and then validated the results using a

super optimizer as the scheduler with varying design parameters. By combining CryptoManiac processors

into parallel configurations, we are able to scale cryptographic performance for applications with intersession

and inter-packet parallelism. We detail the design and implementation of the CryptoManiac

processor and analyze its performance using architectural and physical design models.

My research contribution to this work was the design, implementation, and evaluation of the

CryptoManaic co-processor; this work was published in [42]. Specifically, I contributed the following: i)

design and implementation of the CryptoManiac functional units, ii) area, timing, power, and

Lisa Wu Page 12 4/22/01

performance analyses of the CryptoManiac, iii) experiments for design parameter decisions such as the

bypass logic of the CryptoManiac, the width of the VLIW co-processor, and how the instruction

combining capability affects the design of the hardware, iv) design and implementation of the super

optimizer, v) validation and instruction combination studies done by using the super optimizer, and vi) the

implications of cryptographic algorithms mixing arithmetic and logical instructions.

We discuss the operations of strong secret-key ciphers by carefully examining the Twofish cipher in

Chapter 2. In Chapter 3, we present our experimental framework and analyses of cipher kernel

characterizations, operations, and bottlenecks to gain insight as to how to build efficient designs. In

Chapter 4, we introduce the architectural extensions to support fast cryptographic processing. In Chapter

5, we present the CryptoManiac design, detailing the processor architecture, the system architecture, and

the instruction set. We then examine performance results in Chapter 6. In the final sections, we discuss

related work, summarize our findings, and make suggestions for future work.

Lisa Wu Page 13 4/22/01

Chapter 2 The Nature of Cryptography

Figure 3 shows the kernel of the Twofish cipher, developed by Counterpane Systems [35]. It is a

candidate for the Advanced Encryption Standard (AES) [1], the US government’s effort to develop a new

strong encryption standard. Twofish is a particularly good example to look at because it captures many of

the operations that ciphers employ. The code shown in Figure 3 is the encryption kernel, run on one 128-

bit block of data to encrypt it into a 128-bit ciphertext block using a 128-bit key value. The Twofish

decryption kernel is nearly identical except the order of the operations is reversed and inverted (e.g.,

rotate left becomes rotate right).

The cipher algorithm first reads the input data, XOR’s it with the 128-bit intermediate vector (IV) and

key, and then enters the encryption loop. The encryption loop executes 16 iterations (or rounds as they are

called in cryptography literature) to produce 128 bits of ciphertext. The ciphertext is then once again

XOR’ed with the key and stored to the intermediate vector and the output buffer.

Within the kernel loop, the cipher algorithm employs a series of reversible operations to implement a

process called diffusion. Diffusion works to randomly impress upon each of the output bits some

information from each of the input bits. The direction of the diffusion process is set by the private key.

The more seemingly random and complete the diffusion process is, the more difficult it is to recreate the

plaintext without the key value. With large keys and good diffusion, the ciphertext is extremely resistant

to attackers. Quantitatively, a strong encryption algorithm is one where any change in the input results in

x[0] = input[0] ˆ key[0] ˆ IV[0];

x[1] = input[1] ˆ key[1] ˆ IV[1];

x[2] = input[2] ˆ key[2] ˆ IV[2];

x[3] = input[3] ˆ key[3] ˆ IV[3];

for (ii=15, jj=0; ii >= 0; ii--, jjˆ=2) {

t0 = (sbox1[x[jj][0]] ˆ sbox2[x[jj][1]]

ˆ sbox3[x[jj][2]] ˆ sbox4[x[jj][3]]);

t1 = (sbox1[x[jjˆ1][3]] ˆ sbox2[x[jjˆ1][0]]

ˆ sbox3[x[jjˆ1][1]] ˆ sbox4[x[jjˆ1][2]]);

x[jjˆ3] = x[jjˆ3] <<< 1;

x[jjˆ2] = x[jjˆ2]ˆ(t0+t1+key[ii<<1]);

x[jjˆ3] = x[jjˆ3]ˆ(t0+(t1<<1)+key[(ii<<1)+1]);

x[jjˆ2] = x[jjˆ2] >>> 1;

}

output[0] = IV[0] = x[2] ˆ key[0];

output[1] = IV[0] = x[3] ˆ key[1];

output[2] = IV[0] = x[0] ˆ key[2];

output[3] = IV[0] = x[1] ˆ key[3];

Figure 3. The Twofish Cipher Kernel. All variables are

32-bit integers. Rotates are indicated by <<< and >>>.

Lisa Wu Page 14 4/22/01

a random perturbation of each output bit with probability 50%. Moreover, any change to the key value

should have an equally dramatic effect on the ciphertext produced.

The process of diffusion has two important implications to the underlying machine architecture. First,

diffusing input bits is computationally expensive on modern microprocessors. Most algorithms run their

kernel loop at least 16 times on each block of data encrypted, successively mixing the data more and more

on each round. The second implication is that cipher kernels have little parallelism. Parallelism in the

cipher (especially coarse-grained parallelism) would imply that some aspect of the computation does not

affect later ciphertext results, which would in turn imply that the cipher algorithm was not a strong one!

While we did find a small level of ILP in cipher kernels, the process of making the kernels run fast

primarily entails improving their execution efficiency on the underlying microarchitecture. The

intermediate vector IV ensures that the diffusion process propagates to all remaining ciphertext in the

communication stream. The ciphertext value of encrypted block i is first XOR’ed with plaintext block i+1

before it is encrypted. The end result is that the cipher kernel execution is a very long recurrence with

virtually no parallelism.

The kernel loop also exhibits a mixing of arithmetic and logical operations. Kernels that contain only

arithmetic or only logical operations can easily be attacked using linear analysis. The implication of

mixing these instructions with sequence as an ADD followed by a XOR or an AND followed by an ADD

is to make the cipher resilient to attacks.

To ensure that the ciphertext can be decrypted back to the plaintext, the cipher kernel must employ a

series of key-parameterized reversible operations. The Twofish algorithm demonstrates a number of

these:

Rotates Rotates are easily reversible (simply rotate the same distance in the opposite direction).

Rotates also have good diffusion properties, impressing each bit onto another bit of the output.

Modular Addition Modular arithmetic, if based on a power-of-two base, is cheap, fast, and has

relatively good diffusion properties. Moreover, it is easily inverted using modular subtraction or modular

addition with the two’s-complement of the addend. XOR operations, which are modular additions in base

2, are easily reversible by XOR’ing the same value onto the resulting ciphertext.

Substitutions Table-based substitutions can be used to quickly implement any keyparameterized

function. An SBOX is a table of values indexed with plaintext (usually byte) that produces

the result of the key-parameterized function. SBOX’s are easily reversible by inverting the table, i.e.,

indices become values and values become indices.

In addition, other algorithms often employ two other mechanisms, modular multiplication and XBOXs.

Lisa Wu Page 15 4/22/01

Modular Multiplication Modular multiplication has been shown to have particularly good

Join now!

diffusion properties [23], and the operation can be easily reversed with modular multiplication of the

modular inverse of the multiplicand. If the multiplicand is part of the key, all divides (which are typically

much more expensive) can be confined to the cipher setup code. If the modulus of the operation is a

power-of-two (as in RC6), it can be efficiently implemented using existing multiply instructions. A few

algorithms (most notably IDEA) use a modulus of a 2N+1 prime number. This further improves diffusion

properties of the operation at the expense of more computation. Techniques have been developed to

efficiently implement ...

This is a preview of the whole essay