Dual EC DRBG

Overview:

Index

Ecosystem vulnerability

Background:

DRBGs and back doors

Description of Dual EC

Dual EC news:

Exploitability in TLS

Attack cost

References:

Downloads and links

Performance of the attacks

This page summarizes the latest news regarding the cost of computations to exploit Dual EC in TLS. This news is from the 2014 research paper On the Practical Exploitability of Dual EC in TLS Implementations by Checkoway, Fredrikson, Niederhagen, Everspaugh, Green, Lange, Ristenpart, Bernstein, Maskiewicz, and Shacham.

Major parts of the computation are totally independent and can be performed in an arbitrary order, in particular in parallel. Therefore, an attacker can use a huge number of processor cores, for example multicore processors or computer clusters. The authors of the paper implemented the attacks using the standards OpenMP (to distribute the workload over threads running on different processor cores) and MPI (to distribute the workload over several processes running on different CPUs, e.g., in a computing cluster).

The performance of the attacks was measured on

a reference CPU, an Intel Xeon CPU E3-1275 v3, and
a reference cluster, a four-node, quad-socket AMD Opteron 6276 (Bulldozer) computing cluster.

Performance optimizations

Shumow and Ferguson mentioned experiments when they announced the basic back door at Crypto 2007 but did not publish their code. The first publicly available code was from a blog Dual_EC_DRBG backdoor: a proof of concept by Aris Adamantiadis in December 2013. He implemented a proof of concept for the basic back door. Adamantiadis' code recovers the state from 30 output bytes within 12.1 seconds on the reference CPU.

The authors of the 2014 paper used optimized routines for the elliptic curve computations. This takes only 3.3 seconds on the reference CPU for the same computation described by Adamantiadis. The biggest speedup is obtained for computations involving elliptic curve points that are fixed (the P and Q in the Description of Dual EC) but also the curve arithmetic and various functions are much faster.

Real-world targets

The authors of the 2014 paper also showed that the number and positioning (output gaps and alignment) of available Dual EC output bytes in TLS varies significantly between the different attack targets. As a consequence the run time of the attack on the reference CPU varies strongly (from a few seconds to several hours) between different attack targets. The fastest actual attack requires significantly more computation than the proof of concept.

To give an estimation of the computing cost for each attack, the researchers measured the average run time on the reference CPU. In order to mount real-time surveillance, the attack must be finished in a small amount of time. Therefore, the researchers quantified the computing cost of the attacks by computing the number of reference CPUs required to finish the attack within one second. To prove the scalability of the attack on a parallel computing system, the researchers measured the worst-case run time of the attack on the reference cluster.

General approach for real-world attacks

The general approach for all attacks described in the research paper is the same (explained in more detail in The basic back door): For a given random output string derived from the handshake message, several missing bits need to be guessed to obtain the original x coordinate from the point on the elliptic curve (i.e., the corresponding x coordinate for the integer r in the Description of Dual EC). "Guessing bits" means to try all possible choices for the bits until the correct value is found: For each candidate of the x coordinate, the corresponding y coordinate is computed. About half of the candidates do not belong to a point on the elliptic curve (there is no corresponding y coordinate), so the check for these candidates can be aborted. For candidates that do belong to a point on the curve, the back-door computation is applied to compute a candidate for the inner state, and subsequent random output is computed. The random output is compared to corresponding data fields in the handshake message. If it does not fit, the candidate is discarded. If it fits, the candidate is the correct one with high probability and more outputs can be computed and compared to the handshake message until eventually the secret keying data is recovered.

For each attack, the number of bits that need to be "guessed" and the amount of computation necessary to compute subsequent random output strings varies. Thus, each attack has a different total cost and a different average run time.

BSAFE

The researchers investigated two versions of the BSAFE library: RSA BSAFE Share for C/C++ ("BSAFE-C") and RSA BSAFE Share for Java ("BSAFE-Java").

BSAFE-C:
The simplest attack from the attacker's perspective is BSAFE-C because the largest amount of bits from the x coordinate is available: 30 out of 32 bytes are known. The average run time of the attack on the reference CPU is only 0.26 minutes. To finish the attack within one second requires only 16 reference CPUs. On the reference cluster, the worst-case run time of the attack is only 0.04 minutes.
BSAFE-Java:
The Java version of BSAFE exhibits fewer bits of the x coordinate, only 28 bytes, and therefore requires much more time than the attack on BSAFE-C. On the reference CPU, the average run time is 641 minutes. The attacker requires 38,500 reference CPUs to finish the attack in only one second. This is feasible as of today, even on a public computer: the Titan supercomputer at the Oak Ridge National Laboratory has 38,400 processors (counting both its CPUs and GPUs). On the much smaller reference cluster, the attack requires 63.96 minutes in the worst case.

Windows SChannel

There are two situations for the paper's SChannel attacks: situation I, where the attacker has access to RNG output from the session prior to the one being attacked, and situation II, where the attacker has no previous RNG output. The attack is noticeably faster in situation I. This means that an attacker with access to all communication data has an advantage over one with access to only one target session, even when it comes to attacking that single session.

Situation I: Cleartext RNG output from the previous session is available.
In the first attack situation on Windows SChannel, the same number of bits from the x coordinate is known as for BSAFE-Java but it requires slightly less computation because of other choices in the protocol. The average run time on the reference CPU is 619 minutes; 37,100 CPUs are required to finish the attack within one second. As the BSAFE-Java attack, this is possible with today's technology. On the reference cluster, the attack requires a worst-case run time of 62.97 minutes.
Situation II: Cleartext RNG output is only available from the current session.
In the second attack situation on Windows SChannel, two bits less (only 27 bytes and 6 bits) from the x coordinate than in Situation I and BSAFE-Java are known. The output bytes appear in a different part of the TLS protocol. The amount of information is even somewhat larger than in the first situation, but there are gaps between those bytes and testing each guess is more expensive. The average run time on the reference CPU is 1,760 minutes. In order to finish the attack within one second, 106,000 CPUs are required. This number of CPUs also is in the reach of today's technology: the Tianhe-2 supercomputer at China's National University of Defense Technology, the number one on the TOP-500 list from November 2013, has 80,000 CPUs in total. The worst-case run time on the reference cluster is 182.64 minutes.

OpenSSL-fixed

Similar to BSAFE-C, also for OpenSSL 30 bytes of the x coordinate are known. However, OpenSSL adds entropy in every function call to Dual EC that also needs to be guessed by the attacker. This additional input consists of the system time in microseconds, process ID (pid), and an incremental counter value. The cost of the attack on OpenSSL-fixed depends on how much information about the additional input is available.

Situation I: pid and counter are known.
The time in seconds is known from the time stamp in the server random field of the handshake massage. Only the correct microseconds for the additional input have to be found. The average run time on the reference CPU is only 0.04 minutes; just three reference CPUs are required to finish the attack in one second. The worst-case run time on the reference cluster is only 0.02 minutes.
Situation II: only the pid is known.
The average run time on the reference system is 707 minutes. 44,200 reference CPUs are required to finish the attack within one second. As in the previous cases, computing clusters of this size are feasible as of today. On the reference cluster of only 16 CPUs, the attack requires 83.32 minutes.
Situation III: neither the pid nor the counter are known.
For a counter size of k bits, the run time compared to the second situation is increased by a factor of 2^k. With today's publicly available computing power, it is infeasible to finish the attack within one second, but it is possible to finish the attack within 2^k seconds. A one-second attack could also be feasible (depending on the value of k) with non-public clusters and special-purpose hardware.

The following table summarizes the results:

Library	Situation	Reference CPU		Reference Cluster
Library	Situation	Expected Runtime (minutes)	Expected Cost	Worst-case Runtime (minutes)
BSAFE-C		0.26	16	0.04
BSAFE-Java		641	38,500	63.96
SChannel	I	619	37,100	62.97
SChannel	II	1,760	106,000	182.64
OpenSSL-fixed	I	0.04	3	0.02
OpenSSL-fixed	II	707	44,200	83.32
OpenSSL-fixed	III	2^k⋅707	2^k⋅44,200	2^k⋅83.32

Authors of this "Performance of the attacks" page (alphabetical order)

Daniel J. Bernstein, University of Illinois at Chicago and Technische Universiteit Eindhoven
Tanja Lange, Technische Universiteit Eindhoven
Ruben Niederhagen, Technische Universiteit Eindhoven

Last modified: 2014.07.07