Summary of David Ditzel talk on binary translation

Summary of David Ditzel talk on binary translation.

David Ditzel: worked at Transmeta, now at Intel

A 25 year perspective on binary translation: what worked, what didn’t work.

Examples of binary translation:

pentium pro and later: translates x86 into internal UOPS via hardware
intel ia32el runs user level x86 programs on Itanium with ~60% perf of native machine
Java JITs, .NET (MSIL), VmWare
Apple’s Rosetta runs PowerPC user programs on x86
Transmeta Crusoe and Efficeon processors - transparent, full system level translation of x86

SYMBOL computer - implemented os, editor etc. in logic gates. Lessons: don’t do that. Use the right combination of software, hardware and micro-ops.

AT&T Crisp:

1987, first CMOS superscalar chip (superscalar: multiple instructions per clock)
Reduced Instruction Set Processor targeted at C
hw translated instructions from external, compact version into 180 bit wide UOP cache
optimization tricks:
branch folding - make branches disappear from pipeline
“Stack Cache” as registers - to reduce memory references

Lessons from AT&T Crisp:

translation from external instruction set to internal instruction set works well

Binary Instrumentation

MIPS had tools pixie and pixstats (~1987) to statically modify binaries to count instructions.

Sun followed (~1988) with spix, spixstats etc. Also were able to run MIPS on Sparc (at 1/3rd speed).

Sun tried to extend Sparc with instructions to help x86 emulation but decided that hw mismatch was too big - Sparc was not the right architecture for this.

Lessons realized in 1995:

dynamic translation was reaching ¹⁄₃ speed of native, static ¹⁄₂ of native
processor designed from scratch for binary translation might improve efficiency of dynamic translation by 2-3x
full system level binary translation might soon become practical and even exceed perf of standard microprocessor

That led to Transmeta in 1995. Transmeta:

spent $600M over 12 years in R&D
5 generations of processors although only 3 announced

Key challenges for hybrid processors:

must be 100% compatible. When doing binary translation between commodity processors (e.g. x86<->PowerPC), there are corner cases where emulating things exactly is tricky due to hardware mismatch, which causes inefficiencies in translation => must design for translation
precise control over user visible state, including precise exception and interrupt semantics
delivered performance, including overhead of translation

Hardware support for hybrid processors:

private, non-volatile storage (FLASH ROM), for storing translation software
private memory for storing translated code (stole 5% of DRAM during boot)
software controlled state commit/rollback/abort
more registers than x86
alias detection under software control
fine grain detection of self-modifying code
auto-typing of pure memory vs I/O (because I/O can be memory-mapped which prohibits some optimizations)
fast traps supported by underlying runtime system
instruction primitives for fast interpretation

Software controlled atomic execution - execute in temporary space and ability to rollback to previous commit point. Used to perform not-always-safe optimization which are considered ok as long as we hit next commit point without problems. If not, rollback and re-execute without optimizations. Needed to be able to undo stores to memory.

Transmeta’s code morphing:

first level - interpreter
second compiler
translation - it can cost 10000 instructions to translate 1

Efficeon improvements used 4 levels (gears):

first level - interpreter 15 instruction per 1, gathers
after executing basic block 50 uses quick translation (cost 500 instructions per 1 native), also gathers more information
after executing more than few hundred times - more optimized translation, more costly, classic optimization like common sub-expression elimination, memory re-ordering, critical path scheduling
for hot loops, optimize multiple code blocks and use more aggressive optimization

Lessons: optimization pay off. The bigger the blocks, the bigger optimization payoff.

Binary translation myths:

myth: translation is slow. It’s only ~20% overhead
myth: saving translation to disk is a good idea. In efficeon they improved translation so that it was faster to translate than read from disk
myth: static translation is faster. They compiled Linux kernel to run natively but it was slower than dynamic translation because dynamic translation could use runtime information to optimize.
myth: software isn’t reliable. Doesn’t match transmeta’s nor Transiitive’s experience: they didn’t have any x86 compatibility bugs

Why binary translation now: power usage since increasing cores requires more power so we might not have enough power to light up all processors at full speed.

Edna	speedy note taking app with super powers
SumatraPDF	small, fast, free PDF / ePub / comic book reader for Windows