Introduction

Robert Collins of Intel Secrets refuse to publicize and document this bug so I am going do it. Hopefully, this document will shed some light on the bug. I am not writing just for you techies out there. This is also for people who have the need to run the so-called "single-user single-minded toy OS".

This bug is being referred to as the "Pentium F0" bug. I believe the media picked this up from a newsnet typo. Personally, I prefer calling it LOCK1107 OR NONAME1107 OR BCD1107, based on Collins' bug naming scheme. However, I am no position to argue with the media. :)

I did not discover this bug. I am assessing this bug based on my knowledge on the implementation of Pentium and Pentium Pro processors and reading some tidbits of news net postings. I don't have a pentium machine. I am just a proud owner of AMD 486DX100 running Linux Redhat 4.0. :)

Intel has confirmed the existence of the bug. Read Techweb's article: Intel Confirms Latest Pentium Glitch.

If you see any errors in my writing or have any comments, please email me at hasdi@umich.edu with "YOUR PENTIUM ARTICLE" as the subject line.
 

General

Is Pentium Pro affected by the bug?

No. This invalid instruction only affects classic Pentiums (non-MMX) and MMX Pentiums. Pentium Pro users have so far reported no problems. Bear in mind that Pentium and Pentium Pro are two separate beasts, developed independently by two different Intel engineering teams. Pentium is architecturally different to Pentium Pro as Windows 95 to Windows NT. Same outside, different intel team inside (heee).

What about Pentium II?

Pentium II is based on the Pentium Pro architechture so I do not expect it to have this bug.

Should I care about this bug?

Depends on what you use your Pentium machine for. If you are using it to run Windows 95 to do general office work, play games, this bug generally does not concern you. There are many many many other ways a malicious program can freeze up Windows 95. This is one more, just a little shorter and cuter. If you are using Windows NT or Unices or any multi-user systems, you should start getting medieval on Intel. This invalid instruction can freeze up your operating systems even if it is executing in user mode! This totally violates IA-32 security model by making multi-user machines vulnerable to denial of service attacks.

Oh, if you have Pentium card on your Mac, forget about the bug. There are *even more* ways to lock up your Mac. You have gone this far, what's one more way to lock-up? :)

What is the nature of this bug?

According to the newsgroup posting, the processor will lock up when executing "F0 0F C7 C8" byte sequence. This is NOT a valid Pentium instruction - no program, commercially or otherwise, will have this instruction by accident. You might run into it if your program goes ballistic and jumps to a memory location with junk contents but when that happens you are skewed anyway. The expected behavior for a pentium-level is to complain that this is not a valid instruction (for you techies out there, signal a #UD exception). Pentium actual behavior is to just lock up. Pentium Pro and pentium-level clone processors, Cyrix and AMD, correctly handle the invalid instruction. The worse part is, Pentium locks itself even if you are executing the byte sequence in the lowest privilege level (user mode level)! If you don't know what privilege level means, you don't need to know, unless you are running a real man's OS like Windows NT, Linux, Unices, BeOS-Intel, OS/2, etc., etc.

When this byte sequence is executed, you have to hit the reset button to recover from it.

How do I test this bug in DOS/Windows 95/Windows NT?

You can use the handy-dandy DEBUG.EXE utility that comes with DOS (???? stands for any 4 digit hex number)

	C:\> debug
	- a100 [ENTER]
	????:0100 db  f0 0f c7 c8 [ENTER]
	????:0104 ret [ENTER]
	????:0105 [ENTER]
	- go [ENTER]
If you have the bug, your system will hang so you have to reboot. If you don't, in Windows 95/NT, you get the Blue Screen or a pop-up Window before the operating system "safely" kills the command shell. Dunno what happens in "pure" DOS. (anybody knows?)

How do I test this in Linux/Unices?

The following program works with gcc. I don't see any reason it will not work on another compiler.

	$ cat >  f0.c
	char main[] = { 0xf0, 0x0f, 0xc7, 0xc8 };
	^D
	$ gcc f0.c
	$ ./a.out
	Illegal instruction (core dumped)
If you don't have the bug, you will get a core dump. If you do, the system will hang. (AMD rulez! :)

I am using a pentium to run a multi-user system. What should I do?

Historical

Who publicized this bug?

Some person with a bogus email address (noname@noname.com) posted information on this bug to a Linux newsgroup from Texas University dial-up line on November 7th 1997. Robert Collins claimed that he discovered this bug a few months prior to the posting, presumably shortly after he reported the Pentium Pro floating-point bug on May 2nd 1997. Jim Brooks claimed that Collins demonstrated this bug to him about 4 - 5 months ago. Collins also claimed that he reported this to intel twice but had received no response until the bug went public on November 7th. Intel on the other hand claimed they did not know about this bug before the newsgroup posting.

Who the heck is Collins?

Check out his web site for more information on his exploits. :)

Why didn't Collins publicly disclose this bug earlier?

"Sometimes keeping quiet is more appropriate than yelling FIRE in a theatre of multi-user sys admins."
-Robert Collins (11/07/1997- comp.sys.intel)
 

How is this bug discovered?

By luck I suppose. Or, by fully running chip validation tests, the one thing intel should do before mass producing the chips. They probably skipped this one test.

Leonid A. Broukhis proposed theory if you don't mind me to be a bit technical. The bug can discovered when (1) developing multi-processing and/or embedded systems AND you have (2) a home-brewed (read: buggy) assembler that allows register addressing mode for cmpxchg8b AND (3) some programmer forgot to put brackets around the register. This is a very likely scenario. We all know how notorious GAS can be. :) It will be interesting to know the list of assemblers that would actually allow invalid addressing modes.

Technical

What does "F0 0F C7 C8" byte sequence stands for?

	lock:	cmpxchg8b eax

Roughly speaking, perform the following atomically by asserting the LOCK# signal:

	cmp	edx:eax, eax
	jnz	BRAN1
	mov	eax, ecx:ebx
	jmp	BRAN2
BRAN1:	mov	edx:eax, eax
BRAN2:

In other words, compare a 64-bit number with a 32-bit number and move between them depending on the result. (Thanx for the fix, Arturo! :) You can't compare numbers with unequal bit length representation, much less move between them. This instruction's operand is supposed to be a 64-bit number located in memory. The correct action is to signal #UD, the invalid opcode exception (see Pentium reference manual).

Does this only affect "F0 0F C7 C8" byte sequence?

Navindra Umanee <navindra@cs.mcgill.ca> in comp.sys.intel, reported that the bug affects following byte sequence:
    0xf0, 0x0f, 0xc7, 0xc8    -> lock:    cmpxchg8b eax
    0xf0, 0x0f, 0xc7, 0xc9    -> lock:    cmpxchg8b ebx
    0xf0, 0x0f, 0xc7, 0xca    -> lock:    cmpxchg8b ecx
    0xf0, 0x0f, 0xc7, 0xcb    -> lock:    cmpxchg8b edx
    0xf0, 0x0f, 0xc7, 0xcc    -> lock:    cmpxchg8b esp
    0xf0, 0x0f, 0xc7, 0xcd    -> lock:    cmpxchg8b ebp
    0xf0, 0x0f, 0xc7, 0xce    -> lock:    cmpxchg8b esi
    0xf0, 0x0f, 0xc7, 0xcf    -> lock:    cmpxchg8b edi

In every case, you are trying to compare and move 64-bit number with a 32-bit number, which does not make sense. All will signal the #UD exception.

My original theory is that locking any instruction that signals a #UD exception will automatically freeze up a Pentium but "lock: lea ax,ax" seems to work. "lock: lea ax,ax" is invalid for two reasons: 1) LOCK does not go with LEA and 2) LEA does not accept the source operand to be in register addressing mode (as in the case with CMPXCHG8B).

My second theory is that locking any instruction with a invalid addressing mode will freeze up the processor but "lock: lea ax,ax" works as expected. As far as I know, the only opcode that LOCK can go with but cannot have register addressing mode is CMPXCHG8B.

The third theory is that locking lock-able instructions (MOV R,R/M, XCHG R,R/M, CMPXCHG, CMPXCHG8B) will freeze up the processor if it signals any exception. One possible test is "mov esi,0; lock: cmpxchg8b [esi]". This should cause a page/seg fault but Leonid Broukhis just confirmed to me that Pentium responded properly (probably because page/seg fault handler gets called often enough to be in the cache).

Why did this bug occur? Pentium is a classic pipeline machine (strictly ordered instruction execution). The decoder is a two stage pipeline. My guess is that Pentium recognize, in the first stage, that LOCK and CMPXCHG8B go together AND, in the second stage, that CMPXCHG8B has memory only addressing mode. The LOCK is asserted somewhere in first stage but did not get deasserted in the second stage. Then again, what do I know. :/

How do you overcome this bug?

Replace your Pentium chip with an AMD or a Cyrix chip. :)

Jim Brooks may have found a possible workaround. Check out ftp://ftp.jimbrooks.org/f0opcode.zip. Jim Brooks observed that if the processor recently invoke the #UD exception handler, it is very possible that Pentium will correctly handle the invalid byte sequence. It seems that if some information related to #UD exception handler is already in the cache, the invalid instruction will not lock up. Jim Brooks guessed that it is IDT gate descriptor for the exception handler. I suspect that there is a deadlock situation. By using the lock prefix, you are locking the external bus. With external bus locked, you cannot invoke the exception handler because invoking it requires accessing the external bus. This cycle causes a deadlock.

Thus, a possible solution is to lock the exception handler information into the cache. In Pentium, this would mean tripping the exception on purpose and then disabling the cache (set CD to 1). The loaded lines will still be valid (if I read the manual right). If the cache needs to be flushed, reload the exception handler. Unfortunately, disabling the cache means multi-user systems will run like molasses. :(



hasdi@umich.edu

My HoMePaGe!