I noticed this problem back in January of 2004, with Intel C++ 8.0, and went through heck over nine months with Intel's customer support to get it fixed until I eventually had to abandon their compiler.
On any non-Intel processors, it specifically included an alternate code path for "memcpy" that actually used "rep movsb" to copy one byte at a time, instead of (for example) "rep movsd" to copy a doubleword at a time (or MMX instructions to copy quadwords). This was probably the most brain-dead memcpy I'd ev
For about a year, I've been patching my Intel Compiler compiled code because of this issue. I have to give credit to a poster on the comp.arch newsgroup for an explaination of ONE of the issues, and a workaround. This is not the only anti-Athlon trick in the compiler, but it's an easy one to verify and understand.
As part of my study of Operating Systems and embedded systems, one of the things I've been looking at is compilers. I'm interested in analyzing how different compilers optimize code for different platforms.As part of this comparison, I was looking at the Intel Compiler and how itoptimizes code.The Intel Compilers have a free evaluation download from here: http://www.intel.com/products/software/index.htm?i id=Corporate+Header_prod_softwr&#compilers [intel.com]
One of the things that the version 8.0 of the Intel compilerincluded was an "Intel-specific" flag.According to the documentation,binaries compiled with this flag would only run on Intel processors andwould include Intel-specific optimizations to make them run faster. The documentation was unfortunatelylacking in explaining what these optimizations were, so I decided to do some investigating.
First I wanted to pick a primarily CPU-bound test to run, so I chose SPEC CPU2000.The test system was a P4 3.2G Extreme Edition with1 gig of ram running WIndows XP Pro. First I compiled and ran spec with the "generic x86 flag" (-QxW),which compiles code to run on any x86 processor.After running the generic version, I recompiled and ran spec with the "Intel-specific flag" (-QxN) to see what kind of difference that would make.For most benchmarks, there was not very much change, but for 181.mcf, there was a win of almost 22% !
Curious as to what sort of optimizations the compiler was doing to allow the Intel-specific version to run 22% faster,I tried running the same binary on my friend's computer.His computer, the second test machine, was an AMD FX51, also with 1 gig of ram, running Windows XP Pro. First I ran the "generic x86" binaries on theFX51, and then tried to run the "Intel-only" binaries. The Intel-specific ones printed out an error message saying that the processor was not supported and exited.This wasn't very helpful, was it true that only Intel processors could take advantage of this performance boost?
I started mucking around with a dissassembly of the Intel-specific binary and found one particular call (proc_init_N) that appeared to be performing this check. As far as I can tell, this call is supposed to verify that the CPU supports SSE and SSE2 and it checks the CPUID to ensure that its an Intel processor. I wrote a quick utility which I call iccOut, to go through a binary that has been compiled with this Intel-only flag and remove that check.
Once I ran the binary that was compiled with the Intel-specific flag (-QxN) through iccOut, it was able to run on the FX51. Much to my surprise, it ran fine and did not miscompare. On top of that, it got the same 22% performance boost that I saw on the Pentium4 with an actual Intel processor. This is very interesting to me, since it appears that in fact no Intel-specific optimization has been done if the AMD processor is also capable to taking advantage of these same optimizations. If I'm missing something, I'd love for someone to point it out for me. From the way it looks right now, it appears that Intel is simply "cheating" to make their processors look better against competitor's processors.
Links: Intel Compiler:http://www.intel.com/products/software/in dex.htm?iid=Corporate+Header_prod_softwr&#compiler s
Here is the text:/*
* iccOut 1.0
*
* This program enables programs compiled with the intel compiler using the
* -xN flag to run on non-intel processors. This can sometimes result in
* large performance increases, depending on the application. Note that even
* though the check will be removed, the CPU running the application *MUST*
* support both SSE and SSE2 or the program will crash.
*
*/
#include #include// x86 codes
#define X86_CALL 232// E8 in hex #define PUSH_EAX 80// 50 in hex #define X86_NOP 144// 90 in hex
bool handleCall( unsigned char theBuffer[7], FILE* inputBinary, FILE* fixedBinary );//convienently, the check always seems to be one of the first calls in//the file. this makes it easier to find. void printUsage() {
printf("Usage:\n");
printf("iccOut filename\n\n");
printf("Filename is the name of the file to fix.\n\n"); }//returns whether code was replaced bool processNextCall( FILE* inputBinary, FILE* fixedBinary ) {
if ( ! codeReplaced ) {//if either of the last 2 bytes were a call, we need to keep doing this//until we run out of calls
while ( ( fullBuffer[5] == X86_CALL ) || ( fullBuffer[6] == X86_CALL ) ) {
replacedCode = false;//check if its what we're looking for (one of the first calls followed by 2 push eax's)
if ( ( theBuffer[5] == PUSH_EAX ) && ( theBuffer[6] == PUSH_EAX ) ){
printf("Located call to subroutine to check intel support!\n");
printf("Substituting code...\n");//replace the call with nops
replacedCode = true;
for ( int i=0; i5;i++ ) {
theBuffer[i] = X86_NOP;
}
}
if ( replacedCode || ( ( theBuffer[5] != X86_CALL ) && ( theBuffer[6] != X86_CALL ) )) {//write out the two as they were
for ( int j=0; j7;j++ ) {
tempChar = theBuffer[j];
fwrite( &tempChar, 1, 1, fixedBinary );
}
} else {//don't write last 2 bytes
for( int i=0; i 5; i++ ) {
tempChar = theBuffer[i];
fwrite( &tempChar, 1, 1, fixedBinary );
}
} return replacedCode; }
int main( int argc, char **argv ) {
printf("\nWelcome to iccOut!\n\n");
printf("This will enable binaries compiled with -xN to run on non-intel machines\n\n");//verify parameters
if ( argc 2 ) {
printUsage();
return 0;
}//make sure file exists
if ( ! fileExists( argv[1] ) ) {
printf("File does not exist or is not accessible: %s\n", argv[1] );
return 0;
}
fixIntelBinary( argv[1] );
return 0; }
It's true--and they know about it (Score:5, Interesting)
On any non-Intel processors, it specifically included an alternate code path for "memcpy" that actually used "rep movsb" to copy one byte at a time, instead of (for example) "rep movsd" to copy a doubleword at a time (or MMX instructions to copy quadwords). This was probably the most brain-dead memcpy I'd ev
A workaround for one of the compiler's tricks (Score:5, Informative)
This is not the only anti-Athlon trick in the compiler, but it's an easy one to verify and understand.
From: iccOut (iccout2004@yahoo.com)
Subject: sleazy intel compiler trick (SOURCE ATTACHED)
View: Complete Thread (4 articles)
Original Format
Newsgroups: comp.arch
Date: 2004-02-09 14:38:40 PST
As part of my study of Operating Systems and embedded systems, one of
the things I've been looking at is compilers. I'm interested in
analyzing how different compilers optimize code for different
platforms.As part of this comparison, I was looking at the Intel
Compiler and how itoptimizes code.The Intel Compilers have a free
evaluation download from here:
http://www.intel.com/products/software/index.htm?
One of the things that the version 8.0 of the Intel compilerincluded
was an "Intel-specific" flag.According to the documentation,binaries
compiled with this flag would only run on Intel processors andwould
include Intel-specific optimizations to make them run faster. The
documentation was unfortunatelylacking in explaining what these
optimizations were, so I decided to do some investigating.
First I wanted to pick a primarily CPU-bound test to run, so I chose
SPEC CPU2000.The test system was a P4 3.2G Extreme Edition with1 gig
of ram running WIndows XP Pro. First I compiled and ran spec with the
"generic x86 flag" (-QxW),which compiles code to run on any x86
processor.After running the generic version, I recompiled and ran
spec with the "Intel-specific flag" (-QxN) to see what kind of
difference that would make.For most benchmarks, there was not very
much change, but for 181.mcf, there was a win of almost 22% !
Curious as to what sort of optimizations the compiler was doing to
allow the Intel-specific version to run 22% faster,I tried running
the same binary on my friend's computer.His computer, the second test
machine, was an AMD FX51, also with 1 gig of ram, running Windows XP
Pro. First I ran the "generic x86" binaries on theFX51, and then
tried to run the "Intel-only" binaries. The Intel-specific ones
printed out an error message saying that the processor was not
supported and exited.This wasn't very helpful, was it true that only
Intel processors could take advantage of this performance boost?
I started mucking around with a dissassembly of the Intel-specific
binary and found one particular call (proc_init_N) that appeared to be
performing this check. As far as I can tell, this call is supposed to
verify that the CPU supports SSE and SSE2 and it checks the CPUID to
ensure that its an Intel processor. I wrote a quick utility which I
call iccOut, to go through a binary that has been compiled with this
Intel-only flag and remove that check.
Once I ran the binary that was compiled with the Intel-specific flag
(-QxN) through iccOut, it was able to run on the FX51. Much to my
surprise, it ran fine and did not miscompare. On top of that, it got
the same 22% performance boost that I saw on the Pentium4 with an
actual Intel processor. This is very interesting to me, since it
appears that in fact no Intel-specific optimization has been done if
the AMD processor is also capable to taking advantage of these same
optimizations. If I'm missing something, I'd love for someone to point
it out for me. From the way it looks right now, it appears that Intel
is simply "cheating" to make their processors look better against
competitor's processors.
Links:
Intel Compiler:http://www.intel.com/products/software/i
Here is the text:
* iccOut 1.0
*
* This program enables programs compiled with the intel compiler
using the
* -xN flag to run on non-intel processors. This can sometimes result
in
* large performance increases, depending on the application. Note
that even
* though the check will be removed, the CPU running the application
*MUST*
* support both SSE and SSE2 or the program will crash.
*
*/
#include
#include
#define X86_CALL 232
#define PUSH_EAX 80
#define X86_NOP 144
bool handleCall( unsigned char theBuffer[7], FILE* inputBinary, FILE*
fixedBinary );
void printUsage() {
printf("Usage:\n");
printf("iccOut filename\n\n");
printf("Filename is the name of the file to fix.\n\n");
}
bool processNextCall( FILE* inputBinary, FILE* fixedBinary ) {
int lenRead;
int startIndex, bytesNeeded;
unsigned char addressBuffer[4];
unsigned char checkBuffer[2];
unsigned char fullBuffer[7];
unsigned char tempChar;
bool codeReplaced;
bool otherReplaced;
otherReplaced = false;
lenRead = fread( &addressBuffer, 1, 4, inputBinary );
lenRead = fread( &checkBuffer, 1, 2, inputBinary );
fullBuffer[0] = X86_CALL;
for( int i=1; i5;i++ ) {
fullBuffer[i] = addressBuffer[i-1];
}
fullBuffer[5] = checkBuffer[0];
fullBuffer[6] = checkBuffer[1];
codeReplaced = handleCall( fullBuffer, inputBinary, fixedBinary );
if ( ! codeReplaced ) {
this
while ( ( fullBuffer[5] == X86_CALL ) || ( fullBuffer[6] == X86_CALL
) ) {
if ( fullBuffer[5] != X86_CALL ) {
tempChar = fullBuffer[5];
fwrite( &tempChar, 1, 1, fixedBinary );
fullBuffer[0] = fullBuffer[6];
bytesNeeded = 6;
startIndex = 1;
} else {
fullBuffer[0] = fullBuffer[5];
fullBuffer[1] = fullBuffer[6];
bytesNeeded = 5;
startIndex = 2;
}
for( int i=0; i bytesNeeded; i++ ) {
fread( &tempChar, 1, 1, inputBinary );
fullBuffer[startIndex+i] = tempChar;
}
otherReplaced = otherReplaced || handleCall( fullBuffer,
inputBinary, fixedBinary );
}
} return ( codeReplaced || otherReplaced );
}
bool handleCall( unsigned char theBuffer[7], FILE* inputBinary, FILE*
fixedBinary ) {
bool replacedCode;
unsigned char tempChar;
replacedCode = false;
followed by 2 push eax's)
if ( ( theBuffer[5] == PUSH_EAX ) && ( theBuffer[6] == PUSH_EAX ) ){
printf("Located call to subroutine to check intel support!\n");
printf("Substituting code
replacedCode = true;
for ( int i=0; i5;i++ ) {
theBuffer[i] = X86_NOP;
}
}
if ( replacedCode || ( ( theBuffer[5] != X86_CALL ) && ( theBuffer[6]
!= X86_CALL ) )) {
for ( int j=0; j7;j++ ) {
tempChar = theBuffer[j];
fwrite( &tempChar, 1, 1, fixedBinary );
}
} else {
for( int i=0; i 5; i++ ) {
tempChar = theBuffer[i];
fwrite( &tempChar, 1, 1, fixedBinary );
}
} return replacedCode;
}
void fixIntelBinary( char *filename ) {
FILE *inputBinary;
FILE *fixedBinary;
unsigned char theChar;
bool editedCall;
bool skipWrite;
int lenRead;
printf("iccOut is currently fixing binary: %s\n\n", filename );
editedCall = false;
skipWrite = false;
inputBinary = fopen( filename, "rb" );
fixedBinary = fopen( strcat( filename, ".fixed" ), "wb" );
if ( ! inputBinary ) {
printf("Error opening input binary.\n");
return;
}
if ( ! fixedBinary ) {
printf("Error opening output file.\n");
return;
}
fread( &theChar, 1, 1, inputBinary );
while (1) {
if ( !skipWrite ) {
fwrite( &theChar, 1, 1, fixedBinary );
}
skipWrite = false;
lenRead = fread( &theChar, 1, 1, inputBinary );
if ( lenRead == 0) {
break;
}
if ( ! editedCall ) {
if ( theChar == X86_CALL ) {
editedCall = processNextCall( inputBinary, fixedBinary );
skipWrite = true;
}
}
}
printf("iccOut has saved the day!\n");
fclose( inputBinary );
fclose( fixedBinary );
}
bool fileExists( char *filename ) {
FILE *temp;
bool ret = false;
temp = fopen( filename, "r" );
if ( temp != 0 ) {
ret = true;
fclose( temp );
} return ret;
}
int main( int argc, char **argv ) {
printf("\nWelcome to iccOut!\n\n");
printf("This will enable binaries compiled with -xN to run on
non-intel machines\n\n");
if ( argc 2 ) {
printUsage();
return 0;
}
if ( ! fileExists( argv[1] ) ) {
printf("File does not exist or is not accessible: %s\n", argv[1] );
return 0;
}
fixIntelBinary( argv[1] );
return 0;
}