First of all, for those wondering, [1]: vectorization is a process where a sequential program using a pair of operands on the same instruction in an iteration is transformed into a vector program where a single instruction can perform multiple operations or a pair of vector. In this notation by vector I call datasets of the same type.
In quite modern machines/architectures (namely x86) there exists such a vector processing(one or more
) unit utilized by the SIMD instructions set (usually SSE,SSE, see below). SIMD [2] is one of various ways implementing parallelism in computer hardware. In this schema,some large set of CPU components perform the same task (operation) at the same time, each with different data leading to a high utilization of these components without wasting CPU cycles in the same operation.
For instance the following code is vectorized:
#define SIZE 1024
int a[SIZE], b[SIZE], c[SIZE], d[SIZE];
void linear ()
{
int i;
for (i=0; i<SIZE; i++)
{
a[i] = b[i] * c[i] + d[i];
}
}
If the parallelism is viable, then the compiler after the autovectorization process may produce high performance code that makes high utilization of processor vector units. This certaintly is of supreme importance (to reduce the total instructions executed where applicable) in a high performance distribution. So after some reasearch, I decided to enable on all of my gentoo machines the vectorization on the CFLAGS (the vectorization as one may see in its homepage is automatically enabled on the -O3 optimizations [3] ) . Gentoo should be the fastest linux distro around; END OF STORY.
CFLAGS="-march=nocona -mtune=nocona -O2 -pipe -fomit-frame-pointer -ftree-vectorizer-verbose=5 -ftree-vectorize -fassociative-math -msse -msse2 -msse3"
The options to do so is the -ftree-vectorize combined with tha -msse -msse2 options (which are the SIMD utilizers for the x86 (amd64) arch) if you are having a more capable SIMD instruction set you may use it (i.e. -msse3, -msse4.1, -m3dnow(:-P) etc), while -fassociative-math can be used instead of -ffast-math to enable vectorization of reductions of floats.
here are some before and after results concerning vectorization speedup on:
rodos ~ # uname -a
Linux rodos 2.6.27-gentoo-r8-korki-pentium-prescott #8 SMP Mon Mar 9 19:28:11 EET 2009 x86_64 Intel(R) Pentium(R) 4 CPU 3.00GHz GenuineIntel GNU/Linux
The results follow in the following snippet and are for 8192 bytes in kilo
Vectorize Enabled Vectorize Disabled
md2 7135,23 6435,79 0,11
mdc2 9210,54 9404,42 -0,02
md4 901018,97 636548,84 0,42
md5 546226,18 306845,91 0,78
hmac(md5) 551026,69 551318,87 0
sha1 299592,36 296080,73 0,01
rmd160 174093,65 148373,5 0,17
rc4 439552,68 439959,55 0
blowfish 99672,06 98432,34 0,01
aes-128 111878,14 111209,13 0,01
aes-192 102440,96 100368,38 0,02
aes-256 91635,71 91280,73 0
or in a more comprehensible representation:
