November 18, 2012: version 2.4 - performance
improvements
Execution time in seconds for:
"ppsearch.exe bits=2 maxbits=929 loop"
Intel Corei7 2600K 4.5GHz
version 2.3 version 2.4
SSE AVX SSE AVX
Microsoft 402 148 344 122
gcc n/a n/a 348 111
- New function shiftLeftThenXor replaces a shiftLeft/xor
sequence in the optimized reduction function. Though the total number of
shift and xor operations remains the same, elimination of memory load/store
for the xor operation improves performance.
- Add support for gcc compiler. The gcc code runs 10%
faster for AVX execution (the Microsoft build is slightly faster when AVX is
not available).
- When processor supports AVX, the program runs AVX
optimized versions of all functions. While version 2.2 and 2.3 utilize the
AVX clmul instruction, they do not enable the AVX enhancements to SSE
instructions.
- Tune the code that screens candidate polynomials
using small divisors: Remove duplicates from the small divisor list.
Replace 32-bit registers with 64-bit registers in the function that divides
by a small polynomial. Run this screen on all polynomials instead of those
of degree 96 or higher.
- Correct loop count too big by one in function
multiplyPolynomialAvx for a slight performance improvement.
- Correct some errors in the debug output logged when
option verbose is used.
source code
factor files -
These are derived from a list from
here, and reformatted with a utility. In
addition, selected factorizations beyond the 2^1200 -1 cutoff of the Cunningham
tables have been added. Unlike the factorizations through 2^1200-1, these
additional factorizations do not come from a recognized source. Therefore, the
primeness of these factors should be confirmed before using them.
sample
win64 executable built with
gcc 4.7.3 compiler.
sample
win64 executable built with
Microsoft Visual Studio 2010 compiler.