Weight Distribution Utility
for binary linear codes
with avx2 (haswell new instructions)
and OpenCL GPGPU support
Efficiently counting bits is an interesting software problem, though few applications actually depend on it. One exception is the binary linear code weight distribution function. A program that finds the weight distribution of a binary linear code spends most of its time counting bits. The program blcutil (binary linear code utility) was created to help find what bit counting method is fastest on modern (2012) x86 processors.
The utility uses separately compiled code paths for operations on 256 bits or less, 257-512 bits, and 513-65536 bits. Each of these 3 paths is further divided into 6 algorithms: general purpose register, SSSE3, XMM+popcnt, YMM+popcnt, AVX, and AVX2. One of the 18 separately compiled code paths is chosen based on codeword size, processor type, and command line options. Here are benchmark results comparing the new utility to existing ones. GAP is an open source computational discrete algebra system. Magma is a commercial, closed source computational algebra system. The new utility blcutil is open source. The binary linear code used for this test is the 256-32-96 code returned by Magma's best known binary linear code function. Input files for all 3 programs are included in the blcutil release. Scores are in seconds to complete (smaller is better).
Execution Time in seconds for 256-32-96 Binary
Linear Code Weight Distribution
GAP 4.5.7 MAGMA
V2.19-5 BLCUTIL 1.1/CPU BLCUTIL
1.1/GPU
298 77.5 1.82
0.30
Notes
Compiler used for blcutil: gcc version 4.7.3 (mingw-w64-bin-x86_64-20121103.7z) from http://www.drangon.org/mingw/. Microsoft Visual Studio 2012 compiler also produces good results. Both executables are included in the release.
Blcutil requires a 64-bit version of Windows. Windows 7 or newer is required for running the AVX and AVX2 functions.
GAP and blcutil were tested on a 4.0 GHz Intel core i7-2600K processor with AMD HD 7850 GPU. Magma scores are from the vendor's public server.
Optimizations
Codeword generation: Generating the
codewords of an [N, K] code requires XORing every combination of K generator
rows. The number of XOR operations can be reduced to roughly one per
codeword with the use of a lookup table of suitable size. But if the lookup table is
size is too large, it will not fit in the processor's L1 cache and performance will be
reduced. To overcome this problem, a two stage lookup table is used. For
example, a single stage lookup table for 12 bits needs 2^12 or 4096 entries.
But a two stage lookup table for 12 bits needs as few as 2^6 +2^6 or 128
entries. The optimal sizing of the two lookup tables depends on several
factors. The utility runs a quick benchmark to determine the sizing of the
two lookup tables. A command line option specifying the two lookup table
sizes can be used to bypass the benchmark tuning.
Multithreading: The utility uses all
available processor cores. The codeword list is partitioned into 16 *
[number of cores] sections. All of the cores are dispatched. When a core
needs work, it grabs the next available section from the codeword list. The
core updates a set of local weight counters as it processes the section.
When the section is complete, the core updates the master weight counters
with data from the local weight counters.
Use of local weight counters by the cores eliminates synchronization
overhead as the cores update counters in parallel. Synchronization is
needed only when the core completes a section and updates the master list.
This synchronization, and the synchronization needed when the cores grab
sections from the list, is handled by compiler intrinsic functions that
generate the x86 lock instruction prefix. This method eliminates the
overhead of an OS call.
Extended precision XOR and population
count handling. The utility contains variable precision functions for XOR and population count
that operate on arrays of up to 65536 bits. If the codeword size N is 256
bits or less, these functions will exit their loops after one pass. While
this approach is fairly effective, the presence of the looping logic does
have a measurable impact on performance. For codeword size of 256 bits or
less, functions with no looping logic at all are somewhat faster. To get maximum
performance for codeword size <= 256 and still accept sizes of up to 65536
bits, the utility uses separately compiled execution paths. The single
source code for the function is compiled twice, once for maximum codeword
size 256 and once for maximum codeword size 65536. When compiled for size
256, the compiler's optimizer completely removes the looping logic and its
overhead. The same mechanism is used for codeword size 257-512 bits. The
variable precision functions are used for codeword size 513-65536.
Combine XOR and population count
operations. When the data size is small, the compiler's optimizer can
combine separately coded XOR and population count functions so that no
intermediate result is written to memory. But for the case of extended
precision operations, the compiler is unable to do this. To overcome this
limitation, the XOR and population count operations are combined into a single
function in the source code.
Loop unrolling. The loop for generating
and weighing codewords is unrolled 8 times. This results in a significant
performance gain. An unoptimized loop is used for codes that are too small
for unrolled execution.
Data alignment. Codewords are aligned
to at least two cache lines for best cache efficiency.
Processor feature usage. At startup,
the utility detects the presence of x86 instruction set features: SSSE3,
popcnt, AVX, AVX2. One of 6 algorithms is chosen based on available
processor features. Command line options can override the automatic
algorithm selection.
GPGPU mode. When command line option -opencl is used to invoke GPU processing mode, the CPU cores do no computing work at all. Instead, the same algorithm that is used for CPU mode runs on the GPU device. At startup, a short benchmark determines the number of work items that can be handled efficiently by the GPU. Another short benchmark then determines how to break up processing so that the GPU work is divided up into segments of one to two seconds each. The GPU work is divided up this way to avoid a problem where a Windows watchdog timer restarts the graphics driver.
Revision History
version 1.0 02/18/2013 Initial version.
version 1.1 05/06/2013 Add OpenCL support.
version 1.1a 09/25/2016 Built with no OpenCL support. Removes 64 processor limitation.