Weight Distribution Utility for Binary Linear Codes

Weight Distribution Utility
for binary linear codes
with avx2 (haswell new instructions)
and OpenCL GPGPU support

Efficiently counting bits is an interesting software problem, though few applications actually depend on it. One exception is the binary linear code weight distribution function. A program that finds the weight distribution of a binary linear code spends most of its time counting bits. The program blcutil (binary linear code utility) was created to help find what bit counting method is fastest on modern (2012) x86 processors.

The utility uses separately compiled code paths for operations on 256 bits or less, 257-512 bits, and 513-65536 bits. Each of these 3 paths is further divided into 6 algorithms: general purpose register, SSSE3, XMM+popcnt, YMM+popcnt, AVX, and AVX2. One of the 18 separately compiled code paths is chosen based on codeword size, processor type, and command line options. Here are benchmark results comparing the new utility to existing ones. GAP is an open source computational discrete algebra system. Magma is a commercial, closed source computational algebra system. The new utility blcutil is open source. The binary linear code used for this test is the 256-32-96 code returned by Magma's best known binary linear code function. Input files for all 3 programs are included in the blcutil release. Scores are in seconds to complete (smaller is better).

Execution Time in seconds for 256-32-96 Binary Linear Code Weight Distribution GAP 4.5.7 MAGMA V2.19-5 BLCUTIL 1.1/CPU BLCUTIL 1.1/GPU 298 77.5 1.82 0.30

Notes

Compiler used for blcutil: gcc version 4.7.3 (mingw-w64-bin-x86_64-20121103.7z) from http://www.drangon.org/mingw/. Microsoft Visual Studio 2012 compiler also produces good results. Both executables are included in the release.
Blcutil requires a 64-bit version of Windows. Windows 7 or newer is required for running the AVX and AVX2 functions.
GAP and blcutil were tested on a 4.0 GHz Intel core i7-2600K processor with AMD HD 7850 GPU. Magma scores are from the vendor's public server.

Optimizations

Codeword generation: Generating the codewords of an [N, K] code requires XORing every combination of K generator rows. The number of XOR operations can be reduced to roughly one per codeword with the use of a lookup table of suitable size. But if the lookup table is size is too large, it will not fit in the processor's L1 cache and performance will be reduced. To overcome this problem, a two stage lookup table is used. For example, a single stage lookup table for 12 bits needs 2^12 or 4096 entries. But a two stage lookup table for 12 bits needs as few as 2^6 +2^6 or 128 entries. The optimal sizing of the two lookup tables depends on several factors. The utility runs a quick benchmark to determine the sizing of the two lookup tables. A command line option specifying the two lookup table sizes can be used to bypass the benchmark tuning.
Multithreading: The utility uses all available processor cores. The codeword list is partitioned into 16 * [number of cores] sections. All of the cores are dispatched. When a core needs work, it grabs the next available section from the codeword list. The core updates a set of local weight counters as it processes the section. When the section is complete, the core updates the master weight counters with data from the local weight counters. Use of local weight counters by the cores eliminates synchronization overhead as the cores update counters in parallel. Synchronization is needed only when the core completes a section and updates the master list. This synchronization, and the synchronization needed when the cores grab sections from the list, is handled by compiler intrinsic functions that generate the x86 lock instruction prefix. This method eliminates the overhead of an OS call.
Extended precision XOR and population count handling. The utility contains variable precision functions for XOR and population count that operate on arrays of up to 65536 bits. If the codeword size N is 256 bits or less, these functions will exit their loops after one pass. While this approach is fairly effective, the presence of the looping logic does have a measurable impact on performance. For codeword size of 256 bits or less, functions with no looping logic at all are somewhat faster. To get maximum performance for codeword size <= 256 and still accept sizes of up to 65536 bits, the utility uses separately compiled execution paths. The single source code for the function is compiled twice, once for maximum codeword size 256 and once for maximum codeword size 65536. When compiled for size 256, the compiler's optimizer completely removes the looping logic and its overhead. The same mechanism is used for codeword size 257-512 bits. The variable precision functions are used for codeword size 513-65536.
Combine XOR and population count operations. When the data size is small, the compiler's optimizer can combine separately coded XOR and population count functions so that no intermediate result is written to memory. But for the case of extended precision operations, the compiler is unable to do this. To overcome this limitation, the XOR and population count operations are combined into a single function in the source code.
Loop unrolling. The loop for generating and weighing codewords is unrolled 8 times. This results in a significant performance gain. An unoptimized loop is used for codes that are too small for unrolled execution.
Data alignment. Codewords are aligned to at least two cache lines for best cache efficiency.
Processor feature usage. At startup, the utility detects the presence of x86 instruction set features: SSSE3, popcnt, AVX, AVX2. One of 6 algorithms is chosen based on available processor features. Command line options can override the automatic algorithm selection.
GPGPU mode. When command line option -opencl is used to invoke GPU processing mode, the CPU cores do no computing work at all. Instead, the same algorithm that is used for CPU mode runs on the GPU device. At startup, a short benchmark determines the number of work items that can be handled efficiently by the GPU. Another short benchmark then determines how to break up processing so that the GPU work is divided up into segments of one to two seconds each. The GPU work is divided up this way to avoid a problem where a Windows watchdog timer restarts the graphics driver.

Revision History

version 1.0 02/18/2013 Initial version.

version 1.1 05/06/2013 Add OpenCL support.

version 1.1a 09/25/2016 Built with no OpenCL support. Removes 64 processor limitation.