Bolt  1.1
C++ template library with support for OpenCL
Building TBB (Intel) with Bolt

Table of Contents

TBB (Intel)

Intel Threading Building Blocks (also known as TBB (Intel)) is library developed by Intel Corporation for writing software programs that take advantage of multi-core processors. The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows ® threads, or the portable Boost Threads in which individual threads of execution are created, synchronized, and terminated manually.

BOLT supports parallelization using Intel Threading Building Blocks (TBB (Intel)). You can switch between CL/AMP and TBB (Intel) calls by changing the control structure.

Setting up TBB (Intel) with Bolt.

To start using high performance MultiCore routines with Bolt. Install TBB (Intel) from here On Windows ®, add TBB_ROOT to your environment variable list. e.g. TBB_ROOT=<path-to-tbb-root>. Run the batch file tbbvars.bat (e.g. tbbvars.bat intel64 vs2012) which is in TBB_ROOT%\bin\directory. This batch file takes 2 arguments. <arch> = [32|64] and <vs> - version of Visual Studio. If you want to set it globally then append the TBB (Intel) dll path e.g. TBB_ROOT% \intel64\vc11 in “PATH” Environment variable. This will set all the paths required for TBB (Intel).

NOTE: On Linux ®, set the TBB_ROOT , PATH and LD_LIBRARY_PATH variables.
E.g. 'export TBB_ROOT=<path-to-tbb-root>'
'export LD_LIBRARY_PATH = <path-to-tbb-root>/lib/intel64/gcc-4.4:$LD_LIBRARY_PATH'
'export PATH = <path-to-tbb-root>/include:$PATH'

Then install CMake (see Using CMake build infrastructure). To enable TBB (Intel), BUILD_TBB check box should be checked in CMake configuration list as shown below, the build procedure is as usual.

cmake.build.first.tbb.png
Check BUILD_TBB

On successful build, the TBB (Intel) paths are shown in the Visual Studio Output tab as shown below.

bolt.visual.studio.build.tbb.png
TBB_VS_Build

TBB (Intel) routines in Bolt

These are the Bolt routines with TBB (Intel) support for MultiCore path enlisted along with the backend:

Running TBB (Intel) routines in Bolt

Control object

Bolt function can be forced to run on the specified device. Default is "Automatic" in which case the Bolt runtime selects the device. Forcing the mode to MulticoreCpu will run the function on all cores detected. There are two ways in BOLT to force the control to MulticoreCPU.

  1. Setting control to MulticoreCPU Globally:
    myControl.waitMode( bolt::cl::control::NiceWait );
    myControl.setForceRunMode( bolt::cl::control::MultiCoreCpu );
    This will set the control to MultiCore CPU globally, So reference to any BOLT function will always run MultiCore CPU path.
  2. Setting control to MuticoreCPU locally
    myControl.setForceRunMode(bolt::cl::control::MultiCoreCpu);
    This will set the control to MultiCore CPU locally, passing this control object as first parameter to BOLT function enables multicore path only for the calling function.

AMP has same use case only CL namespace(bolt::cl) needs to be change to AMP(bolt::amp)

Other Scenarios:

  1. Using MulticoreCPU flag with BOLT function, when TBB (Intel) is not installed on the machine will throw an exception like "The MultiCoreCpu version of <function> is not enabled to be built." Proper care has to be taken to make sure that TBB (Intel) is installed in the system.
  2. The default mode is "Automatic" which means it will go into OpenCL ™ path first, then TBB (Intel), then SerialCpu. The examples discussed below in the next subsection focus on how TBB (Intel) parallelization is achieved with different functions.

Example:

Transform Reduce:

Transform_reduce performs a transformation defined by unary_op into a temporary sequence and then performs reduce on the transformed sequence.

.....
int length = 10;
std::vector< float > input( length );
ctl.setForceRunMode(bolt::cl::control::MultiCoreCpu);
bolt::cl::negate<float> unary_op;
bolt::cl::plus<float> binary_op;
float boldReduce = bolt::cl::transform_reduce(ctl, input.begin(), input.end(), unary_op, 4.f, binary_op );

AMP backend variant:

...
int length = 10;
std::vector< float > input( length );
ctl.setForceRunMode(bolt::amp::control::MultiCoreCpu);
float boldReduce = bolt::cl::transform_reduce(ctl, input.begin(), input.end(), unary_op, 4.f, binary_op );

Inclusive and Exclusive Scan By key:

Inclusive_scan_by_key performs, on a sequence, an inclusive scan of each sub-sequence as defined by equivalent keys.

......
int length = 10;
std::vector< int > keys = {1, 2, 2, 3, 3, 3, 4, 4, 4, 4};
// input and output vectors for device and reference
std::vector< float > input( length);
std::vector< float > output( length);
bolt::cl::equal_to<int> eq;
bolt::cl::plus<float> plus;
ctl.setForceRunMode(bolt::cl::control::MultiCoreCpu);
// Inclusive call:
bolt::cl::inclusive_scan_by_key(ctl, keys.begin(), keys.end(), input.begin(), output.begin(), eq, plus);
// Exclusive call:
bolt::cl::exclusive_scan_by_key(ctl, keys.begin(), keys.end(), input.begin(), output.begin(), 4.0f, eq, plus)

AMP backend variant:

#include <bolt/amp/scan_by_key.h>
......
ctl.setForceRunMode(bolt::amp::control::MultiCoreCpu);
std::vector< float > input( length);
std::vector< float > output( length);
bolt::cl::equal_to<int> eq;
bolt::cl::plus<float> plus;
// Inclusive call:
bolt::amp::inclusive_scan_by_key(ctl, keys.begin(), keys.end(), input.begin(), output.begin(), eq, plus);
// Exclusive call:
bolt::amp::inclusive_scan_by_key(ctl, keys.begin(), keys.end(), input.begin(), output.begin(), 4.0f, eq, plus);

Sort:

Sort the input array based on the comparison function provided.

#include <bolt/cl/sort.h>
.....
int length = 1024;
std::vector< float > input( length, 0.0 );
ctl.setForceRunMode(bolt::cl::control::MultiCoreCpu);
bolt::cl::sort( ctl, boltInput.begin( ), boltInput.end( ), cmp_fun );

AMP backend variant:

#include <bolt/amp/sort.h>
....
int length = 1024;
std::vector< float > input( length, 0.0 );
ctl.setForceRunMode(bolt::amp::control::MultiCoreCpu);
bolt::cl::sort( ctl, boltInput.begin( ), boltInput.end( ), cmp_fun );