Bolt  1.1
C++ template library with support for OpenCL
Bolt Documentation

Table of Contents

bolt.png

Introduction

Bolt is a C++ template library optimized for GPUs. Bolt is designed to provide high-performance library implementations for common algorithms such as scan, reduce, transform, and sort. The Bolt interface was modeled on the C++ Standard Template Library (STL). Developers familiar with STL will recognize many of the Bolt APIs and customization techniques.

C++ templates can be used to customize the algorithms with new types (for example, the Bolt sort can operate on ints, float, or any custom type defined by the user). Additionally, Bolt lets users customize the template routines using function objects (functors) written in OpenCL ™ for example, to provide a custom comparison operation for sort, or a custom reduction operation.

Bolt can directly interface with host memory structures such as std::vector or host arrays (e.g. float*). On today's GPU systems, the host memory is mapped or copied automatically to the GPU. On future systems that support the Heterogeneous System Architecture, the GPU will directly access the host data structures. Bolt also provides a bolt::cl::device_vector that can be used to allocate and manage device-local memory for higher performance on discrete GPU systems. Bolt APIs can accept either host memory or the device vector.

This document introduces the architecture of Bolt and also provides a reference for the Bolt APIs.

Examples

Simple Example

The simple example below shows how to use Bolt to sort a random array of 8192 integers.

#include <bolt/cl/sort.h>
#include <vector>
#include <algorithm>
int main()
{
// generate random data (on host)
std::vector<int> a(8192);
std::generate(a.begin(), a.end(), rand);
// sort, run on best device in the platform
bolt::cl::sort(a.begin(), a.end());
return 0;
}

The code will be familiar to the programmers who have used the C++ Standard Template Library; the difference is the include file (bolt/cl/sort.h) and the bolt::cl namespace before the sort call. Bolt developers do not need to learn a new device-specific programming model to leverage the power and performance advantages of GPU computing.

The example demonstrates two important features of Bolt:

Bolt Functor Example

Below example shows how to use functors with BOLT. For user defined datatypes to work with Bolt functor, user has to wrap his data type around a macro BOLT_FUNCTOR. This enables the user defined class to be available as a string to the OpenCL compiler which is invoked during a call to clBuildProgram(). Consider the following example.

BOLT_FUNCTOR(MyType<int>,
template <typename T>
struct MyType {
T a;
bool operator() (const MyType& lhs, const MyType& rhs) const {
return (lhs.a > rhs.a);
}
bool operator < (const MyType& other) const {
return (a < other.a);
}
bool operator > (const MyType& other) const {
return (a > other.a);
}
bool operator >= (const MyType& other) const {
return (a >= other.a);
}
MyType()
: a(0) { }
};
);

After defining the class, user need to register it with Bolt. Internally Bolt algorithms will make use of the device_vector class. But these are defined only for the basic data types like int, float, unsigned int etc. For user defined data types it’s the responsibility of the application developer to create a definition of the deveice_vector for his own defined data type. See the below example.

BOLT_TEMPLATE_REGISTER_NEW_TYPE( bolt::cl::greater, int, MyType< int> );

In the above example bolt::cl::greater functor is defined in include/bolt/cl/functional.h. For the user defined data types it is the responsibility of the application developer to define as shown above for the functors which he uses from include/bolt/cl/functional.h .

Stitching it all together with a below sample code for Sort:

int main()
{
typedef MyType<int> mytype;
std::vector<mytype> myTypeBoltInput(length);
bolt::cl::sort(myTypeBoltInput.begin(), myTypeBoltInput.end(),bolt::cl::greater<mytype>());
return 0;
}

User can also work with his own defined Functor as shown below:

BOLT_FUNCTOR(uddtD4,
struct uddtD4
{
double a;
double b;
double c;
double d;
};
);
// Functor for UDD. Adds all four double elements and returns true if lhs_sum > rhs_sum
struct AddD4
{
bool operator()(const uddtD4 &lhs, const uddtD4 &rhs) const
{
if( ( lhs.a + lhs.b + lhs.c + lhs.d ) > ( rhs.a + rhs.b + rhs.c + rhs.d) )
return true;
return false;
};
};
);
int main()
{
std::vector <uddtD4> boltInput(sizeOfInputBuffer);
bolt::cl::sort( boltInput.begin( ), boltInput.end( ), AddD4() );
return 0;
}

Templatized Bolt Functor

User can also use Templatized version of Bolt Functor with User defined data types as shown below:

BOLT_TEMPLATE_FUNCTOR1( MyFunctor, int
template <typename T>
struct MyFunctor{
T a;
T b;
bool operator() (const MyFunctor& lhs, const MyFunctor& rhs) const {
return (lhs.a > rhs.a);
}
bool operator < (const MyFunctor& other) const {
return (a < other.a);
}
bool operator > (const MyFunctor& other) const {
return (a > other.a);
}
MyFunctor(const MyFunctor &other)
: a(other.a), b(other.b) { }
MyFunctor()
: a(0), b(0) { }
MyFunctor(T& _in)
: a(_in), b(_in) { }
};
);
BOLT_TEMPLATE_REGISTER_NEW_TYPE( bolt::cl::greater, int, MyFunctor< int> );
int main()
{
typedef MyFunctor<int> myfunctor;
std::vector <myfunctor> boltInput(sizeOfInputBuffer);
bolt::cl::sort( boltInput.begin( ), boltInput.end( ), bolt::cl::greater<myfunctor>() );
return 0;
}

If the user wants to use TEMPLATE_FUNCTOR for more than one datatype then he can use other variants of this like BOLT_TEMPLATE_FUNCTOR2, BOLT_TEMPLATE_FUNCTOR3 etc from include/bolt/cl/clcode.h file.

Supported Functions and Code Paths

List of Supported Functions and Code Paths.

Code Path Behaviour

Default Behaviour

Bolt function is designed to be executed with four code paths (OpenCL™, C++ AMP, Multicore CPU and Serial CPU). The default mode is "Automatic" which means it will go into GPU path first, then Multicore CPU (Intel TBB), then SerialCpu with below mentioned order, control will go to other paths only if the selected one not found.

Selection order will be:

Force Behaviour

Forcing mode to any device will run the function on that device only. There are two ways in BOLT to force the control to specific Device.

  1. Setting control to Device Globally:
    myControl.waitMode( bolt::cl::control::NiceWait );
    myControl.setForceRunMode( bolt::cl::control::OpenCL );
  2. Setting control to Device locally
    ctl.setForceRunMode(bolt::cl::control::OpenCL);
    This will set the control to specified device locally, passing this control object to BOLT function enables specified device path only for that function, so reference to any BOLT function will always run specified device path.

Debug Log Facility

If Debug Log is enabled, we record the actual code path taken for execution. It could be OpenCL™ GPU/CPU, Multicore TBB or Serial. Users need to initialize the log object before the bolt call and a query after the call to know which all paths have been executed.

  1. Initialization
    #define BOLT_DEBUG_LOG
    #include "bolt/BoltLog.h"
    BOLTLOG::CaptureLog *xyz = BOLTLOG::CaptureLog::getInstance();
    xyz->Initialize();
    std::vector< BOLTLOG::FunPaths> paths;
  2. Querying
    xyz->WhatPathTaken(paths);
    for(std::vector< BOLTLOG::FunPaths>::iterator parse=paths.begin(); parse!=paths.end(); parse++)
    {
    std::cout<<(*parse).fun<<”\n”; // prints function number as defined in Boltlog.h
    std::cout<<(*parse).path<<”\n”; // prints the path number – 0 for Multicore TBB, 1 for OpenCL™ GPU, 2 for OpenCL™ CPU and 3 for Serial.
    std::cout<<(*parse).msg<<”\n”; // prints the Path taken
    }
    When multiple functions are called, they are logged in the order invoked. E.g. sort & then search.

Requirements

Bolt uses an OpenCLTM implementation that supports the static C++ kernel features; specifically, C++ template support for OpenCLTM kernels. Currently, the AMD OpenCLTM SDK 2.7 and above versions are designed to provides this support; other vendors may adopt this feature in the future. If you face any issue with Bolt. Please log an issue in Bolt Issues page. Bolt Issues page.

Prerequisites

Windows

  1. Visual Studio 2010 onwards (VS2012 for C++ AMP)
  2. Tested with 32/64 bit Windows® 7/8 and Windows® Blue
  3. CMake 2.8.10
  4. TBB (For Multicore CPU path only) (4.1 Update 1 or Above) . See Building Bolt with TBB.
  5. APP SDK 2.8 or onwards.

Note: If the user has installed both Visual Studio 2012 and Visual Studio 2010, the latter should be updated to SP1.

Linux

  1. GCC 4.6.3 and above
  2. Tested with OpenSuse 12.3, RHEL 6.4 64bit, RHEL 6.3 32bit, Ubuntu 13.4
  3. CMake 2.8.10
  4. TBB (For Multicore CPU path only) (4.1 Update 1 or Above) . See Building Bolt with TBB.
  5. APP SDK 2.8 or onwards.

Note: Bolt pre-built binaries for Linux are build with GCC 4.7.3, same version should be used for Application building else user has to build Bolt from source with GCC 4.6.3 or higher.

Catalyst™ package

The latest Catalyst package contains the most recent OpenCL runtime. Recommended Catalyst package is 13.11 Beta V1.

13.4 and higher is supported.

Note: 13.9 in not supported.

Supported Devices

AMD APU Family with AMD Radeon™ HD Graphics

  1. A-Series
  2. C-Series
  3. E-Series
  4. E2-Series
  5. G-Series
  6. R-Series

AMD Radeon™ HD Graphics

  1. 7900 Series (7990, 7970, 7950)
  2. 7800 Series (7870, 7850)
  3. 7700 Series (7770, 7750)

AMD Radeon™ HD Graphics

  1. 6900 Series (6990, 6970, 6950)
  2. 6800 Series (6870, 6850)
  3. 6700 Series (6790 , 6770, 6750)
  4. 6600 Series (6670)
  5. 6500 Series (6570)
  6. 6400 Series (6450)
  7. 6xxxM Series

Compiled binary windows packages (zip packages) for Bolt may be downloaded from the Bolt landing page hosted on AMD's Developer Central website.