GPGPU utilizes the number crunching speed and massive parallelization of your graphics card to accelerate general-purpose tasks. When your algorithm is compatible with GPU hardware, the speedup of running hundreds of concurrent threads can be enormous.
There are a number of ways to implement GPGPU, ranging from multi-platform frameworks such as OpenCL to single-company frameworks such as NVIDIA's CUDA. I've gotten to play around with CUDA while TAing the parallel computing class, and its lots of fun.
With NVIDIA (other vendors are probably similar, I'll update this as I learn more), each GPU device has a block of global memory serving a number of multi-processors, and each multi-processor contains several cores which can execute concurrent threads.
Specs on NVIDIA's GeForce GTX 580:
- 512 cores (16 (MP) ⋅ 32 (Cores/MP))
- 1.5 GB GDDR5 RAM
- 192.4 GB/sec memory bandwidth
- 1.54 GHz processor clock rate
- 1.58 TFLOPs per second
Zoom.
The FLOPs/s computaton is cores⋅clock⋅2
, because (from page 94
of the CUDA programming guide) each core can exectute a single
multiply-add operation (2 FLOPs) per cycle. Also take a look at the
graph of historical performance on page 14, the table of device
capabilities that starts on page 111, and the description of
warps on page 93.