GPGPU utilizes the number crunching speed and massive parallelization of your graphics card to accelerate general-purpose tasks. When your algorithm is compatible with GPU hardware, the speedup of running hundreds of concurrent threads can be enormous.

There are a number of ways to implement GPGPU, ranging from multi-platform frameworks such as OpenCL to single-company frameworks such as NVIDIA's CUDA. I've gotten to play around with CUDA while TAing the parallel computing class, and its lots of fun.

With NVIDIA (other vendors are probably similar, I'll update this as I learn more), each GPU device has a block of global memory serving a number of multi-processors, and each multi-processor contains several cores which can execute concurrent threads.

Specs on NVIDIA's GeForce GTX 580:

  • 512 cores (16 (MP) ⋅ 32 (Cores/MP))
  • 1.5 GB GDDR5 RAM
  • 192.4 GB/sec memory bandwidth
  • 1.54 GHz processor clock rate
  • 1.58 TFLOPs per second

Zoom.

The FLOPs/s computaton is cores⋅clock⋅2, because (from page 94 of the CUDA programming guide) each core can exectute a single multiply-add operation (2 FLOPs) per cycle. Also take a look at the graph of historical performance on page 14, the table of device capabilities that starts on page 111, and the description of warps on page 93.