Graphics Processing Units (GPUs) are essentially supercomputers sitting inside your desktop workstation. Over the years, the design of these computing devices, alongside its software stack has evolved tremendously in terms of raw computational power and memory bandwidth. Today, GPUs lead CPUs by an order of magnitude in terms of instruction throughput and memory bandwidth (see Figure 1, courtesy of Nvidia Corp.).
Since 2007 and the initial release of the CUDA (Nvidia) software framework, the computational performance of these devices has increased aggressively, driven mostly by the needs of the gaming industry, entertainment industry (rendering), high performance computing fields (scientific computing), and other market forces.
Designed towards massive parallelization
In contrast to CPUs which are excellent general purpose computing machines capable of handling diverse workloads, GPUs are designed and specialized towards massive parallelization (e.g. tens of thousands of concurrent threads) computing finer-grained parallel tasks running on lightweight threads.
Additionally, while there is great legacy, tradition and resources in software that supports development for CPU, GPU software development is more recent and employs a different programming paradigm. The challenge in GPU development is to algorithmically extract and exploit the inherent low-level parallelism in a given problem and presenting it to the hardware GPU in a way that efficiently utilizes its computing and memory resources. The benefit of doing so is a dramatic increase in speed of execution of a given problem, or in simply enabling existing problems to be solved in real-time or near real-time.
Currently, this resulted in a coprocessor with tremendous computational power and yet it is available everywhere, sitting under your desk, mostly unused. To put things in perspective, Figures 2 and 3 (courtesy of Nvidia) show the relative computational power and bandwidth of the GPU as over an order of magnitude higher than that of CPU/RAM.
Computer vision and image processing on GPUs
Thanks to their data-parallel SIMT architecture, GPUs are extremely well suited to computer vision and image processing workloads. Here, operations performed on pixels or image sub-regions map well onto the numerous GPU cores which are being supported and fed by the high bandwidth interface of the GPU. It is not uncommon to achieve two orders of magnitude speedup over CPU solutions, given the right device, adequate parallelization of the problem, careful profiling and optimization. Simple textbook examples of such operations include image filtering (convolutions, correlations, edge detection, etc.) and transformations (e.g. warping), color conversions, inpainting, statistical analyses (reductions, scans, histograms, etc.) as well as higher-level computer vision algorithms like feature recognition and tracking, segmentation, stereo correspondence, structure from motion, and many others. Just like deep learning workloads benefit greatly from GPUs for training and inference through higher-level libraries, almost any computer vision workload can benefit through lower-level libraries or custom implementations.
Efficiency boosted by GPU-processed key operations
For applications which can benefit from using GPUs, typically a part of the algorithm can already be parallelized after a quick analysis of the given problem. For example, in the fields of physics-based simulation a typical workload involves inversions of large sparse or dense matrices arising from discretization of some physical medium. Off-loading only these key operations to the GPU while leaving the remaining code unchanged is an excellent first step. Such targeted interventions can bring significant improvements to code efficiency for a relatively small investment in development time. As an added benefit, delegating demending tasks to the GPU also frees up CPU cycles for other tasks, as GPU and CPU work asynchronously.
Many other applications can benefit from GPU implementation, especially if the critical constitutive parts are already efficiently implemented and accessible through Nvidia's libraries, such as typical signal and image processing libraries, other linear algebra algorithms, fast linear solvers, etc.
GPUs for performance-critical applications
For applications which are absolutely performance critical, parallelizing a problem involves writing custom solutions in a lower-level language such as CUDA. The objective of these solutions is to balance the target algorithms' inherent parallelizability (how and at which level) with the intricacies of the computing/memory subsystem and their interaction during the execution of the code. This process is typically iterative, involving several steps of implementation, profiling and optimization. However, even a short pilot phase can provide a rough estimate of what is achievable and can be expected, depending on quantifiable measures, such as the utilization of different resources of the GPU.