An Introduction to GPU Driving with OpenCL

Press the gas pedal of a Venom GT car to the max and you can reach a speed of over 400 km/h. Ask Andy Roddick to show you a fast serve and you will hear the sound that a tennis ball makes when flying at a speed of almost 250 km/h.

Now imagine not just two, not tens, but thousands of acceleration pedals being pressed to the floor at the same time. All in parallel! Imagine thousands of powerful tennis serves and the sound that they make. All in parallel!

No, Clarkson will not be your trainer for this workshop, nor will Andy Roddick explain to you the secret recipe for the perfect forehand.

But, as we still have to quench our thirst for speed and performance, we’ll take a look together at the hundreds or thousands of cores that your computer probably has and we will teach you the fundamentals for starting your own experiments with parallel burning cores.

Throughout the course you will learn the basics of OpenCL parallel programming paradigm with a focus on GPUs. While getting familiar with the OpenCL concepts, you will have to add OpenCL functionalities to an existing image processing C application and port the existing algorithms to run on the GPU.

Can you make it run faster? How much faster?

When and Where?

September 12th - September 16th 2015.

Date Time Room
September 12th 2015 10:00-13:00 EG304
September 13th 2015 10:00-13:00 EG304
September 14th 2015 18:00-20:30 EG304
September 15th 2015 18:00-20:30 EG304
September 16th 2015 18:00-20:30 EG304

Workshop Agenda


  • Theory
    • OpenCL platforms, hosts and devices
    • Compute units, work groups, work items
    • Memory hierarchy
  • Lab session
    • Detect available OpenCL platforms and devices on your system
    • Query capabilities of the detected OpenCL platforms and devices


  • Theory
    • OpenCL execution model
    • Kernels, queues, synchronization
    • Memory objects
    • The OpenCL language
  • Lab session
    • How to map work items on the problem space
    • Transfer data to/from GPU
    • Implement the OpenCL kernel for the first image processing operation (IPO1)
    • Transfer data to GPU and back from the GPU


  • Theory
    • Profiling, events
  • Lab session
    • Profile the kernel for IPO1
    • Implement the kernel for the second image processing operation (IPO2)
    • Profile and analysis
    • Implement the kernel for the third image processing operation (IPO3)


  • Theory
    • Recap memory hierarchy and memory objects
    • Synchronization across work items
  • Lab session
    • Profile and optimize the kernels


  • Theory
    • Images
  • Lab session
    • Profile and optimize the existing OpenCL implementation

Target Audience and Prerequisites

If you are interested in learning the fundamentals of OpenCL or simply eager to take a first step in the world of parallel programming with GPUs, then you're definitely part of the target audience. You are expected to be familiar with computer architecture and have good C programming knowledge.


To register for this workshop, please fill in the form. Please try to just be yourself and provide honest and simple answers. We want to get a better idea about what you already know and what you would like to learn, but also to polish the last details of the training materials according to your requirements and preference. For any questions regarding this workshop, please feel free to contact the trainer.

Registration is now closed.

About the Organizers

The workshop is organized by ROSEdu in partnership with StreamComputing.

We, the people at StreamComputing, are crazy about speed and performance. We specialize in optimizing software, by means of GPUs, multi-core CPUs, FPGAs or any other kind of hardware that usually lays around unused by normal applications. When people need faster code, that's when we come in.

Course Staff

Trainer: Anca Hamuraru

Assistant Trainer: Albert Zaharovits

After the Workshop

For some of the participants the lab sessions were simply not enough. So after the workshop we had no other option but to have a small competition for them. The participants were given a functional implementation of an algorithm in C and OpenCL. There were two goals: to get the best possible performance out of the OpenCL kernel and to get the best overall speedup for the entire application. All participants had to use the same machine and the same GPU.

And the winners are (…drumroll…): Cristi Alexandru Vasile and Costin Giorgian Papuc! Congratulations! The runner up with very close performance is Alexandru Grad.

Here are the results of our winners:

Name Input Size Overall Speedup Kernel Speedup
Cristi Alexandru Vasile 16K 28.22X 2.31X
Cristi Alexandru Vasile 64K 26.16X 2.29X
Cristi Alexandru Vasile 144K 25.97X 2.31X
Cristi Alexandru Vasile 256K 25.81X 2.51X
Costin Giorgian Papuc 16K 29.18X 2.32X
Costin Giorgian Papuc 64K 26.86X 2.29X
Costin Giorgian Papuc 144K 26.98X 2.29X
Costin Giorgian Papuc 256K 26.27X 2.36X

The overall speedup is measured as the ratio between the execution time of the C implementation and the execution time for the OpenCL implementation. The measured execution time for the OpenCL implementation also includes the time for allocating buffers on the device, transferring the data to the device and back to the host. However, it does not include the time needed for initializing the OpenCL context and building the OpenCL kernel. The C implementation is single threaded and does not make use of SIMD instructions.

sesiuni/opencl.txt · Last modified: 2015/10/03 18:37 by ahamuraru