Parallelizing Ray Tracing

Index

Setup

I’m going to use Ray Tracing in one weekend as a basis for this article. I’ve implemented a ray tracer based on these books by Peter Shirley, but the books have since been updated and moved into a new repository. So this is a nice and clean entry point if you want to follow along.

Single Threaded

You can fork my version of RayTracingOneWeekend, I’ve added a CMakeFile and edited the project so it writes the resulting image into a file. I made a baseline branch for the default result for creating a 1200×800 image with 10 samples per pixel.

Singlethreaded, 1200×800, 10spp, 88 seconds.

On my CPU this takes around 88 seconds, and overall uses only 13Mb of memory during runtime.

Naive Jobification

In order to have an idea of different approaches to make this run faster, I decided to start by using std::async, and create one job per pixel. I assumed from the beginning that this would bring “some” speed gains, but some caveats.

Once again, I’ve made a new branch for the Jobs version of the Ray Tracer, feel free to download that and experiment with it. The main changes include creating a std::async job for each pixel, saving the std::future<ResultJob> in a vector, and using a std::condition_variable to wait until all jobs are complete.

std::mutex mutex;
std::condition_variable cvResults;
std::vector<std::future<RayResult>> m_futures;

for (int j = 0; j < ny; ++j) {
	for (int i = 0; i < nx; ++i) {
		auto future = std::async(std::launch::async | std::launch::deferred, 
		[&cam, &world, &ns, i, j, nx, ny, &cvResults]() -> RayResult {
			const unsigned int index = j * nx + i;
			vec3 col(0, 0, 0);
			for (int s = 0; s < ns; ++s) {
				float u = float(i + random_double()) / float(nx);
				float v = float(j + random_double()) / float(ny);

				ray r = cam.get_ray(u, v);
				col += color(r, world, 0);
			}
			col /= float(ns);

			RayResult result;
			result.index = index;
			result.col = vec3(sqrt(col[0]), sqrt(col[1]), sqrt(col[2]));
			return result;
		});

		{
			std::lock_guard<std::mutex> lock(mutex);
			m_futures.push_back(std::move(future));
		}
	}
}

This is what the main loop now looks like. RayResult is a simple struct containing the image index and resulting color value. After this, I wait for m_futures to be the same value as the number of pixels in the image, and then build the image before writing it to a file.

// launched jobs. need to build image.
// wait for number of jobs = pixel count
{
	std::unique_lock<std::mutex> lock(mutex);
	cvResults.wait(lock, [&m_futures, &pixelCount] {
		return m_futures.size() == pixelCount;
	});
}

// reconstruct image.
for (std::future<RayResult>& rr : m_futures)
{
	RayResult result = rr.get();
	image[result.index] = result.col;
}
1 Job per Pixel, 1200×800, 10spp, 23 seconds.

The resulting image is the same. As expected this reduced execution time to around 23 seconds, 3.8x improvement. But not everything is looking good, on the contrary, we now started to use a lot more memory!

Now this took almost 1 Gb of memory while running, compared to the 13Mb for the single threaded version! CPU usage is almost 100% across all the execution, meaning most cores where used, but that memory usage is way too high. I think we can do better!

Threads and Blocks

The next implementation involves creating N-Threads, the number of threads my CPU can run concurrently, and splitting the image into N blocks of image rows. I’ll be using a std::condition_variable to determine if each thread has finished as well, and we’ll see if this improves our program.

We do get around the same speed benefit and a small enough increase in memory consumption from the baseline test. std::async jobs still performs faster, but I suspect that is it because some of the blocks had less work to do than other, and therefor, finished first. This will make some of our CPU cores idle while the threads finish their blocks ( we can see that from the decrease CPU usage in the screenshot above ). The image is less computationally intensive in some areas than others, think about diffuse spheres versus refractive ones.

Now, I also think that if we used std::async, and split work equally in blocks, we would also reduce memory consumption and calculate the image slower. I think we need to find a nice balance between jobs sizes, obviously one job per pixel is too little and a too big block might cause idle time if the jobs is performed too fast ( assuming that thread doesn’t have another job to perform )

You can grab the source code on GitHub

Fine Tuning Job Sizes

If we have less jobs than CPU cores, some of them become idle and have no more jobs to take on. I’ve created new tests to try out different job sizes. You can check out the code for the image block version using threads and using std::async.

In both branches, you can edit nRowsPerJob to test different job sizes.

const int nThreads = std::thread::hardware_concurrency();
int nRowsPerJob = 10; // play with amount of rows for each job
int nJobs = ny / nRowsPerJob;
int leftOver = ny % nThreads;

I managed to get the same results on both methods. I no longer get a gigantic 1Gb memory usage with std::async, but now using a reasonable amount of pixels to generate, instead of one. There is no visible benefit in terms of performance from threads vs std::async that I could see. On both versions, with various block sizes, I had the same results: around 24 seconds per image and 30Mb of memory usage. By keeping the number of images rows per block low, more jobs will be created and this is ideal to split jobs evenly across CPU cores.

Take away

I set out to expand and consolidate my knowledge on multi threading paradigms and concepts using c++, using some older and some newer c++ features, and I had a lot of fun doing so.

I’m sure there’s lot of room for improvement and provably made lots of mistakes, if I did, let me know. In any case I hope you can take your own conclusions and maybe learn something too.

Exploring Multi-Threading in C++: Loading Textures

Index

Problem Overview

Let’s say we have a game engine that uses OpenGL and we need to load textures asynchronously so that we don’t block the main thread, and we can load the editor or game much faster.

Now as previously demonstrated, I could launch a new set of threads and have them load up the textures ( I’m using stb_image for that ) and generate textures with glGenTextures. The main issue here is that OpenGL context is only availabe in the main thread so if we want to take advantage of multi-threading texture loading we need to split the loading and generating for textures.

Loading is going to be done in worker threads, and generating textures will be done in the main thread. The following diagram shows a simplified workflow of what we’ll achieve.

Simplified execution of loading textures in worker threads, and processing them in the main thread.

In our main thread we have a method that will check our processing textures queue for a job. If it finds one, Generates the OpengGL texture and assigns it back to the material.

void AssetManager::Update()
{
	if (!m_processingTexturesQueue.Empty())
	{
		TextureLoadJob assetJob;
		if (m_processingTexturesQueue.TryPop(assetJob))
		{
			// Generate OpenGL texture
			Texture outputTexture = GenerateTexture(assetJob.loadedData, assetJob.textureType);
			// Update Material
			assetJob.materialOwner->AddTexture(outputTexture);
		}
	}
}

The loader thread will continuously run and check the loading textures queue for jobs. In this case, I load the texture from a file path and assigning the result into the loaded data.

void AssetManager::LoaderThread()
{
	while (m_loadingThreadActive)
	{
		if (!m_loadingTexturesQueue.Empty())
		{
			TextureLoadJob assetJob;
			if (m_loadingTexturesQueue.TryPop(assetJob))
			{
				// Load texture data into asset job
				assetJob.loadedData = LoadTextureData(assetJob.texturePath);
				// push job into processing queue
				m_processingTexturesQueue.Push(assetJob);
			}
		}
		// ....
	}
}
In game sponza scene.

This architecture allows me to load textures while the game is running without blocking the main thread. Its a bit pointless to compare times here since I’m using my own sandbox instead of a sample program to test only this matter. See Part 1 and Part 2 for more info and code you can follow along.

Full source code can be found on GitHub ( commit )

In the next part, we’ll parallelize a toy Ray Tracer. This a different problem on its own where we need to use the resulting values of multiple threads or jobs to build a final image.

Continue Reading

Exploring Multi-Threading in C++ Cont

Index

Specific Worker Threads for Specific Jobs

My next test case is to have different worker thread to run different kinds of tasks. The idea is to have a couple of threads for important jobs, others for less important jobs. I’ve split the tasks into different Job queues for simplicity.

static std::mutex g_mutexLowJobQ;
static std::mutex g_mutexMediumJobQ;
static std::mutex g_mutexHighJobQ;

std::queue<CalcPiJob*> GetJobsOfType(int count, int iterations)
{
	std::queue<CalcPiJob*> jobQ;
	for (int i = 0; i < count; ++i)
	{
		jobQ.emplace(new CalcPiJob(iterations));
	}
	return jobQ;
}

void RunThreadedPriority()
{
	int nHighThreads = 3;
	int nMediumThreads = 2;
	int nLowThreads = 2;
	
	std::queue<CalcPiJob*> lowJobQ = GetJobsOfType(Settings::JobCountLow, Settings::IterationCountLow);
	std::queue<CalcPiJob*> mediumJobQ = GetJobsOfType(Settings::JobCountMedium, Settings::IterationCountMedium);
	std::queue<CalcPiJob*> highJobQ = GetJobsOfType(Settings::JobCountHigh, Settings::IterationCountHigh);

	std::vector<std::thread> threads;

	std::atomic<bool> hasHighJobsLeft = true;
	for (int i = 0; i < nHighThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasHighJobsLeft, highJobQ, g_mutexHighJobQ);
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> hasMediumJobsLeft = true;
	for (int i = 0; i < nMediumThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasMediumJobsLeft, mediumJobQ, g_mutexMediumJobQ);
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> hasLowJobsLeft = true;
	for (int i = 0; i < nLowThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasLowJobsLeft, lowJobQ, g_mutexLowJobQ);
		});
		threads.push_back(std::move(t));
	}

	// main thread
	while (hasHighJobsLeft || hasMediumJobsLeft || hasLowJobsLeft)
	{
		if (hasHighJobsLeft) 
		{
			ExecuteJobsQ(hasHighJobsLeft, highJobQ, g_mutexHighJobQ);
		}
		else
		{
			// wait for other threads to complete.
			std::this_thread::sleep_for(std::chrono::milliseconds(10));
		}
	}

	const int threadCount = threads.size();
	for (int i = 0; i < threadCount; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 6059 ms. ( 4 High Job threads, 2 medium and 2 low threads. )

( click to expand )

The profile image show the 4 threads handling only big jobs, 2 threads handling medium jobs and the other 2 threads handling smaller jobs. As we can see, this won’t win us much time, since when some thread finish their work, they stand idle, not contributing to the bigger picture.

We can try to fix that by implementing some kind of work stealing. When a thread has no more jobs meant for them, they can steal jobs from other thread queues.

Specific Threads with Work Stealing

This next test is just that. Each thread type was setup to grab a job of less priority from their main one, once they run out of jobs. Hopefully we will prevent threads from going idle.

void RunThreadedPriorityWorkStealing()
{
	int nHighThreads = 5;
	int nMediumThreads = 1;
	int nLowThreads = 1;

	std::queue<CalcPiJob*> lowJobQ = GetJobsOfType(Settings::JobCountLow, Settings::IterationCountLow);
	std::queue<CalcPiJob*> mediumJobQ = GetJobsOfType(Settings::JobCountMedium, Settings::IterationCountMedium);
	std::queue<CalcPiJob*> highJobQ = GetJobsOfType(Settings::JobCountHigh, Settings::IterationCountHigh);

	std::vector<std::thread> threads;

	std::atomic<bool> isHighPriorityThreadsActive = true;
	for (int i = 0; i < nHighThreads; ++i)
	{
		std::thread t([&]() {
			
			while (isHighPriorityThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(highJobQ, g_mutexHighJobQ);

				// if no more High Jobs, take on Medium ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);
				}

				// if no more Medium Jobs, take on Small ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
				}

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isHighPriorityThreadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> isMediumThreadsActive = true;
	for (int i = 0; i < nMediumThreads; ++i)
	{
		std::thread t([&]() {
			while (isMediumThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);

				// if no more Medium Jobs, take on Small ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
				}

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isMediumThreadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> isLowThreadsActive = true;
	for (int i = 0; i < nLowThreads; ++i)
	{
		std::thread t([&]() {
			while (isLowThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isLowThreadsActive = false;
				}
			}
			});
		threads.push_back(std::move(t));
	}

	// main thread
	while (isLowThreadsActive || isMediumThreadsActive || isHighPriorityThreadsActive)
	{
		if (isHighPriorityThreadsActive)
		{
			CalcPiJob* currentJob = GetAndPopJob(highJobQ, g_mutexHighJobQ);

			// if no more High Jobs, take on Medium ones.
			if (!currentJob)
			{
				currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);
			}

			// if no more Medium Jobs, take on Small ones.
			if (!currentJob)
			{
				currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
			}

			if (currentJob)
			{
				currentJob->DoWork();
				delete currentJob;
			}
			else
			{
				isHighPriorityThreadsActive = false;
			}
		}
		else
		{
			// wait for other threads to complete.
			std::this_thread::sleep_for(std::chrono::milliseconds(10));
		}
	}

	const int threadCount = threads.size();
	for (int i = 0; i < threadCount; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 2625 ms.

(click to expand)

Now we can see that the high priority worker threads started to take on medium sized jobs as soon as the higher ones depleted, and then the small jobs followed.

Synchronizing Threads

Now lets say I’m processing data and I need to start Jobs in sync in between multiple threads, or maybe I’m building a game engine and my main update loop needs to start at the same time as the physics loop in some other thread. Whichever the case, I tough looking up synchronization mechanisms was worth doing as well.

std::mutex g_syncMutex;
std::condition_variable g_conditionVariable;

void RunSynchronizedThreads()
{
	int nThreads = std::thread::hardware_concurrency() - 1;
	std::vector<std::thread> threads;

	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> signal = false;
	std::atomic<bool> threadsActive = true;
	for (int i = 0; i < nThreads; ++i)
	{
		std::thread t([&]() {
			while (threadsActive)
			{
				// Tell main thread, worker is available for work
				{
					std::unique_lock<std::mutex> lk(g_syncMutex);
					g_conditionVariable.wait(lk, [&] { return signal == true; });
				}

				CalcPiJob* currentJob = GetAndPopJob(jobQ, g_mutexJobQ);

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					threadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	// main thread
	std::atomic<bool> mainThreadActive = true;
	while (mainThreadActive && threadsActive)
	{
		// send signal to worker threads, they can start work.
		{
			std::lock_guard<std::mutex> lk(g_syncMutex);
			signal = true;
		}
		g_conditionVariable.notify_all();

		// send signal to worker threads, so they have to wait for their next update.
		std::this_thread::sleep_for(std::chrono::milliseconds(1));
		{
			std::lock_guard<std::mutex> lk(g_syncMutex);
			signal = false;
		}
		g_conditionVariable.notify_all();

		// main thread work.
		CalcPiJob* currentJob = GetAndPopJob(jobQ, g_mutexJobQ);

		if (currentJob)
		{
			currentJob->DoWork();
			delete currentJob;
		}
		else
		{
			mainThreadActive = false;
		}
	}

	for (int i = 0; i < nThreads; ++i)
	{
		threads[i].join();
	}
}

Run time: 2674 ms

(click to expand)

I’ve setup this one up so worker thread only start at the same frequency of the main thread. The goal here was to use condition variables to synchronize the threads, and hopefully confirm it with the profiler., which we can look at in the image above.

Test RunTime (ms)Improvement
One Thread 103961.99x
Threaded 26257.88x
Threaded with Priority 60593.4x
Threaded with Work Stealing 26257.8x
Synchronized Threads 26747.7x

Download code from GitHub

Continue Reading

Exploring Multi-Threading in C++

The pursuit of performance is something that interests me as a developer, so as a learning exercise I decided to experiment and consolidate my knowledge about multi-threading. Nowadays it’s becoming even more important since our CPUs get more and more cores. Modern game engines and applications use multiple CPU cores to stay fast and responsive.

Index

Setup and Baseline Result

As a test case, I decided to create a series of tasks, ones small and other big, to simulate different workload types. As an easy test case, I grabbed a method to calculate Pi, and run that method multiple times, depending on how heavy I want the workload to be.

double CalcPi(int n)
{
	double sum = 0.0;
	int sign = 1;
	for (int i = 0; i < n; ++i)
	{
		sum += sign / (2.0 * i + 1.0);
		sign *= -1;
	}
	return 4.0 * sum;
}

Now I create a couple of different Jobs running CalcPi and add them into a vector or a queue ( depending on the test I’m running ). My CalcPiJob class looks something like this.

class CalcPiJob
{
public:
	CalcPiJob(int iterations)
		: m_iterations(iterations)
	{ }

	void DoWork()
	{
		float p = 0.0f;
		for (int i = 0; i < m_iterations; ++i) {
			p += CalcPi(m_iterations);
		}

		p /= m_iterations;
		std::this_thread::sleep_for(std::chrono::milliseconds(Settings::ThreadPause));
	}

private:
	int m_iterations;
};

Creating a series of different workload types looks something like:

std::queue<CalcPiJob*> GetJobsQ()
{
	std::queue<CalcPiJob*> jobQ;
	for (int i = 0; i < Settings::JobCountHigh; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountHigh));
	}

	for (int i = 0; i < Settings::JobCountMedium; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountMedium));
	}

	for (int i = 0; i < Settings::JobCountLow; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountLow));
	}
	return jobQ;
}

std::vector<CalcPiJob*> GetJobVector()
{
	std::vector<CalcPiJob*> jobs;
	for (int i = 0; i < Settings::JobCountHigh; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountHigh));
	}

	for (int i = 0; i < Settings::JobCountMedium; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountMedium));
	}

	for (int i = 0; i < Settings::JobCountLow; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountLow));
	}
	return jobs;
}

I have also defined a couple of constants to help out.

struct Settings
{
	enum class Priority : int {
		Low = 0,
		Medium,
		High
	};

	static const int JobCountLow = 120;
	static const int JobCountMedium = 60;
	static const int JobCountHigh = 25;

	static const int ThreadPause = 100;

	static const int IterationCountLow = 5000;
	static const int IterationCountMedium = 10000;
	static const int IterationCountHigh = 20000;

	static const int PrecisionHigh = 100;
	static const int PrecisionMedium = 100;
	static const int PrecisionLow = 100;
};

Now for baseline, I go through all Jobs and execute DoWork sequentially.

void RunSequential()
{
	std::queue<CalcPiJob*> jobQ = GetJobsQ();
	while (!jobQ.empty())
	{
		CalcPiJob* job = jobQ.front();
		jobQ.pop();

		job->DoWork();
		delete job;
	}
}

I’m running all my tests on a i7 4770K, that has 4 cores and 8 threads. All timings where taken from a release build, and all profile images from debug builds ( for illustration of workload purposes ).

Sequential run time: 20692 ms

First Worker Thread

Let the interesting part begin. As an easy step towards a multi-threading application, I’m going to create only one thread, to share the workload with the main thread.

This already brings a few new concepts to be aware of such as sharing data across multiple threads. We protect our data access with a std::mutex, and lock it with std::scoped_lock ( introduced in C++17. Use similar std::lock_guard if your compiler doesn’t support it ).

You’ll need a few includes first.

// you should already have these.
#include <vector>
#include <queue>

#include <thread> // thread support
#include <mutex>  // mutex support
#include <atomic> // atomic variables
#include <future> // later on for std::async
CalcPiJob* GetAndPopJob(std::queue<CalcPiJob*>& jobQ, std::mutex& mutex)
{
	std::scoped_lock<std::mutex> lock(mutex);
	if (!jobQ.empty())
	{
		CalcPiJob* job = jobQ.front();
		jobQ.pop();

		return job;
	}
	return nullptr;
}

GetAndPopJob does exactly what is says, it will get a job if one exists and pop it from the queue. empty(), front() and pop() are protected inside this method with the use of the std::scoped_lock.

void ExecuteJobsQ(std::atomic<bool>& hasWork, 
	std::queue<CalcPiJob*>& jobQ, 
	std::mutex& mutex)
{
	while (hasWork)
	{
		CalcPiJob* currentJob = GetAndPopJob(jobQ, mutex);
		if (currentJob)
		{
			currentJob->DoWork();
			delete currentJob;
		}
		else
		{
			hasWork = false;
		}
	}
}

ExecuteJobsQ will run in the main thread and the worker thread. It gets a job, execute it, and continue until there is no more work to do.

// global mutex for read/write access to Job Queue
static std::mutex g_mutexJobQ;

void RunOneThread()
{
	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> jobsPending = true;

	// Starting new thread
	std::thread t([&]() {
		ExecuteJobsQ(jobsPending, jobQ, g_mutexJobQ);
	});

	// main thread, also does the same.
	ExecuteJobsQ(jobsPending, jobQ, g_mutexJobQ);

	t.join();
}

One worker thread run time: 10396 ms

(click to expand)

The image above show the execution of the jobs, the larger ones first, then the medium sized ones and lastly the smaller ones. This was the order at which the tasks where added into the queue.

More Worker Threads

Now this is nice, so lets add more threads! How many? Well, I know my CPU has 8 thread, but nothing guarantees they will only run for my program tho. Operating system time slice program execution across multiple cores/threads, so even if you create more threads than your max CPU threads, there’s no “problem” because the operating system will switch execution time for them on its own.

C++ provides us a way of determining how many concurrent threads our system supports, so lets just use that: std::thread::hardware_concurrency()

void RunThreaded()
{
	// -1 to make space for main thread
	int nThreads = std::thread::hardware_concurrency() - 1;
	std::vector<std::thread> threads;

	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> hasJobsLeft = true;
	for (int i = 0; i < nThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasJobsLeft, jobQ, g_mutexJobQ);
		});
		threads.push_back(std::move(t));
	}

	// main thread
	ExecuteJobsQ(hasJobsLeft, jobQ, g_mutexJobQ);

	for (int i = 0; i < nThreads; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 2625 ms.

8 threads ( click to expand )

Now this is a nicer view. 7 worker threads working with the main thread to process all jobs. Again, first we see the bigger jobs, then medium, then smaller ones being processed. This is being processed in the order they were added.

Async Tasks

When spawning tasks with std::async, we don’t manually create threads, they are spawned from a thread pool.

void RunJobsOnAsync()
{
	std::vector<CalcPiJob*> jobs = GetJobVector();

	std::vector<std::future<void>> futures;
	for (int i = 0; i < jobs.size(); ++i)
	{
		auto j = std::async([&jobs, i]() {
			jobs[i]->DoWork();
			});
		futures.push_back(std::move(j));
	}

	// Wait for Jobs to finish, .get() is a blocking operation.
	for (int i = 0; i < futures.size(); ++i)
	{
		futures[i].get();
	}

	for (int i = 0; i < jobs.size(); ++i)
	{
		delete jobs[i];
	}
}

Run time: 2220 ms

(click to expand)

Overview

This time table only serves as an overview for this particular case. Of course, in real applications, results vary.

Test RunTime (ms)Improvement
Sequential 206921.x
One Thread 103961.99x
Threaded 26257.88x
Async Tasks22209.3x

The sample codes are my exploration of this specific case and by no means is free of bugs. But it is interesting to see how the code would run across multiple thread, how to synchronize and make the most of my system.

All screenshots are taken with the debug version of the program, so we could clearly see the workload in the profiler. For that I used Superluminal Profiler. I found out that it is an amazing, lightweight profiler. You can also use Intel’s VTune for free.

Download code from GitHub

Continue Reading