Parallelizing Ray Tracing

Index

Setup

I’m going to use Ray Tracing in one weekend as a basis for this article. I’ve implemented a ray tracer based on these books by Peter Shirley, but the books have since been updated and moved into a new repository. So this is a nice and clean entry point if you want to follow along.

Single Threaded

You can fork my version of RayTracingOneWeekend, I’ve added a CMakeFile and edited the project so it writes the resulting image into a file. I made a baseline branch for the default result for creating a 1200×800 image with 10 samples per pixel.

Singlethreaded, 1200×800, 10spp, 88 seconds.

On my CPU this takes around 88 seconds, and overall uses only 13Mb of memory during runtime.

Naive Jobification

In order to have an idea of different approaches to make this run faster, I decided to start by using std::async, and create one job per pixel. I assumed from the beginning that this would bring “some” speed gains, but some caveats.

Once again, I’ve made a new branch for the Jobs version of the Ray Tracer, feel free to download that and experiment with it. The main changes include creating a std::async job for each pixel, saving the std::future<ResultJob> in a vector, and using a std::condition_variable to wait until all jobs are complete.

std::mutex mutex;
std::condition_variable cvResults;
std::vector<std::future<RayResult>> m_futures;

for (int j = 0; j < ny; ++j) {
	for (int i = 0; i < nx; ++i) {
		auto future = std::async(std::launch::async | std::launch::deferred, 
		[&cam, &world, &ns, i, j, nx, ny, &cvResults]() -> RayResult {
			const unsigned int index = j * nx + i;
			vec3 col(0, 0, 0);
			for (int s = 0; s < ns; ++s) {
				float u = float(i + random_double()) / float(nx);
				float v = float(j + random_double()) / float(ny);

				ray r = cam.get_ray(u, v);
				col += color(r, world, 0);
			}
			col /= float(ns);

			RayResult result;
			result.index = index;
			result.col = vec3(sqrt(col[0]), sqrt(col[1]), sqrt(col[2]));
			return result;
		});

		{
			std::lock_guard<std::mutex> lock(mutex);
			m_futures.push_back(std::move(future));
		}
	}
}

This is what the main loop now looks like. RayResult is a simple struct containing the image index and resulting color value. After this, I wait for m_futures to be the same value as the number of pixels in the image, and then build the image before writing it to a file.

// launched jobs. need to build image.
// wait for number of jobs = pixel count
{
	std::unique_lock<std::mutex> lock(mutex);
	cvResults.wait(lock, [&m_futures, &pixelCount] {
		return m_futures.size() == pixelCount;
	});
}

// reconstruct image.
for (std::future<RayResult>& rr : m_futures)
{
	RayResult result = rr.get();
	image[result.index] = result.col;
}
1 Job per Pixel, 1200×800, 10spp, 23 seconds.

The resulting image is the same. As expected this reduced execution time to around 23 seconds, 3.8x improvement. But not everything is looking good, on the contrary, we now started to use a lot more memory!

Now this took almost 1 Gb of memory while running, compared to the 13Mb for the single threaded version! CPU usage is almost 100% across all the execution, meaning most cores where used, but that memory usage is way too high. I think we can do better!

Threads and Blocks

The next implementation involves creating N-Threads, the number of threads my CPU can run concurrently, and splitting the image into N blocks of image rows. I’ll be using a std::condition_variable to determine if each thread has finished as well, and we’ll see if this improves our program.

We do get around the same speed benefit and a small enough increase in memory consumption from the baseline test. std::async jobs still performs faster, but I suspect that is it because some of the blocks had less work to do than other, and therefor, finished first. This will make some of our CPU cores idle while the threads finish their blocks ( we can see that from the decrease CPU usage in the screenshot above ). The image is less computationally intensive in some areas than others, think about diffuse spheres versus refractive ones.

Now, I also think that if we used std::async, and split work equally in blocks, we would also reduce memory consumption and calculate the image slower. I think we need to find a nice balance between jobs sizes, obviously one job per pixel is too little and a too big block might cause idle time if the jobs is performed too fast ( assuming that thread doesn’t have another job to perform )

You can grab the source code on GitHub

Fine Tuning Job Sizes

If we have less jobs than CPU cores, some of them become idle and have no more jobs to take on. I’ve created new tests to try out different job sizes. You can check out the code for the image block version using threads and using std::async.

In both branches, you can edit nRowsPerJob to test different job sizes.

const int nThreads = std::thread::hardware_concurrency();
int nRowsPerJob = 10; // play with amount of rows for each job
int nJobs = ny / nRowsPerJob;
int leftOver = ny % nThreads;

I managed to get the same results on both methods. I no longer get a gigantic 1Gb memory usage with std::async, but now using a reasonable amount of pixels to generate, instead of one. There is no visible benefit in terms of performance from threads vs std::async that I could see. On both versions, with various block sizes, I had the same results: around 24 seconds per image and 30Mb of memory usage. By keeping the number of images rows per block low, more jobs will be created and this is ideal to split jobs evenly across CPU cores.

Take away

I set out to expand and consolidate my knowledge on multi threading paradigms and concepts using c++, using some older and some newer c++ features, and I had a lot of fun doing so.

I’m sure there’s lot of room for improvement and provably made lots of mistakes, if I did, let me know. In any case I hope you can take your own conclusions and maybe learn something too.

Exploring Multi-Threading in C++: Loading Textures

Index

Problem Overview

Let’s say we have a game engine that uses OpenGL and we need to load textures asynchronously so that we don’t block the main thread, and we can load the editor or game much faster.

Now as previously demonstrated, I could launch a new set of threads and have them load up the textures ( I’m using stb_image for that ) and generate textures with glGenTextures. The main issue here is that OpenGL context is only availabe in the main thread so if we want to take advantage of multi-threading texture loading we need to split the loading and generating for textures.

Loading is going to be done in worker threads, and generating textures will be done in the main thread. The following diagram shows a simplified workflow of what we’ll achieve.

Simplified execution of loading textures in worker threads, and processing them in the main thread.

In our main thread we have a method that will check our processing textures queue for a job. If it finds one, Generates the OpengGL texture and assigns it back to the material.

void AssetManager::Update()
{
	if (!m_processingTexturesQueue.Empty())
	{
		TextureLoadJob assetJob;
		if (m_processingTexturesQueue.TryPop(assetJob))
		{
			// Generate OpenGL texture
			Texture outputTexture = GenerateTexture(assetJob.loadedData, assetJob.textureType);
			// Update Material
			assetJob.materialOwner->AddTexture(outputTexture);
		}
	}
}

The loader thread will continuously run and check the loading textures queue for jobs. In this case, I load the texture from a file path and assigning the result into the loaded data.

void AssetManager::LoaderThread()
{
	while (m_loadingThreadActive)
	{
		if (!m_loadingTexturesQueue.Empty())
		{
			TextureLoadJob assetJob;
			if (m_loadingTexturesQueue.TryPop(assetJob))
			{
				// Load texture data into asset job
				assetJob.loadedData = LoadTextureData(assetJob.texturePath);
				// push job into processing queue
				m_processingTexturesQueue.Push(assetJob);
			}
		}
		// ....
	}
}
In game sponza scene.

This architecture allows me to load textures while the game is running without blocking the main thread. Its a bit pointless to compare times here since I’m using my own sandbox instead of a sample program to test only this matter. See Part 1 and Part 2 for more info and code you can follow along.

Full source code can be found on GitHub ( commit )

In the next part, we’ll parallelize a toy Ray Tracer. This a different problem on its own where we need to use the resulting values of multiple threads or jobs to build a final image.

Continue Reading