Parallelizing Ray Tracing

Index

Setup

I’m going to use Ray Tracing in one weekend as a basis for this article. I’ve implemented a ray tracer based on these books by Peter Shirley, but the books have since been updated and moved into a new repository. So this is a nice and clean entry point if you want to follow along.

Single Threaded

You can fork my version of RayTracingOneWeekend, I’ve added a CMakeFile and edited the project so it writes the resulting image into a file. I made a baseline branch for the default result for creating a 1200×800 image with 10 samples per pixel.

Singlethreaded, 1200×800, 10spp, 88 seconds.

On my CPU this takes around 88 seconds, and overall uses only 13Mb of memory during runtime.

Naive Jobification

In order to have an idea of different approaches to make this run faster, I decided to start by using std::async, and create one job per pixel. I assumed from the beginning that this would bring “some” speed gains, but some caveats.

Once again, I’ve made a new branch for the Jobs version of the Ray Tracer, feel free to download that and experiment with it. The main changes include creating a std::async job for each pixel, saving the std::future<ResultJob> in a vector, and using a std::condition_variable to wait until all jobs are complete.

std::mutex mutex;
std::condition_variable cvResults;
std::vector<std::future<RayResult>> m_futures;

for (int j = 0; j < ny; ++j) {
	for (int i = 0; i < nx; ++i) {
		auto future = std::async(std::launch::async | std::launch::deferred, 
		[&cam, &world, &ns, i, j, nx, ny, &cvResults]() -> RayResult {
			const unsigned int index = j * nx + i;
			vec3 col(0, 0, 0);
			for (int s = 0; s < ns; ++s) {
				float u = float(i + random_double()) / float(nx);
				float v = float(j + random_double()) / float(ny);

				ray r = cam.get_ray(u, v);
				col += color(r, world, 0);
			}
			col /= float(ns);

			RayResult result;
			result.index = index;
			result.col = vec3(sqrt(col[0]), sqrt(col[1]), sqrt(col[2]));
			return result;
		});

		{
			std::lock_guard<std::mutex> lock(mutex);
			m_futures.push_back(std::move(future));
		}
	}
}

This is what the main loop now looks like. RayResult is a simple struct containing the image index and resulting color value. After this, I wait for m_futures to be the same value as the number of pixels in the image, and then build the image before writing it to a file.

// launched jobs. need to build image.
// wait for number of jobs = pixel count
{
	std::unique_lock<std::mutex> lock(mutex);
	cvResults.wait(lock, [&m_futures, &pixelCount] {
		return m_futures.size() == pixelCount;
	});
}

// reconstruct image.
for (std::future<RayResult>& rr : m_futures)
{
	RayResult result = rr.get();
	image[result.index] = result.col;
}
1 Job per Pixel, 1200×800, 10spp, 23 seconds.

The resulting image is the same. As expected this reduced execution time to around 23 seconds, 3.8x improvement. But not everything is looking good, on the contrary, we now started to use a lot more memory!

Now this took almost 1 Gb of memory while running, compared to the 13Mb for the single threaded version! CPU usage is almost 100% across all the execution, meaning most cores where used, but that memory usage is way too high. I think we can do better!

Threads and Blocks

The next implementation involves creating N-Threads, the number of threads my CPU can run concurrently, and splitting the image into N blocks of image rows. I’ll be using a std::condition_variable to determine if each thread has finished as well, and we’ll see if this improves our program.

We do get around the same speed benefit and a small enough increase in memory consumption from the baseline test. std::async jobs still performs faster, but I suspect that is it because some of the blocks had less work to do than other, and therefor, finished first. This will make some of our CPU cores idle while the threads finish their blocks ( we can see that from the decrease CPU usage in the screenshot above ). The image is less computationally intensive in some areas than others, think about diffuse spheres versus refractive ones.

Now, I also think that if we used std::async, and split work equally in blocks, we would also reduce memory consumption and calculate the image slower. I think we need to find a nice balance between jobs sizes, obviously one job per pixel is too little and a too big block might cause idle time if the jobs is performed too fast ( assuming that thread doesn’t have another job to perform )

You can grab the source code on GitHub

Fine Tuning Job Sizes

If we have less jobs than CPU cores, some of them become idle and have no more jobs to take on. I’ve created new tests to try out different job sizes. You can check out the code for the image block version using threads and using std::async.

In both branches, you can edit nRowsPerJob to test different job sizes.

const int nThreads = std::thread::hardware_concurrency();
int nRowsPerJob = 10; // play with amount of rows for each job
int nJobs = ny / nRowsPerJob;
int leftOver = ny % nThreads;

I managed to get the same results on both methods. I no longer get a gigantic 1Gb memory usage with std::async, but now using a reasonable amount of pixels to generate, instead of one. There is no visible benefit in terms of performance from threads vs std::async that I could see. On both versions, with various block sizes, I had the same results: around 24 seconds per image and 30Mb of memory usage. By keeping the number of images rows per block low, more jobs will be created and this is ideal to split jobs evenly across CPU cores.

Take away

I set out to expand and consolidate my knowledge on multi threading paradigms and concepts using c++, using some older and some newer c++ features, and I had a lot of fun doing so.

I’m sure there’s lot of room for improvement and provably made lots of mistakes, if I did, let me know. In any case I hope you can take your own conclusions and maybe learn something too.

Exploring Multi-Threading in C++: Loading Textures

Index

Problem Overview

Let’s say we have a game engine that uses OpenGL and we need to load textures asynchronously so that we don’t block the main thread, and we can load the editor or game much faster.

Now as previously demonstrated, I could launch a new set of threads and have them load up the textures ( I’m using stb_image for that ) and generate textures with glGenTextures. The main issue here is that OpenGL context is only availabe in the main thread so if we want to take advantage of multi-threading texture loading we need to split the loading and generating for textures.

Loading is going to be done in worker threads, and generating textures will be done in the main thread. The following diagram shows a simplified workflow of what we’ll achieve.

Simplified execution of loading textures in worker threads, and processing them in the main thread.

In our main thread we have a method that will check our processing textures queue for a job. If it finds one, Generates the OpengGL texture and assigns it back to the material.

void AssetManager::Update()
{
	if (!m_processingTexturesQueue.Empty())
	{
		TextureLoadJob assetJob;
		if (m_processingTexturesQueue.TryPop(assetJob))
		{
			// Generate OpenGL texture
			Texture outputTexture = GenerateTexture(assetJob.loadedData, assetJob.textureType);
			// Update Material
			assetJob.materialOwner->AddTexture(outputTexture);
		}
	}
}

The loader thread will continuously run and check the loading textures queue for jobs. In this case, I load the texture from a file path and assigning the result into the loaded data.

void AssetManager::LoaderThread()
{
	while (m_loadingThreadActive)
	{
		if (!m_loadingTexturesQueue.Empty())
		{
			TextureLoadJob assetJob;
			if (m_loadingTexturesQueue.TryPop(assetJob))
			{
				// Load texture data into asset job
				assetJob.loadedData = LoadTextureData(assetJob.texturePath);
				// push job into processing queue
				m_processingTexturesQueue.Push(assetJob);
			}
		}
		// ....
	}
}
In game sponza scene.

This architecture allows me to load textures while the game is running without blocking the main thread. Its a bit pointless to compare times here since I’m using my own sandbox instead of a sample program to test only this matter. See Part 1 and Part 2 for more info and code you can follow along.

Full source code can be found on GitHub ( commit )

In the next part, we’ll parallelize a toy Ray Tracer. This a different problem on its own where we need to use the resulting values of multiple threads or jobs to build a final image.

Continue Reading

Exploring Multi-Threading in C++ Cont

Index

Specific Worker Threads for Specific Jobs

My next test case is to have different worker thread to run different kinds of tasks. The idea is to have a couple of threads for important jobs, others for less important jobs. I’ve split the tasks into different Job queues for simplicity.

static std::mutex g_mutexLowJobQ;
static std::mutex g_mutexMediumJobQ;
static std::mutex g_mutexHighJobQ;

std::queue<CalcPiJob*> GetJobsOfType(int count, int iterations)
{
	std::queue<CalcPiJob*> jobQ;
	for (int i = 0; i < count; ++i)
	{
		jobQ.emplace(new CalcPiJob(iterations));
	}
	return jobQ;
}

void RunThreadedPriority()
{
	int nHighThreads = 3;
	int nMediumThreads = 2;
	int nLowThreads = 2;
	
	std::queue<CalcPiJob*> lowJobQ = GetJobsOfType(Settings::JobCountLow, Settings::IterationCountLow);
	std::queue<CalcPiJob*> mediumJobQ = GetJobsOfType(Settings::JobCountMedium, Settings::IterationCountMedium);
	std::queue<CalcPiJob*> highJobQ = GetJobsOfType(Settings::JobCountHigh, Settings::IterationCountHigh);

	std::vector<std::thread> threads;

	std::atomic<bool> hasHighJobsLeft = true;
	for (int i = 0; i < nHighThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasHighJobsLeft, highJobQ, g_mutexHighJobQ);
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> hasMediumJobsLeft = true;
	for (int i = 0; i < nMediumThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasMediumJobsLeft, mediumJobQ, g_mutexMediumJobQ);
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> hasLowJobsLeft = true;
	for (int i = 0; i < nLowThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasLowJobsLeft, lowJobQ, g_mutexLowJobQ);
		});
		threads.push_back(std::move(t));
	}

	// main thread
	while (hasHighJobsLeft || hasMediumJobsLeft || hasLowJobsLeft)
	{
		if (hasHighJobsLeft) 
		{
			ExecuteJobsQ(hasHighJobsLeft, highJobQ, g_mutexHighJobQ);
		}
		else
		{
			// wait for other threads to complete.
			std::this_thread::sleep_for(std::chrono::milliseconds(10));
		}
	}

	const int threadCount = threads.size();
	for (int i = 0; i < threadCount; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 6059 ms. ( 4 High Job threads, 2 medium and 2 low threads. )

( click to expand )

The profile image show the 4 threads handling only big jobs, 2 threads handling medium jobs and the other 2 threads handling smaller jobs. As we can see, this won’t win us much time, since when some thread finish their work, they stand idle, not contributing to the bigger picture.

We can try to fix that by implementing some kind of work stealing. When a thread has no more jobs meant for them, they can steal jobs from other thread queues.

Specific Threads with Work Stealing

This next test is just that. Each thread type was setup to grab a job of less priority from their main one, once they run out of jobs. Hopefully we will prevent threads from going idle.

void RunThreadedPriorityWorkStealing()
{
	int nHighThreads = 5;
	int nMediumThreads = 1;
	int nLowThreads = 1;

	std::queue<CalcPiJob*> lowJobQ = GetJobsOfType(Settings::JobCountLow, Settings::IterationCountLow);
	std::queue<CalcPiJob*> mediumJobQ = GetJobsOfType(Settings::JobCountMedium, Settings::IterationCountMedium);
	std::queue<CalcPiJob*> highJobQ = GetJobsOfType(Settings::JobCountHigh, Settings::IterationCountHigh);

	std::vector<std::thread> threads;

	std::atomic<bool> isHighPriorityThreadsActive = true;
	for (int i = 0; i < nHighThreads; ++i)
	{
		std::thread t([&]() {
			
			while (isHighPriorityThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(highJobQ, g_mutexHighJobQ);

				// if no more High Jobs, take on Medium ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);
				}

				// if no more Medium Jobs, take on Small ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
				}

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isHighPriorityThreadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> isMediumThreadsActive = true;
	for (int i = 0; i < nMediumThreads; ++i)
	{
		std::thread t([&]() {
			while (isMediumThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);

				// if no more Medium Jobs, take on Small ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
				}

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isMediumThreadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> isLowThreadsActive = true;
	for (int i = 0; i < nLowThreads; ++i)
	{
		std::thread t([&]() {
			while (isLowThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isLowThreadsActive = false;
				}
			}
			});
		threads.push_back(std::move(t));
	}

	// main thread
	while (isLowThreadsActive || isMediumThreadsActive || isHighPriorityThreadsActive)
	{
		if (isHighPriorityThreadsActive)
		{
			CalcPiJob* currentJob = GetAndPopJob(highJobQ, g_mutexHighJobQ);

			// if no more High Jobs, take on Medium ones.
			if (!currentJob)
			{
				currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);
			}

			// if no more Medium Jobs, take on Small ones.
			if (!currentJob)
			{
				currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
			}

			if (currentJob)
			{
				currentJob->DoWork();
				delete currentJob;
			}
			else
			{
				isHighPriorityThreadsActive = false;
			}
		}
		else
		{
			// wait for other threads to complete.
			std::this_thread::sleep_for(std::chrono::milliseconds(10));
		}
	}

	const int threadCount = threads.size();
	for (int i = 0; i < threadCount; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 2625 ms.

(click to expand)

Now we can see that the high priority worker threads started to take on medium sized jobs as soon as the higher ones depleted, and then the small jobs followed.

Synchronizing Threads

Now lets say I’m processing data and I need to start Jobs in sync in between multiple threads, or maybe I’m building a game engine and my main update loop needs to start at the same time as the physics loop in some other thread. Whichever the case, I tough looking up synchronization mechanisms was worth doing as well.

std::mutex g_syncMutex;
std::condition_variable g_conditionVariable;

void RunSynchronizedThreads()
{
	int nThreads = std::thread::hardware_concurrency() - 1;
	std::vector<std::thread> threads;

	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> signal = false;
	std::atomic<bool> threadsActive = true;
	for (int i = 0; i < nThreads; ++i)
	{
		std::thread t([&]() {
			while (threadsActive)
			{
				// Tell main thread, worker is available for work
				{
					std::unique_lock<std::mutex> lk(g_syncMutex);
					g_conditionVariable.wait(lk, [&] { return signal == true; });
				}

				CalcPiJob* currentJob = GetAndPopJob(jobQ, g_mutexJobQ);

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					threadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	// main thread
	std::atomic<bool> mainThreadActive = true;
	while (mainThreadActive && threadsActive)
	{
		// send signal to worker threads, they can start work.
		{
			std::lock_guard<std::mutex> lk(g_syncMutex);
			signal = true;
		}
		g_conditionVariable.notify_all();

		// send signal to worker threads, so they have to wait for their next update.
		std::this_thread::sleep_for(std::chrono::milliseconds(1));
		{
			std::lock_guard<std::mutex> lk(g_syncMutex);
			signal = false;
		}
		g_conditionVariable.notify_all();

		// main thread work.
		CalcPiJob* currentJob = GetAndPopJob(jobQ, g_mutexJobQ);

		if (currentJob)
		{
			currentJob->DoWork();
			delete currentJob;
		}
		else
		{
			mainThreadActive = false;
		}
	}

	for (int i = 0; i < nThreads; ++i)
	{
		threads[i].join();
	}
}

Run time: 2674 ms

(click to expand)

I’ve setup this one up so worker thread only start at the same frequency of the main thread. The goal here was to use condition variables to synchronize the threads, and hopefully confirm it with the profiler., which we can look at in the image above.

Test RunTime (ms)Improvement
One Thread 103961.99x
Threaded 26257.88x
Threaded with Priority 60593.4x
Threaded with Work Stealing 26257.8x
Synchronized Threads 26747.7x

Download code from GitHub

Continue Reading

Exploring Multi-Threading in C++

The pursuit of performance is something that interests me as a developer, so as a learning exercise I decided to experiment and consolidate my knowledge about multi-threading. Nowadays it’s becoming even more important since our CPUs get more and more cores. Modern game engines and applications use multiple CPU cores to stay fast and responsive.

Index

Setup and Baseline Result

As a test case, I decided to create a series of tasks, ones small and other big, to simulate different workload types. As an easy test case, I grabbed a method to calculate Pi, and run that method multiple times, depending on how heavy I want the workload to be.

double CalcPi(int n)
{
	double sum = 0.0;
	int sign = 1;
	for (int i = 0; i < n; ++i)
	{
		sum += sign / (2.0 * i + 1.0);
		sign *= -1;
	}
	return 4.0 * sum;
}

Now I create a couple of different Jobs running CalcPi and add them into a vector or a queue ( depending on the test I’m running ). My CalcPiJob class looks something like this.

class CalcPiJob
{
public:
	CalcPiJob(int iterations)
		: m_iterations(iterations)
	{ }

	void DoWork()
	{
		float p = 0.0f;
		for (int i = 0; i < m_iterations; ++i) {
			p += CalcPi(m_iterations);
		}

		p /= m_iterations;
		std::this_thread::sleep_for(std::chrono::milliseconds(Settings::ThreadPause));
	}

private:
	int m_iterations;
};

Creating a series of different workload types looks something like:

std::queue<CalcPiJob*> GetJobsQ()
{
	std::queue<CalcPiJob*> jobQ;
	for (int i = 0; i < Settings::JobCountHigh; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountHigh));
	}

	for (int i = 0; i < Settings::JobCountMedium; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountMedium));
	}

	for (int i = 0; i < Settings::JobCountLow; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountLow));
	}
	return jobQ;
}

std::vector<CalcPiJob*> GetJobVector()
{
	std::vector<CalcPiJob*> jobs;
	for (int i = 0; i < Settings::JobCountHigh; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountHigh));
	}

	for (int i = 0; i < Settings::JobCountMedium; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountMedium));
	}

	for (int i = 0; i < Settings::JobCountLow; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountLow));
	}
	return jobs;
}

I have also defined a couple of constants to help out.

struct Settings
{
	enum class Priority : int {
		Low = 0,
		Medium,
		High
	};

	static const int JobCountLow = 120;
	static const int JobCountMedium = 60;
	static const int JobCountHigh = 25;

	static const int ThreadPause = 100;

	static const int IterationCountLow = 5000;
	static const int IterationCountMedium = 10000;
	static const int IterationCountHigh = 20000;

	static const int PrecisionHigh = 100;
	static const int PrecisionMedium = 100;
	static const int PrecisionLow = 100;
};

Now for baseline, I go through all Jobs and execute DoWork sequentially.

void RunSequential()
{
	std::queue<CalcPiJob*> jobQ = GetJobsQ();
	while (!jobQ.empty())
	{
		CalcPiJob* job = jobQ.front();
		jobQ.pop();

		job->DoWork();
		delete job;
	}
}

I’m running all my tests on a i7 4770K, that has 4 cores and 8 threads. All timings where taken from a release build, and all profile images from debug builds ( for illustration of workload purposes ).

Sequential run time: 20692 ms

First Worker Thread

Let the interesting part begin. As an easy step towards a multi-threading application, I’m going to create only one thread, to share the workload with the main thread.

This already brings a few new concepts to be aware of such as sharing data across multiple threads. We protect our data access with a std::mutex, and lock it with std::scoped_lock ( introduced in C++17. Use similar std::lock_guard if your compiler doesn’t support it ).

You’ll need a few includes first.

// you should already have these.
#include <vector>
#include <queue>

#include <thread> // thread support
#include <mutex>  // mutex support
#include <atomic> // atomic variables
#include <future> // later on for std::async
CalcPiJob* GetAndPopJob(std::queue<CalcPiJob*>& jobQ, std::mutex& mutex)
{
	std::scoped_lock<std::mutex> lock(mutex);
	if (!jobQ.empty())
	{
		CalcPiJob* job = jobQ.front();
		jobQ.pop();

		return job;
	}
	return nullptr;
}

GetAndPopJob does exactly what is says, it will get a job if one exists and pop it from the queue. empty(), front() and pop() are protected inside this method with the use of the std::scoped_lock.

void ExecuteJobsQ(std::atomic<bool>& hasWork, 
	std::queue<CalcPiJob*>& jobQ, 
	std::mutex& mutex)
{
	while (hasWork)
	{
		CalcPiJob* currentJob = GetAndPopJob(jobQ, mutex);
		if (currentJob)
		{
			currentJob->DoWork();
			delete currentJob;
		}
		else
		{
			hasWork = false;
		}
	}
}

ExecuteJobsQ will run in the main thread and the worker thread. It gets a job, execute it, and continue until there is no more work to do.

// global mutex for read/write access to Job Queue
static std::mutex g_mutexJobQ;

void RunOneThread()
{
	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> jobsPending = true;

	// Starting new thread
	std::thread t([&]() {
		ExecuteJobsQ(jobsPending, jobQ, g_mutexJobQ);
	});

	// main thread, also does the same.
	ExecuteJobsQ(jobsPending, jobQ, g_mutexJobQ);

	t.join();
}

One worker thread run time: 10396 ms

(click to expand)

The image above show the execution of the jobs, the larger ones first, then the medium sized ones and lastly the smaller ones. This was the order at which the tasks where added into the queue.

More Worker Threads

Now this is nice, so lets add more threads! How many? Well, I know my CPU has 8 thread, but nothing guarantees they will only run for my program tho. Operating system time slice program execution across multiple cores/threads, so even if you create more threads than your max CPU threads, there’s no “problem” because the operating system will switch execution time for them on its own.

C++ provides us a way of determining how many concurrent threads our system supports, so lets just use that: std::thread::hardware_concurrency()

void RunThreaded()
{
	// -1 to make space for main thread
	int nThreads = std::thread::hardware_concurrency() - 1;
	std::vector<std::thread> threads;

	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> hasJobsLeft = true;
	for (int i = 0; i < nThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasJobsLeft, jobQ, g_mutexJobQ);
		});
		threads.push_back(std::move(t));
	}

	// main thread
	ExecuteJobsQ(hasJobsLeft, jobQ, g_mutexJobQ);

	for (int i = 0; i < nThreads; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 2625 ms.

8 threads ( click to expand )

Now this is a nicer view. 7 worker threads working with the main thread to process all jobs. Again, first we see the bigger jobs, then medium, then smaller ones being processed. This is being processed in the order they were added.

Async Tasks

When spawning tasks with std::async, we don’t manually create threads, they are spawned from a thread pool.

void RunJobsOnAsync()
{
	std::vector<CalcPiJob*> jobs = GetJobVector();

	std::vector<std::future<void>> futures;
	for (int i = 0; i < jobs.size(); ++i)
	{
		auto j = std::async([&jobs, i]() {
			jobs[i]->DoWork();
			});
		futures.push_back(std::move(j));
	}

	// Wait for Jobs to finish, .get() is a blocking operation.
	for (int i = 0; i < futures.size(); ++i)
	{
		futures[i].get();
	}

	for (int i = 0; i < jobs.size(); ++i)
	{
		delete jobs[i];
	}
}

Run time: 2220 ms

(click to expand)

Overview

This time table only serves as an overview for this particular case. Of course, in real applications, results vary.

Test RunTime (ms)Improvement
Sequential 206921.x
One Thread 103961.99x
Threaded 26257.88x
Async Tasks22209.3x

The sample codes are my exploration of this specific case and by no means is free of bugs. But it is interesting to see how the code would run across multiple thread, how to synchronize and make the most of my system.

All screenshots are taken with the debug version of the program, so we could clearly see the workload in the profiler. For that I used Superluminal Profiler. I found out that it is an amazing, lightweight profiler. You can also use Intel’s VTune for free.

Download code from GitHub

Continue Reading

Variable Jump Height in Unity

In numerous games, characters have variable jump height. What this means is that the more your press the jump button, the more the character will remain in the air, or even jump a little higher ( think super mario games ). In this tutorial, I’ll show you a simple way to implement this kind of jumping in your games.

You can begin by making a new project, or create a new scene from a project you have in Unity. Add a cube to the scene to act as the ground, so you might want to scale it forward (z) and sideways (x), and finally add another cube to act as our player. By default, both of these will have BoxCollider Components attached to them, which is good, but our player will also need a Rigidbody Component, so we can use Unity’s physics engine. Note that if you add a Rigidbody component to the ground, make sure you have IsKinematic enabled and Gravity disabled, so it does not fall. On the contrary, make sure the player has Gravity enabled and IsKinematic disabled ( those should be set by default ). You should now have something pretty simple such as:

Screen Shot 2015-07-23 at 21.29.55

So, using Unity’s physics engine, how can we make our player jump? By looking at the Rigidbody’s documentation, we can see that we have many methods to add force, and we’ll use the normal AddForce, which takes a directional vector and a force mode. As far as direction goes, we’ll want the player to jump up, so we can use the local up vector from the player’s transform. For the ForceMode, we have Force, Acceleration, Impulse, and VelocityChange, and I will use VelocityChange for now.

In order to control how much I want my player to jump I need to setup some variables first, namely, I will need a jumpVelocityChange and isJumping variable. I will also store the reference to the rigidbody, that I will get in the Awake method.

[SerializeField] private float _jumpVelocityChange;
[SerializeField] private bool _isJumping;

private Rigidbody _rigidBody;
void Awake( )
{
	_rigidBody = this.GetComponent();
}

Quick Tip: [SerializeField] is used in unity, not only but also, to expose private variables in the Inspector. Why don’t I just make it public you ask? Because I don’t want other classes to access those variables, they are meant to be private.

So now that I have those setup, I’ll set them on the Inspector. I chose to set _jumpVelocityChange to 50 to begin with and _isJumping variable to false. Now since we’re using the physics engine, we should apply our changes in the FixedUpdate and not in the Update method by default. Meaning that we will not need to multiply our force values by Time.deltaTime, since FixedUpdate always ticks at the same rate.

Now, to make the actual jump. In our Update function, I’m going to check if I click the left mouse button. If the left mouse button is pressed and I am not currently jumping, I’ll set the _isJumping variable to true, and add a vertical force to the player’s rigidbody.

if ( Input.GetMouseButtonDown(0) && !_isJumping ) {
	_isJumping = true;
	_rigidBody.AddForce(this.transform.up * _jumpVelocityChange, ForceMode.VelocityChange);
}

This only will make the player jump when you click the left mouse button. But now when it falls down, you can’t jump again. Care to guess why? Because we haven’t set our _isJumping variable to false. So when should we do that? I’m going to go pretty simple with this, since both the ground and the player have colliders and rigidbodies, I can call the OnCollisionEnter method from my player, and when it’s called, switch the variable to false. I’m not going to add anything else now, and leave it like this, but ideally, you’ll want to check with whom the collision was made, if indeed it was the floor, for instance.

void OnCollisionEnter( Collision collision )
{
	_isJumping = false;
}

Now everything seems to work fine. You can jump, fall down, and then jump again. But what about the SuperMario style jump? You know, the more you press, the longer you remain in the air? For this, I need to setup three new variables, _startJumpTime, _maxJumpTime and _jumpAcceleration, so I can control air time and the acceleration force added during the jump.

[SerializeField] private float _startJumpTime;
[SerializeField] private float _maxJumpTime;
[SerializeField] private float _jumpAcceleration;

I’ve changed the update method to include the “hold-to-jump-higher-and-for-more-time” behaviour, and updated the first impulse to add the start jump time. So your FixedUpdate method should look something like this:

if ( Input.GetMouseButtonDown(0) && !_isJumping ) {
	_isJumping = true;
    _startJumpTime = Time.time;
	_maxJumpTime = _startJumpTime + _airJumpTime;
	_rigidBody.AddForce(this.transform.up * _jumpVelocityChange, ForceMode.VelocityChange);
}
else if ( Input.GetMouseButton(0) && _isJumping && ( _startJumpTime + _maxJumpTime > Time.time ) ) 
{
	_rigidBody.AddForce(Vector3.up * _jumpAcceleration, ForceMode.Acceleration);
}

This checks for a different Input method, GetMouseButton instead of GetMouseButtonDown. The first one is called whenever the button is down, and the later is called once, when the button was pressed. So we capture the long button press, check if we are indeed already jumping, and check the amount of time we have to allow this long jump. If all those conditions are met, we keep on jumping, giving a continuous acceleration. This time we use an acceleration instead of velocity change, so the effects is a lot smaller.

So now you should be able to have your awesome “Super Mario style Jump”:

[RequireComponent (typeof(Rigidbody))]
public class Character : MonoBehaviour 
{
	[SerializeField] private float _jumpVelocityChange;
	[SerializeField] private float _jumpAcceleration;
 
	[SerializeField] private float _startJumpTime;
	[SerializeField] private float _maxJumpTime;
	
	[SerializeField] private bool _isJumping;
 
	private Rigidbody _rigidBody;
 
	void Awake( )
	{
		_rigidBody = this.GetComponent();
	}
	
	void FixedUpdate() 
	{
		if ( Input.GetMouseButtonDown(0) && !_isJumping ) {
			_isJumping = true;
			_startJumpTime = Time.time;
            _maxJumpTime = _startJumpTime + _airJumpTime;
			_rigidBody.AddForce(this.transform.up * _jumpVelocityChange, ForceMode.VelocityChange);
		}
		else if ( Input.GetMouseButton(0) && _isJumping && ( _startJumpTime + _maxJumpTime > Time.time ) ) 
		{
			_rigidBody.AddForce(Vector3.up * _jumpAcceleration, ForceMode.Acceleration);
		}
	}
 
	void OnCollisionEnter( Collision collision )
	{
		_isJumping = false;
	}
}

Quick Tip: Notice the [RequireComponent (typeof(RigidBody))] on the top of the class. What this does is automatically add the RigidBody component to the object you are adding this class to. So if there is no rigidbody, it will automatically add it. This makes sure that whenever I call the component in the Awake method, I always get a rigidbody, thus avoiding future errors.

Here’s the final result for my game:

How to create a recursive call with Unity’s Coroutines

During the development of Super Stems I’ve had to deal with chain reaction. I started a battle with one tile, and if they win, the captured tile would start a battle, and this would happen recursively. For some reason, I have this battle method call on a different thread, in Unity’s terms, a Coroutine.

Now comes the question. How do I handle recursion with Coroutines? After looking around the doc and experimenting, I found a solution.

Here’s a sample code on how to make it work.

public void StartBattle( )
{
	StartCoroutine(BattleRecursive(0));
}

public IEnumerator BattleRecursive( int depth )
{
	// Start Coroutine"
	
	yield return new WaitForSeconds(2f);

	if( depth == 0 )
		yield return StartCoroutine( BattleRecursive(depth+1) );

	if( depth == 1 )
		yield return StartCoroutine( BattleRecursive(depth+1) );
	
	Debug.Log( "MyCoroutine is now finished at depth " + depth );
}

After calling the entry point to your recursive method StartBattle, inside, you have yield return a new Coroutine call to the recursive method. This way, the execution order will be correct. Try it out and see for yourself.

Output should be.

MyCoroutine is now finished at depth 2
MyCoroutine is now finished at depth 1
MyCoroutine is now finished at depth 0

Enjoy

Simple Achievement System in C#

Achievements are becoming more and more usual in games. They provide the player a sense of accomplishment and progress by rewarding them with badges that proves their skill and experience. Some achievements are simple and other require a combination of particular actions to unlock. In this article I show you how to make a simple Achievement System using C# and will demonstrate it using Unity3D, but this should be easy enough for you to port it to whatever language you’re more familiar with for your games.

Continue reading “Simple Achievement System in C#”

Non Blocking C# Task Cancelling

In our previous sample snippet, Cancel a Loop in a Task with CancellationTokens in c# , I try to explain how we can get out of a looping c# task, but a problem may arise from that situation. If we were to wait for any result out of that Task, we would be blocking the calling thread until the task returned, which is not good if we are on the main thread. We would locking our UI and might crash our application.

So I’ve been testing different ways to get out of that loop without causing any trouble, and you can achieve what we want many different ways.

So, to begin with, I think I would correctly assume that it is only necessary to wait for a task to complete if that task will return something. If there is no return value, why would we want to call wait on it? We can just break out of it, correct me if I’m wrong. If we have a return value, then it is necessary to surround the wait call on the task with try/catch to receive its result. But then again, we can avoid the locking here with a continuation task, which will create and start the task after the first one completes, giving us the result from the previous task to work with.

Continue reading “Non Blocking C# Task Cancelling”