Using different technologies to optimize linear complexity
When studying programming concepts that cover algorithms, worst and best case scenarios, and, in the process of developing code that is efficient, aspiring programmers may often find themselves in a tendency to examine their program execution to find which code blocks and functions they may improve. While this is an understandable undertaking, a less experienced practitioner of our field will often consider the following:
What is the time complexity of my code?
Big O notation stands as the most common approach when answering this question, and it entails this simple objective: Under the worst case scenario, how long will my code take to execute?
As a developer myself, undergraduate studies indirectly led me to believe optimization of code became irrelevant below the linear approach, e.g., if all elements of an array needed to be examined for a reason, such as getting the total sum of elements, no more action was needed to optimize code execution because the worst case scenario indicated a O(n), and given that to get the sum of all elements each element needs to be accessed once, the linear complexity was justified.
The main problem arises in the context of usage. It is more than likely that budding developers and students will work with small data. However, a real life scenario where thousands, if not millions of records of a database or a list need to be examined, a linear complexity without any form of enhancement quickly becomes purposeless and obsolete.
It is for such cases that, in order to attempt to reduce linear complexity to a fraction of itself, different optimizations have arisen. Of course, considering the number of operations made only, complexity may still be linear. But one must also understand even a little bit of the processing of this information via a computer’s architecture. Using the little tweaks described in brief detail below may reduce the total elapsed time of your operations significantly.
The Basic Approach
Before I start numbering the options available to use, let’s consider the following example: You have a million elements and want to add them all. For simplicity, let’s assume you store them in a 2D array, that is, a thousand rows and columns, and that each element is 1. In short; you wish to get the million; you want to add all elements so that the end result is 1,000,000.
Following this, we want to add all the elements. Considering no attempt at optimizing, we could use a simple nested loop, iterating over all the elements following the constant variable DIM
, which is 1000:
So how can we improve the simplicity of this nested loop?
I. General Optimizations
At the risk of readability (depending on the complexity of your code), there are several techniques to apply to either code blocks or loops in order to improve their execution time. Techniques such as strength reduction (like performing bit-wise shift operations instead of integer multiplication), loop fusions, nested loop interchange, loop unrolling, etc. remove some bulk of operation overhead from compilers.
Note the step value of the iterations! In this case, because the limit is known to be a factor of 4 (1000 / 4 = 250), we can reduce the total number of iterations from 1,000,000 to just 62,500, only 6.25% of the total! Sure, the number of accesses to the array is the same, but because each iteration incurs in the test and update operations 1 for every 4 times of the non-optimized function, overhead is reduced dramatically. Also note the pre-declarations of both i
and j
, which reduce the need for memory allocation in the loops in the initialization phase.
II. SIMD
Single Instruction, Multiple Data, is an execution model that focuses, as the name implies, on performing the same operation on groups of elements which require it in parallel, as opposed to performing this operation one at a time. Since their introduction around 1997, SIMD operations which use special MMX registers have undergone and received a number of extensions, from SSE which introduces XMM registers to AVX with YMM registers.
The gist of this is relatively simple: We are allowed to work with data types larger than your typical integer. With XMM registers, for example, we can declare variables of up to 128 bits; enough to fit up to 4 16-bit integers, which is what I’ll be using for this case:
Note that you’ll need the immintrin.h header file in order to use __m128i variables. So what this function does is create 4 128-bit variables to add all of the values in the 2D array, in a nested loop whose inner loop iterates at a step of 16. 1000 / 16 * 1000 is still only 6.25% of iterations, but the whole idea here is that the addition operation only happens once per group of 4. So, in theory, this SIMD function should be about 4 times faster than the General Optimization one, but that’s not quite the case.
Note as well that extra operations are added to the overhead due to the usage of 4 __m128i variables. Why 4? The optimistic approach would be to use one single variable, then divide the whole total in 4, so each of the integers in the variable get one fourth of a million, then we add them all together with _mm_extract_epi16
at the end. The problem arises when we take scope in consideration! Each of the integers in the 128-bit variables is equivalent to an unsigned short, meaning that for the purpose of adding positive numbers we will only reach up to 65,535
, which is a wee bit too short of the 250,000
we hope to get. So we use 4 variables to cover all of that range ensuring the last 2 don’t add anything beyond the last iterations in each row, to avoid getting the wrong result.
III. OpenMP
Open Multi-Processing is a technique where a master thread, the main execution of the program, creates a specified number of slave threads by forking them; dividing the load of the task for the block declared in.
To the less experienced developers, OpenMP will come as a godsend in that it allows for a rather simple understanding of multi-threading. An yet, depending on the task, OpenMP may not benefit much. It is worth considering that the creation of more threads and their further consolidation and combination of the results obtained, depending on how simple a task is, will only add up further to the total cost of execution of a program, i.e. the time it takes to complete.
Although this technique provides very little speedup for this case, I wanted to showcase it as it becomes very useful, and a must for larger operations such as motion estimation in video compression and game development.
IV. Combining Everything!
Last but not least we have the option to combine all the technologies described above to hopefully create a scenario where every optimization possible together greatly improves the program’s performance:
Measuring Performance
In the same way we can test the endurance and speed of a professional athlete, we can use a form of chronometer to time the duration of finding the total sum of the 2D array. Conveniently enough, C++ 11 introduces a library called chrono, which enables us to get elapsed times using a steady_clock type. The concept to create elapsed time is simple: Compare a time in the past to a time now. Current time will always be bigger, so simply subtract the time in the past from the current one to get the duration.
With these two functions, we can then start the clock with startClock(start) and stop it with stopClock(start, stop, ellapsedTime), and displaying the result in the console.
It is then possible to run every test once to get the time for each, starting the clock before each function call and stopping it right after each is done, to display the elapsed time of each and get a winner! You could do so via a switch
statement that calls each function depending on the one desired by an index. Please feel free to clone this repository to test it yourself, https://github.com/gfcf14/GetMillion
Note that, although times will appear different with every run, and because of the simplicity of the operation, the OpenMP optimization may at times be slightly slower than execution with no optimizations, despite performing a number of operations comparable to the number of elements we’ve accomplished our objective to reduce the total time elapsed by a fraction of the linear complexity expected, as reflected below:
Because our technology oriented times demand it, more and more different methods to accomplish the same operations we handle today will arise, yet done with far more efficiency. Wherever possible, do not hesitate to consider improving the efficiency of your code!