Hyperthreading is a technology used by processors to improve their performance by allowing them to handle multiple threads of instructions simultaneously. This technology works by duplicating certain processor components, such as registers and the instruction pipeline, and allowing the processor to switch between two sets of these components in order to execute different threads in parallel.
While hyperthreading can improve processor performance, it can also introduce latency in some cases. Today I’m going to cover all those common cases and give you an overview how this happens outside your application at the kernel level too.
Summary
Before I jump into the details I would like to give you an overview of these factors so you can decide if this is applicable to you. Do not hesitate to use the table of contents above if you want to drill in any of those particular features that we will talk about.
Context | Workaround | |
---|---|---|
Cache Thrashing | Application level | Improve code |
False Sharing | Application level | Improve code |
Context Switching | Application level | Improve code |
Resource Contention | Both OS and Application | None |
SMT Overhead | Kernel level | Disable |
Thread Synchronization | Application level | Improve code |
CPU Level | Hardware level | Bios disable |
Windows Kernel | Kernel level | None |
Linux Kernel | Kernel level | Fine tune kernel |
In the table above I basically listed all the ways hyperthreading latency may affect your system and put a column of how you can work around it. There are a few cases where there’s nothing to be done other than changing your operating system. In particular Windows is very limiting when it comes to this sort of thing and it doesn’t leave you with a lot of options. In my opinion if you think you are being affected the best choice is to go with Linux as it will work better in your case.
Cache thrashing
When multiple threads are running on a hyperthreaded processor, they may end up accessing the same cache lines, which can cause cache thrashing. This happens when the processor needs to repeatedly evict and reload data from the cache, which can slow down performance. The following code snippet shows an example of cache thrashing:
for (int i = 0; i < 1000; i++) { // Thread 1 // pthread context here for 1 a[i] = b[i] + c[i]; // Thread 2 // pthread context here for 2 d[i] = e[i] + f[i]; }
In this code, both Thread 1 and Thread 2 are accessing the same cache lines, which can cause cache thrashing. Basically what happens behind the scenes the CPU first taps into the level 1 and 2 caches before it goes into the system stack access. The truth is that when you do this in multiple threads/caches this can cause a latency due to the thrashing happening. This could get worst if you are using more data and the collision happens in two hyper threads which is dependent on your code.
False sharing
False sharing occurs when multiple threads are accessing different variables that are located in the same cache line. This can cause unnecessary cache line invalidations and updates, which can slow down performance. Let’s take a look at an example of how this works in practise with some C code.
// Shared cache line struct { int a; int b; } shared; // Thread 1 void thread1() { for (int i = 0; i < 1000; i++) { shared.a += i; } } // Thread 2 void thread2() { for (int i = 0; i < 1000; i++) { shared.b += i; } }
In this code, Thread 1 and Thread 2 are accessing different variables that are located in the same cache line, which can cause false sharing. Basically what this means is that you are trying to retrieve data from the same area. Think of it as trying to walk in a single line and you can’t skip over the person infant of you. You are essentially competing for his time and in reality you are not really sharing time with him you are waiting on him. This can eventually cause latency in your hyper threaded code.
Context switching
When a hyperthreaded processor switches between two threads, there is a cost associated with saving and restoring the processor state. This can introduce latency, especially if the threads are frequently switching. The following code snippet shows an example of context switching:
// Thread 1 void thread1() { for (int i = 0; i < 1000; i++) { // Do some work } } // Thread 2 void thread2() { for (int i = 0; i < 1000; i++) { // Do some work } } int main() { // Start Thread 1 std::thread t1(thread1); // Start Thread 2 std::thread t2(thread2); // Wait for both threads to finish t1.join(); t2.join(); return 0; }
In this code, the processor needs to switch between Thread 1 and Thread 2 multiple times, which can introduce context switching latency. You want to avoid a lot of context switching at all costs. How you end up implementing that in your system basically defines how quickly and responsive your application will be. A metaphor would be if you had two apple trees and you wanted to climb and pick up and apple from each. Now think of yourself going up the ladder and every time you take a few steps up you stop and go and do this on the other ladder. The process of you switching from one ladder to another is costly! The cost in this case comes with a penalty of latency. So always account for this overhead of context switching as it can be one of the biggest factors of slow hyperthreaded code.
Resource Contention
Hyperthreading can also cause latency when multiple threads are competing for the same resources, such as the memory bus or execution units. This can result in increased waiting times for resources, which can slow down performance. The example we used earlier can also be used to demonstrate resource contention. The difference here is that the resource could be anything besides just memory or a stack. This essentially applies to all hardware resources in the system that your hyperthread tries to access or interact with.
Basically in our previous example, Thread 1 and Thread 2 may end up competing for the same resources, which can cause resource contention and increase latency.
SMT Overhead
Simultaneous Multithreading (SMT) is the technology that enables hyperthreading, and it can introduce overhead in some cases. SMT works by duplicating certain processor components, which increases the complexity of the processor and can introduce additional overhead. Again our reference will be the last code example we used.
The processor needs to perform additional work to manage the two threads running on the hyperthreaded core, which can introduce additional overhead and increase latency. This happens particularly when there is resource contention or cache thrashing. If you are not careful and you basically fall into those two mistakes your SMT implementation will delay your entire code piece which would have been faster if executed on a serial thread implementation.
Thread Synchronization
Hyperthreading can also cause latency when multiple threads need to synchronize their operations. Synchronization involves ensuring that two or more threads are coordinating their activities and not interfering with each other. This can result in increased waiting times for synchronization, which can slow down performance. The following code snippet shows an example of thread synchronization:
// Shared variable int counter = 0; // Thread 1 void thread1() { for (int i = 0; i < 1000; i++) { // Lock the mutex std::lock_guard<std::mutex> lock(mutex); // Increment the counter counter++; } } // Thread 2 void thread2() { for (int i = 0; i < 1000; i++) { // Lock the mutex std::lock_guard<std::mutex> lock(mutex); // Decrement the counter counter--; } } int main() { // Start Thread 1 std::thread t1(thread1); // Start Thread 2 std::thread t2(thread2); // Wait for both threads to finish t1.join(); t2.join(); return 0; }
In this code, Thread 1 and Thread 2 need to synchronize their operations using a mutex, which can introduce additional waiting times and increase latency.
A common thing for thread synchronization is when you are waiting on a result to happen from one thread to another. Let’s take as an example a network scenario. You are basically requesting some resources online from a web API and some other thread you have running relies on the results of it. Now the thread that’s waiting has to synchronize with the thread that’s fetching the data. Since there’s a gap in timing there you are blocking the other thread from doing other things. So a more optimal approach to the above would be to trigger and event between threads and avoid synchronizing them with a wait. Regardless this depends on how optimal your code design is on avoiding those situations.
CPU Level Hyperthreading Latency
At the CPU level, hyperthreading can impact latency in a number of ways. Hyperthreading allows a single physical core to run multiple threads simultaneously, which can improve performance by utilizing unused processing resources. However, this can also introduce additional latency due to resource contention and other factors.
CPUs with hyperthreading support include features designed to work effectively with hyperthreading. For example, Intel’s Hyper-Threading Technology includes a feature called “Resource Allocation Technology” (RAT) which helps manage shared resources between hyperthreads to minimize contention and improve performance. Additionally, modern CPUs with hyperthreading support often include larger caches and other features designed to reduce the likelihood of cache thrashing.
CPUs with hyperthreading support include features designed to mitigate these issues and improve performance, but developers must design their applications to work effectively with hyperthreading in order to take advantage of these features.
Hyperthreading In Linux Kernel
Hyperthreading in the Linux kernel can impact latency in a number of ways. While hyperthreading can improve performance by allowing the kernel to make better use of available processing resources, it can also introduce additional latency due to resource contention, scheduling overhead, and other factors.
One of the key ways that hyperthreading can impact latency in the Linux kernel is through resource contention, which we gave examples of earlier in our article. More particularly at the kernel level when multiple threads are running on the same physical core, they may compete for resources such as cache, memory bandwidth, and execution units. This can lead to increased latency as threads are forced to wait for access to shared resources.
Scheduling overhead associated with hyperthreading can also impact latency. When multiple threads are running on the same core, the kernel must switch between them more frequently in order to ensure that each thread receives adequate processing time. This can result in additional scheduling overhead and context switching, which can introduce additional latency.
To mitigate these issues, the Linux kernel includes a number of features designed to work effectively with hyperthreading. The kernel’s scheduler is designed to be aware of hyperthreading and to distribute threads across physical cores in a way that minimizes resource contention. The kernel includes support for CPU affinity, which allows developers to specify which threads should run on which cores, in order to minimize resource contention and reduce scheduling overhead.
While hyperthreading can improve performance by allowing the kernel to better utilize available processing resources, it can also introduce additional latency due the factors we discussed about earlier. The biggest problem however is that these are outside your control as a developer since it’s happening at the operating system level. That’s why Linux offers way of disabling it or temporarily turning it off which may help address your problems if it gets out of hand.
Hyperthreading In Windows Kernel
Hyperthreading in the Windows kernel can have an impact on latency, both positive and negative. While hyperthreading can improve performance by utilizing unused processing resources, it can also introduce additional latency due to resource contention and scheduling overhead.
The Windows kernel is through increased resource contention, similar to what we discussed about with the Linux kernel.
Windows kernels include features designed to work effectively with hyperthreading. Logical Processor Groups (LPGs) allow the kernel to manage hyperthreaded cores as distinct processing entities, which can help minimize resource contention and reduce scheduling overhead. This allows the kernel’s scheduler to be aware of hyperthreading and to distribute threads across physical cores in a way that minimizes resource contention.
Like we discussed about with Linux earlier in the sections this is a big problem because it’s outside your control. Operating system hyperthreading latency in general is an issue that you need to find ways to fine tune at the kernel level if needed to avoid impacting your application.