Recently someone asked:
A simple question. I have a set of tasks to perform and I want to optimize their performance. Basically reading a file and writing to a database. Each operation is atomic, none of them depend on one another. So, which is better?
Task.WhenAll
orParallel.ForEach
?
The Task.WhenAll
is about threading, while the Parallel
is about concurrent utilization of the underlying resources.
This may sound the same, but there are crucial differences. When you dispatch some work to a separate thread, there is no guarantee that the thread will be scheduled in a different processor core (actually, most often will not). Even if you create 1000 threads, they may all run on the same core. So, Task.WhenAll
indeed will dispatch all those threads concurrently, but this “concurrency” is only on the high app level. This does not guarantee that these threads will be dispatched concurrently at the CPU level. It’s just high level abstraction for you. And, that’s fine, because concurrency in this context is all about IO operations.
On the other hand, if you have CPU bound operations, then the way in which the work is scheduled becomes important. You want to maximize the CPU usage. That’s when you need Parallel
. This will mediate all the complex communication with the OS and will ensure that it’s run on multiple cores.
You need to be careful though. If you use the default configuration, it will use the CPU to the max, and will consume all cores. And this may bring your machine to a seemingly “frozen” condition. If you have other processes on the host, they may be affected badly. So, it’s always nice to also explicitly configure the maximum allowed cores to use.
For example, if you want to calculate a factorial, or find prime numbers, you want parallel processing and you should use Parallel
.
On the other hand, if your job is to get some data from some endpoint, this is in essence mostly IO, and you can use Task.WhenAll
. the CPU utilization won’t make a difference here.
Ok So Which One Do I Want?
I wrote a lot, but didn’t answer the question 🙂 For this scenario, if you don’t have some heavy operations, the job only reads a file, does simple mapping and writes to a database, stick to Task.WhenAll
.
It’s not as simple as I described it, but something similar 🙂. This is a high level explanation and like all such explanations, not 100% accurate. Check the documentation to learn more about the low level details.
The thread is the smallest unit that can get CPU time (in almost any OS). So, whenever you create a thread, you need an OS thread to do some work. In managed environments (like .NET) that’s a bit different, since we usually pool some defined number of threads (so we don’t ask the OS all the time, it’s a costly process). But, anyhow, you always have 1:1 mapping of App thread to OS thread (we don’t have green threads in .NET).
The subtle difference is how you have communicated your intent to OS. If you just say give me a thread, then where it is scheduled is at the OS’s discretion. You still get the impression of concurrency, since there is CPU time scheduling. The Parallel
approach actually just forces the OS to dispatch those threads into as many cores as possible. So, it’s just more specific in communicating the intent to the OS.
Summary
The key difference between Task.WhenAll
and Parallel.ForEach
is how they deal with real OS threads. Task.WhenAll
will most likely end up using threads on the same CPU as the host process, while Parallel.ForEach
will aggressively try to use multiple CPUs. If you have a CPU-bound operation, you will get better performance from Parallel.ForEach
, but take care you don’t saturate every CPU and “freeze” the host OS!