I’m trying to understand how does multi-core systems work and how to program efficient programs for systems with many cores. I know this is very hard topic but i’m very interested in fastest solutions possible.
First of all, i’m trying to understand how does threading work. It’s obvious, that in most cases multithreading can increase the performance dramatically. According to this page, this is how multithreading work:

But why is switching between N threads is faster than just running N threads one by one ? How can threading work on system with only one CPU ?
Next, what is the point or multi-core programming? I assume that the point is to split threads between cores and split tasks between them ? But how can i split 8 threads on 4 CPU’s system equally ?
Do i have to use processor affinity (cpu_affinity) to split threads/processes between CPU’s ? Can i create 4 threads using pthread_create on system with 4 CPU’s to run each thread on each CPU ?
How hyper-threading helps and is it helps at all ? How can we use CPU cache programming for multi-core systems ?
Why is this so hard for big old projects, like MySQL for example, to fully use advantages of many-CPU systems ?
I’m interested in theory of this problem and also in practical solutions/examples/projects/books/articles for Linux systems (using C).
I know this is increasingly important topic and I hopes I’m not only one interested.
The difference between switching N threads and running N threads one by one is what happens when a thread, temporarily, can make no further forward progress. If you switch N threads and one of those threads temporarily can’t make forward progress, say it’s waiting for data to be read from the disk, another thread can make forward progress. If you ran them completely sequentially, then the CPU would be wasted while a thread was waiting for disk I/O to complete.
Hyper-threading helps by allowing you to make fuller use of CPU core execution resources. For example, if a thread isn’t doing any floating point math, the floating point units of that core are wasted. With hyper-threading, another thread can use those execution units.
On a typical modern core, operations take many clock cycles and a number of operations are in progress at a time. This means a core typically has many extra execution resources that it can’t use at any particular instant. Hyper-threading allows a higher percentage of those execution resources (barrel shifters, adders, logic units, branch units, and so on) to be used. Typically, hyper-threading may improve performance by 10% to 15%. The benefit isn’t greater because the threads also steal execution resources from each other, pollute each others use of cache, and so on.
The CPU cache is used automatically and you generally don’t have to do anything special to use it. Perhaps the most common exception is dealing with false sharing or cache ping-ponging.