Found an awesome book in my uni's library today in which I found comprehensible explanation of CPU pipeline and other goodies. So I want to share some stuff. (If anyone is wondering when that book was released, it was released at around Athlon XP and Pentium 4 era)
Just before reading I want to mention that information isn't given in actual chronological order and is adjusted to make explanation easier.
Days before CPU cache and pipelining
CPU was getting information from RAM and RAM got information it needed from hard drive. That was slow due to hard drive's being slow and having very little RAM. CPU ran at constant clock speed often same as RAM's. If RAM had no information loaded, CPU couldn't do anything and whole computer was held back by the speed of hard drive. That wasn't ideal. Also CPU ran at much lower clock speeds than today. On top of that IPC (instructions per clock cycle) weren't a thing yet as CPU needed to do several cycles to execute instruction.
The clock speed improvements
CPUs were getting faster and faster. To achieve that manufacturer most likely increased clock speed. For clock speed improvements there was need too supply more voltage to be able to change 0 and 1 binary fast and reliably. In electrical terms keep oscillations (0s and 1s) strong enough to be detectable by CPUs rest of internals as 0s and 1s. If there isn't enough voltage, then CPUs internals will fail to detect 0 or 1 state and make incorrect calculations, eventually leading to crashing.
Die shrinking and litography
In order to increase computational power there was need to put more transistors. While in theory it's possible to just make CPU bigger, in practice it's not good at all. First of all you could only improve size up to some point, then due to size of chip electrical signals would need to travel longer distances and they would achieve their needed location slower leading to slower processing speed and the last thing is that you can only fit so many transistors using one lithography before CPU would get too hot to operate (due to need to ramp up electrical properties for reliable transfers of electrical signals). To counter those problems CPU engineers even before hitting limits of either started to think of solutions and they came up with die shrinking using smaller lithography.
Making CPUs utilizing smaller lithographies enabled manufacturers to increase transistor count per given area. Besides that CPU makers can reduce voltage needs due to much smaller distances each signal would need to travel. And as consumers we will not need to have huge CPUs, which mostly was a big concern for portable devices.
Longer electrical signal travel distances would have meant higher voltage needs. When manufacturer can choose between more voltage and smaller lithography, the choice is lithography as you will need less power to complete same tasks faster. Therefore making CPU much more efficient.
Worth mentioning that besides R&D costs making smaller CPU would cost less, which would make manufacturer and buyer happier.
CPUs got faster and faster with always increasing clock speeds and RAM always had to play catch up. At some point RAM couldn't get faster at same rate as CPU did. Engineers had a problem to solve. How they are supposed to make CPUs with higher clock speeds without fast RAM. They could have just increased clock speed of CPU, but then would a big problem. While for some cycles CPU will be fed date, for some it will have to wait for RAM to fill up CPU with data. Therefore CPU would need to perform empty cycles during information loading from RAM. This would have been wasteful and inefficient as CPU would sip power and do nothing and consumer would waste time while loading from RAM would be happening.
At the time we could split multipliers and not rely on classical 1:1 ratio for CPU and RAM clock speeds, but with penalties of performance. Soon they came up with solution. The cache.
What it enabled to do is while RAM was slower than CPU in terms of refreshing rate, cache would store small amount of date fast and let CPU perform tasks, when RAM could load data at the time. This way efficiency was greatly improved and there was no need to keep computer clock speed ratio as 1:1 for optimal performance. That split could be made bigger and bigger and CPU clock speed improved further and further.
The IPC problem
While all those improvements are nice and highly helpful there was another problem. Even with all those improvements CPU cycling rate could be improved, but how much data could be processed in each cycle was still limited. For example you needed 12 cycles to execute one instruction. Let's say our theoretical CPU works at 12Hz and does this operation in 12 seconds. If we double the clock speed then same task could be made in 6 seconds. And if we improved that once again two times, same task that once took 6 seconds would now could be performed in just 3 seconds. Very impressive.
Still we would be limited by 12 cycles needed to complete instruction and increasing clock speed wasn't very easy task.
If in one CPU clock cycle we could process only one information unit in one time frame, then with pipelining we could try to process several information units in single time frame. That's awesome! But how does that work?
For this explanation it would be helpful to imagine work situation with humans. Imagine one man wants to screw in light bulbs in factory. The factory is big and there are 1000 bulb sockets in total. For each bulb let's imagine it would take 1 minute to install a bulb. That would take 1000 minutes for whole task to complete. That's a lot of time, almost 17 hours of pure work. A man can understand and what screwing in light bulb means, but CPU of computer need instructions what to do. Those could be broken down into:
Place it under light bulb socket
Screw in light bulb
For CPU one abstract task would mean at best 6 smaller ones. Human can think of those automatically.
Well human might not want to work almost 17 hours, so he can ask more people to help him. Then everyone will be doing smaller tasks in same time frame. Some people after finishing work could just go to another light bulb socket, but they would need to wait for other workers to finish task. Therefore people would be less exhausted overall, but task would still take almost 17 hours to do.
If data in CPU pipeline is arranged well CPU could do tasks faster. Then we get back to question how. And for that we must know that some tasks in CPU can be done without need to wait for others to be completed. For visualization this table will be very useful:
In this table we can see that we have 5 instructions to complete and each of them consists of 5 smaller tasks to do. In total we have 9 CPU cycles. In the first cycle CPU can execute only one smaller task, but at the second cycle it can start to execute similar instruction alongside with the previous one. This reaches the peak at 5 instruction being executed per one cycle. We can reach maximum IPC of 5.
Pipeline means that CPU has "task completion line" in which workers in ideal conditions can all work at the same time in line at given time frame. If each stage of of our production line is utilized efficiently, we can perform 5 smaller tasks per time frame and work on 5 instructions per each cycle without need to wait for each to finish before starting to perform other. That would be awesome and indeed it is.
If our light bulb workers could do the same then imagine one person places ladder and moves on to another bulb socket without letting another person climb up. Ouch!
Now someone would say that my light bulb worker example is waste of time and was poor. That would be correct and incorrect at same time.
Correct in a way that it doesn't show theoretical working of CPU pipeline, but very correct in practice.
If CPU was strictly pipelined and "bulb task" was forced to run like that it would crash. The thing is that many tasks can be executed only in line, meanwhile others can be broken down into smaller pieces and done independently.
So the question remains what can done. The answer would be to execute several instructions at same time in line without breaking them down, yet pipelined CPU would be like manager and assign several people to work on light bulbs independently. So now we get back to single worker for single instruction, but instead of groups of people working on one action sequence, several people do sequences without needing to wait for others. Therefore increasing efficiency, speed and etc.
Now we can can reliably execute several instructions per time frame. Therefore per same CPU cycle count multiple instructions can now be executed.
Sadly pipeline length of CPU is hardware based and cannot be changed for each task. So if most often performed task takes 5 cycles and your CPU has pipeline of 12 units, then 5 will be used and others will keep doing nothing until whole pipelines tasks will be completed. So there are dangers of making pipeline too long, meanwhile making it too shot would result in alright working, but poor maximum IPC count. While improvements mentioned earlier are definitive improvements, pipeline improvements can be legit improvements if utilized. If not, then it's a loss. Therefore pipelining is a bit of gamble.
In order for CPU to know how arrange tasks in pipeline we can use thing called branch prediction. What it does is that it predicts how tasks will look like before their execution. It can help data to be loaded before processing, therefore improving processing speed. That works well if predictions are right, if they aren't then CPU's pipeline will either be poorly loaded or memory loading task would be rewritten. Either way if prediction is incorrect, we face penalties of processing speed. So we have more gamble going on. Good thing is that branch prediction is more often correct than not, so usually we get acceleration of some tasks without penalties.
Core count increase
While CPUs can have more clock speed, cache, pipelining and branch prediction what is we want several big tasks to be done at the same time. It would normally would happen slowly, but if we have more than one data factory (core) or processor things can be done side by side. Two serial tasks could be executed side by side in parallel. In order to do this could reuse same cache, but we need two instruction pipelines working at same refresh rate. Like it was said before not all tasks can be broken down into smaller pieces, therefore those must be done in series without ability to be done in parallel fashion. Meanwhile some tasks can be broken down into smaller ones and be spread into several processing pipelines or we can run two computer programs at ones and utilize several cores at once. In theory we could see big speed gains if all cores are being used, but in reality not everything can be broken down into smaller pieces, meaning that core count doesn't guarantee speed improvements, but offer ability to improve speed if program maker decides to do so.
Hopefully this gave some idea how CPU works. Sorry for messy pipeline explanation, I quite mixed it up with core count, but tried to make clear which is which and what happens in each. Still there's no explanation of what data width means in processing as well as some other things like Hyper Threading.