Over the last few years we’ve seen a sustained migration of hardware architectures to architectures which include multi-core CPUs or multiple CPUs. I’ve even written about them here. This shift in architecture is done mostly for performance reasons. Previously, computers were made faster by increasing the speed at which the CPU completed an instruction. This worked for many years but eventually CPUs became so fast that performance improvements seen from this method were minimal; another method was required to meet the increase in computing demand.
Multi-core CPUs, or computers with many CPUs, allow the computer to execute multiple instructions at a time - one for each CPU core. The computer could execute multiple instructions per core if the cores are hyper-threaded. By adding additional cores, computing performance can increase significantly.
Until now, most computers available to consumers with multi-core CPUs were limited to just a few cores. There are many reasons for this. Firstly, designers are just beginning to understand how to manufacture these CPUs efficiently and cheaply. Secondly, there are many issues that are pushed onto the programmer in order to get the maximum benefit of multiple cores - the software industry is just beginning to get caught up.
There are also complexities introduced with this sort of architecture, Cache coherency for instance. Basically, the term cache coherency refers to the process used to ensure that the data within each core’s cache is valid. If multiple cores have placed the same data in their cache and are using it with reads and writes the data may not be valid in either core. There are protocols that ensure coherency amongst the caches of the cores. However, as the number of cores (and therefore the number of caches grows) these protocols produce a significant amount of overhead communication. So much overhead that the cache coherency protocols interfere with the core’s communication and performance is degraded. Thus, either new less noisy, cache coherency protocols must be developed or the responsibility of cache coherency must move to the programmer and the noisy protocols dropped.
Figure 1: Cache coherency illustration courtesy of Wikipedia
Intel has recently developed a research chip (not for consumer availability) which pushes the limits of multi-CPU technology. The research chip is designed for academics and other industry partners to experiment on and provide feedback. This chip may be similar to what is available to consumers in the future. The current design of this chip allows for up to 1000 cores!
Single Chip Cloud Computer
The multi-CPU that Intel has developed as a research chip is called the Single Chip Cloud Computer or SCC. I think the name aptly describes the capabilities of this chip. In fact, when thinking about this chip it’s helpful to think of a large room with multiple servers in it acting as a cloud.
The cores on this chip are actually networked together and follow a protocol to send messages to one another through a router. These messages are what handle the cache coherency - not in any specific protocol, in SCC case the coherency of the caches is a responsibility left to the programmer. The inter-core messaging capability is opened up to the programmer via a custom build application programming interface (API). This is one of the motivations for sharing the SCC with academics and industry partners first - so that they can develop programming models and processes which work best for this programming paradigm. This API is called RCCE (pronounced Rocky).
RCCE is really just an optimized lightweight library of commands that follow a protocol. The designers also implemented this chip with the TCP/IP protocol used for the message passing. This worked great and really followed the thinking of the cloud. For one test they installed a different operating system on each core and had them speaking to each other via TCP/IP. This is exactly what you would have in an actual cloud - it was just on a single chip. This type of experiment, while interesting, doesn’t really push the limits of multiple CPUs though. In developing RCCE, Intel has made optimizations designed for networking at the chip level which require an evolution in programming paradigms.
The messages to each core can work because the cores are networked together. To explain how the cores are networked together I’ll need to also explain the general architecture of the chip.
The chip is actually a collection of tiles where each tile has two cores, a router shared by the cores, and a communications buffer. The router is connected to a mesh network on the die. The SCC utilizes both on-chip SRAM and off-chip DRAM, both are directly accessible by the programmer. The SRAM can be accessed using the RCCE API. The DRAM can be accessed through a memory controller. The SCC actually has multiple memory controllers and each group of tiles are assigned to always use one of the memory controllers. For more information on memory controllers see my previous article on Direct Memory Access.
The data in each core’s DRAM memory is cached in L1 and L2 in the expected manner. However, because there is no cache coherency built into the chip the programmer is responsible for maintaining coherency. APIs to do this are provided by RCCE. You can learn more about this API here.
Another function provided by RCCE is power management. While still in its infancy, the power management capabilities are quite interesting. To manage power, the SCC can currently control the frequency of each tile (or the entire mesh) and the voltage delivered to each group of 4 tiles - called a voltage island. This combination of frequency and voltage allow the SCC to control the power usage.
While at first this might not sound all that impressive, you have to know that you can’t just assign any random voltage to the voltage island. The allowable voltages are defined by the frequency of the tiles. Likewise the allowable frequencies are defined by the voltage level. RCCE handles these complexities to ensure that the programmer cannot accidentally fry the chip by improperly assigning a frequency or voltage.
It should also be noted that these changes are not instantaneous. It may take hundreds of clock cycles for the change to take effect. So this type of power management should not be used for a small number of instructions, but rather should be used for longer periods of time. For instance, if a server was to be utilizing the SCC a given program may be assigned to use a certain tile similar to how processor affinity works. When this happens a decision could be made about how much power is required to run that program. Perhaps the program is not very intensive and could run quite well with very low power. This case the power of that tile could be lowered to save power usage on the over-all server. This is a big deal. Not only could you save space by having 1000 CPUs in one server, but you can save power too!