February 16, 2010

Single Cloud Chip Symposium at Intel Labs in Santa Clara

A few months back in December 2009 Intel Labs announced a new many core research prototype called SCC. SCC stands for single cloud chip and adds up 48 Intel-Architecture (IA) cores on a single chip, which is the largest number ever put on a single CPU. Last week on February 12th Intel Labs held the SCC Symposium inviting researchers to get to know this chip in more detail. The goal of this symposium was to allow researchers have a really close look at the chip and its capabilities. The idea is that attending researchers apply for access to such a system in order to explore the possibilities of many core computing.

In this blog post I will give you an introduction to the SCC architecture, its challenges and opportunities.

SCC Platform

As mentioned before the SCC platform is a single chip that has 48 IA cores. Those 48 cores are arranged in a two dimensional array of 6 x 4 tiles, and each tile has two cores which are Pentium processors running at max 1.0 GHz frequency. Each core has a 16 kB L1 cache and a 256 kB L2 cache which are private to the core. Main memory is connected to the chip using 4 DDR memory controllers.
For on chip communication all cores are connected to each other using an on-chip mesh network with 256 GB/s bisectional bandwidth. Furthermore the SCC introduces hardware support for message passing and the cores on a tile share 16 kB of additional message passing buffer (MPB). Access to the MPB is very fast and each core can write into / read from any other core‘s MPB. In general, the MPB is of the same structure and size as each core‘s L1 cache, but instead of being only available as a private memory the MPB is shared.

Intel Labs engineers took special care of making the inter-core communication as fast as possible. When communicating with other cores the communication latency is amazingly low. Since each message has to be sent via the routers in the mesh network, the latency depends on the number of hops a message takes, with each hop adding 4 cycles of latency. Another interesting fact is that the router network runs with a different clock speed than the actual cores and Intel engineers said that this design decision was made on purpose: the over-provisioning of the mesh network will not allow the cores to max out the available bandwidth on chip.
Due to the fact that the SCC misses hardware support for cache coherence it is not possible to simply boot any given operating system on the SCC platform and use all 48 cores at the same time. Instead, Intel Labs decided to boot a single Linux on each core. Inter-core communication is handled by RCCE (pronounced "Rocky") which is a communication library based on message passing. How future operating systems could work with SCC using all cores is not solved, and Intel has put this question out to the research community - for example, the Barrelfish project at ETH Zuerich thinks about porting their next-generation multi-core OS to the SCC platform.


The system design of SCC allows ultra-fast communication on the chip but introduces a couple of challenges that software engineers have to deal with in order to exploit the chip to the maximum extent. For the sake of maintaining a simple chip design the maximum address space is limited to 64GB and -- more important -- to a maximum address length of 32 bit. As a consequence each core can only address 4GB of main memory. Of course the lack of 64bit addresses can be problematic - especially in the world of main memory databases - but Intel expects the research community to scale down their problems and promises to add more features over time as the SCC platform evolves. Since the system design and fabrication is a long and expensive process the goal is to identify sets of features that are required from software engineers and add them to the platform in the order of priorities.

Another important fact is that there is no hardware support for cache coherence and each core must consequently take care of this problem itself. The reason for this decision is that with hardware cache coherence the system bus would be saturated too early with snoop messages due to changes in the cached memory, thus slowing down the complete system. To ease the handling of memory management the SCC introduces look up tables (LUT). Each LUT can be dynamically configured and modified at runtime, allowing dynamic changes in how memory is mapped from the available physical memory to a core‘s local memory. The challenge here is to use application-level cache coherence protocols for managing memory access.


The SCC platform offers a wide range of possible research topics that could not be addressed until now. Besides topics that are related to operating systems and clusters, I would see the following interesting topics with regards to main memory databases:
Power Management - The SCC allows a very fine-grained control of the power management. The chip is divided into multiple groups of frequency and voltage islands. Each tile with two cores can run at its own frequency and each 4 tiles by 8 cores block can be run at an individual voltage. Both frequency and voltage can be dynamically changed allowing the system to run in a power consumption range between 125W and 25W(!). From my point of view fine-grained power control is a very important factor in densely packed data centers and is also important in scenarios where power consumption can directly be mapped to TCO (e.g. in SaaS? applications).
Compute location latency - With 4 memory controllers and 24 routers on chip it is likely that data is fetched from far away and the hop distance must be optimized. When I am thinking about how a query plan of a main memory DBMS looks like this issue becomes even more important since the query scheduler has to specifically assign sub-plans to certain nodes always making sure that a certain maximum distance is never exceeded.
Message passing based query processing - It would be really interesting to see how one could implement query processing using message passing on a single chip and to see how this might affect the layout of data in memory (i.e. rows, columns, or hybrid).
Database OS - Besides the typical Linux environment the SCC allows to run BareMetal C applications and thereby implementing an own DBMS as close as possible to the CPU. It's important to mention that this is a tedious task but it comes with all the opportunities mentioned above.


To summarize my impressions from the SCC Symposium I have to say that I really enjoyed it and I would like to thank Intel for their great work. What impressed me most was the very open attitude of Intel towards the research community. Instead of hiding source code and tools behind corporate walls, they announced to publish all the tools that they have developed and in exchange expect the community to build new tools, thereby extending the already available tools and documentation.
By the way, Intel and HPI are currently in process of setting up a research cooperation. As part of the Future SOC Lab initiative, Intel intends to make the SCC physically available to HPI as one of the first institutions in Europe.

Author: Martin Grund