TILERA IS CLAIMING to have the first commercial CPU to reach 100 cores, and while this is true, the real interesting technology is in the interconnects. The overall chip is quite a marvel, and it is unlike any mainstream CPU you have ever heard of.

Making a lot of cores on a chip isn’t very hard. Larrabee for example has 32 Pentium (P54) cores, heavily modified, as the basis of the GPU. If Intel wanted to, it could put hundreds of cores on a die, that part is actually quite easy. Keeping those cores fed is the most important problem of modern chipmaking, and that part is not easy.

Large caches, wide memory busses, ring busses on chip, stacking, and optical interfaces all are attempts to feed the beast. Everyone thought Intel’s Polaris, also known as the 80 core, 1 TeraFLOPS part from a few years ago, was about packing cores onto a die. It wasn’t, it was a test of routing algorithms and structures. Routing is where the action is now, packing cores in is not a big deal.

Routing is where Tilera shines. It has put a great deal of thought into getting data from core to core with minimal latency and problems. Its rather unique approach involves five different interconnect networks, programmable partitioning, accelerators, and simply tons of I/O. Together, these allow Tilera’s third generation Tile-Gx CPUs to scale from 16 to 100 cores without choking on congestion. They may not have the same single-threaded performance of a Nehalem or Shanghai core, but they make up for it with volume.

Tilera 100 core chip

The basic structure is a square array of small cores, 4×4, 6×6, 8×8 or 10×10, each connected via five (5) on-chip networks, and flanked by some very interesting accelerators. The cores themselves are a proprietary 32-bit ISA in the first two generations of Tilera chips, and in the Gx, it is extended to 64-bit. There are 75 new instructions in the Gx, 20 of which are SIMD, and the memory controller now sees 64 bits as well.

In previous generations, there was no floating-point (FP) hardware in Tilera products. The company strongly recommended against using FP code because it had to be emulated taking hundreds or thousands of cycles. With the new Gx series chips, FP code is still frowned upon, but there is some FP hardware to catch the odd instruction without a huge speed hit. The 100 core part can do 50 GigaFLOPS of FP which may sound like a large number, but that is only about 1/50th of what an ATI Cypress HD5870 chip can do.

The majority of the new instructions are aimed at what the Tilera chips do best, integer calculations. Things like shuffle and DSP-like multiply-and-accumulate (MAC) functions, including a quad MAC unit, are where these new chips shine. Basically, the Gx moves information around very quickly while twiddling bits here and there with integer functions.

While the cores might not be overly complex, the on-chip busses are. Each Gx core has 64K of L1 cache, 32K data and 32K instruction, along with a unified 8-way 256KB L2 cache. The cache is totally non-blocking, completely coherent, and the cache subsystem can reorder requests to other caches or DRAM. On top of this, the core supports cache pinning to keep often used data or instructions in cache. On the 100 core model, the Gx has 32MB of cache.

Tiles are the name Tilera uses for for a basic unit of repetition. The 16 core Gx has 16 tiles, the 64 core Gx has 64, etc. A tile consists of a core, the L1 and L2 caches, and something Tilera calls the Terabit Switch. More than anything, this switch is the heart of the chip.

A Tilera tile

Remember when we said that cramming 100 cores on a die is not a big problem, but feeding them is? The Terabit Switch is how Tilera solves the problem, and it is a rather unique solution. Instead of one off-core bus, there are five. Each of them has a dedicated purpose, and that not only gives huge bandwidth, it also goes a fair way towards minimizing contention. Cache traffic will never be stepped on by user data, and so on.

The five networks are called QDN, RDN, FDN, IDN and UDN. In the last two generations of Tilera chips, all of these networks were 32 bits wide, but on the Gx, the widths vary to give each one more or less bandwidth depending on their functions.

QDN is called the reQuest Dynamic Network, and it is used for memory and cache. QDN is 64 bits wide. RDN is Response Dynamic Network, and it is used to feed memory reads back to the chips. RDN is 112 bits wide, an odd number, 64 + 48 from the look of it.

FDN is the widest at 128 bits, and it is used for cache to cache transfers and cache coherency. Given the critical nature of cache transactions like this, the width is no surprise. The last two IDN and UDN are both 32 bits wide. IDN is I/O Dunamic Network, and passes data on and off the chip. With a dedicated channel for off-chip transfers, you can see that reaching theoretical numbers was a priority at Tilera.

The last network UDN is for User Dynamic Network, basically the one users get to send stuff around on. QDN, RDN, FDN and IDN are basically housekeeping, they work in the background. If you want to send things from point A to point B, you send it across the UDN.

Although Tilera didn’t explicitly state it, each hop from router to router takes one cycle. This means that in a pathological case, corner core to memory on the far corner, it could take 19 cycles to go from request to memory, plus the memory round trip time, and then another 19 cycles to get back. That is what you call a long time in computer speak. Even in an ‘average’ case, you have a 10 cycle latency, which is very long as well.

To be fair, the Tilera architecture is not made to run general purpose code. As it was described when the first generation came out, workloads are meant to be chunked up, so a single tile does a function, then the data gets passed to the next tile for more work, and so on and so forth. If your program has 20 steps, you use 20 tiles and pipeline the work.

This solves many of the problems with variable latency and multi-hop traffic. The other more elegant solution is the ability to section off chunks of the chip into sub-units. There is a hypervisor that can partition each Gx chip into programmable blocks.

Sub-sections of tiles

As you can see in the diagram above, each Gx is broken up into sub-chips in software. You can give each process as much CPU power as it needs, and arrange it so the output of one block feeds into the input of the next in a single clock. This example has two Apache web server instances, an intrusion prevention system (IPS), a secure sockets layer (SSL) stack, a network stack and a few other processes running next to each other.

The Apache instances have their own memory controller, as do the IPS and the SSL stack. The network stack is sitting on top of the memory controller for decreased latency. Basically, the programmer can choose where to put each process to minimize latency. It doesn’t take much to figure out how to apply these concepts to a database plus web server scenario, or a three-tiered SAP-like workload.

Basically, Tilera allows you to explicitly place the data and compute resources where, when and how you need them. The chunks are done at roughly the same level as hardware VMs are in x86 CPUs, running below the level that a process can affect. This creates hardware walls to segregate data transfers, cache coherency traffic, and other tile to tile transfers. If done correctly, it can minimize latency a lot in addition to keeping processes from stepping on each other.

Now that you know how the cores work, talk, and are partitioned, what about the ‘uncore’? Talk about that starts with the memory controllers – four DDR3-2133MHz banks on the 64 and 100 core Gx, two on the 16 and 36 core models. For the keen eyed out there, this means Tilera has two different socket configurations, one for the 64 and 100 core chips, and another one for the 16 and 36 core chips.

DDR3-2133MHz memory is very fast, hugely fast in fact. The math says 17GBps per controller, so 68GBps for the chip with an 80 cycle ‘typical’ load to use latency. Each controller can support 16 ranks, vastly more than a PC controller. In all, the 64 and 100 core Gxs can support 1TB of memory if you can find 4Gb DDR3 chips. The memory controllers also support request reordering and some QoS provisioning.

Moving along to I/O, there are 20 PCIe 2.0 lanes, arranged in three controllers of 8, 8 and 4 lanes. There are also 8 Ethernet controllers capable of supporting 4 GigE lanes per controller. These can be aggregated into 8 10Gb or 2 40Gb lanes. Basically, this chip has a lot of available bandwidth. As you might imagine, on the 16 and 36 core variants, there are only half the controllers, so half the bandwidth.

In addition, you have a generic controller for USB, UARTs, JTAG and I2C controllers. Given that Tilera chips are basically embedded, these are not likely to be used for much more than booting and diagnostics.

On the core diagram above, there are two other blocks, the orange MiCA and mPIPE accelerators. These are where the other parts of the Tilera Gx ‘magic’ happen. MiCA stands for Multistream iMesh Crypto Accelerator, while mPIPE is short for multicore Programmable Intelligent Packet Engine. If it isn’t blindingly obvious, the MiCA does the crypto and the mPIPE speeds up I/O.

The mPIPE does a lot of interesting things, all supposedly at wire speed. It has a programmable packet classification engine, said to be usable at 80Gbps or 120M packets per second. It can twiddle headers and do other evil things that would make Comcast drool with the potential for ‘network management’ extortion payements.

In addition, it can also load balance across the various I/O lanes, and redirect tile to tile ‘I/O’ in a somewhat intelligent fashion. On top of that, the mPIPE manages buffer sizes, queues, and other housekeeping to keep latencies low. Think of it as a programmable housekeeping offload engine.

The most interesting bit is that the mPIPE can tag a packet with a 32 bit header before it sends it onto the internal network. This is where the programmable part shines. You can set up fields in the I/O packet itself to pass along pre-decode information and other time-saving tidbits. Since I/O is fully virtualizable, you could theoretically tag the packets with VM data, or just about anything else a bored programmer can think of.

The MiCA engines, two on the 64/100 core, one on 16/36 cores, are crypto offload engines. They can work either ‘inline’ or as ull blown offload engines, that is up to the programmer. The MiCA can pull data directly from caches or main memory without CPU overhead, basically fire and forget.

If you like acronyms, the MiCA on the Gx can support AES, 3DES, ARC4, Kasumi and Snow for crypto, SHA-1, SHA-2, MD5, HMAC and AES-GMAC for hashes, RSA, DSA, Diffie-Hellman, and Elliptic Curve for public key work, and it has a true random number generator (RNG). WTF, LOL, ROFL and other netspeak can be encrypted along with any other text that uses correct grammar. RLY.

Tilera claims that the MiCA engine can do wire speed 40Gbps crypto with full duplex on the 100 core Gx, and 1024b key RSA at 50K keys per second on the 100 core, 20K keys per second for the 36 core. Not bad at all. In addition, the MiCA supports a hardware compression engine that uses the tried and true Deflate algorithm.

The last piece of the puzzle is something that Tilera calls external acceleration interfaces. This could be as simple as plugging in a PCIe card, but that lacks elegance. The interesting part is a field programmable gate array (FPGA) interface. You can take up to 8 lanes of PCIe and connect the FPGA to the serial deserial unit (SerDes) to enable basically direct and low latency 32Gbps transfers. Direct transfers to cache and multiple contexts are supported, meaning you can do quite a bit with an FPGA and a Tilera-Gx chip.

In the end, you have a monster chip for I/O and packet processing. It doesn’t do single-threaded applications all that fast, but it really isn’t meant to. The chip itself is not out yet, nor is there even silicon yet. The first version out will be the 36 core Gx in Q4 of 2010, followed by the 16 core later in Q4 or possibly Q1 of 2011. These both share the same socket configuration and a 35*35mm package.

In Q1 of 2011, the 100 core chip will come out on a new socket and in a 45*45mm package. A bit after that, the 64 core will hit the market. Power ranges from 10W for the 16 core to 55W for the 100 core, but you can get power optimized variants that will only suck 35W. Given the programmability of the parts, power use is likely more dependent on the programs running on it.

The last bit of information is clock speeds. The 64 and 100 core models will come in versions that run at 1.25GHz and 1.5GHz, not bad considering how much there is to synchronize and keep going. The 36 core models will come in 1.0GHz, 1.25GHz and 1.5GHz versions, and the 16 core models will only come in 1.0GHz or 1.25GHz versions. Given the core count, internal interconnections, memory and I/O capabilities, Tilera will pack a lot of power into these small packages.S|A