Transcription

1 ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall Anh Huy Bui Nilesh Malpekar Vishnu Gajendran

2 AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis Processor architecture, memory Interconnection model Cluster design Software (System & Application) Conclusions 2

3 1. BRIEF INTRODUCTION Brief history of supercomputer Introduced in the 1960s By Father of supercomputer: Seymour Cray at CDC The first supercomputer: CDC Scalar processor 40 MHz 3

4 1. BRIEF INTRODUCTION Roadmap of supercomputer since then until now Processors roadmap Early machines: scalar processors 1970s: most supercomputer used vector processors mid-1980s: number of vector processors working parallel Late 1980s and 1990s: massive parallel processing system with thousands of ordinary CPU, which are offthe-shelf units or being custom designs Today: supercomputers are now highly-tuned computer clusters using commodity processors combined with custom interconnects 4

5 1. BRIEF INTRODUCTION Roadmap of supercomputer since then until now Speed 5

6 SUPERCOMPUTER DEFINITION As per Landau and Fink The class of fastest and most powerful computers available As per Dictionary of Science and Technology Any computer that is one of the largest, fastest and most powerful available at a given time 6

7 LINPACK BENCHMARK Introduced by Jack Dongarra Reflects performance of a dedicated system for solving dense system of linear equations Algorithm must confirm to LU factorization with partial pivoting have 2/3 n^3 + O(n^2) double precision floating point operations 7

8 LINPACK BENCHMARK - DETAILS Flops/s 64-bit floating point operations per second Operations refer to addition or multiplication Gigaflops => 10^9 flops/s Teraflops => 10^12 flops/s Petaflops => 10^15 flops/s Exaflops => 10^18 flops/s 8

9 LINPACK BENCHMARK DETAILS Rpeak Theoretical peak performance Number of full precision floating-point additions and multiplications completed within cycle time of the machine E.g. If 1.5 GHz computer completes 4 floating point operations per cycle, then Rpeak is 6 Gigaflops 9

10 LINPACK BENCHMARK DETAILS Rmax Maximum performance of a supercomputer measured in Gigaflops Nhalf Size of the problem for which machine achieves half its peak speed Good indicator of machine bandwidth Small value of Nhalf => good machine balance 10

11 2. DESIGN CONCERNS Processor architecture, memory Interconnection model Cluster design Software (System and Application) 11

12 SUPERCOMPUTER ARCHITECTURE Processor architecture Flynn's taxonomy SISD SIMD MISD MIMD Memory Shared memory Distributed memory Virtual shared memory 12

13 VECTOR PROCESSING (SIMD) Acts on a array of data instead of single data item Pipelines the data to the ALU. Scalar processors pipelines only the instruction execution Example: A[i] = B[i] + C[i] for i = 1 to 10 13

14 SCALAR PROCESSOR EXECUTION Execute this loop 10 times read the next instruction and decode it fetch this number fetch that number add them put the result here end loop Demerits: Instruction fetched and decoded ten times Memory is accessed ten times 14

15 VECTOR PROCESSOR EXECUTION Read instruction and decode it. Fetch array B[1..10] and fetch array C[1..10], add them and put the results in A[1...10] Merits Only two address translations are needed Instruction fetch and decode is done only once Demerits Increase in the complexity of the decoder Might slow down the decoding of normal instruction 15

16 VECTOR PROCESSOR BASED Fujitsu VPP500 series Cray -1, Cray-2, Cray X-MP, Cray Y-MP Nec SX-4 series 16

17 RISC ARCHITECTURE Simple instructions Simple hardware design Pipelining is used to speedup RISC machines Less cost and good performance 17

18 PIPELINED VS. NON-PIPELINED 18

19 RISC BASED SUPERCOMPUTERS IBM Roadrunner #1 spot among supercomputers in 2008 uses cell processor Tianhe-IA #1 spot among supercomputers in 2010 uses Intel Xeon processors and Nvidia Tesla GPGPUs 19

20 GPGPU General purpose computing on graphics processing units GPU Stream processor Processor that can run single kernel on many records SIMD High arithmetic intensity 20

21 GPGPU BASED SUPERCOMPUTERS 3 out of top 5 supercomputers in the world uses NVIDIA Tesla GPUs Tianhe 1A Nebulae Tsubame

22 SPECIAL PURPOSE SUPERCOMPUTERS High performance computing device with hardware architecture dedicated for single problem Custom FPGA or VLSI chips are used Examples GRAPE for astrophysics D.E. SHAW RESEARCH ANTON for simulating moleculat dynamics MDGRAPE-3 for protein structure computation BELLE for playing chess 22

23 TOP 500 THE CPU ARCHITECTURE The CPU Architecture Share of Top500 Rankings between 1993 and

24 SHARED AND DISTRIBUTED MEMORY 24

25 SHARED AND DISTRIBUTED MEMORY Virtual Shared memory Programming model that allows processors on the distributed memory machine to be programmed as if they had shared memory Software layer takes care of the necessary communications 25

26 MEMORY HIERARCHIES Two types Cache based Vector register based Factors affecting memory latency Temporal locality - for instruction and data Spatial locality - for data only 26

27 CACHE BASED o Hierarchy of memory o Most recent used data is kept in the cache memory o Cost increases and access time decreases as it goes up the hierarchy 27

28 VECTOR REGISTER BASED Consists of small set of vector registers Main memory built from SRAM Instructions to move data from main memory to vector register in a high bandwidth bulk transfer 28

29 CACHE BASED & VECTOR REGISTER BASED Cache based Merits lower average access time low cost Demerits Lower bandwidth to memory Programs not exhibiting spatial or temporal locality are penalized Vector register based Merits Faster access to main memory Demerits Expensive 29

30 LATEST DEVELOPMENTS Using Flash memory instead of DRAM Cheaper than DRAM Retains data when the current is turned off Reduces the space and power requirements Livermore's Hyperion supercomputer uses Flash based memory 30

31 2.2 INTERCONNECTION Supercomputer interconnect Joins nodes within supercomputer Compute node I/O node Service node Network node Needs to support High Bandwidth Very low level communication latency 31

32 INTERCONNECT TOPOLOGY Static (fixed) Dynamic (switches) Routing Involves large quantities of network cabling often must fit within small spaces do NOT utilize wireless networking technology internally! 32

33 INTERCONNECT USAGE 33

34 WIDELY USED INTERCONNECTS Quadrics 6 /10 fastest supercomputers used Quadrics in 2003 Hardware QsNet I : 350 5us MPI latency QsNet II : MPI latency QsTenG : 10 Gigabit Ethernet switches, from 24-port QsNet III : Approx 2 GB/s in each 1.3us MPI latency 34

35 WIDELY USED INTERCONNECTS Infiniband Switched fabric communication link point-to-point bi-directional serial links between processor node and high-speed peripherals supports several signaling rates links can be bonded together for additional input 35

36 WIDELY USED INTERCONNECTS Myrinet High speed LAN Much lower protocol overhead better throughput less interference and latency can bypass operating system physically two fiber-optic cables upstream and downstream 36

37 TOFU : 6D MESH/TORUS From Fujitsu For large-scale supercomputers that exceed 10 petaflops Stands for TOrus FUsion Can be divided into an arbitrary size of rectangular submeshes, provides torus topology for each submesh 37

38 TOFU : 6D MESH/TORUS 38

39 TOFU : MULTIPATH ROUTING 39

40 TOFU : 3D TORUS VIEW 40

41 TOFU: OTHER FEATURES Throughput and Packet Transfer 10 GB/s of fully bidirectional bandwidth for each 100 GB/s of the off-chip bandwidth for each node to feed enough data to a massive array of 128-Gflops processors Variable packet length; 32 B to 2 KB including header and CRC 41

42 TOFU : 6D MESH/TORUS 42

43 TOFU : 6D MESH/TORUS 43

44 TOFU : 6D MESH/TORUS 44

45 2.3 CLUSTER DESIGN Nowadays, most of supercomputers are clusters: Typical nodes in a cluster Tiered architecture of a cluster Energy consumption Cooling problem 45

46 2.3 CLUSTER DESIGN Typical nodes in a cluster Compute nodes: Comprise the heart of a system. This is where user jobs run I/O nodes: Dedicated to performing all I/O requests by compute nodes - not available to users directly Login/Front-end nodes: These are where users login, compile and interact with the batch system Service nodes : for management functions such as system boot, machine partitioning, system performance measurements, system health monitoring, etc. 46

47 2.3 CLUSTER DESIGN Nodes in BlueGene/P General Configuration 47

48 2.3 CLUSTER DESIGN Scaling Architecture(H/W Scaling) 48

49 2.3 CLUSTER DESIGN A schematic overview of a Blue Gene/L supercomputer 49

50 2.3 CLUSTER DESIGN A schematic overview of the tiered composition of the Roadrunner supercomputer cluster. 50

51 2.3 CLUSTER DESIGN Energy consumption A typical supercomputer consumes a lot of energy Most of them turns into heat Then it requires cooling Examples Tiahe-1A: 4.04MW/hr, if 10cent/hr, then $400/hr and $3.5M/year K computer: 9.89 MW/hr ~ 10,000 suburban homes. $10M/year Energy efficient is measured: FLOPS/Watt Green 500 June 2011: IBM BlueGene/Q is 1st: MFLOPS/Watt. 51

52 2.3 CLUSTER DESIGN Cooling techniques Liquid cooling Fluorinert "cooling waterfall":cray 2 Hot watercooling:ibm Aquasar system (water is used to heat the building as well) Air cooling IBM BlueGene/P Combination of air conditioning with liquid cooling System X Virginia Tech Using low power processors IBM BlueGene systems 52

53 2.3 CLUSTER DESIGN IBM BlueGene/P cooling system 53

54 2.3 CLUSTER DESIGN IBM Aquasar cooling system 54

55 2.4 SOFTWARE - SYSTEM SOFTWARE Operating systems Most of supercomputers are now using Linux Operating sytems used by top

56 2.4 SOFTWARE - APPLICATION SOFTWARE/TOOLS Programming languages: Base languages: Fortran, C Variants of C: C for CUDA or OpenCL for GPGPUs Libraries Loosely connected clusters: PVM, MPI Tightly coordinated shared memory clusters: OpenMP Key software for different functions FullLinux kernel on I/O nodes Proprietary kernel dedicated for compute nodes Scalable control system based on an external service node Tools: open-source solutions Beowulf, WareWulf... 56

57 2.4 SOFTWARE - APPLICATION SOFTWARE/TOOLS Software stacks IBM BlueGene 57

58 3. CONCLUSIONS Giving an overview of concerns when designing a supercomputer Hardware design Interconnection Software design Cluster layout Other concerns: power consumption and cooling Not covered all topics, various designs due to proprietary. 58

59 REFERENCES Supercomputer Wikipedia Tofu: a 6D mesh/torus interconnect for exascale computers. Yuichiro Ajima, Shinji Sumimoto and Toshiyuki Shimizu, Fujitsu Evolution of IBM System Blue Gene Solution, RedPaper REDP Using the Dawn BG/P System. 59

60 THANK YOU! 60