High CSI 702
Popular in Course
Popular in Computer & Information Science
This 53 page Class Notes was uploaded by Summer Kreiger on Monday September 28, 2015. The Class Notes belongs to CSI 702 at George Mason University taught by John Wallin in Fall. Since its upload, it has received 46 views. For similar materials see /class/215158/csi-702-george-mason-university in Computer & Information Science at George Mason University.
Reviews for High
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/28/15
081 702 High Performance Computing Dr John Wallin Research 1 room 352 703 993 3617 jwalhn grnuedu http WWWcosgrnuedujwa1hnC7O2f07 My Interests 0 observations and simulations of colliding galaxies o numerical methods 0 high velocity impacts 0 high performance computing Prerequisites Fluency with one of the following computer languages FOR TRAN FORTRANQO C or C C81 603 604 or equivalent Fluency with the Unix operating system CSI 601602 or equiva lent C81 700 Numerical Methods and C81 701 Foundations of Computational Science OR Permission of Instructor Translation All homework will be written in C C Fortran 90 or Fortran You are also expected to know how to use matlab and the basic nu merical methods taught in C81 700 and the techniques methods and standards taught in C81 701 Your codes must also compile and work on the Linux machines in the COS lab and on the other machines selected by the instructors A MiniQuiz How do you create change or move directories How do you delete rename or move a le How do you use tar and gzip to compress and backup a direc tory What le and directory permissions are required to set up a website on your account In which directory do you normally place your web pages How do you modify your path How do you nd the location of a binary that you Wish to execute 8 Have you Used Matlab to simulate real world problems such as a traveling salesman problem 9 Have you ever used regular expressions to 010 searches A Word About Textbooks Steve McConnell s book Code Complete is required for this class It describes how to write high quality software The emphasis is on software construction that is writing readable and maintain able code This is an excellent reference for both experienced and inexperience programmers Heath s book Scienti c Computing An Introductory Survey is required for the class and is also used in C81 700 and C81 701 In this class we cover the last sections of the book including ODE s PDE s FFT s We also review selected other sections This book is a good overview of numerical methods focusing on algorithm rather than formal proofs Why Do Scientist Use Computers 0 experiments are impossible o experments are too expensive 0 equations too dif cult to be solved analytically 0 experiments don t provide enough insight or accuracy 0 data sets too complex to be analyzed by hand Computers bridge the gap between experiments and theory What is a supercomputer De ne What a supercomputer is and come up with some reasons Why we need them The Atanasoff Berry Computer The earliest electronic digital computer was built in the basement of the Physics Department at Iowa State University by Atanasoff and Berry in 1937 1942 It was a special purpose machine that was used to solve a 27x27 element linear equation Even though its programming was limited to a single task it contained all the elements storage digital logic base 2 of modern machines Although this seems like a triVial problem now solVing this type of matrix problem is extremely dif cult without a computer Supercomputer Speeds Taken from http h0meearth11r1kr1et mrob pub computer history html and http wwwt0p5000rg main archivephp Year Computer speed CPU 1947 Eniac 500 FLCPS 1 1955 IBM 704 10 kFLOPS 1 1964 CDC 6600 12 MFLOPS 1 1976 Cray 1 125 MFLOPS 1 1988 Cray Y MP 2 GFLCPS 16 1993 Fujitsu Numerical Wind Tunnel 1245 CFLCP 140 2000 ASCI Red Sandia 24 TFLCP 9632 2005 BlueGeneLDOENNSALLNL 280 TFLOP 131072 10 Supercomputer Speeds new additions Year Computer speed CPU 1947 Eniac 500 FLCPS 1 1955 IBM 704 10 kFLOPS 1 1964 CDC 6600 12 MFLOPS 1 1976 Cray 1 125 MFLOPS 1 1988 Cray Y MP 2 GFLOPS 16 1993 Fujitsu Numerical Wind Tunnel 1245 CFLCP 140 2000 ASCI Red Sandia 24 TFLCP 9632 2005 BlueGeneL DOENNSALLNL 280 TFLCP 131072 2007 Dual Core AMDlntel 30 50 CFLCPS 2 2007 NVIDIA 8800 CTX Video Card 350 500 CFLCPS 128 FLOPS Historical Trends in SuperComputing 1 1e16 1e14 1e12 1e10 1e08 1e06 10000 100 9K 9K 1940 1950 1960 1970 year 1980 1990 2000 2010 FLOPSCPU Historical Trends in SuperComputing 2 1e13 1e12 1e11 1e10 1e09 1e08 1e07 1e06 100000 10000 1000 100 1940 9K 9K 1950 1960 1970 1980 1990 2000 year 2010 The Drive toward High Performance Computing 0 resolution 0 dimensions 0 physical realism Resolution When we increase the resolution we are using to solve a problem computational time increases as well The increase in CPU time is usually much worse than a linear increase with the number of computational cells The Euler Equations Consider the Euler equations The size of the time step is limited by the Courant condition 61 mi m39 Ci 6t where 61 is the grid size vi is the bulk uid velocity and Ci is the local sound speed If we double the resolution we decrease 6w by a factor of two AND half the size of the time step This means we need four times the CPU time to to solve the same physical problem with twice the spatial resolution N body Methods The rst N body simulations included only a few hundred parti cles Since every particle exerts a force on every other particle the order of calculations goes as 0n2 There are about 100 billion stars in our galaxy not including the dark matter and gas Modern cosmological simulations usually try to simulate the volume that contains 10000 or more galaxies The current state of the art cosmological simulation has 10 billion stars Dimensions Adding a physical dimension to a simulation greatly increases the cost of solving physical problems Early models were typically done in only one dimension Most physical models are now done easily in two dimensions but it is still computationally very expensive to do three dimensional simulations Even going from a two to three dimensional problem with the same physical resolution changes the cost from 0n2 to 0n3 where n is the grid size along one spatial dimension Physical Realism Any set of equations is an approximation to physical reality How ever there are always choices in how much physics to include in any particular simulations If you take the example of galaxies we can characterize different physical effects by their relative importance in changing the overall structure of the galaxy Similar problems occur across Computational Science Galaxy Dynamics Observation vs Simulation Galaxy Dynamics 0 large scale gravitational encounters 0 internal gravitational forces 0 gas dynamics 0 formation of stars from gas 0 feedback from star formation back into the gas 0 active galactic nuclei All programs approximate reality but the better the physical model the closer the results are to the real world Are Algorithms Important Which is more important 0 An ef cient Algorithm 0 A fast computer 0n2 to On log n o The FFT algorithms changed the computational cost of calcu lation Fourier Transforms from 0n2 to On log n o The particle mesh method and the hierarchical tree code changed the cost of solving n body methods from 0n2 to On log n 0 Assume it takes 1 second to calculate the Fourier transform of a given size Approximately how much longer will it take to calculate the Fourier transform of a problem one thousand times larger Hardware Architecture HOW does a computer work Building Fast Computers 0 The underlying basis for all computers is digital logic circuits this has not changed since 1937 o All CPU s are based AND OR and NOT circuits 0 If you can build faster digital circuits you can increase the clock speed of your machine 0 You can also alter the design of your machine execute more simultaneous instructions The Cray 1 The NAND Gate 1 Input A luput B Output 0 0 0 0 1 1 1 0 1 1 1 1 All digital logic circuits you need can be built from NAND gates If you build a faster NAND gate the world will beat a path to your door 26 Early Parallel Computers 0 The development of parallel computers was predicated on the creation of computer networks 0 In the mid 1980 s high performance computing began moving toward parallelism 0 Major computer companies began to create and sell multipro cessor machines SIMD Machines 0 Early parallel computers executed a single instruction on all their CPU s Each CPU held a different set of data so they are called Single Instruction Multiple Data machines 0 The instructions that could be executed in parallel were fairly limited the individual CPU s were not very powerful and the networking was slow but there were usually a LOT of CPU s in the grid Thinking Machine CM 2 4k to 64k processors MasPar 16k processors Limitations of SIMD Machines 0 The biggest limitation was that the architectures limited the types of problems that could be run on these systems Problems that could not be decomposed to the size of the CPU array didn t work well Communication was limited both in speed and in cost 0 Not all programs could be mapped to this type of programming paradigm The Arrival of MIMD Machines 0 In about 1993 computer companies started moving toward haV ing more powerful nodes connected with networks 0 Each node was a fully functional computer 0 The nodes could all execute their own instructions on their own data Multiple Instructions Multiple Data 0 Communications was handled by message passing 0 Each machine was constructed completely with proprietary hard ware operating systems and software Early MIMD Reinventing the Wheel 0 Every major computer company hired a vast set of hardware and software specialists operating systems needed to be written networking protocols needed to be created network switches needed to be built compilers and languagesl needed to be developed 0 creating a computer from aluminum copper and silicon was very expensive Early MIMD Reinventing the Wheel 0 Since machines were built by different vendors Users needed to rewrite their programs when a new machine arrived OS were unstable tools were unreliable The development costs for new machines was HUGE o By the late 1990 s most builders of large computers were in deep nancial trouble Cluster Based Computing 0 In 1994 Donald Becker and Thomas Sterling created the rst commodity based cluster computer at GSFC working under a USRA contract 0 Beowulf clusters became wildly popular Numerous organiza tions started putting together their own cluster based comput ers o The cost per calculation for Beowulf clusters was MUCH lower than what was available from large computer vendors Cluster Based Computing The new direction for parallel computing is in cluster based com puting The essential characteristics are 0 multiple commercial off the shelf COTS machines Intel AMD or Apple Macintosh 0 no special operating system or compilers usually Linux 0 high speed but COTS networking cards and routers connecting separate boxes 0 standard message passing library handing communications through RSH or SSH o The common name for these machines is Beowulf clusters Performance Characteristics of Beowulf Clusters These are general characteristics of Beowulf clusters Consider them rules of thumb rather than certainties 0 individual nodes are moderate end PC boxes 0 communications are usually routed through a single router or switch 0 most communications are fairly slow with high to moderate latency 0 System performance depends heavily on the type of applica tions and the user base Grids One of the growing themes seen in high performance computing is grids Grids are a distributed set of computers and clusters used to solve common problems The nodes on a grid can be separated between continents so we have to use the web and advanced web tools to manage them Grids are very much cutting edge technology but there are some good examples of them being used for data access and computing Message Passing Libraries With the creation of MIMD machines message passing libraries had to be created to allow general communication between com putational nodes The original libraries used were proprietary and sold only with parallel machines Every company wanted to have new and bet ter features than every other company This competition led to completely machine dependent programming Every new parallel machine required a complete rewrite of the message passing sec tions It is not very surprising that parallel computing has been a commercial failure at least until recently MP1 has emerged as the primary standard for message passing Hardware Issues The hardware you have on your Cluster determines your perfor mance There are number of Choices you can make o CPU 0 memory 10 Channels 0 graphics cards 0 motherboards ohipsets bios 0 hard drives 0 networking We will not go into all of these issues in detail today but some of them will reooour as the Class continues 38 CPUS The CPU is one of the key links to determine the single node performance of a computer 0 Opteron AMD64 bit chip most commonly used in high end computing nodes 0 Intel 64 ltanium ltanium 2 Intel 64 bit chips commonly used in many PC and Apple Macs There is a nice comparision between these two chip sets at http enwilltipe Both chips are very competitive MultiCore Chips One of the BIG changes in computing is the universal adoption of multicore chips 0 Virtually all commodity machines are dual core 0 Most high end machines are quad or oct 0 You can currently buy an 8 core Macintosh 0 Intel has a working prototype of a 80 core Tera op CPU This hardware change will force a change in how we program Memory IO bandwidth Memory bandwidth is often the primary bottleneck in high per formance computing The general rule of thumb is every oating point operation per second your chip is capable of should have one byte of RAM The limits of accessing memory depend on the bus speed PCl X is a commonly used standard but it is slowly being replaced by serial standards like HyperTransport and ln niBand Networking and internal bus distinctions are blurring Graphics Cards Although graphics cards are usually associated with doing graph ics there has been considerable work in making them do other tasks The use of general purpose graphics processing units has made it possible to do some general scienti c programming In some applications this speed up has been considerable For clus ters haVing a GPGPU may provide a computational advantage httpwwwgpgpuorg Graphics Cards 2 The power and ease of use of graphics cards for scienti c program ming is rapidly changing The new generation of NVIDIA cards can be programmed with CUDA The 300 version of these cards can get 300 GFlop speeds using 96 parallel pipelines Codes no longer are executed on a computer Networking Interconnection of boxes on clusters is both simple and subtle Getting a simple router is now inexpensive for 100 base T and Gigabyte networks However the performance difference between these common protocais and faster networking standards such as In niband can be very large Design Choices Ultimately you have to make compromises when creating a new cluster There are three major tradeoffs to consider number of nodes 0 power memory of a single node 0 networking interconnect Communication VS Load Balancing In parallel codes there is a major tradeoff between communication and load balancing How do we communicate o Intra node Within a single box with shared memory 0 Inter node between nodes on a Cluster Intra node Processes 0 different codes 0 unique address space 0 run independently o communicate through interprocess communication Internode Communication Passing messages between nodes is Often more complicated We need to use some type of networking protoeal to get the informa tion between the systems Types of interprocess communications You can broadly characaterize communciation between processes or threads into four groups 0 messages passing 0 synchronization 0 shared memory 0 remote procedure calls Implementations of IPC The Wiki page on interprocess communication lists 11 different methods These include o sockets 0 message passing 0 signals 0 les 0 semiphores 0 shared memory 0 pipes The speci c methods available depend on the operating system 51 Intra node Threads 0 threads a way to split a code into simultaneous tasks 0 sharing memory 0 sharing resources 0 sharing state Threading works well on rnulticore CPU s for creating parallel jobs Threads Tutorial http wwwyohnuxcom TUTORIALS LinuxTutorialPosixThreadshtm http WWW 1281bmcom developerworks linux library 1 posix1htm1 http wwwibibhoorg pub Linux docs faqs T breads FAQ html