Advanced Topics in High Performance Computing Video Transcript

Frank: Talking about research computing at Boise State. Anyway, let me just go ahead and I’ll just roll on to that. So, if you don’t know already, we’re Boise State University’s research computing department. So, I’m sponsoring this event and we’re here to help. Basically, we do a number of things. This is our team. We’ve got a diverse set of backgrounds. Elizabeth, our leader, does a lot of outreach. Let’s go through it. Steve is a visualization person and also faculty, our faculty liaison. Jason is our cyber infrastructure and network expert. Kyle does software development and partners with research teams. James does more systems-heavy stuff, heavy lifting on that. I do a lot of software installation. Jenny does a lot with user-facing software and works a lot with Python. Jenny and I are both from chemical engineering and material science backgrounds, so for applications in that, we’re good people to talk to. Eli is our intern; he helps out wherever he can contribute. So, who we are, where we are, and what we do are the three things you should know. The easiest way to reach us, and what most people do, is email us at Research Computing. That starts the process. We have a phone number. Is this phone number still manned? They said it’s in your office, so this is the direct line to Jason. Some people prefer to make a phone call, so we have that available. Physically, we’re located in Riverfront Hall. Although we’re often working remotely, you’re welcome to try to find us there. In order to set up a meeting there with us, and then of course, we have our website, BoiseState.edu/RCS. And one more in the waiting room. Okay, alright, and Jason, you said you’re monitoring this? Okay, oh, you can do that. So, I don’t need to do that then. Alright, and let’s see. This is where we are, and then physically, on the map, this is in the slides. Of course, all the slides are in the course materials git repo for the lesson I’m doing today. Please forgive me; I’m working off a cup here, and I grabbed a Diet Coke, so I’m inadvertently belching while I’m trying to speak. So forgive me for that. Okay, the last slide is just to talk about research computing and what we do. We provide computing resources; we have two clusters of our own, plus the shared cluster at Falcon that we share with the other universities of Idaho. And manage those together. We do the installation of software for research, data management, and data publication. We can help with the computational components of proposal development. If the Office of Sponsored Projects sends it to us, we can help you get the information you need to help you put your proposal together. From computational resources, we’re also a liaison for Access, which is NSF’s provider NSS program for providing research computing resources. If you need something different from what we offer at our own clusters, we can help you find something in that ecosystem. There’s quite a bit available, and we can assist with that. We also do training and offer one-on-one consultations. So, if you’re not sure or think you might need something from us, Just ask us. Reach out, and we can either help you directly or guide you to the right resources.

With that, let’s delve into the intriguing topic of numbers. What are numbers? Big numbers, small numbers, real numbers, and even terms like “fake numbers” surface when discussing computing, especially in research. There’s an extensive list of categories to which numbers belong. I’ll focus primarily on floating-point numbers in the context of research computing and high-performance computing. It’s crucial to understand aspects like the accuracy of these numbers. For instance, I went through my entire PhD without considering the precision of the floating-point type, which is roughly seven decimal places. This precision is something researchers should be aware of and understand how to manage. I went through my entire PhD without realizing the precise accuracy of a floating-point type. Approximately, it’s seven decimal places. Researchers should understand this intricacy. It’s an important detail to grasp, especially for those in the research field.

Now, discussing numbers in computing, we recognize that they are finite. Their storage demands a certain structure. One of the most efficient storage mediums, physically speaking, is DNA. With its quaternary system, DNA can store four different value types within its molecules. Due to its double helix structure, it has built-in redundancy. It’s a compact system, though interfacing with it isn’t particularly straightforward. In the world of computing, we deal with a different setup. Although DNA achieves impressive feats, our computing world must make do with transistors and the like. It’s a realm where blinky lights and electronic components dominate, contrasting the organic intricacy of DNA. Some things that are notable in this are that everything’s organized into memory. Okay. So we have to store numbers. They have to be stored in memory. Memory is a bunch of bits and bytes, ones and zeros, organized eight at a time. A bit is eight, or a byte is eight bits. So, you look through here and you have a set of memory addresses with which you can store different values. When you’re retrieving things from memory, you don’t, when you go and ask the computer if you want to fetch me this value from this memory address, you don’t actually get just the one value. You get what’s called a cache line. I put a dozen eggs here up as an example. For me, this was a good visual. It’s like, you need this one egg, but you can’t just pull one egg out of the fridge. You go to the box in the fridge that’s got the egg in it, and you pull that box of eggs out. It’s kind of like this when you’re grabbing values out of memory. So, what you’re doing is, you fetch this, and then on the bus, your data goes on, you fetch the data, it goes on the bus, and then it takes it to the processor where you need it. It’s just a bit about storage on that.

But to bring it back to the machine representation of the floating point number, the very commonly used floating point type is 32 bits. In C, it’s just called a float. It was before they had a double, a double precision. It was just the float, and many years ago, there were different definitions of what the float was, until the 1980s when the IEEE standardized on saying, “Here’s what a float is, and here’s what you can do with it.” There were competing definitions depending on what machine you were trying to run on. This is the IEEE 754 2008, the current working definition of. A 32-bit type of a float, bits number zero to thirty-one. There’s a sign bit which is a fancy way to think of it. It’s like minus one to whatever the value of the sign bit is. Minus one to the zero is one, minus one to the one is minus one. Anyway, it’s just a sign bit. It’s either plus or minus for your number. We have something; the next piece is the exponent. Because we’re talking about floating point which means the decimal point can float, or you can move it around, shift it to where you want it in memory. So that you can get a larger span of numbers than you could with just plain integers. You can get a larger absolute range of numbers. This works in a fancy way. It’s not just the value of the exponent, but take this with a shift of minus 127 because there are eight bits, 255 possible values. You want the exponent to range from negative values to positive values. So they just take it and they shift it by 127. In this particular example, this gives, if you add up all the values of the bits, a minus three. So it means the number we get out of this is going to be shifted by 2 to the minus three and then, the main part of it is the fraction. This is where you’re storing all the precision details of your floating point number. And curiously, they do something clever here. Notice that the fraction is 23 bits, but it actually stores 24 bits of information. They shift it around the same way you have when you’re writing numbers in scientific notation. You shift the decimal point to round until you get one number to the left of the decimal point. Well, in binary, it’s not. Decimal point, it’s binary. So you know that they shifted around until they get one number that’s non-zero to the left of the decimal point or the binary point. But because it’s binary, it’s always one, so you don’t need to record it. It’s implicit, so that’s why there are only 23 bits but there’s effectively 24 bits of precision. Now, in this example, you have only one bit set here. It’s the second one from the left, as we’re showing, representing two to the minus two. So you have one plus two to the minus 2, which is one plus a quarter and it equals 1.25. And then you multiply that by the exponent and the sign and that’s how you get this value of 0.15625. The takeaway here is that there’s a limited amount of precision that can be used in storing a number in computer memory. And so we’ll get into what those limits are and when you need to be aware of it and when you can, and more importantly, when you can disregard it. Because if you can use less precision, you can actually push more compute through in less time if you use less precision sometimes. Okay.

And if you’re not aware of it, you can run into rounding errors because you can only record so much precision. These rounding errors are a source of error. If you’re a scientist, you need to account for any source of error, any source, or any issues with measurement error. Modeling shortfalls are also, we believe, the predominant source of error in computing. Models are useful, but none are perfect. However, as a researcher, you need to account for measurement errors. So what can you do? Recognizing the limits of machine precision, you can make responsible choices regarding machine precision. For example, a lot of AI and machine learning applications can work and can get by using very low precision. They’ll have an 8-bit floating point type. I don’t know a lot of specifics on machine learning, but you can get by, for what you’re doing there, with a very low precision type. Chromax Molecodynamics package, by default, builds as a single precision. One of the reasons it runs faster than some other packages is you can run it in single precision mode. And especially on a system where you’re able to vectorize, you can pack a lot of instructions into a single tick of the clock. There can be reasons why actually going with lower precision is important.

It’s crucial though to document your choices, and you can do better research if you’re aware of how machine precision is affecting you. So, how can we sum this up? If you’re taking, for example, if you’re adding two numbers, where does this rounding affect you? You get and just to summarize here, if one of those numbers is one, how much can you add to it where it doesn’t get rounded off? That’s something we call a delta, a machine delta. For the floating point type, it’s 2 to the minus 24. So, if x is larger than this value and you add 1 plus x, then you get 1 plus x. If it’s smaller than that value, it gets rounded off, so you get just one. And you can look at this by the same, as if this is the same thing except you’re adding just two numbers and they’re not equal to one. It’s just the same statement again.

And to make a note, just what is the final limit of precision for a 32-bit or for a floating 32-bit floating point type? Remember, there’s 24 bits showing the fraction, so it’s just log base 10 of 2 to the 24. A little bit of math: 24 times the log base 10 of 2, which gives you 7.2 decimal places. So, if you’re reporting more than seven decimal places, if you report eight decimal places, about twenty percent of the time, you’re showing, or you’re reporting that the last place is not actually correct. So that’s the limitation, but you can’t record more than seven decimal places accurately with this type. But this is a pretty good amount. Seven places is pretty good for a lot of things. A lot of people will just automatically use double precision because they think it’s better, but it might not be. It might just be twice as slow.

Okay, to illustrate this, takes us to our first exercise. It’s a very short, simple C code to show where this rounding affects you. I’ll just walk through the code and then we can go ahead with the exercise. Let’s see. I call it TTF, which stands for two to the 24. This is the decimal value of this. So, assigning this value on this delta is the reciprocal of that, and we’re having a sum where we’re gonna be playing around with summing up this value. So, to reflect what we’re starting with, here’s the print statement. Even if you don’t read C, this should be pretty straightforward to follow. You’ve got variables you’re assigning values to, that’s all, and you’re looping here. So, what we’re doing is two to the 24th times, we’re taking our sum and starting with zero and adding delta, that little number. After we do that two to the 24th times, we print out what we get. And then, what happens if we keep doing that? We do this… Another two to the 24th times, keep adding this little Delta on and see what we get. We would Does anybody want to take a guess what we would expect to see at the first point? Due to the 24, 1 over 2 to the 24, 25 to the 24th times. Benji, so get one right. Okay, how about if you do it another two to the 24th times? We would think we might get two, but if we’re not making a decision understanding machine Precision, we’ll get an option, an answer that’s off by a factor too. Okay, let’s go ahead and get started with this simple exercise, and we can tweak the parameters a little bit, just to have some fun and play with it a little bit. All right, look at the Connect to Borah. I think everybody has an account, Board I have. If we have anybody who needs an account in Borah, I’ve got a couple of guest accounts. Does anybody need anybody in the room who needs an account, Jason? Does anybody need an account on Borah? All right, if we need them, I have a couple up here. That’s a good question. Yes, it should. There’s a GPU thing at the end of this, but If you can just skip that one. So, there’s a git repository. Yeah, did I not put that in the slide? Okay. Okay, let me post a link to it in the chat. All right, let me know. Here we go.

Okay, so just go ahead, that’s the address. Git clone that. I’ll just go ahead and do it here. Okay. Okay, I dropped it in the chat if anyone’s doing this again. So okay. Let me go back to the slides here.

Okay, so do you… Know, and then there was, I’m going Devastation VSU, there was a Couple. What was the name of the other queue? Jason, do you remember? Okay. I’m Deb session RC, does the assistant RC days. Okay, I’ll go and I’ll try this

One. So, this should just be put on a node interactive. Oh, Okay, there’s a window showing up in the way. Okay, so Is the text big enough here to read? People, can you see it okay, or is It? I need to make that. I can size it up once, maybe just a little bigger. All right. Yeah, okay. So, I cloned it. Now I’m going to there. On the notes, let’s see, and we’re gonna, Oh.

Subdirectory. So, all right, in the subdirectory machine numbers.

Here we go. Okay, so this is the code that we just looked at, and it’s just inspect it, and I’ll go ahead and run it. I’ll go ahead and compile it. So, we do Add the floats, and we see we start with our sum and then this Delta value, which is A very small value, and we get one for the sum for the first time, and we get one for the sum Again the second time, because after we’ve summed it 2 to the 24th times, it gets so big That adding the next one gets rounded off because floating points can’t handle precision, They’re not infinite precision. That’s the point where they break. They have that Delta when the difference in scale is greater than that Delta.

What happens then? A couple of things to try just for fun. What happens if you increase The size of this TTF? Like, for example, But it’s no longer going to be 2 to the 24th, it’s going to be 2 to the 24th plus one. So, we change it just by one number.

And we run it again.

Okay, it stays the same. All right, why is that? Oh, because Delta got became smaller. When it tried to store the Delta, it still had to adjust it a bit. It still rounds it up slightly to account for that extra value. And what happens if we decrease its size? It was six, let’s change it to five and observe.

So, we modified it again. Now the result is two because it’s just large enough that it doesn’t get rounded off. I’ll try it another way. Okay. And… Hold on. Oh! There’s a complication.

This is interesting. I’m using GCC 9.2 here. I overlooked one detail. Let me backtrack. There’s another step I forgot. Okay, I can’t recall the exact command I use for switching. Let’s see. If you’re working on Borah, you’ll need to load our base package. Borah has a lot of legacy software that predates the current research computing team, which I’ve kept because people might still be using them. I wouldn’t want to disrupt their work.

However, the newer additions are built on a different base. Now, GCC 12 is available. Let’s proceed and load that version.

And let’s see, we’re getting that version now. Let’s go back.

To our exercise, sorry, that’s the binary. Okay, so we’ve sent you back to the original value. Now we’re compiling it with gcc-12.

Okay, and we see the same result as before. Sometimes when you upgrade your compiler, you can actually get different results because there could be a bug in the compiler. It could be a way that it was handling things differently or moving to support a new architecture. Nothing to watch out for, but what I wanted to show was, let’s go, this is where we change this a little bit and sometimes the compiler will..Okay, so that one works. That one still compiles. I want to show what happens when I make it just a little bit bigger than this. Oh, it’s not sure. Okay, when I was working on this before, I actually had the compiler flagging and warning me that it was going to be rounding because I was choosing something of a certain size. I’m curious that it’s not. I’m curious I’m not getting that compiler warning right now. Sometimes things work differently when you go and do your own live. Lots of moving parts. Anyway, it’s one of these things where the compiler can catch the fact that you’re using a number that’s too large and may not do what you think it’s doing. Let’s poke at this and see if I can get there again. I’m going to shift my window up here a little bit. All right, so let’s go back to this. Just start with the original value. Something is the same, and it was. So we’re doing 1.0 minus two, or sorry, 1.0 divided by this one. Oh, and Okay, so that’s I can’t remember if I, so there’s we’re doing, there’s a conversion. Okay, it might have been because I was doing a conversion of that. No, that wasn’t. Yeah, that’s not it. Oh, This just specifies that this is actually a float and makes sure that because I can never remember exactly where everything is converting. I mean, you can go through the order of operations and do it definitively, but I try to throw parentheses around and make it explicit as much as I can. Anyway, I’m going to move on from this. I did come to a point where the compiler was actually generating a warning about what I was doing. So, okay. So, that’s showing that you can get around the error that gives you a surprising result.

All right, does anybody need more time to look at the exercise or want to play with that more? Anyone online? It’s a lean crowd today. So, you said we have two online? Six? Oh, wow. Okay. All right, so does anybody else, anyone online, still working? I’ll just take a minute. And a bonus if anyone can get it to generate the compiler warning.

We got these Sour Patch candies, which was because they were all sold out of cough drops at the con at the C store downstairs, so we’re going with these Sour Patch candies. Yeah, it’s one thing, I caught every cold. Every cold that was going around, I think I caught it in the past month. And so, not COVID, I got that last summer. And then, anyway, the past month has been rough, just having a nasty cough. That’s the end of it now. Oh, those are sour! Oh my gosh. Wow, that’s a good one. Hey!

Okay, so I’ll go. If anyone still wants more time, just let me know and I can stop. But otherwise, I’ll go on with the example of an MPI reduction.

Parallel processing. When you’re using MPI, you don’t know the order that things are going to run in because all MPI is doing is spawning a bunch of different Unix processes that talk to each other. But they’re independent processes. MPI has stuff built into it to help those processes communicate. They pass messages back and forth to each other, and that’s why it’s the Message Passing Interface.

But the processes inherently don’t run in any particular order. So, you can have strange things happen if you’re expecting things to happen if you’re Expecting processes to run in a certain order, then you can have some strange, surprising things. Happen.

Rounding is one of those. Rounding is one of the things that can spring up. If you’re adding up contributions to a force field from different, from different images or from different partitions of a physical domain in a molecular dynamic simulation. If the contributions from further away will be smaller in magnitude, and those can end up getting rounded if you add them last. Whereas if you add them all up, like if you’re in the middle of a huge cloud of atoms and you’re trying to compute all the interactions, If you add the ones that are far away, they’ll add up to a lot more than if you try to add the ones that are close and then add the ones that are far away.

These are the kinds of places where how you partition the physical domain, or the order of things, or the ways you do things in a reduction can have real consequences in doing science. It doesn’t depend on the order in which operations are performed and it can also depend on the size of the run. Interestingly, it can depend on which MPI you’re using. It’ll actually give you a different result.

Which I’ll show in a little bit because we have this, and I run it with open MPI and then again with MPISH. All right, so in a reduction, you’ve got, and this is just a way scaled-down example, you’ve got three tasks running, where you have three tasks running and you’re just adding up a reduction. It’s basically where you take something from all of the different MPI processes and reduce it down to a single value. It’s a reduction. And here we’ve got three processes. Two of the processes are Going to be reducing from the value of delta, and the third one is going to be using the value of one. Depending on where you put this one, you’ll get a different result. Here we’ve got Delta plus Delta gives two Delta. And we all know, of course, one plus two Delta is one plus two Delta. But if you’ve got Delta, 1, and Delta, and this one plus Delta happens first and the Delta gets rounded off, it gets reduced to one. And so, you’ve got one plus Delta again. Delta gets rounded up again, and so you get one. And similarly for the last case, if the second and third are being reduced together in the same step, then that’s where the rounding is occurring. So depending on where that value is placed, you can get a different result.

All right, let’s play with that. We’ve got a simple C program where we’ve got the same value of delta specified here. We’re launching our MPI, capturing Rank and size, and then assigning a local value. It’s either one or Delta based on which of the ranks you are. Just going back to this image, you’re then doing the reduction and going to see what the results are. And then, this last one is just a shorthand for saying if you’re rank zero, go ahead and print out the result of the reduction, and then a finalize step. So let’s just go to this exercise.

This is just as we looked at it before.

You can make this reduce. And we don’t have the MPI. Just like this. Alright, trying to say “make reducing”, it says “mpicc command not found” because I forgot to load the MPI module. Let me go ahead and do that. And let’s see what we have available. Okay, we got OpenMPI. We’ve got a few new ones. For example, we also have, which MPI we have. There are some newer ones and the ones that are on the base, like Borah base, are the ones that we recommend for new work. We suggest not mixing and matching with the old ones because having inconsistencies in the compiler can lead to surprising and difficult-to-debug results. So let’s go ahead and load OpenMPI first. “module load OpenMPI”, and we’ll run “make reduce”. And, we’re on a compute node, so we can run it here. You can compile on a compute node. On some systems, you can run cross-compilation. Cray systems are notoriously challenging, but on these, you can compile on the head node or the CPU without any issues. Balkan is super easy because everything is Broadwell, so there are no worries about cross-compilation at all. There’s only one architecture. Anyway, let’s run this. Oh, sorry. I ran it with… I forgot to use MPI around that. I neglected to do that. So, let’s run this again with three tasks. And, I should mention MPI processes run in no specific order. So, we got processes: zero, one, and two. However, the process was zero finished and provided output before the other two even reported back. And, the result shows our sum. It is one plus two Delta when we placed it at position rank zero. But if we positioned it at rank one or rank two, the Delta was rounded off. You can use this information to deduce the order of reductions on the system. However, note that you might get different results based on where the value is placed. So, what happens if you run it again?

Do you get the same result? Interestingly, the ranks finish in a different order from my initial observation. When I was preparing my slides, the order was one, two, zero. However, in this instance, the outcome remains consistent. What if you run this again with four tasks and still perform the reduction? We observe that rank 0 and rank 2 retrieve values without rounding. In contrast, the other ranks have rounded values. Therefore, depending on how the reduction is executed and the scale of the system, i.e., how many tasks are involved, outcomes may vary. Performing the reduction can yield different results. And lastly, what I find particularly intriguing is building this with MPICH instead of OpenMPI.

Let’s ensure we’re using the right MPICH with gcc-12. And, we need to ensure it’s utilizing the C compiler properly. It’s important to see the details. This isn’t exactly a dry run, but it demonstrates what the MPI CC command actually executes. Let’s compile it again. Oh, my mistake, all the ‘make’ command does is execute the mpicc. So, we have our ‘reduce’ setup. Let’s overwrite the previous version and run it with four processes. However, there seems to be an issue. MPI isn’t responding. There’s a problem with MPI.

I need to troubleshoot MPI. Let me ensure there are no residual processes that might cause interference.

Let’s try it again.

Okay, not sure what went wrong. We rebuilt it, and now it shows the MPI didn’t initialize correctly. When you see repeated ‘0 1 0 1’, it signifies the processors aren’t communicating, running serially instead of in parallel. In this attempt, we see consistent results with the reduction regardless of where we place the larger value. Running it with three processes, we observe a different result compared to OpenMPI. MPICH and OpenMPI might handle reductions differently. There’s no guarantee on the operation’s order or the process’s execution sequence. Thus, developers need to account for these nuances in their code. While popular simulation packages likely address these issues, when writing custom code, it’s essential to be aware. This becomes especially important if you’re handling unconventional cases or using established software in an unorthodox manner. For instance, if your research entails examining an anomaly, like a single oversized atom, the results might behave unpredictably. The software itself may never be accounted for or never anticipated. So again, that’s a modeling issue, but it can manifest itself in your computation. All right.

Any questions on that or does anyone want to try anything different? Want me to try anything different with this? Well, we could do this with more processes. I’ll just back up a second here. Let’s try this just for fun, 36 processes. Okay, yeah, 36 processes, and not 33, not 36 different answers. I’m counting, let’s see, one, two, three, four. I’m counting at least four different answers here, so there are probably more. So okay. All right. And again, if anyone online wants me to ask any questions or go over anything, on that, I’m feeling alright. Feel free to interrupt, we’re moving at a race car pace here. And we have three hours, and we’re probably going to end up finishing earlier, looks like. So okay, all right.

So, the next exercise is looking at the Central Limit Theorem as an illustration for sources of error, something that can be affected by rounding using randomness, but randomness is not necessarily a source of error always. So let’s just…

Into that, if we’re considering a randomly sampled variable, the script R ranging in value from zero to one for a uniformly distributed random variable, this is basically what you get when you ask for a random number from Linux or from the C standard library. The typical backend is like the Mersenne Twister. There are some other backends. Some people like this, people who just live to fight about what random number generator is better. I’m not going to get into that discussion.

Just using a seeded random number generator, where you give it one same seed, gives you the same result. And you can always give it a different seed to get a different result. These are typically supplied as integers ranging from zero to max int. But if you divide those by max int, then you get this zero to one values distribution.

And if we’re looking at sets of these numbers, taking these sets a time, for the example I’m going to do, I’m going to take eight samples, so n equals eight. You do, and you just take the average of these eight samples of this randomly distributed variable. You can get, and you look at the statistics for that, like what’s the probability distribution for the set of averages of the randomly distributed variable, it comes out to be proportional to the value minus the mean value of it. Essentially, it comes out to a Gaussian. Another way to say it’s just these values of the averages of the sets of random variables are distributed normally. So you’re taking an average of random variables and you get a new distribution of variables starts to converge toward a Gaussian distribution. That’s the Central Limit Theorem. It’s just saying that this is what happens when you do this.

The experiment is, okay, what we’re doing for this is an MPI code. It’s taking different random number seeds scattered to different MPI processes. So each MPI process gets its own starting value. It lands in its own place in the random number generator and then it goes and picks numbers from there. So you’re getting effectively different values of that script R to generate one value of R sub I. And then for each MPI task, you’ve got a value of R sub I. These MPI processes run and complete in whatever order they happen. Because there’s rounding, there can be errors when you do the reduction when you basically pull all these values back. And as we’ve shown, the MPI operations can give different results depending on the MPI. Depending on the run, it can differ between runs and also on the size of the run itself.

Okay, so the code for that, just to walk through it, I’m going to pull this code up. I’ll leave the slide up, but I’m going to pull the code up in the window here. It’s “CentralLimit.c”, and just to walk through it. Should pop up.

So, initializing MPI. Remember I said that this capital N was this element like elements per process, and I’m setting that to eight. Initial seed for the random number generator: just one, one, two, three, four, five, six, plus the MPI rank. So, that’ll ensure each MPI rank gets a little bit different seed off by one. Although it’s off by one, it’s actually a very far distance away within the random number generator. So, there’s no worry about the period, not to worry about the periodicity in it.

Okay, let’s see. So we’re grabbing the size and rank. Print out a little message to say what we’re doing. We’re creating the seeds, and then we’re scattering those. It’s again, it’s the initial seed plus the MPI rank. Those are the values that we’re going to scatter with an MPI scatter, and then using, this is all happening in parallel space. So when we get our seed, we get our seed value from the scanner and we seed our random number generator with that seed value.

Oh, and okay. So then, they were just picking a number using that random number generator, making calls to Rand. Again, this returns an integer divided by Rand Max, and we’re just doing a floating point division to get the sample back, okay.

Oh, and then the sum, and then something to get the sample, and then dividing by the number of elements in the process. So we’re ready, we’re getting a number between zero and one. It’s the average of eight samples from the random number generator.

And then, after this, we okay, that happens on each. This is, these are scattered. This process runs on each. Of the MPI ranks, and then we’re doing our gather here to get those results back, whatever they are.

So, and then we gather those back. We bin the results. This is basically looking at how frequently a value in a particular range occurs. And then, if we get for rank zero, we’re going to use that to print it out. All this is doing is generating a histogram from the binning by just printing out the letter X proportional to the number of samples that fit there.

Get out of here and build this.

And for this, pick a pretty big number; I’ll start with 16 just to illustrate.

Okay, and so it’s got eight elements per task. These are the values that come back. And then, binning these, we get a kind of a weird looking Gaussian distribution. But it’s only got 16 samples in it. Let’s try this with a little bit bigger, see 32 samples. Okay, it’s starting to look more Gaussian-like. And just for fun, let’s crash the mode and put in like 240. Hopefully, you don’t cast another 240 tests, but I said I should just keep going until it crashes the mode, and then James can yell at me. Okay, so, you know, during a workshop, it’s bad form, Frank.

Takes a little while to launch that many processes, but okay, we’re getting something that looks pretty much like a Gaussian back now. Oh, okay, don’t want to get too off into the weeds. In this, into the statistics part of this, let me just get this to where it looks. I’m trying to get it where it looks pretty, but the terminal settings are getting me. Just never mind this, okay.

Something like this. Here’s what it looks like with 48. That time, if you’re running with the same random number seed, you’re going to get the same result. Why? Probably because unless you’re getting a rounding problem somewhere. Oh, and count on rounding problems in this particular scenario. Well, if we’re picking numbers between zero and one, actually we’re picking an integer between one and two to the 30 seconds. And if you map that to be something between zero and one, you know again, you know that you’re having two to the 24th possible values for the float versus two to the 32nd possible values for an integer. That means that there are two to the eighth places where you can have something that’s going to get rounded off. But two to the 8th is not a very big number. So, about 256 times out of two to the 32nd, which is like four billion, you could have a rounding error coming from the random number generator being just getting picked up something which just gets crushed to zero. Otherwise, it would just as often get close to the number next to zero. So, actually, it’s not even an error, it’s just that it’s where it falls. That’s where it gets binned because it’s going to be zero because it just gets mapped to the value zero. You’re just as likely to Get the value zero as you already get zero plus Delta, etcetera. So, okay.

But to talk here, it’s just to think through what are the potential sources of error and uncertainty. The way things are, the way things are getting banned with, around the random number generator, we’re using the random number generator with a given seed value. We’re not just going off like Dev U random or something where we’re taking into account thermal entropy and things like that that are implicit in the system. Now we’re doing, we’re starting with something that’s reproducible. So the random number generator is not a source of error. Rounding might be, depending on how you’re using it.

And what happens if you run up against a different seed value? You could certainly get a different result for a different seed value, but that’s a result of the random number generator, probably not rounding. But if you increase the number of MPI processes, your results get better. Because you’re doing better, more statistics means a larger sample size, more likely to get something that’s converging.

And things like, what if you use a different number of samples for each of the R sub I? I was using eight. If you have zero samples, then you’re not doing anything, so that one doesn’t count. If you use one, you’re just getting a uniform distribution, not something that looks like a galaxy. It’s the infinitely wide Gaussian is just a flat distribution.

If you go the other way, what happens if you use like a million samples to get each of your R sub I? Well, then you end up getting a distribution that gets really, really narrow so your guessing it becomes really narrow and if you just keep going up with that number, it converges to direct Delta distribution. That’s basically a singular point. So, anyway, these are some of the things that can happen with using machine numbers in an MPI.

So and this, yeah, we have very small, what happens if we have very small values of R? It’s if you’re considering if you’re only looking at if you’re using that just this as a 32-bit floating point type. Then those, then those, you have to understand those first 256 values will be will collapse to zero. They’ll be rounded down to zero. The next 256 values will be rounded down to Delta. The next 256 values will be rounded down to two Delta, etc. Which may be what if if that’s not what you’re expecting then you need to account for that. If it is what you’re expecting and what you intend to work with, then it’s good to understand that’s what’s happening.

So, some takeaway points, are that greater precision isn’t always better. It can be slower it may not be. You can actually propagate some strange things if you’re having if you’re having a higher precision than your model.

I don’t know how to say this. You can end up actually deceiving yourself if your model is not that good and you are cranking up the precision. You can actually deceive yourself into thinking you’re getting a lot better, more accurate results than you are. If you know the answer is between one and two but you know to 32 decimal places, you still don’t know. You still don’t know if your answer is any good.

So anyway, oh, Smaller types can often mean better performance, especially true on vector machines. It’s like if you have a vector with like 512, on the thing, Intel’s last round of vector processing stuff was with the 512 means you can cram, what’s 3512 by 32? Yeah, see 16. I think this wait I have to shift it to yours, okay it’s two to the ninth, my two to the nine minus two to the thirty wait, minus two to the five so it’s two to the fourth. Yeah, so you can get 16 floating point operations into a 512 wide vector operation or you can do, or you can do eight double precision. But if you got a float 16 then you can cram 32 of them in there for a single click of the clock. So you can certainly get better performance if you’re using a smaller type if you’re on a system that supports that.

Let’s see, okay so it’s again, it comes down to making choices and understanding what the limits are. Let’s see, you can get the operations you’re running that can give you different results that vary between runs and depending on the size of the run. Randomness is not always an implicit source of error, it depends on how you’re using it. If you work in Monte Carlo, you probably understand this pretty well. If you don’t specifically work in Monte Carlo, then you may just think random is magic. In a statistical mechanics background, so it’s we have particular biases on how we look at some of these things.

It’s worth mentioning that there’s a, so I mentioned that the IEEE 754 standard is what’s been in play since all the vendors got together and decided to actually agree on a floating point standard. That’s what’s been in play for a long time. There are some pretty harsh critiques of it though.

Like, It’s great to have a standard, but if you happen to catch John Gustafson’s talk on the implicit problems with the IEEE floating point standard, it’s pretty hilarious. Actually, it’s very entertaining. And he’s developing an alternate standard. I don’t know a lot of specifics on it, but it’s his present results that show a lot of promise for being able to do things with less memory, basically using less memory to span more usable ranges more elegantly. So, it’s essentially a more elegant standard, and it seems to be getting a lot of traction. I believe it’s implemented in silicon in some experimental chips now. It’s implemented in software in Julia, which, if nothing else, is high praise for it being taken as a serious standard. So, it’s something worth looking into. I’ll include a link for the standard at some point later on, but it’s certainly plausible that things will be going that way in the long term, particularly as people move away from x86 and some of the things that have dominated high-performance computing for a long time.

And the final takeaway, though, is that these limitations of machine precision are inherent and no matter what the standard is, 32 bits can only store 32 bits of information. A 64-bit, you know, can only store so much information that way. And so, at some point, you always run into those limits, and it doesn’t matter if you’re Doing it with positives or if you’re doing it with IEEE numbers or if you’re doing it with Bizarro machine learning types, there are still limitations foreign.

Any questions to this point on the fun of machine numbers? This is kind of a dry topic. I find it pretty interesting to get into, and dig into it a little bit, but I try to make it a little bit more fun anyway. Shifting gears a little bit. Any questions online? Looking at the GPU application, we talked about Vector machines like GPUs are just pure vector machines. That’s all they do is the same thing on huge sets of data, which is why they’re so streamlined. So, I posed the question: when is it a good time to use a GPU? And I like to answer this by asking for anything that looks or can look like a rendering problem. A rendering problem is taking some information and you’re generating colors, pixel colors, based on something. In a video game, it depends on where the light is, it depends on where whatever object is there or whatever is going into the algorithms for the rendering. But if you can trick your problem into looking like something that looks like a rendering problem, then you might be able to use a GPU to apply to it. So, for example, let’s say you have a bunch of atoms in a simulated sample of polystyrene, like for example, this here.

And you want to look at, you want to do something with this sample. Something at each one of these points along a grid. It’s like maybe you want to figure out what color to assign each part of that grid or maybe you want to do something different, but it can look like a rendering problem. So first off, you’re decomposing your domain along these lines. This is a Cartesian equidistant spacing. You can actually partition this in reciprocal space, in log space. You can approach this in many different ways. The idea, though, is that it’s regularly spaced intervals in some domain, in some conceptual domain.

Okay, let’s say you’re trying to determine the chemical potential of a specific problem, trying to determine the chemical potential, which is the additional energy that it takes to add another atom into your sample. This is something you do if you’re looking at gas diffusion, which is something I’ve studied, and you want to see what happens if you insert it at different points along your sample. What happens is you’ve got energy; it’s an energy of interaction where what you’re inserting interacts with everything else at every other point in the sample. The only thing that’s changing is exactly what point, the XYZ, where you’re inserting that point. A researcher in the 1960s, back when computers were rudimentary and you had to manually operate them, figured you could do this by just taking a bunch of sample points. You could do this computationally, but they lacked the computing power then, so it kind of fell out of favor until more recently. It was this researcher,

Ben with him is just this was we came up with this idea of inserting an insertion parameter where this is this internet this decisive is the energy of interaction divided by KT because that’s the energy note is the way you normalize energy for a partition function again statistical mechanics not especially important what the point is that you’re doing the same sample you’re doing the same operation at a bunch of evenly spaced points in a particular space and thus it can look like a rendering problem I’ve taken this a step further and stuff that I work on I call the free volume index because I’m interested in finding where the free space is and how the free space in a particular sample of material is shaped and I do that by sticking atoms into it and seeing how they interact and that comes up with this function of X Y and Z that is this was my The first big problem to render on a GPU is a function of X, Y, and Z based on this energy. You can map it to a color because it ranges between zero and one. You can map that to a red. The first thing I did with this was I mapped it to a green channel and viewed it as a TIFF. Anyway, there’s just a quick exercise here. I’m going to run through this. I won’t suggest doing this one because we were limited on GPU nodes and also because I had this code run fine last fall, but I can’t recall the exact steps. And now, it’s been hanging, but let’s explore what does work. So, let’s see. Just to show, I’m launching a GPU session. It’s just an onboard interactive session. Session to a GPU node. Let’s see if anybody else on here we’re going to bother. Yeah, okay.

They’re pretty quiet, and let’s just say oh.

Okay, imagine just some miscellaneous programs. And the name of this, the name of this code, is one of the programs that’s in this.

Package” vacuums” is something that I developed when I had some spare time, and let me go back to the CPC.

All right, so we have this ps.gfg. This is a polystyrene configuration. It’s 64040 1:08:48.480,1:08:54.420 atoms, one atom per line. The X, Y, and Z positions of it and then what they call 1:08:54.420,1:09:00.480 Sigma and Epsilon parameters. Sigma is how big the atom is. Epsilon is, think how soft it is for when you’re trying to stick something in there. And we’ll just do this. Okay, let’s do it.

I’m going to make this bigger just so I can. I’ve got it, I can close that one. So, this is. All right, so this is running this program giving a set of parameters. So this is the box size that I mentioned like it’s a sample in a box.The potential is which, what, how to interpret the force field parameters. That’s the last two columns of the data, and then resolution like okay, how many? This is a three-dimensional sample. In the same resolution is like okay, how many points do I want, how many times do I want to slice it, how many slices do I want to make from the sample? 1024 means a thousand twenty-four slices in each of the X, Y, and Z directions. Already we’re at a billion points. That’s three-dimensional stuff gets big fast. Not as big as combinatorial stuff, but it gets pretty big pretty fast. And this is taking its input, they polystyrene configuration ps.gfg, and then just dumping the output to a file. It’s a horrible data format, but it was just something easy because it’s X, Y, and Z plus the insertion parameter. So that’s the code, and this is just the script to run it. So we do that, it says it’s calculating, and let’s go and run this in the background. And this again.

Okay, so the job’s okay. All right, I was off the screen for a second, so I launched the job. The job’s running in the background, and if you run this Nvidia SMI, you can see that it’s running on GPU zero. There’s something that’s hung in that I know, and I need to go back and look at the code for that. Launching a GPU job is just what it does. You can go to the node and see what GPUs are available. Here, we see two Tesla V100 GPUs. I’m on GPU zero and one, and I have my process running on GPU zero. Just an example of something the kind of thing you can run on GPU and what it looks like when it’s running. Okay, this one is not going to finish. I’m just going to go ahead and cancel it.

All right, so with that, we’re going to wrap up early here. We’ll find something to do with ourselves.

Okay, there are a few things to say about GPUs in relation to the machine number topics that we discussed before. The first GPUs actually didn’t perform arithmetic operations that were compliant with the IEEE standard. They didn’t need to. The original GPUs were graphics cards for video gamers and it was just rendering pixels. If you had a few dead pixels or a few frames get dropped, nobody cared. Maybe the gamer cared, but it wasn’t like you would get the wrong result in your research or something.

The first GPUs were actually doing their floating-point operations, but they didn’t pad them out. The IEEE standard for floating-point operations requires that you pad them out to 40 bits through your operation and then you round the result back to 32 bits. The original GPUs weren’t doing that. They weren’t padding them up. They didn’t need to. It just wasn’t important enough. Even if your pixel was off by one shade out of 256, it didn’t matter.

And then with the rise of general-purpose GPUs, that started becoming important. When the rise of general-purpose GPUs began, that’s when it started becoming important.

Just a few links here for additional resources:

The IEEE 7754 standard, if you’re interested to know what floating-point numbers look like, it’s pretty interesting. This is a link to the Wikipedia article about the standard. It’s a little more readable than the standard itself. And it’s also where I got the images I showed from Wikipedia.

Another standard is posited. This is a link to that standard. This is a very short read, maybe a dozen pages. And curiously, our directors are friends with Don Gustafson who’s been developing the standard. It’s John Gustafson developing this. Don Justice and Gustafson’s law, same guy. This is what he’s been working on lately.

And then finally, just a link to the MPI forum for general stuff on MPI standards and things like that.

And last but not least, here’s our address again, our website, and research computing at Boise state.edu. Get in touch with us for any research computing needs. We’ll help you out or help you figure out somewhere else to assist you.

And that’s what I’ve got for today. Questions, comments, or fun things we want to try? You can go back and try to make a compiler screen with us. Or you can stay there for another day. All these materials are in the repo. With the code samples. All the slides are in there as well.

Advanced Topics in High Performance Computing Video Transcript

Research Computing Support