Skip to main content

Video Recap: Fast Data Transfer Full Transcript

JASON: Let’s go ahead and start. Hi, I’m Jason Watt with OIT Research Computing, and today I’m going to talk a little bit about transferring some data using two utilities: Globus and Rclone, and then I’m also going to talk a little bit about a Science DMZ and how that might benefit you if you’ve got really large data, and you need to move it across institutions and kind of what that is. This will be like a high-level overview; it’s not a hands-on Workshop, but you can follow along, and then after we go through the slides, if you have questions or want to try… if you already have a Borah account, you’ll be able to follow along and use some of these utilities. Yeah, you’re getting my screen. right? Can you still see the slide? Yeah, it didn’t come up to this.
Right, you’re seeing the control screen? Yeah, but do you see this slide on the control screen? That’s weird; it says screen sharing was paused I don’t think I’ve ever seen that happen before.
Yeah, right, stop… Does anybody familiar enough with Zoom that they’ve seen it pause screen share?

[Screen share paused]

Oh, are we supposed to record this? Oh, it’s set to clarity. Let me retry the screen share, then.

[Went to next slide – Data Transfer Utilities]

How’s that? Is that better? Okay, so the two utilities once again, we’ll talk about are Globus and Rclone, and these are two things you can use as opposed to traditional things like if you’re familiar with CP or SCP on Linux or maybe rsync or GUI utilities like Filezilla and Windows Explorer kind of the common ways of moving files around or even really Legacy utilities like email or FTP at one time kind of suffice but no longer do and we’ll talk about a traditional campus Network and a science DMZ and kind of how they defer and on a traditional campus network. You have email, you’ve got video streams, you have all your Canvas or Blackboard classes, whatever you use, and there’s a lot of chatter on those networks and then a science DMZ is a parallel network that’s built across a campus that the only traffic on that is specifically for that science flows or flows of that research data. What does that look like?

[Went to next slide – Quick Network Overview]

Yeah, okay, a little bit about our Globus infrastructure on campus and for people that have used some of the clusters, you may or may not be familiar with some of these, but we’ve got basically a Globus environment across both our campus Network and our science DMZ network. The Globus infrastructure we have on R2, and then we have other endpoints called Bronconet01, Bronchonet02, and Globus VM. I’ll talk about how those are different in a little while. Some of those can be used to access research shares, so if you’re part of a lab that has maybe a Windows share, we can set it up so you can access your Windows share via Globus to facilitate maybe moving your data back and forth to either our clusters or external clusters. Science DMZ endpoints, we have Aura DTN DNT-R2 and then kind of a shared resource that’s just storage called DTN-01, and then two of these nodes live over in Idaho Falls at our data center there, which houses Borah, and that’s Borah-DTN and Bronconet02…

[Went to next slide – Research Data Storage and Globus Transfer Infrastructure: An intricate network of data storage and transfer infrastructure. From the top left, the R2 Cluster Infiniband Network links to the R2 Storage Node, which feeds into the R2 Scratch Space Storage. This network interacts with the Infiniband Switch Fabric, the Campus Network, and the Science DMZ, all looping back to the R2 Storage Node.

The Infiniband Switch Fabric connects to multiple elements, notably the CCP OIT L3 Switch, which in turn connects to the Globus DTN and the RFH OIT L3 Switch. The RFH switch is linked to the Cisco ASA FW and Border Firewall, ultimately leading to the Border Routers and the RFH Science DMZ Arista 7280 Switch.

Bronconet01 DTN situates within the Campus Network, while the Production NetApp Storage links to the Production Virtual Infrastructure via the Cisco ASA FW. The Science DMZ Network originates from DTN-R2, passing through the CCP Science DMZ Cisco Nexus 3548 Switch, the RFH Science DMZ Arista 7280 Switch, leading into DTN-01, and is connected to the Science DMZ NetApp Storage.

Lastly, the RFH Science DMZ Arista 7280 Switch connects to the Border Router, leading to Syringa/IRON/ IRON EE and then to the Campus Network Extension to INL C3 and Science DMZ Extension to INL C3. On the right but not connected, the R2 Compute, Cluster Infiniband, and Network at C3 are placed.]

Antenna, what does this look like? There’s in orange, there’s our campus network which is basically all of your emails, classes, video screen, VPN, etc., and then there’s another network that’s built with the highest data transfer in blue. How does this relate to the clusters? because I think that’s why most folks are here and wonder how we operate and work with clusters on campus, and if you look at kind of the top and dash one, you know that all the [inaudible] called R2, which is located downtown at the city center Plaza building, and those of you familiar Clusters running in the back-end network are represented in purple and top, and that’s where your home directories and your scratch space are and how this is all connected through its available from the campus network and the science network… the endpoint names that I brought out earlier are all highlighted in yellow, and then our state center to Florida, the C3 data center, is represented down here with our digital details. Do you ever think of that is the internet connection outside of the University and this network is optimized and always available to practice path without suggesting with any traffic or continuous two high levels and one over here had, let’s say, Eight Lane Highway and they’re large things that will go down I think of big summits and or here’s something in the campus network is lots of small stream so you have like motorcycle running around, so again, different networks or different purposes, and you don’t need to worry too much about like the details other than there are two separate networks, and those are integrated into both of our cluster environments, both Borah and R2.

[Went to next slide – The Basics]

Okay, the basics. of Globus, which is really kind of what we’re here to learn, how to move some data, and let’s say you have a large data set that’s in, you know, hundreds of gigabytes or even terabytes if you were to try to do that on a legacy utility like Filezilla or SCP and you started copying that, and you had a network Interruption, let’s say a 900 GB into a one terabyte transfer; what you have to do you have to restart that data transfer all the way from the beginning. Globus has the ability it’s a fire-and-forget data transfer service and its automatic recovery. It has a checksum verification of your data from source to destination, and it can deal with flows that are called elephant flows, which are for larger singular files. For example, terabytes per file. It’s also really good with small files if you have a data set that’s maybe tens of thousands of files, that might only be 4K, right? Trying to move those across a legacy utility could be problematic. At Globus, once you start the job, it goes, and if there are any network interruptions, it will pick up where it left off. Then another really good feature about Globus is that you have data. You need to share it with anyone at another institution or even not at another institution you have the ability. You’re empowered within the Globus environment to say, “Take this data, whatever data I’m connected to in Globus, and share it with any other user.” All you need to know is their Globus ID which is usually their email address at their institution, or anyone with a Gmail account can get on Globus. So if you needed to share data with somebody in some other country or continent, as long as they can create a Gmail account, they could use Globus to get this large data set from you. We manage a Globus Plus subscription, and that’s only when it comes to a product called Globus Connect Personal, which we’ll talk a little bit more about in the advanced section. It allows you to share your personal computers and endpoint essentially… and then, Globus is an environment that’s browser-based, it’s a service, it’s not storage, it’s just a service, a software service to transfer storage from existing storage repositories. Rclone and that’s a graphical based on a web browser, there’s also a command line client that you can do some advanced things with as well, and then the other utility I’ll talk about, Rclone, is a command line-based utility interface with any cloud storage provider. So if you have storage in Amazon, S3, Google Drive, or any of the other numerous Cloud providers, Rclone has connectors for that, and if you’ve ever used rsync, Rclone is really similar to rsync.

[Went to next slide – Common Use Cases]

Okay, here are some common use cases… something that Globus is really good at is; if you need to transfer a large data set from one institution to another, and I’ll use an example here of let’s say, weather data from Ncar, you can transfer large data sets, you can go grab public data, anyone that’s got a Globas endpoint can share data with you. So it’s good to get data from Boise State to Ncar back and forth. Another example I’ll talk about later is getting data from a laptop in Portugal to Boise State, with over 17,000 files on spotty internet connections. So between institutions, between end users and institutions, and then it also facilitates internal campus transfers. If you have data sitting in a research share and you need to work with that on the cluster, we can leverage Globus to move that data back and forth… and then Rclone I use Rclone primarily with Google Drive and get data back and forth to Google Drive, and that’s less important now than it was a year ago because Google has changed its model from it was free unlimited at one point. We we’re recommending people use something like Rclone; if you had one archive data, you could shove it in Google Drive with Rclone there’s it’s not necessarily free unlimited anymore; I don’t think our University knows yet… any kind of chargeback model. So I don’t know if there’s an upper cap on that, but it’s a great utility if you need to lose some data back and forth and you want to do a command line, you want to automate it, and you want to work with storage it’s something like Google Drive cloud storage Some background… there are two utilities, and I’m going to do some demonstrations of using those utilities. I tried to model these after maybe a workflow that you might have using a cluster.

[Went to next slide- Campus Network: Demo 1GB and 10GB Transfers]

For those of you on a remote, I’m going to stop the share because I’ve got to change screens

[Turned screen share off and turned on GlobalProtect VPN after entering user info. Logged into broach-dtn, dtn-01, and globus-vm]

and I’m gonna have to drop off a VPN, so it’ll take me a second to reattach.

[Opened RC Days AGENDA & ZOOM Links showing around 28 minutes are left for the fast data transfer session.]

First things first, the Globus interface is available at the URL globus.org, and when you land there, and you select log in you’ll

[Opened Globus page – globus.org. Entered into “Use your existing organizational login” Boise State University.]

come to a page where you can log in with your institution’s ID. So we would select Boise State

[Pressed “Continue” leading to the next page requesting username and password.]

and continue.

[Went back to the landing page of Globus]

So there’s the globus.org landing page, and go to login

[Continued to the login page by pressing “Log in”]

and those of you with Boise State accounts can’t actually follow along

[Pressed continue moving to a BSU site for entering your username and password. Entered username and password, then pushed “Login”]

or even if you’re not Boise State, if you have the global subscription connection, you can select your execution, and then it’ll bounce you out to your single sign-on Federated Authentication.

[Opened Duo login page and pushed ‘Sen Me A Push”]

It wouldn’t be a good day to forget your login.

[File manager page loaded]

Okay, so this is what the Globus file manager interface looks like. If you

[Switched layout by clicking on the first of three options for “Panels.”]

guys are following along, and you landed on a page that looks like that with only one pane up in that upper right-hand corner; you can select the two-pane… and you can think of this as like something similar to Filezilla, right? You’ve got a source and a destination.

0:15:38.100,0:15:40.140
So what we’re going to do is, let’s say, a colleague

[Clicked onto Collection text box and pushed enter]

has said, “Hey, the data you need to work with is at”…

[Entered into the text box Esnet and clicked ESnet Houston DTN (Anonymous read-only testing)]

we’re going to go to the ESnet server.

[Files loaded in File Manager]

This is just test data, but like a colleague would give you a real link to a real collection name where you could go get data let’s say, from Ncar or the University of Idaho or Falcon or whatever, and then we’re going to select a 10 GB file and a 1 GB file.

[Clicked Search and enter, transitioning to the Collection Search page]

For our data set, we need to get those to Boise State and drop them on our lab groups research share

[Entered into the “Collection’ text box “globus-vm” and pushed the search icon. Then clicked onto “Globus-VM” link. Files opened into “File Manager.”]

and you would go use something like a resource you’ve got on campus called Willis VM

[Went into “data” folder in the Globus-VM directory. Three folders are visible: “cruz-lab-intermediate”, “demo3”, and “jwatt-rcdays-demo”.]

and the names of all these endpoints are less important; you’ve got a familiarity with them; what you would do is if you had a real use case, you’d come to research Computing and say, “Hey, we’re gonna have this use case for data when we’re trying to get data from X to Y.” Then any of us in Research Computing would help you out, say “use these endpoints, and this is what it looks like.”

[Went into “jwatt-rcdays-demo”]

This is just kind of a representative example of what you could expect to use… and so on the left side, I’ve got a source of the data that I want to bring to Boise State

[Clicked onto “Transfer & Timer Options,” revealing settings: a “Label This Transfer” text box, checkboxes for “Sync – only transfer new or changed files,” “Delete files on destination that do not exist on source,” “Preserve source file modification times,” “Do NOT verify file integrity after transfer,” “Encrypt transfer,” “Skip files on source with errors,” “Fail on quota errors,” and a note stating, “These settings will persist during this session unless changed.” There are also “Notification Settings” – selections include: “Disable success notification,” “Disable failure notification,” “Disable inactive notification.”]

and on the right side, I’ve got an endpoint at Boise State and couple things of note by default Globus will do checksumming and the data in flight and make sure that bit for bit you’re getting what you ask for. If you don’t care about that, and you really care about speed, and you want the stuff to come as fast as you can. You can select “do not verify data Integrity

[Clicked onto “Transfer & Timer Options,” closing the settings. Also pushed “Start,” initiating the transfer and displaying an icon giving a success confirmation.]

after transfer,” and I do that just for the speed aspect of it. These are small files, so we’ll matter. Then you click on start, and you’ll see the start button while we’re waiting for that transfer to start. You’ll see there’s start button on either side; you can move data… either endpoint can be the source and destination. It doesn’t matter; it’s just relative; you can select the left side to be the destination and the right the source then you would just use the start button on the other side. You can see there are little arrows pointing which way you want the transfer to go, and then in the interface, once the transfer is submitted, you can go watch

[Clicked on “ACTIVITY” and the right arrow icon to display an overview of the transfer’s current state. This includes the task label, source, source host, destination, task ID, owner, condition, requested, deadline, duration, and transfer settings. On the right, there are buttons to edit the label and terminate the task, along with information that includes the number of files, directories, files transferred, bytes transferred, effective speed, skipped files due to sync, and skipped files due to errors.]

the activity of that job, and this will usually update every 60 seconds, so on a short job like this, to first update, it’ll probably already be done. When you’re working on, let’s say, transferring to 240 terabyte data set from Inl to Boise State like we did this summer for one of our PIs that was a two-week project and required checking in periodically on the transfers to make sure they were still chugging along because errors do happen… and so, in this case, we’re just bringing data from what’s called an es net test endpoint in Houston

[Clicked on “ACTIVITY,” showing that the transfer has started and is halfway complete. Clicked back on the job, displaying new information – 7.10 GB transferred and an effective speed of 99.8 MB/s.]

to Boise State… and I used to have a slide in the deck about doing live demos right because I always go wrong and the slide that I used I removed it for some reason, but the slide I used was Elon Musk showing how indestructible cyber trucks windows were. If anyone remembers that.

[Clicked on “ACTIVITY,” showing that the transfer is still halfway complete. Clicked back on the job, displaying an effective speed of 69.92 MB/s. Clicked on globus-vm, entered “cd /data” followed by “ls,” displaying cruz-lab-intermediate, demo3, and jwatt-rcdays-demo. Navigated into the directory jwatt-rcdays-demo and entered “ls,” displaying files 10G.dat and 1G.dat. Then entered “ls -alh,” displaying: “total 13G drwxr-xr-x 2 jwatt g-jwatt 35 Mar 28 09:36 .drwxrwxrwx 5 root root 73 Mar 27 18:24 ..-rw-r–r– 1 jwatt g-jwatt 8.4G Mar 28 09:39 10G.dat\-rw-r–r– 1 jwatt g-jwatt 954M Mar 28 09:37 1G.dat.” Went back to the activity page reloading the website displaying new information – 9.73 GB transferred and an effective speed of 68.31 MB/s.]

It was nine gigs. What’s the transfer rate? So the speed of the network card in the endpoints and then the disk io on the attached storage is usually kind of the limiting factor but these if I go back to that slide that had the DMZ on the other side, our DMZ is built at 100 gigabit network speed but the DTNS, the data transfer nodes that are in there have 10 gig cards, so the fastest theoretical you’d ever see is 10 gigabits which is still pretty fast. In practice let’s say local or Regional transfers, you’ll see around 9.4 gigabits per second. As you start increasing distance that’ll diminish in speed a little bit but let’s say seven like a test I ran last night and I’ve got the data in a slide I think I got 6.9 gigabit from Houston to Boise State using the science DMZ which were not for this example.

[Clicked on “ACTIVITY,” showing that the transfer has finished. Clicked back on the job, displaying new information – 11 GB transferred and an effective speed of 71.87 MB/s. Also, a duration of 2 minutes 33 seconds]

So we can see here this is done 11 gigs total. The page gives you the start and end time and how long it took. In this case two minutes and 33 seconds. Then you asked about speed, and over here, it shows the effective speed of basically 72 megabytes per second, and this is another kind of it’s in a slide a little further down I have, but in general, when you’re talking speeds and transfers and networks you’re talking in bits per second, and when you live in a storage room, you’re talking bytes, and when those collide you can have the wrong thing stated, and I’ll explain that in a slide, but I wouldn’t in general talk about network speed and megabytes per second as it’s listed here. I talk about it and megabits per second because Network Engineers live in that world. You buy a 10-gigabit card; it’s not a megabyte card, it’s a 10 gig bits per second speed.

[Clicked onto “FILE MANAGER” then collection text box search and pressed enter. Then opened Globus-VM and opened /data/jwatt-rcdays-demo/ directory.]

So this transfer is done, so now if we go back to our file manager, they’ll look right place that data. Now you can see I’ve got those two files. Now if you’re a researcher and you’ve brought that data down, or you’ve had a student or someone in the lab start amassing the data set, and now it’s in your research share in the windows world, which is where a lot of our researchers live right as a Windows share it like a map on their desktop. Let’s say you have a masses data, and now it’s time to get it to a cluster because you want to do something with it, so the next step we’ll do is we’re going to take these two files we’re going to move them to scratch space

[Clicked collection text box search and pressed enter, opening recent searches. Then clicked BORAH-DTN retrieving its directory contents.]

on the cluster, so what we’ll use here if you’re using Borah the endpoint you would use is for a Borah-DTN and then I’m going to go to my scratch.

[Entered into “Path” textbox “/bsuscratch/jwatt” and pressed enter. Then entered into jwatt-rcdays-demo folder.]

Did I delete my files? I did. So now what I would do is…

[Selected 10G.dat and 1G.dat file in Globus-VM directory]

I like the files that I want to transfer, let’s say, from my Windows share, which is attached to this endpoint called Globus VM, and now I need to get it to scratch in the cluster because I’m ready to run a job. So I’ll highlight the data I want to move here; we’ll say data Integrity is important, so we’re not going to not do the checksumming and then we’re going to go source on the write the destination on the left this time, and we click the start button…

[Clicked “ACTIVITY” and the right arrow icon to display an overview of the ongoing transfer. Initially, the transfer overview 2 files, 0 directories, 0 files transferred, 0 bytes transferred, 0 effective speed (MB/s), 0 skipped files due to sync, and 0 skipped files due to error.]

and then while that job has started. That should go a little quicker than pulling that down remote because that remote job was kind of slow I did this last night’s testing, and I had better throughput… and you asked about speed. You know if there’s network congestion along the way anywhere between Houston and Boise State, that could affect the speed of that transfer as well. So what we can do is build those networks and build the storage to handle the capacity of a certain speed, and then there could be anything in the path along the way that could also affect it.

[Clicked on “ACTIVITY” then clicked back onto Globus-VM to BORAH-DTN displaying new information – 11 GB transferred and an effective speed of 259.55 MB/s.]

Okay, and that’s already done something like that, and we’ll see that it flew by 260 megabytes. Okay, so now you’ve got to own the cluster, and if I were to go to representative…

[Opened borah-login and entered credentials.]

So if I’m command line on the cluster,

[Entered “ls”, “mlxconfig-9.txt”, “cd scratch/”, “cd jwatt-rcdays-demo/”, “ls”, “clear”, “ls -alh]

yeah, so now you can see it there.

[“total 14G
drwxr-xr-x 6 jwatt jwatt 215 Mar 28 09:43 .
drwxrwx— 19 jwatt jwatt 664 Mar 27 18:39 ..
-rw-r–r– 1 jwatt g-jwatt 9.4G Mar 28 09:43 10G.dat
-rw-r–r– 1 jwatt g-jwatt 954M Mar 28 09:37 1G.dat.
drwxr-xr-x 2 jwatt jwatt 26 Mar 27 19:13 dtn-01–to–borah-dtn
drwxr-xr-x 2 jwatt jwatt 26 Mar 27 19:21 dtn-01–to–bronconet02
drwxr-xr-x 2 jwatt jwatt 26 Mar 27 19:42 globus-vm–to–borah-dtn
drwxr-xr-x 2 jwatt jwatt 26 Mar 27 19:29 globus-vm–to–bronconet02”]

So now those are available in the command line in the cluster, and I can go run my batch jobs on those and do whatever I need to do.

[Opened Globus-VM to BORAH-DTN overview]

So now, for the sake of argument, let’s say that the 10 gig was actually the source data in that one gig file with my product. Now let’s say I’m done with that research project, and I need to shove that in Google Drive; well, I might use something like Rclone to do that. So now, here’s where a little demo of Rclone will come in.

[Opened broach-login and entered “rclone”. Then entered “clear”.]

So Rclone is a utility like rsync. I won’t go into a lot of detail about how to use it. We can talk about that after class and how to configure that, but let’s say I want to see what’s in my Google Drive.

[Entered into terminal “rclone lsd gd:”]

I can do Rclone + b, and then my shortcut name for my Google Drive is gd and then colon. What Rclone is doing here is it’s going out, and it’s talking to Google Drive, and it just gave me a directory listing everything in my Google Drive and a side note with Rclone, you can also access any Google Drive that’s been shared with you, there’s a bunch of other command line parameters you need to pass to get to those directories, but you can do that. So for the purpose of this example, you can see I’ve got a directory here doing RCA’s demo, and let’s say that’s the destination that I need to put my product that I’ve created on Borah.

[Entered command “rclone copy 1.6.dat gd:/jwatt-rcdays-demo”]

I could say Rclone copy 1.6.dat gd:/jwatt-rcdays-demo and I also meant to do dash dash progress. So you guys can see… if you have a longer job and you want to watch our clone and make sure

[Entered command “rclone copy 1.6.dat gd:/jwatt-rcdays-demo –progress”]

it’s doing something; add in the dash dash progress. It will tell you what it’s doing. Otherwise, it just sits there and you don’t have an idea that that’s happened, so now if I go and do this and be careful in Google Drive if I were to do an LS right there since Google Drive is object storage, it doesn’t print just the objects in my base directory it gives me every object I have in Google Drive so we’d be sitting here for minutes while that spun through until I canceled it. So the ls command you want to tell it which directory you really want to see in.

[Entered command “rclone ls gd:jwatt-rcdays-demo” and this displayed : 1000000000 1G.dat and 10000000000 10G.dat]

Here’s what’s in that directory in my Google Drive. Actually, both of those were already there. I forgot to delete them after that’s right tested this last night, so what would be the point of doing that? So now you’ve got some products on Google Drive, and I think a lot of people use Google Drive to share with collaborators, maybe outside the institution, because it’s really easy, right? So what does that really look like? Well, if I go into my Google Drive. All right, so I’ve taken Globus, and I’ve brought data into the institution, get it ready to work on the cluster, moved it to the cluster, ran my job in the cluster, and now I have an output file. I need to share that with somebody. I used Rclone to shove that up to Google Drive, and as we go to Google Drive… and we look on my drive, there are those files.

[In folder jwatt-rcdays-demo in Google Drive]

So essentially, I’ve downloaded for you guys is just the path that a researcher might take using data or using Globus to ingest data Globus to get it on the cluster, and in Rclone to put it somewhere else; you could just as well use Globus to shove it back on your research share too. So let’s go back to another point.

[Opened presentation slides]

I’m just going to keep it on
this screen so I don’t have to toggle back on

[Went to next slide – Boise State University Research Data Storage and Globus Transfer Infrasturucture]

for the remote users. So essentially if you
wanted to see the path that we just did there, yes…

[Went to next slide – Science DMZ Network: Demo File Transfer]

I’ll start this demo, too; here, this demo will show you the speed difference, you’re asking about speed, between using the science DMZ and maybe using the campus Network, and I will go back to… yeah, we’ve got 10 minutes there should be enough time to get one fired up here.

[Transitioned back to file manager and selected “ESnet Starlight DTN (Anonymous read only testing” for collection.]

I’m going to go to, I think, Starlight is what I used last night for this past. I’m going to place it on.

[Selected DTN-01 for the second collection and pressed transfer & timer options to select “do NOT verify file integrity after transfer”.]

We’re going to grab this 100-gig file. We’re going to tell it not to verify the Integrity because we just want to see what kind of throughput we can get today.

[Current path in DTN-01: /data/everyone/jwatt-rcdays-demo/]

Good so now we’re going to do you know the file a fairly significant size 100 gigs.

[Started transfer with 10G.dat file from ESnet Starlight DRN to DTN-01 and received drop-down display of “Transfer request submitted successfully”.]

That job is now submitted across the science DMZ and the science DMZ is to send points connected at 10 gigs, so that’s the fastest theoretical throughput we’d ever expect to see…

[Clicked onto “ACTIVITY” button and selected right arrow icon on ESnet Starlight DTN transfer.]

and then I’m going to hold two.

[Transitioned to the presentation slide – “Test Run Transfer Results”]

If I go to the results of a test, I ran last night, and then we’ll talk about the difference in units. So last night, I ran this test and moved that 100 gig file to our campus on the campus network; I got a throughput of 93 megabytes per second which we do the math that’s 0.75 gigabit per second, and that took just under two… I’ve got those time slipped. That took nearly 18 minutes. Those times were flipped, and then when I did the same transfer of the 100-gigabyte file to the science DMZ, they increased nearly by a full order magnitude to 873 megabytes per second. Which is about seven gigabits per second, and here’s where I talked about stories in general, 500 gigabytes to 4,000 gigabytes, so let me take something like our 10 gigabits for speed in this finally last night, which did around 873 megabytes per second; we want to know what that was in the networking right here 6,900, maybe through the ground 69 gigabits.

0:34:49.500,0:34:53.760
Now let’s go check on that transfer and
see if we got throughputs somewhere too…

[Refreshed page and new information was displayed – 1 file transferred, 0 directories transferred, 100 GB transferred, 986.29 MB/s, 0 skipped files on sync, and 0 skipped files on error.]

yeah so even faster. So today, that moves in one minute, 41 seconds, and 986. So do the rough math on that. Let’s just round up to a thousand megabytes by eight; we’ve got nearly eight-gigabit throughput on that, and then in the interest of time so I can cover a few more things, I won’t run that same test from that node to the campus network, but it’ll be in the order of magnitude slower because it’s going through things like a campus firewall, it’s having contention with other campus traffic, and it’s not a network that’s just dedicated to research traffic. It’s really the world of that story; if you’ve got large data sets that you need to acquire, come see us; we can help optimize the path that you use to get it to campus.

[Transitioned back to the presentation at slide 12 out of 16 titled “Links:”]

Here are some links. I’ll make the slide available in the Ether pad link that’s for this class, it’s not on there now, but I’ll use that. If you’re interested in Rclone, you can go there and walk through the Rclone config. Once you’re Rclone config file is created, there are tokens in there you can take that config file around with you to different posts that you want to run Rclone on, like for instance, I run Rclone on my phone to back it up to Google Drive. I’ll have links for Globus Connect Personal, which I’ll briefly discuss in a moment, and then the Globus CLI if you’re interested in that and one of the things let’s see here we’ll talk about in five minutes I’ll leave a minute or two for questions. So, Globus Connect Personal, what does that do for me? That would turn your laptop into an endpoint that’s discoverable in that Globus ecosystem. So briefly, if I go to file manager and I search for… I want a VPN, so it’s offline right now. My laptop is in the Globus ecosystem, so if I had a file that I needed to share with somebody in Globus, I could do it directly from my laptop and the story I will tell about, a little personal, is

[Transitioned around and landed back on slide 12 and went to slide 14 showing a postcard.]

we had a student in the collective camera trap data in Mozambique, and it was a large data set relatively so, it was 200 to 250 gigabytes, roughly 17,000 files, and seems to try and use Dropbox to get that data to the university. Can you imagine trying to move stuff up to Dropbox, and then you’re moving… and she was actually in Portugal, moving from one hotel to a coffee shop to another. Can you try to imagine micromanaging, moving 17,000 files right in Dropbox from one place to another, and keeping track of where you’ve done stuff? That’s where Globus really shines as we put Globus Connect Personal on her laptop, fired up a job to transfer that data to the university, and at that point, it was Carefree for her. It didn’t matter whether her laptop was on, off; she was at the hotel with a really slow internet or a coffee shop. It was better just to turn a laptop on and find the network it would start transferring more files, so at the end of the day, that’s over 17,000 files took 12 days and three hours. Between her running around Portugal to get that data back to us, but the moral of that story is she had to click start on the job on her laptop, and then the data from her laptop ended up at the institution 12 days later without her having to micromanage any of that through hotel connections and spotty connections, etc. being you know another continent away.

[Went back one slide – “Misc”]

I think I already briefly mentioned the Globus is good for big files like bringing 240 terabytes back from in INL or bringing lots of smaller files to act like Domingues files were and then also worth noting the endpoint name for Falcon, if any of you are Falcon users C3 plus 3 cluster, then you can access your Falcon scratch

[Went to next slide – “Q&A”]

and with that kind of out of time, so if you want to ask questions, I’ll end it there with a couple of minutes for questions. Reach out to us and email if you’re interested in either of those utilities are more about them or help using them.

Globus could. There’s a Google Drive connector for Globus, and Research Computing tried to acquire that. Well, it’s been three or four years ago, and as it migrated to our software review board, the office of CSO had some concerns, and so we kind of stopped pursuing it, but then once they went to “its no longer unlimited storage.” I don’t know if we’ll look at pursuing it again. when OIT discovered that there was a cap and they were going to be billed, they asked everyone over a certain amount of data to start bringing data to the institution and bringing it back. So we ended up opening a bunch more research shares for faculty and PIs, and they started bringing data back; the last I’ve heard, they don’t know how much to charge people to use it. So effectively, I don’t think there really is, but I can tell you when it was unlimited, Google put a cap on people of I think it was 750 gigabytes per day; that you could put in Google, and if you went over that you ended up in Google jail for 24 hours, and so to test that I made a five terabyte dummy file because that’s the largest single file that you can store in Google Drive, and I Rcloned that over to Google Drive. It took like 28 hours to get that five-terabyte file into Google. Once I did that, I exceeded my 750 gigs per day; I was in Google Drive JL but Google Jail, actually, for 24 hours. I couldn’t even create a new calendar event or add files to Google Drive because I was, I was locked. Yeah, so all PIs that the institution can have up to 25 terabytes no cost storage on a Windows share or an NFS export. Most are Windows shares, and then that’s where that Globus VM becomes important because you can put Globus Connect Personal on your own computer map a drive and connect that into that endpoint, or we can connect it to Globus VM depending on your use case and what that looks like, and how often you’re going to use it because there’s some back-end work that goes on on Globus VM to make that connection.

Anyone else? Yeah. how is it different than what? Now your empowered to share data, so if you have a research share with them the institution… If you were to want to get data, and this is for unprotected data, that’s not like, you know, controlled data but open research data. You can literally go into Globus Connect Personal or even so where any of your storage is… So I’ll use a better example or a Scratch; if you had data in there that you wanted to share with anyone else in the world, you could do that today can go into your scratch, finding the directory, click on share, adding the email address of another user. They’ll receive an email that says so and so is shared data with you if I had a Globus Ecosystem here’s a link. I click on that link, and it’ll take them to Globus; once they authenticate, they open your endpoint or that shared collection, and they can now grab data straight from your board scratch, and Research Computing doesn’t even involve that. You’re empowered to do that.

I think there’s another class here. Okay.

What? Oh, I don’t know. Yeah, I know nothing about that, sorry.

[Went to file manager webpage]

That’s it. I hope you guys found it useful it’s a very brief introduction, and it’s a lot of information, but if you’re interested in those or interested in using those within your data at Boise State or sharing data with anyone or pulling data down just reach out to either me directly jwatt@boisestate.ede or ResearchComputing@boisestate.edu, and we can all help you.