Skip to main content

Unix Shell Commands and File Creation Video Recap Transcript

[Two webpages are opened – Jupyter terminal and swcarpentry github info page: Creating Files a Different Way | We have seen how to create text files using nano editor. Now, try the following command: | Bash $ touch my_file.txt |

  1. What did the touch command do? When you look at your current directory using GUI file explorer, does the file show up?
  2. Use ls – l ot inspect the files. How large is my_file.txt?
  3. When might you want to create a file this way? | Solution drop-down box | To avoid confusion later on, we suggest removing the file you’ve just created before proceeding with the rest of the episode; otherwise, future outputs may vary from those given in the lesson. To do this, use the following command:]

Eric: I’m going to pull our attention back up here for a second; we got a great question.

[Clicked onto the terminal – Jupyter:

0:$ pwd

/home/jovyan

0:~$ whoami

Jovyan

0:~$ cd –

/

0:/$ ls~/Desktop/data-shell/

creatures north-pacific-gyre solar.pdf data notes.txt writing molecules pizza.cfg

0:/$ variable = greg

0:/$ echo $VARIABLE

Greg

0:/$nano

0:/$]

So in the tutorial that we’re working through right now, many of us have made a file like touch my file.txt and let’s… I’ve got an error that says permission denied. The reason that I’ve gotten this error is that I’m still in the root directory, where I don’t have permission to make stuff. So I’m going to navigate back to a directory where I can make stuff like my home directory. I’ll do cd; now I’ll do touch. first, I’m going to confirm that I’ve gotten to the right place; so pwd.

[Output – /home/jovyan]

I’m in my home directory, I’ll do touch…

[Jupyter page is gone and replaced with Google Doc – Sheel – RC Days 2023]

Hey everybody, sorry for those Zoom difficulties. To reorient us to where we are and where we’re going. Right now, we are working through the working with files and directories exercises.

[Link – https://swcarpentry.github.io/shell-novice/03-create/index.html]

The link to this is pasted over in our Google doc under working with files and directories, and we’re our kind of collection point is the creating files a different way exercise. So once we have

[Started typing in Google Doc – Shell – RC Days 2023]

finished with creating files and directories a different way… this is a blue flag. We’ll take our notes up there, but this is kind of our collection point, and we’re getting some great questions from working through these examples, and so right now, I’m in the room I am kind of circulating and answering some questions for those of you that are online put your questions in the chat, and Jenny will address those, and I’m going to pop back up here to the mic and answer some of the questions to the group as a whole as an example… one of the questions that we had is “can we edit or view the contents of non-text files using the command line,” and the answer to that is… I’ve got two answers. Firstly yes, we can use text editors to view the contents of any file. So that’s kind of like one technically correct but annoyingly pedantic answer that doesn’t give a lot more like utility because the second answer is for certain kinds of files

[Opened terminal and started entering commands – ls  outputted Desktop myfile.txt python | cd Desktop/ | ls outputted data-shell]

the way that file is written

[Entered cd data-shell/]

or the way that file… so every single file on your file system is a collection of bytes that can be interpreted in different ways; some of those collections of bytes can be interpreted as plain text, and text editors can read that just fine.

[Entered ls and the terminal outputted creatures north-pacific-gyre solar.pdf data notes.txt writing molecules pizza.cfg]

So like as an example if I do nanonotes.txt;

[Entered nano tool]

here we’ve got a file that has some plain text that has some notes inside of it. I’ll do Ctrl x to exit, but if I do nano solar.pdf, this PDF file is a file that is not a plain old plain text file. It’s a PDF, and if we try to open that as a text file, nano works. It’s like, okay, I will try to open this as if it were plain text, and there is some text that makes sense, like I understand the words color and space, but there’s also stuff in here that is actually gobbledygook.

What is going on over here is that the encoding for this PDF file format has a combination of bits that can be interpreted as text and some bits that are to be interpreted in a different way. This reveals to us that under the hood, everything is just bits in some sense, and those bits can be encoded in a way that we can read with text editors, but sometimes that’s not useful. Using a text editor isn’t a useful way of having a look at this particular kind of file.

Using different programs for interpreting those files in the way that they were meant to be interpreted, in the way that they were meant to be decoded, like using a PDF reader to open that PDF, is a more useful way of having a look at that PDF.

Does that answer your question?

[Terminal page changed, and a glitch occurred – Jupyter page is open]

When we don’t give a file name to word count, it will listen for input, and we need to use this special keystroke combination control+d to tell word count to stop listening. This is a convention used across multiple programs for example, if I typed in Python and start running the Python interpreter, I could input six plus seven

[Input 6+7 – output 13]

to do a little bit of math. I can use control +d to tell Python to stop listening to input, and it will stop listening for input. If I type in word count and start typing in some nonsense, I can use ctrl+d

[Entered h, a, asdf, asdf, and then asgh – output 5 5 19]

to tell it to stop listening for input. So if you run a command that has stolen the focus of your command prompt, if it is listening for input, the way to tell it to stop listening is ctrl+d, and this happens all of the time. So if you ever feel like you’ve lost your command prompt and you want it back, try ctrl+d to get it back. Okay, so we have

[Moved the terminal to 50 percent width and the GitHub info page to half – GitHub page includes: Which of these files contains the fewest lines? It’s… answer when there are only six files, but what if there were 6000? Out first step toward a solution is to run the command: Bash $ wc -l *.pdb > lengths.txt | The greater than symbol, >, tells the shell to redirect the command’s output to a file instead of printing it to the screen. This command prints no screen output because everything that wc would have printed has gone into the file lengths.txt instead. If the file doesn’t exist prior to issuing the command, the shell will secret the file. If the file exists already, it will be silently overwritten, which may lead to data loss. Thus, redirect commands require cation | ls length.txt confirms that the file exists: Bash $ ls lengths.txt]

side by side listed

[Entered wc – l cuba then wc-l *.pdb outputting 20 cubane.pdb 12 ethane.pdb 9 methane.pdb, 30 octane.pdb, 21 pentane.pdb, 15 propane.pdb, 107 total.]

the number of lines of all of these pdb files with word count -l *pdb, and now we’re going to show how to take the output of one command and use it for something else. The first way that we’re going to do this is with redirection. We’re going to use the greater than sign to redirect the output from this command, and we’re going to put it into a file.

When doing so, I’m going to press up to get my wc-l * pdb and then greater than lengths.txt; when I press enter, I get no output. I got a new command prompt; it’s listening to me, but what happened was the output that we got before, instead of being printed to our screen, has been redirected and stored in this file lengths.txt.

Let’s do cat lengths.txt, and we can see that the stuff that’s in there is exactly the stuff that we had output before; pretty cool.

[Started scrolling down on the GitHub info page]

This will give us the ability to store the information that came out from a command in another file. So now we’re going to start combining multiple commands together to do some cool stuff.

Alright, so we’re going to use the sort command to sort the contents of the file of the length, and first, we’re going to work through this exercise to learn a little bit about what the sort command is.

So if we’ve got these numbers in a file: 10 2 19 22 6 and we run sort on this file, the output is 10 19 2 2 6; is that right? Does 2 come after 19? I’m seeing some heads Shake no. Well, you’re wrong because this is computer land, and everything doesn’t make sense.

There are two different ways of ordering numbers and letters. Numbers in this order don’t make sense, but these aren’t numbers. These are words that are made from characters, and so to sort these things, we don’t use the… well, the sort command is not sorting these numerically. The sort command is sorting these lexicographically. So it’s putting them in alphabetical order and in the lexicographical ordering of numbers.

The number two comes after the number one alphabetically, but the word 19 is kind of like an analogy… here is what I’m going to make. I’m going to make a file we’ll call this ordering. Let’s do nano ordering, and in this file ordering, I’m going to put “a” let’s put a call on it. Yeah, “a,” “b,” and “b”. If we want to put these words into alphabetical order, what order do they go in? Like a b comes first, and then b comes second.

This is the exact same logic behind 19 and 2 or 19 and 9 being identical. Then the word 19 comes before a 9 in the alphabet, and so that’s why it’s ordered in this way; this idea that the numbers are included in the alphabet is a new idea that someone needs to tell us the first time that we see the sort command to know that sort is sorting things alphabetically first, but we can use the -n flag to enforce numerical sorting rather than alphabetical sorting or rather than lexicographical sorting. So if we do… I’m gonna make this specific example. I’m gonna do nano numbers.txt. Now paste in the numbers.

[Copied and pasted in 10, 2, 19, 22, and 6]

then paste. Let’s get

[Added a new line between each number]

some new lines, and let’s write it out. Save it so now if I do cat numbers.txt, we’ve got 10 2 19 22 6, and if I do sort numbers.txt, we’ve got that 10 19 22 6. If I do sort -n numbers.txt. Now we get them in numerical order 2 6 10 19 22.

So the purpose of this exercise is to show that there’s a difference between the alphabetical ordering of numbers and numerical ordering of numbers, and computers are the worst, and this is something that someone needs to show us the first time that we’re working with sort.

Okay, cool. Great question. So the question is, shouldn’t 19 come before 10 in the alphabetical order of numbers, and the answer is no. Zero comes before 1 in the alphabetical ordering of numbers; in the lexicographical ordering of numbers. So the ordering here would go: 10 11 12 13 14 15 16 17 18 19, and then 2 would come after 19 and 0 1 would come before 10, and 0 19 would come before a 10.

Yeah, no, that’s a good question. Is this like a stem and leaf plot? I think maybe not, but maybe the right takeaway here is if we forget for a second that numbers are numbers and we add them to the alphabet before it. So if we add the numbers 0 1 2 3 4 5 6 7 8 9 and then a b c d e f g, that is the ordering of numbers that computers use. It’s a convention because file names have text and numbers in addition to letters.

Pop Quiz… what comes first in the alphabet, A or a? You’re right. So this is a convention that humans have decided on, like the American Society for computer information, something I forget what ASCII stands for. There is a standard for how we should sort characters, and we’ve just decided that we’re gonna do capital letters first, and numbers come before capital letters come before lowercase letters, and then there’s a bunch of other weird characters that are used in computers too. There’s a character for beep, and beep has to go somewhere in the alphabet. If that’s a character that we can put into a file.

Suppose we want to be able to sort things. Yeah? We have a question over there. Okay, yeah, cool, all right. So now that we’ve got the ability to sort things numerically, let’s have a look at combining some of these ideas together. So let’s do sort -n lengths.txt, and I’m going to pipe that, or I’m going to redirect it to > sorted-lengths.txt.

[Entered command – sort -n lengths.txt > sorted-lengths.txt]

Hit enter,

[Output – 9 methane.pdb, 12 ethane.pdb, 15 propane.pdb, 20 cubane.pdb, 21 pentane,pdb, 30 octane,pdb, 107 total]

and now if we cat sorted-lengths.txt, we see that our files: methane, ethane, propane, cubane, pentane, and octane have now been sorted numerically instead of in the order that they were sorted before, which was the alphabetical order.

Now, we’re going to introduce a new command called head that’s pretty handy. head is a command that we can use to print off the header or the first lines of a file. We can specify how many lines to print with the -n flag. So if we want two lines or 20 lines, we can put the number here after the -n. I’m going to ask for the first line from sorted-lengths.txt. So head -n 1 sorted-lengths.txt, and it gives me the very first line of sorted-lengths.txt, which is “9 methane” or technically some spaces, or maybe that’s a tab, and then “9 methane.pdb”.

This is a way of figuring out which of my files had the fewest atoms in it or which molecule has the fewest atoms out of all these molecules. We’ve expressed this idea in code. If we wanted to determine which of these molecules has the fewest atoms, we can count the number of lines in each of these files and sort those numbers numerically. Whichever one comes out on top with the fewest number of lines is the one with the fewest atoms.

Similarly, if we wanted to ask for the very last line, we can use the tail command…

[Inputted tail -n 1 sorted-lengths.txt and outputted 107 total]

Tail assorted lengths, in this particular case, give me the total number of lines. So if I ask for the last two lines, let’s go tail two, I can see that octane. That pdb is the one that has the most lines, so we can use these command line tools to start expressing increasingly complex ideas. Okay, so now we have used redirection to take the output from a file and store it into another file that was the the greater than sign, but we can also redirect it in a slightly different way. I’m going to clear the screen if things are looking a little bit messy, and I’m going to do… let’s see, I’m going to do this exercise. I’m going to skip. Let’s talk our way through it so

[Entered ls command and outputted: cubane.pdb, numbers.txt, propane.pdb, ethane.pdb, octane.pdb, sorted-lengths.txt, lengths.txt, ordering, methane.pdb,pentane.pdb]

In our current directory, we’ve got all of our pdb files that we’ve been working with. We have our lengths.txt and our sorted-lengths.txt. I should pause and address that because I have scaled my screen down small, we have some text wrapping quirks that, the first time we see them, are disorienting. But our terminal is doing the best it can, so when I ask my command prompt to list the contents of this directory, it’s here. Let’s give “cubane” a little bit of space. “numbers.txt”, let’s give you a little bit of space. There’s “propane”, but “propane” doesn’t fit on this line, so it’s going to wrap the name of that file to the next line. And then, it was expecting that to be the end of the line, so it starts a new line underneath with “ethane”. It makes things look a little bit weird, but when we slow down and have a look at what’s there, it is still parsable in principle.

Okay, so we have used the greater than sign to redirect the output of a command to a file. If I run sort -n lengths.txt > sorted-lengths.txt and I redirect the output to sorted-lengths.txt, the contents of sorted-lengths.txt are the same as we had before.

[Inputted cat sorted-lengths.txt and got the output: 9 methane.pdb, 12 ethane.pdb, 15 propane.pdb, 20 cubane.pdb, 21 pentane.pdb, 30 octane.pdb, 107 total]

Now I’m going to show if we add on another, I’ve got kind of a funny quirk with how things are being disappointed here. So if I run back and forth with my cursor a couple of times, it makes it visible. Now we can see what I’m typing in sort -n lengths.txt. I’m going to put in a double greater than sign, and this is a slightly different version of output redirection. Instead of writing over the file sorted links.txt, it appends to the file,

[Entered command sort – n lengths.txt >> sorted-lengths.txt]

so now when I do cat-sorted lengths.txt

[Command outputted – 9 methane.pdb, 12 ethane.pdb, 15 propane.pdb, 20 cubane.pdb, 21 pentane.pdb, 30 octane.pdb, 107 total, 9 methane.pdb, 12 ethane.pdb, 15 propane.pdb, 20 cubane.pdb, 21 pentane.pdb, 30 octane.pdb, 107 total]

we’ve got the first time that we redirected our output there, and this time, when we appended to the file, we’ve got another copy of the same thing. So anytime that we use the single greater than sign for redirection, it will overwrite what is in that file, but if we want to append to that, we can use the greater than greater than redirection operator. Okay, we’ve got a couple of orange flags. I’m going to circulate, and then we’ll come back and talk about pipes.

[Left camera view and reappeared]

We had such a good question about sorting over here. We were talking about the difference between numerical ordering and alphabetical ordering, and so if we’ve got our sorted lengths.txt right now and if I sort them… sorted lengths.txt.

[Entered sort sorted-lengths.txt and the terminal outputted 9 methane.pdb, 9 methane.pdb, 12 ethane.pdb, 12 ethane.pdb, 15 propane.pdb, 15 propane.pdb, 20 cubane.pdb, 20 cubane.pdb, 21 pentane.pdb, 21 pentane.pdb, 30 octane.pdb, 30 octane.pdb, 107 total, 107 total]

We get them in what looks like numerical order now: 9, 12, 15, 20. And if we look at this, we should ask, “Eric, you told me that if we don’t sort numerically, that 9 comes after 12 in the alphabet. Why is this sorted numerically now, even though we didn’t use the -n flag for this particular example?”

The answer is that what’s being sorted here includes spaces. Space is also a character that has to go in the alphabet, and “space space 9” comes in the alphabet before “space 12”. So there’s a really great observation. The answer to this particular question is that a space is in the alphabet, and it comes before the numbers.

Okay, so now let’s put things together with some pipes. The pipe character is shifted, and it’s the character above your return or enter key. So this pipe character is one that you may have never typed out before because it is basically only used for this purpose. And if you’re having trouble finding it, it has an orange flag on some keyboards. We’re going to use this pipe character, and on some keyboards, it can look like a really tall colon.

[Entered and deleted character “|”]

like there might be a little space in between the vertical lines, but the pipe character is what we’re going to be using here, and we’re going to use it to combine two stitch-together programs. So if we do sort -n lengths.txt and I piped that to head in one, what this is going to do is take the output from length from the sort command and send it as the input to the head command.

[Entered sort -n lengths.txt | head -n 1 and the terminal outputted 9 methane.pdb]

So this is a way of getting the file that has the fewest lines from all of our pdb files without involving an intermediary. Before what we did was sort our links.txt, we redirected that to sorted lengths

[Enetered command sort -n lengths.txt > sorted-lengths.txt]

and then we said head one. I’m pausing here because I have not given a file name. I’ve made that stumble that I talked about earlier. I need to tell head to stop listening. All right, let’s press CTRL d. Let’s do head -n 1 sorted-lengths.txt.

[Command outputted 9 methane.pdb]

Before we use this sorted lengths.txt as an intermediary, we put the output from sort into sorted lengths, and then we used head N1 to see which file had the fewest lines. We don’t need to use that intermediary with pipes with the pipe; we can take the output directly from sort and pipe it as the input to head N1, and we can pipe together multiple things. I can type that to word count

[Entered command sort -n lengths.txt | head -n 1 | wc and the terminal outputted 1 2 22]

and we see that the output of head was one line with two words and 22 characters. This particular output is information-dense. We can see that there are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 characters just between the 9 and the “pdb” in word count, telling us that there are 22 characters in this line. This tells us that there are a whole bunch of spaces that are being included over here. These spaces are real characters that take up space.

Incidentally, when I teach programming classes, sometimes there are assignments where you can find some code on the internet that will do the homework for you, right? I’ll have some problems and ask my students to write some code to solve a little puzzle, and I’ll write some code and solve the puzzle. But sometimes, students will copy each other and change the names of variables to claim the assignment as different from their friends. But I don’t look at the variable names, I look at the spaces in the file. If you have all the exact spaces in the exact places, then I know that you cheated. The probability of having the exact same spaces in the exact same places is zero. So, heads up if you’re trying to cheat. Also, change the number of spaces you’re using to throw off the computer detectives.

Okay, so when we’re combining multiple programs together with pipes like this, we refer to each of these program invocations as a filter. Each filter takes the output from one program, changes it in some way, and passes it off to the next program with a pipe. By filtering multiple things together, we can express pretty complicated, realistic workflows. This is how we start building automated workflows that do complicated tasks.

For example, let’s do word count. We can get the number of lines from all of our pdb files. word count…

[Entered clear command]

my goodness wc -l *.pdb | sort -n.

[Output: 9 methane.pdb, 121 ethane.pdb, 15 propane.pdb, 20 cubane.pdb, 21 pentane.pdb, 30 octane.pdb, 107 total]

So we’ll get the numerical ordering of all of those pdb files, and we can type that to head -n 1

[Entered command – wc -l *.pdb | sort -n | head -n 1 and got the output 9 methane.pdb]

to get the shortest file all in one line, now we’ve got a one-liner that says we can use this in any directory of pdb files to figure out which file is the shortest in that directory. That’s a nice transferable magic power that we’ve put together using three different commands.

Okay, here is a nice visualization of some of the different versions of the word count invocation that we’ve seen. The first time, when we invoked word count -l start at pdb, the output came back to this shell. We say that it’s returned to standard output, and that’s what we see when we get output from our command prompt. We can redirect that to lengths.txt, and then we got no output. We can also pipe together multiple filters to get the output, expressing this big idea in one line at the command prompt.

What questions exist right now? Remember, your questions are a gift to your classmates because somebody else has the same question that you have.

[Scrolled down to Pipe Reading Comprehension – A file called animals.csv (in the shell-lession-data/exercise-data/animal-counts folder) contains the following data: Code | 2012-11-05,deer,5 | 2012-11-05,rabbit,22 | 2012-11-05,raccoon,7 | 2012-11-06,deer,2 | 2012-11-06,fox,4 | 2012-11-07,rabbit | 2012-11-07,bear,1 | What text passes through each of te pipes and the final redirect in the pipeline below? Note, the sort -r command sorts in reverse order.]

Okay, if there are not any questions right now, let’s pop over to the pipe reading comprehension exercise and give me a blue flag when you’re done with pipe reading comprehension and give me an orange flag if you’ve got some questions, and we’ll come to answer them.

[Added info into Shell – RC Days Google Doc and worked through some commands that will be covered in the future discussion]

For those of you that are looking for the specific data file that’s used in this example

[Highlighted: 2012-11-05,deer,5 | 2012-11-05,rabbit,22 | 2012-11-05,raccoon,7 | 2012-11-06,deer,2 | 2012-11-06,fox,4 | 2012-11-07,rabbit | 2012-11-07,bear,1]

there’s been a small change to some file names between the tutorial content and what we’ve got on our docker container. So in the animals.txt file, we’ve got the same output or the same file contents as in this example. This is a kind of theoretical exercise that you could do in your head but we do have this file and it is hidden in this

particular path.

[The path – ~/Desktop/data/animal-counts/animals.txt]

I’ll paste the path in here so that you can find it in ~/Desktop/data-shell/data/animal-counts/animals.txt This should be the right. Thank you.

[Left camera to answer questions and returned to then scroll to the next activity – Nelle’s Pipeline: Checking Files: Nelle has run her samples through the assay machines and created 17 files in the north-pacific-gyre directory described earlier as a quik check, starting from the shell-lession-data directory, Nelle types: Bash $ cd north-pacific-gyre $ wc -l *.txt | The output is 18 lines that look like this: 300 NENE01729A.txt | 300 NEN01729B.txt | 300 NENE01736A.txt | 300 NENE01751A.txt | 300 NENE01751B.txt]

Okay, for our last activity, we’re going to work through a realistic data analysis task. So We’ve got a hypothetical graduate student, Nell and Nell has done a bunch of experiments and collected information about these experiments in a directory called North Pacific. So I’m going to head over to that directory, and to do that, I’m going to do with caps lock off. cd to take me back to my home directory.

[Entered command ls and terminal outputted Desktop and python]

I’m going to navigate into the desktop directory

[Entered command ls and terminal outputted data-shell python-novice-gapminder swc-python]

and into the data shell directory.

[Entered command ls and terminal outputted creatures north-pacific-gyre solar.pdf data notes.txt writing molecules pizza.cfg]

Then there is north-pacific-gyre. Then cd north-pacific-gyre, so the path to where I am is my home directory desktop data shell north-pacific-gyre, and inside this directory, I’ve got one directory 2012-07-03. I’m going to navigate to that directory. cd 2012-07-03. If you’ve navigated here, give me a blue flag, and if you’ve got problems, give me an orange flag. Okay, I’m interpreting the absence of orange flags as blue flags. So put those orange flags up if there is a problem, all right, okay.

[Went off camera to help and returned]

One nice tip that popped up in that last round of chatting was if you want to see what can be auto-completed in your current directory, you can hit tab two times really fast, and it’ll list… your operating system will list all of the things that could be auto-completed. So if I don’t type anything and I hit tab twice really fast, it tells me all of the different programs I could possibly run. Which is a little bit overwhelming, but if I do ls space tab twice, it’ll tell me

[Entered ls and pressed tab twice – Files outputted: NENE01729A.txt NENE01751.txt NENE0171Z.txt NENE02040A.txt NENE02043B.txt NENE01729B.txt NENE01812A,txt NENE01978A.txt NENE02040B.txt goodfiff NENE01736A.txt NENE01843A.txt NENE01978A.txt NENE02040Z.txt goostats NENE01751A.txt NENE01843B.txt NENE02018B.txt NENE02043A.txt]

the names of all of the files in this directory, and so these files are like neen01971z.txt, and these are the files that Nell has created during her experiments. Let’s have a look at one of them. Let’s do cat NENE017298A.txt enter. So there’s a bunch of numbers in here. These are some measurements that have come off of her instrument in the North Pacific Gyre. Maybe she’s a hydrologist of some variety. It’s a big long list of numbers, and she wants to do some data analysis that involves all of the files that have a and b in this directory, so there’s like the prefix NENE, there’s a number that says what experiment it is and if it’s experiment A or experiment B. So like, we’re going to try to put together some pipes that involve all of these.

[Entered command ls and terminal outputted – NENE01729A.txt NENE01751.txt NENE0171Z.txt NENE02040A.txt NENE02043B.txt NENE01729B.txt NENE01812A,txt NENE01978A.txt NENE02040B.txt goodfiff NENE01736A.txt NENE01843A.txt NENE01978A.txt NENE02040Z.txt goostats NENE01751A.txt NENE01843B.txt NENE02018B.txt NENE02043A.txt]

All of these data files, and this is a task that like you might have someday where like you’re trying to analyze statistics of some goo that you’ve sampled, and we want to do this… we want to make sure that we’re averaging the right data together and that there aren’t any problems with our data so if we do a wc -lc *.txt.

[Output on terminal – 300 NENE01729A.txt 300 NENE01729B.txt 300 NENE01736A.txt 300 NENE1751A.txt 300 NENE01751B.txt 300 NENE01812A.txt 300 NENE01843A.txt 300 NENE01843B.txt 300 NENE01971Z.txt 300 NENE01978A.txt 300NENE01978B.txt 240 NENE02018B.txt NENE02040A.txt 300 NENE02040B.txt 300 NENE02040Z.txt 300 NENE02043A.txt 300 NENE02043B.txt 5040 total]

We can see that we’ve got 300 lines in most of these files, but what’s up with this one. This particular file’s mean 0 1 0 2 0 1 8B has 200 or 300 240 lines instead of 300 lines. So that’s weird. It tells us something different happened with that particular experiment. So let’s get some more information. Let’s do a… hey, Elizabeth. Let’s do a word count -l *.txt. Let’s sort numerically the output of all of those and get a sense of the shortest five files. So head -n five

[Entered command wc -l *.txt | sort -n | head -n 5 | ouputted: 240 NENE02018B.txt 300 NENE01729A.txt 300 NENE01729B.txt 300 NENE01736A.txt 300 NENE01751A.txt]

and indeed that 2018B is a file that’s 60 lines shorter than the other ones and so let’s also make sure that there are not any files that have too much data. So we can use tail instead of head to get the last five files, and oh, my command prompt is freezing. Let’s see if I’ve lost everything or if it’ll come back. Looks like it’s toast, so I’m gonna open up a new code lab for myself. Open up a new terminal

[From a new terminal entered the directory ~/Desktop/data-shell/north-pacific-gyre/2012-07-03]

and then navigate to desktop datashell North Pacific Gyre 2012, and we’re right back where we were.

[Entered command ls – outputted: NENE01729A.txt NENE01751.txt NENE0171Z.txt NENE02040A.txt NENE02043B.txt NENE01729B.txt NENE01812A,txt NENE01978A.txt NENE02040B.txt goodfiff NENE01736A.txt NENE01843A.txt NENE01978A.txt NENE02040Z.txt goostats NENE01751A.txt NENE01843B.txt NENE02018B.txt NENE02043A.txt]

All right, hooray. So word count -l *.txt. Pipe that to sort and type that to the tail to get the last five files, and we need to include the number of lines for the tail to display. The error that I got up here is tail says, “Hey, the option -n requires an argument; you didn’t give it one.” Normally we would use tail with some options and a file name but you didn’t give me the argument to end. So now I’ve added the five, and let’s press enter

[Entered command wc -l *.txt | sort -n | tail -n 5 – outputted 300 NENE02040B.txt 300 NENE02040Z.txt 300 NENE02043A.txt 300 NENE02043B.txt 5040 total ]

and the last five files or the last five lines of that sorting tell us that these four files all have 300 lines. Okay, cool. But wait for a second, we want to include all of the data files that have the “A” and “B” in our analysis, and this one has a “Z”. What is going on here? We gotta exclude this data file with the “Z” in it.

So, how are we going to include just the ones with the “A”s and the “B”s and not the “Z”s? Well, we can use the wildcard character that we learned about before. If we do ls *z.txt, we can see if there are any other files that have a “Z”. Well, there are two of them. There are a couple of files that have “Z”s, and we could delete these files, but maybe they’re important for something. So instead of deleting them, let’s put together a workflow where we can analyze all of the files that end with “A” and all the files that end with “B” but don’t include the “Z”.

So if we do, for example, ls NENEA*.txt, this will give us… ls NENE*.txt.

[Entered ls NENE*A.txt and outputted all of the files with NENE#A.txt]

That’ll give us all of the files that end with “A”. If we do NENEB*.txt, that’ll

[Entered ls NENE*B.txt and outputted all of the files with NENE#B.txt]

give us all the files that end with “B” and there might even be some ways of including all of the files that have “A” or “B” inside of it. Let’s see if I remember how to do that. This isn’t actually part of this lesson but I’m curious if it works if we put in the square brackets A comma B.

[Entered command ls NENE*[A, B].txt – outputted all of the files with NENE#A.txt and NENE#B.txt]

Now this expands to all of the files that start with “NENE,” have anything in the middle between “mean” and an “A” or “B,” and end with “.txt.” How cool is that?

Okay, what questions exist right now? We’ve got some in the chat. Cool, alright, so let me mute myself. Y’all in the… okay, I think we might be good now.

Okay, so in this section, we have learned about pipes and filters. We’ve put together the ability to, in this particular case, take a whole bunch of files that are in a directory and learn something about them, like the number of lines in the files. We’ve taken that information and sorted it to make decisions. We’ve used redirection to put the output of a command into a file. We’ve used pipes to take the output of a command and use it as the input for another command. And we’ve started to dip our toes into Nell’s data where we’re trying to get our bearings with some data that we’re going to analyze.

So far, we’ve seen how to use wildcards to express the idea that we want certain kinds of files to include in our analysis. We’re going to build on that in the next section. So I’m going to paste the link to our last section for today’s loops

into our Google Doc, and we’ll get started on loops. Maybe we’ll finish this, maybe we won’t, but whether or not we’re done, I want to take a moment to summarize. We’ve seen a lot of stuff today. At the beginning of the day, command line interfaces were a new idea, and we have learned how to navigate file systems, create files, edit those files, remove those files, move those files, run programs on those files, and pipe together the outputs from a command to the inputs of another command to start expressing complicated ideas.

Combining these skills that we’ve started practicing with loops and the ability to do the same thing multiple times on different files or in different places are the building blocks to automating repetitive tasks. Our job as scientists is to think about stuff and figure out what the data means. We shouldn’t be spending all of our time copying data from one column into another Excel spreadsheet and doing the analysis on that column of data over and over and over again for every single data file. We can automate these repetitive tasks away.

Anytime you find yourself doing the same thing over and over and over again, you should think, “Okay, can I do this with the command line? Can I write a recipe for how to do this so that I don’t have to do it over and over and over again?” A heuristic that I use for myself is if I can guarantee I’m going to do this thing 20 times or fewer in my life, I’ll do it by hand. Just get it done, won’t write any code for it. But if I think I might have to do this more than 20 times and it’s important that I’ll do it right all of these times, it’s time to sit down and figure out how can I write some loops? How can I write some scripts, programs, pipes, and filters that I can put together to express this data analysis workflow in a way that I can use over and over and over again?

So maybe that number is different for you, but if I can guarantee I only have to do it 20 times, I just do it by hand. But if it’s going to take more times, it’s time to sit down and write some code because I’m going to use that over and over and over again, and that’ll save me time in the long run.

Okay, any questions about where we are and where we’re going?

[Scrolled down to code showing for loops –

Bash

for thing in list_of_things

do

Operation_using $thing #Indentation within the loop is not required, nut aids legibility

done

And we can apply this to our example like this:

Bash

$ for filename in basilisk.dat minotaur.dat unicorn.dat

> do

>          echo $filename

>          head -n 2 filename | fail -n 1

> done

|
Output: basilisk.dat

CLASSIFICATION: basiliscus vulgaris]

So, in a little bit of orientation to this loops lesson, it culminates in writing some loops to do analysis on these NENE files in Nell’s analysis directory. These North Pacific Gyre measurements that she’s done, we’re going to put together some loops to loop over these data files to get some mathematical analyses like averages and standard deviations of the numbers that are inside these files. This is a pretty common workflow that we might be dealing with in our day-to-day analyses.

We’re not going to finish that today, so I wanted to orient you toward where this lesson is going and give you a little bit of orientation towards the Unix-Shell and software carpentry lessons. More broadly, the rest of the Unix-Shell lesson will build on these ideas with loops and pipes and filters to show how to combine these things into recipes into shell scripts that we can use over and over again. We’ll also learn how to find different files and programs in our operating systems. So that’s some of the content that we’re leaving out today, and that’s okay.

You can come back to this after class today and work through these tutorials with other folks in your lab or by yourself. You can also talk to other folks that are involved with software carpentry, which includes either myself or the kind folks at research computing. It is their job to help you learn this content for doing stuff on high-performance computing clusters that we have here at Boise State or that we have access to through the Collaborative Computing Center at Idaho National Lab. You are not alone in learning how to use these skills to do your research, and anytime that you find yourself spinning your wheels alone, you’ve made a mistake. Doing stuff alone and learning this stuff alone is the worst.

So, send an email to OIT Research Computing ResearchComputing@boisestate.edu or come find people at events like this and write down their email addresses from the sign-in sheet so that you can go talk to them later and figure out where to go to get help about this stuff. How do I find out how to make my workflow go faster or be more correct or be more transferable? Okay, so that’s kind of my rant about the other content in this particular workshop and software carpentry writ large. Software or recruit.org.

[Went to https://software-carpentry.org in Firefox]

There’s a whole bunch of other lessons that you should know about. Where did they go? Not license I want lessons.

[Clicked on “Lesson” and then “Other Lessons” to get to the current page]

There’s Version Control with Git, programming with Python, and Programming with R, these are kind of the canonical lessons for the software carpentry curriculum in addition to software carpentry there is data carpentry and library carpentry, and let me pull up links to all of those and paste them into our Google doc. All of these different carpentries lessons have slightly

[Pasted into the document the other website – https://datacarpentry.org/lessons/ & https://software-carpentry.org/lessons & https://librarycarpentry.org/lessons]

different orientations for different populations. So Unix-Shell is a really great cross-cutting lesson that supports workflows in the Data Science Community. The research Scientist Community and the Library Sciences Community but these communities are different enough that having Data Carpentry Software and Library Carpentry is a logical division of curricula, and depending upon the kind of work that you do you might find tutorials in some of these other carpentries that help you level up your skills in authentic tasks to your lab.

So I encourage you to check these things out and see what you should be learning and maybe it’s learning Python or learning R Studio or something different for the things that are happening in your lab. These are all resources that exist that you should know about and the reason that we call this Software Carpentry or Library Carpentry and stuff is just in case somebody asks you about this material that you like really like and are finding useful is because it’s evoking the idea that you don’t need an architecture degree or a building engineering degree to build a coffee table. If you have a bookshelf in your house or your apartment sometimes you might need to manage the books, the data the information that is in your apartment or in your lab. You don’t need a full-fledged engineering degree to build a bookshelf but you do need some carpentry skills.

Analogously you don’t need a whole computer science Ph.D. to do computational science. You don’t need a Ph.D. to use computers to make your life easier. You need some software carpentry skills to make your life easier and this is your introduction to some of those tutorials and some of those communities that exist out there. I should mention that the carpentries have really regular community meetings where you can find other birds of a feather that are working on problems in your domain area and meeting those people and having those conversations to build out your skill sets and level up is really healthy.

All right, what questions exist before we go into our last 10 minutes? Great, all right let’s pop back over to loops. So I think what we’ll do for these last 10 minutes is start working through some of these loop exercises. Carolyn and I will circulate around and answer any questions that pop up, and if folks have questions about getting terminals working on their own machines you can pop questions about those over in the Google Doc and we can try to answer those asynchronously.

Whoa, I’ve pasted these links in a weird spot. I’m gonna move these to the bottom, cool. So let’s pop over to the loops exercises, and we’ll write our own loop as our kind of gather point for finishing up today. If you finished, write your own loop. You can put up your blue flag and call it a day, and if you’ve got questions on how to write your own loop put up your orange flag. We’ll come to answer those questions, and thanks so much for attending today. We’ve got a really great program today and tomorrow of talks and other workshops and poster sessions and yeah, it’s lovely to have you all here back in person. It’s been a while since we’ve had an RC days where we’ve gotten together like this and it’s very refreshing. Thank you.