Linux/Unix and the Command Line
Today we will be learning how to get around a Linux computer that is logged in remotely through a Terminal, or Command Line. We will also learn how to run commands on this remote server and see the results.
First, we will log in to the CGRB cluster computer that we will be running our software on. If a terminal window is not open already, click the PuTTY link on your desktop and ask a course TA for help. The terminal is a black screen with white text and a blinking cursor:
Each command is typed in at the end, and output from that command is printed to the screen.
Right now the terminal allows you to run commands on your own computer. We are now connected to the CGRB cluster to run our analyses.
.
First, let's find out where we are. To display the path of the folder you are currently in, or your "current working directory", type the command:
pwd
This will print out the full path to where you are in the computer. The path includes all folders containing your folder, separated by "/". For example, the path "/home/bpp/franklinr" means the folder "franklinr" is in the folder "bpp", which is in the folder "home".
The program “ls” will list all of the files and folders inside your current directory:
ls
When you run a program, you can also give it additional information, such as providing an input file, or tell it options for how to run. To do this, you add what are called "arguments" after the program name. For example, without any arguments, the "ls" command prints out the contents of the current working directory. You can also give it a path to a folder as an argument. This will cause it to print out the contents of that folder instead. Try this command:
ls /usr/bin/
This will print out the contents of the folder "/usr/bin/" which contains many of the programs installed on the cluster. Programs in this folder can be run just by typing their name, such as "ls". If you do not know the options for a program, you can use the argument "--help" or "-h" after the command to list all of the arguments it recognizes:
ls --help
Next we can create a directory. Directories are like folders to keep files and such. You can create a directory by typing the command “mkdir” for “make directory”. For example, create a directory to store your scripts:
mkdir data
You can go inside of a directory. This is with the command “cd” for “change directory”. For example change directory into your data directory:
cd data
Run the pwd command again to show your new current working directory:
pwd
Now let's get ready to copy a file. The next command lists all files in the folder " /nfs1/Teaching/CGRB/dbbc_s16/data/" that end with ".fasta"
ls /nfs1/Teaching/CGRB/dbbc_s16/data/*.fasta
The * character in the last command is called a wild card. This character can match any number of any character. For example,
ap* would match:
apple
apricot
apromycin
Now we will copy one of the files from that folder to your current folder. We will use the "cp" program to copy files. The first argument is the file you want to copy, and the second argument is the name of the directory to store the copy:
ls
cp /nfs1/Teaching/CGRB/dbbc_s16/data/labfile.fasta ./
ls
You can also rename this file or move it to another folder without making a copy by using the "mv" or move command:
mv labfile.fasta myfile.fasta
ls
If you give a filename instead of a directory to the cp command, it will make a copy of the file with that name:
cp myfile.fasta mysecondfile.fasta
ls
ls my*
The "rm" command can be used to delete files. Be VERY careful with this command, there is no "undo" or recycle bin in Linux, and deleted files are gone forever. Now remove the copy of the file you created:
rm mysecondfile.fasta
ls
Now we will look at the contents of this file using the program nano. Nano lets you open and read/write files from the command line. The myfile.fasta file is in fasta format, which is very common for storing DNA and protein sequences.
nano myfile.fasta
Press ctrl+x to quit nano.
You can also search within a file from the command line using the "grep" program. To search for every occurance of the word "virulence" in the file myfile.fasta, use the following command:
grep "virulence" myfile.fasta
Most commands print output to the screen. You can save the screen output of any command to a file using the ">" character. To save the output of the previous grep to a new file, rerun the command and use the ">" character:
grep "virulence" myfile.fasta > virlist.txt
The "head" program can print the first few lines of a text file without having to open the entire file:
head virlist.txt
Now try to solve the following exercises on your own.
Exercises:
1. Use the grep command to find all lines in the file myfile.fasta with the word "copper"
2. how many times does the word "transporter" appear in the file? (hint: use grep --help first and look in the "Output control" sections for counting options)
3. How many protein sequences are in the file "/nfs1/Teaching/CGRB/dbbc_s16/data/Agrobacterium_proteins.fasta"? Hint: The description for each sequence in a fasta format file starts with the ">" character.
This file contains all of the proteins from a single Agrobacterium genome.