UNIX tutorial

Useful Material

A UNIX cheat sheat like this one might be helpful as a reference.

Directory structure

A computer file system is laid out as a hierarchical multifurcating tree structure. This may sound confusing but it is easy to think of it as boxes of boxes where each box is a directory.
There is one big box called the root. All other boxes are contained in this one big box. Boxes have labels such as ‘Users’ or ‘Applications’. Each box may contain more boxes (like Desktop or Downloads or Work) or files (like ‘file1.txt’ or ‘draft.docx’)
Thus it is hierarchical (boxes in boxes), multifurcating (each box can contain multiple boxes or files) tree structure (similar to how a tree has branches and leaves).

directoryStruct

There are two ways to refer to directories and their positions in this hierarchy and relationship to other directories.
Absolute path
The absolute path is the list of all directories starting from the root that lead to the current directory
Directories are separated using a /
For example, the path to the directory ‘Microsoft’ is /Applications/Microsoft/

  • What is the absolute path to the directory ‘currentWork’?

Relative path
A directory can also be referred to by its relative location from some other directory (usually where you are working from).
The parent of a directory is referred to using ..
For example, if I am in ‘currentWork’ and want to get to ‘Microsoft’ the relative path is ../../../Applications/Microsoft/

  • What is the relative path from ‘Adobe’ to ‘Desktop’?

 Important obscure keys

There are some keys that are used a lot in UNIX commands but can be difficult to find on many keyboards. Try to find now the following keys:

  • ~ (tilda)
  • / (forward slash)
  • \ (back slash)
  • | (vertical bar or pipe)
  • # (hash or number or gate sign)
  • $ (dollar sign)
  • * (asterisk)

Basic syntax of shell commands

UNIX or shell commands have a basic structure of

command -options target

The command comes first (such as cd or ls as we will see later) then any options (always proceeded by a – and also called flags) and then the target (such as the file to move or the directory to list)
These commands are written on the prompt (terminal command line).

Panic button

If you are running a process or program and it is stuck or doing something you don’t want then holding control and pressing c. This will kill the current process and return you to your prompt.

Starting the terminal on the Amazon instance

On your Amazon instance desktop there is a ‘MATE terminal’ icon which you can double click to launch. Alternatively, at the top of your screen there is an icon for the terminal (black box with a > inside) which you click to start the terminal. Alternatively again, you can select Applications->System Tools->MATE terminal.

If you do not wish to use the graphical X2Go interface you can also use ssh to connect to the instance, resulting in a terminal window connected to your Amazon instance through your terminal or putty. There are instructions on how to do this in the “Starting your Amazon virtual machine” document.

Navigation

In order to navigate around the directory structure you first need to know where you are in that structure currently. This is done using the command

pwd

This stands for ‘print working directory’
If we are in the Desktop folder in the above example image then pwd will print
/User/Desktop

  • Type pwd in to your amazon cloud instance terminal. What comes up?

You can list the contents of the directory using

ls

(This stands for list system)

  • Use this in the home folder (the folder you see when you launch the terminal). What do you see?

You can list any folder in the directory structure by using ls followed by the path to the folder

  • Can you list the contents of the wme_jan2015 folder without first changing directory?

You can use flags to modify this view. For example

ls -l

will give you a list view with each item on its own line.

ls -a

gives you all the files, including hidden ones such as those that point to the parent (..)
You can often combine flags such as

ls -la

which gives you all the files, including hidden ones, in a list format.

Getting around the directory hierarchy is done using the cd command (change directory).
The syntax is:

cd <absolute or relative path>

For example, to go to the Microsoft folder in the above hierarchy we would use
cd /Applications/Microsoft/
or if we were currently in the ‘currentWork’ folder we would use

cd ../../../Applications/Microsoft/

  • On the amazon cloud, starting in the home directory (/home/ubuntu/ which should have a ~ on the prompt) navigate to the Desktop folder.
  • From the Desktop folder navigate to the wme_jan2015 folder. How is this done using both the absolute and the relative path methods?

If you get lost in the directory structure you can return to your home directory by typing

cd ~

Creating directories and files

Creating a folder is simple and follows the syntax

mkdir <foldername>

thus creating a folder called ‘folder1’ you type

mkdir folder1

  • create a folder in the wme_jan2015/activities folder called unix now.

Creating files can be done in many ways. The most common method is to use a command line text editor. Here we will use nano but there are many others (emacs, vi etc) that are more powerful.
Create a file called file1.txt in the unix folder by typing

nano file1.txt

This will launch the nano text editor and allow you to edit file1.txt. Write in here ‘this is the contents of file1.txt’.
Save the file by pressing control-o. This will prompt you at the bottom of the screen to confirm the file name and you can press return to confirm this.
Exit nano by pressing control-x (and hitting return). This will return you to your prompt.
If you list the contents of your current directory (using ls) you should see file1.txt is now there.

Copying and moving files

Files can be copied using the command

cp <filename> <new filename>

For example, to copy file.txt to a new file called file2.txt you type

cp file1.txt file2.txt

This will leave file1.txt intact and copy its contents to the new file
You can also copy files between folders by putting the path before the target. For example to copy file.txt to the home directory you can type

cp file1.txt /home/ubuntu/

The system knows that this is a folder and thus will copy the file there and use the same file name (file1.txt). You can designate a new filename by putting it at the end of the new directory path (e.g. /home/ubuntu/file3.txt).
This can also be used to copy files to the current directory (by putting the path to the file into the first filename section of the cp command)

NOTE: if a file already exists with that name in the directory you are copying to, it will be overwritten by the new one you are copying in. There is no confirmation warning so be careful when copying or moving files.

Files can be moved to a new location using the command

mv <filename> <new location>

For example, file2.txt can be moved to the parent directory (in this case activities if you are in the unix directory) by typing

mv file2.txt ..

This will remove the file from the current directory and place it in the new one. This will overwrite any other file with the same name in that directory as happens with cp.
The mv command can also be used to rename files in the current directory. For example

mv file1.txt file4.txt

will keep the file in the current directory but rename it to file4.txt. This can be combined with a directory path to move files between directories and rename at the same time.

Viewing file contents

You can use a text editor like nano to view the contents of text files but this can be tedious and difficult for very large files.
There are several commands for viewing files without an editor, depending on how you wish to view them.
To print the entire contents of a file to screen you can use the command

cat <filename>

This will display all the contents at once. This is ok for small files (a few lines) but larger files will run on and on.
For large files it is better to use interactive viewers> one such viewer is less, invoked by typing

less <filename>

the file is then displayed one screen length at a time. Certain commands are then used to go back and forward through the file:

  • space bar: display the next page
  • b: display the previous page
  • enter/return: display the next line
  • k: display the previous line
  • q: quit the viewer

If a file contains very long lines, these lines will wrap to fit the screen width. This can result in a confusing display, especially if there are, for example, long sequences in your file. To stop this we can use

less -S <filename>

which will stop the text wrapping. You can scroll horizontally across lines using the arrow keys.

To view a certain number of lines at the start or end of a file we use head and tail. For example, to view the file 50 lines of a file type

head -n 50 <filename>

Here we see the flag/option -n is used to denote the number of lines we want followed by the number itself. The same can be done using tail to view the last lines of a file.

Starting programs

Most programs you will use here are already installed in the amazon instance you are using. thus you just type raxml like you would cd or cat followed by the required options.

Naming files

As you can see from the above tutorial, spaces are used to separate commands from options from targets. Thus, it is good practise not to put spaces into file or directories as this will make it difficult to run commands. It is better to use capitals to separate words in filenames. For example:
‘raxmlOutputFile’ instead of ‘raxml output file’ or ‘workDirectory’ instead of ‘work directory’
You should also not use ‘weird’ symbols such as \/?* etc. as these are used by the system to do special commands.

Cyberduck

Copying files between the Amazon instance and the local computer is done through a method called scp (secure copy). This an be done in two main ways: on the command line and through a GUI. Due to differences in operating system implementations of scp we will use the GUI version through the program Cyberduck. You may also like to refer to slides from the Workshop on Genomics, here, about using FileZilla.
Download Cyberduck from https://cyberduck.io/ and install as per usual for your operating system. Note on Mac you may need to change your security permissions to allow you to open a program downloaded from the internet.
Once you load Cyberduck you will see an open connection button in the top left corner.
Click this and select SFTP from the dropdown menu.
Fill in your Amazon public DNS address where it says server, ubuntu for the username and the password provided for the password (in the same way as setting up the X2Go).
Click ‘Connect’ and the screen should now display your home directory on the Amazon instance.
From here you can drag and drop files both to and from the instance and many other options in the right click menu. You can also save this connection information by creating a new bookmark in the bookmark menu at the top.

CyberduckUnix (or alternatives)

If using a unix based laptop (i.e. Linux or Mac) you can try the command line version of scp within the terminal. To copy a file from your Amazon instance to your computer you can use the following syntax:

scp ubuntu@<public dns address>:./file.txt .

This will copy a file called file.txt that is in the Amazon home directory to the current directory on your computer. To copy a file in a different directory, place the relative path from the home folder to the path. For instance if the file.txt is in the wme_jan2015 folder you would use

scp ubuntu@<public dns address>:./wme_jan2015/file.txt .

To send files to the instance you reverse the order of the targets. Thus sending a file local.txt that is on your laptop to the home directory of the instance you would use the command

scp local.txt ubuntu@<public dns address>:./

Advanced shell commands and options

NOTE: this is not required for the current course but are useful tools for general working within the unix environment.

Redirect output to file

If a command prints information to the screen (standard out) such as cat, echo, ls, grep etc. this output can instead be redirected to a file.
The > symbol is used to overwrite the contents of the file with whatever output you specify to redirect to it.
The >> is used to append instead of overwrite.
Thus, to place the sentence “this is redirected output” into a file ‘redir.txt’ we type

echo “this is redirected output” >redir.txt

(echo is a command that prints whatever is after it to the screen)
If we then want to add the contents of a file called ‘file1.txt’ to that file we type

cat file1.txt >>redir.txt

This is very useful for concatenating multiple files (e.g. sequence files) into one large file by using:

cat sequenceFile1.txt >>largeFile.txt
cat sequenceFile2.txt >>largeFile.txt

etc.

Tab completion

The tab button can be used to complete file/directory names and do quick lookup of commands.
If, for example, a file or directory has a long name, you can save time by using tab completion.
Lets say we wanted to copy a file named ‘reallyLongFilename.txt’ to the parent directory. We would use the command

cp reallyLongFilename.txt ..

Instead of having to type out the the whole filename, you can type the first few letters and hit the tab button. This will fill in the rest of the name for you, providing that the file is in the current directory and there is no other file/directory that starts with that name.

  • To test this, create a file called reallyLongFilename.txt and use tab completion to fill out the name within a cat command.

If there are two or more files that start the same way (for instance if you have a file reallyLongFilename.txt and a file realDataTest.txt) then tab completion after typing ‘rea’ will not fill in the whole name as there is ambiguity to which file you mean. In this instance pressing tab will fill in as much as it can (in this case ‘real’) and stop. Pressing the tab button twice will now display all the options of files that start with those letters, allowing you to see what extra letters you must type to complete the file.
In this case you can type an extra l (giving you ‘reall’) and then hit tab and it will complete it for you.

Create a file named ‘realDataTest.txt’ in the same folder as the ‘reallyLongFilename.txt’ file and try this double tap hinting completion method.

Tab completion can be used on directories to show their contents as well. Say, for instance, you wish to copy a file to the wme_jan2015 folder but don’t know what else is in the folder. You start the command by typing

cp file1.txt /home/ubuntu/wme_jan2015/

and then hit tab twice. This will then list the folder contents as per the ls command, and allow you to see what options are available to you for subdirectories etc.
The same will work for commands such as cd, less, etc. and programs that are installed such as raxml and paup.

Repeating commands using loops

The real power of the shell is the ability to repeat commands on multiple targets. This is useful for example for creating multiple folders, moving files into each folder, running pipeline on multiple samples etc.
This is accomplished by using a tool called the for loop. In order to use these properly, two features of the shell need to be understood: variables and wildcards.

Variable

A variable is a placeholder for some text such as a directory name, filename, number, sentence etc. These allow for the contents of the variable to be changed within a loop without manually having to do so yourself.
A variable is always initialised using the name you designate for the variable (e.g. file, direc, superman, x, etc.) It can be whatever you want once it is a single word without spaces or special characters.
The variable is then called using a $ in front of the name. Thus is the variable is named direc it is referenced using $direc.

Wildcards

The asterisk (*) is referred to as a wildcard symbol in unix. This allows for matching of filenames, directories etc that all have a certain sections of their name in common.
For example, if all your files start with ‘result’ (e.g. result.txt, result.tree, result.nexus, resultFile, result) these will all be recognised using result*.
Alternatively if they all end with .txt you can loop over them all using the *.txt

Both variables and wildcards are used in for loops to maximise their power. A for loop has the syntax:

for <variable> in <list>
do
<tasks to repeat for each item in list>
done

Each section is written on a separate line (e.g. after ‘do’ hit enter’) and instead of a prompt the terminal will display a > to designate you are in a multi-line command.
Alternatively you can place a loop all on one line using ; to separate the commands (except for the line break after the ‘do’ where there is no ; included).

For example, we can use the command ‘echo’ to print something to the screen. This is used like
echo ‘hello’
which will print hello to the terminal.
We can use a for loop to print the number 1 to 10 to screen by typing

for num in {1..10}
do
echo $num
done

This loop starts at 1, places the number in the variable num which can then be accessed inside the loop through $num.
(this can be done on one line by writing ‘for num in {1..10};do echo $num;done;’ Note the lack of semi-colon after ‘do’)

This becomes more useful when we want to create, move, modify etc. files and directories.
Lets use a for loop to create 3 directories which will be named run1, run2 and run3

for num in {1..3}
do
mkdir “run”$num
done

We place ‘run’ before the variable to tell the system we want this string to be placed before the variable as part of the directory name. If we wanted it placed after (e.g. create 1run etc.) we could use $num”run”.

Loops can also be used to affect a set of folders or files that have some portion of their name in common.
We can then use a loop to go into each run folder (created above) and create a blank file called ‘result.txt’ within the folder. This is done using cd commands to go in and out of directories in a list and the touch command to create blank files.
This is done by the following loop:

for direc in run*
do
cd $direc
touch result.txt
cd ..
done

This loop goes into each directory that starts with ‘run’, creates a file called ‘results.txt’, goes back out of the directory and then to the next in the list etc. Thus, the loop is stepping in (using cd $direc) and out (using cd ..) of each folder and issuing commands within the folders without you having to do so manually.
NOTE: this loop will operate on every folder or file that starts with ‘run’. Thus if you happen to have a file that starts with ‘run’ in the same directory, the loop will attempt to step into this file, print an error saying it can’t but then continue the loop and create a file called ‘results.txt’ and cd .. meaning it goes into the directory above. Therefore you must be careful that if you are stepping in and out of folders with these loops there are no files that would be put into your list due to matching the text with the wildcard. The best way to test this is to create your loops that step in and out of folders and use pwd commands to check it is the right path at each step.

Loops are then most useful when running a pipeline on multiple samples. For example if you wish to run mafft and raxml on files contained in folders that start with ‘sample’ you could use a command such as

for x in sample*
do
cd $x
mafft <put mafft command options here>
raxml <put raxml command options here>
cd ..
done

This will then run the programs on each sample in sequence, saving you from having to manually start these programs on every sample yourself.
Anther use is to take all the .txt files in a directory and concatenate their contents into one large file using a for loop. Try this as an exercise.

Grep

Grep is a tool for searching files for a specific content. It has many powerful applications, the basics of which will be explained here.
The basic syntax of grep is

grep <search pattern> <filename>

For example if we want to find every line that contains the word ‘result’ in the file output.txt we should type

grep “result” output.txt

This will print the lines to the screen (or you can redirect these to a file using the > or >> methods).

Flags can be used to modify these results in many useful ways. For example:
grep -n <search pattern> <filename> will print the line number of the result beside each matching line
grep -v <search pattern> <filename> will find the lines which don’t contain the search pattern
grep -c <search pattern> <filename> does not return the lines that match but instead returns a count of the number of lines that contained a hit
grep -i <search pattern> <filename> use a case insensitive match (meaning B and b are the same thing)
grep -A 5 <search pattern> <filename> will print the 5 lines that come after a line that matches the pattern
grep -B 5 <search pattern> <filename> will print the 5 lines that come before a line that matches the pattern
These can then be combined so that, for example, grep -vc <search pattern> <filename> will return a count of the lines that don’t contain the provided search pattern

Grep can also be used to search multiple files using wildcards as seen in the for loop
For example, to search for the pattern “result” (without worrying about case) in all files that end in .txt we could use

grep -i “result” *.txt

You can display the filenames that match instead of the matching lines by using -l.
Thus, to create a file that lists all the text files which contain the word “result” (case insensitive) you could use

grep -il “result” *.txt >>found.txt

Grep can also be used with regular expression patterns, which allow for wildcards and other special characters to be used to match a variety of pattern combinations.
For example, to find lines which have a b followed by any letter followed by a g (e.g. matching big, bog, etc.) we would write

grep “b.g” file.txt

Here the . is used as a special character designating ‘any character’. If we want to look for the . specifically (e.g. b.g only) we have to ‘escape’ the . to tell grep we want to find the period, not the special function the period has. This is done with a \.
thus to look for b.g only (and not big etc.) we type

grep “b\.g” file.txt

If we want to find b followed by any number of characters and then a g we use the * symbol after the .. e.g.

grep “b.*g” file.txt

which will find big, brig, berg, bloomberg etc.
If we wanted to have only a selection of characters matched we can use the [] (square brackets).
For instance, if we wanted to search for only big or bog we could use

grep “b[io]g” file.txt

or if we wanted specific ranges, like only 1,2,3 after the word ‘result’ we could use

grep “result[1-3]” file.txt

We can also specify if we want matches only at the start of the line using ^ or the end of the line using $.
For example if we wanted a line that started with ‘Salmonella’ and ended with any number between 400 and 600 but we didn’t care about the character inbetween we could use

grep “^Salmonella.*[400-600]$” file.txt

Thus you can see how grep and regular expressions are useful for searching large files for certain information like specific blast results in a tabular file, likelihood scores in phylogenetic analysis outputs etc.

Sed

Another useful tool for file manipulation is sed. This tool has many powerful applications including the replacement of one block of text with another.
The syntax for this is

sed ‘s/<pattern to find>/<text to replace it with>/g’ <filename>

This will output the changed file contents to the screen, which can then be redirected to a new file using the > or >> as above. For example, if we wanted to replace every instance of the word ‘species’ with the abbreviation ‘sp’ in the file tax.txt and place it in a new file called newTax.txt we could use

sed ‘s/species/sp/g’ tax.txt >newTax.txt

Remember: if a file is overwritten by a bad sed command there is no ‘undo’: the file is now permanently changed. Thus, use sed with caution and practise.
Sed has many other powerful applications such as deletion of text and lines and regular expression pattern matching. These are very useful to learn but are to be used with caution.

Pipe

The output from one unix command can be sent as input to another using a pipe. The symbol for this pipe is the vertical bar |.
For example, lets say you have a directory that has lots of files and folders, making the ls command very full. If you want to search for a specific file prefix (lets say ‘result’) within the ls command we can pipe the output of ls to grep like so:

ls| grep ‘result’

Note we do not specify anything as the input to grep since the pipe takes the output of ls and automatically puts it as the input to grep.
Piping commands together as input to grep, sed, less, etc can become very useful for sorting, modifying and searching files and folders, especially within loops.

.bash_profile, alias and PATH

The behaviour of the terminal shell can be modified by adding commands to one of two hidden files: .bashrc and .bash_profile, which are found in the home directory (/home/ubuntu on the Amazon instance). In the Mac OSX terminal these files do the same thing. In the Linux version the .bashrc runs each time a terminal window is opened whereas .bash_profile only runs the first time a terminal is opened.
Here we will learn how to modify and use the .bash_profile file as the .bashrc is already populated with many commands we do not wish to interfere with.
The two main things we will use the .bash_profile file for is creating aliases and modifying the PATH.

Alias

An alias allows you to create shortcut commands that point to longer commands, thus saving on time.
For example, if we used ls -al often and wanted a shortcut for this we could create an alias ‘lal’ that would run this command for us.
Create a .bash_profile file in the /home/ubuntu directory by typing

nano .bash_profile

Within this file we will create a new alias for ls -al by typing

alias lal=’ls -al’

Save and close the file as per the nano instructions above.
Now, each time you use the terminal in the Amazon instance, you can type lal and it will run the command ls -al. However, because .bash_profile is only called the first time we use the terminal, we would have to shut down the instance and start again to enact this change.
To save on having to do this each time we modify and test the .bash_profile, we can manually load the .bash_profile file by typing

source .bash_profile

Now if you type ‘lal’ the command should run as specified.
Aliases can be used to create command shortcuts for any task. This is useful particularly for repeated tasks such as ssh and scp to locations with long addresses. For instance, lets say in order to shh to my computer in work I had to type

ssh conor@work.address.com

This would become tedious and hard to remember if addresses are long and complicated. Instead I can create an alias ‘work’ to enact this command for me. E.g.

alias work=’ssh conor@work.address.com’

Thus I just type ‘work’ on the terminal and the ssh command is run for me.

PATH

The commands that are run within the terminal such as cd, less, ls etc. are executable files that have been created and stored in specific folders in the system. The system then knows where to look for these programs by searching in directories specified in the environment variable PATH.
You can view your current PATH by typing

echo $PATH

which will likely output something like
/usr/local/bin:/usr/bin:/bin
This means that the system can run executable programs that are stored in these three folders (separated by the : ) without the user having to specify the absolute path to the folder.
A user may wish to modify this path to add other folders where such programs can be stored. This is useful for downloaded programs which you want to be able to run.
For instance, if you download the BLAST+ package from NCBI and store it in your home directory, each time you wish to run blastn in the terminal you would have to type

/home/ubuntu/blast+/bin/blastn <options etc>

This is because the system does not have that folder in the PATH and thus you must specify the path to the program manually each time.
Alternatively, a good practise is to create a bin folder in your home folder (i.e. /home/ubuntu/bin/) and store all downloaded programs (like BLAST) in this folder. You then add this folder to the path in the .bash_profile
Thus, if a folder /home/ubuntu/bin/ exists and blastn is inside this we can add the folder to the path by editing the .bash_profile file and writing

export PATH=$PATH:/home/ubuntu/bin

This command says to set the PATH to whatever is currently in the PATH ($PATH:) followed by the new addition (/home/ubuntu/bin)
Save the .bash_profile file and exit. You can reload the .bash_profile file manually again by typing

source .bash_profile

Now if you use ‘echo $PATH’ you will see the /home/ubuntu/bin added to the end of the PATH. Therefore, any executable program placed in this folder (eg. blastn, raxml, mafft etc) can be called directly from the command line, similar to cd etc.

Other useful commands

Below is a brief overview of some other useful commands. I suggest looking at these in more detail yourself.
wc counts words or lines in a file or output.
e.g.
wc -m file.txt will output the number of characters in the file
wc -l file.txt will output the number of lines in the file
ls|wc -l will take the output from ls and then pipe to wc, resulting in a count of items in the directory

cut undertakes basic text processing by cutting a text file in specific ways.
e.g.
cut -c2 file.txt will return the second character of each line in the file
cut -c3-5 file.txt will return the 3rd, 4th and 5th character of each line in the file
cut -c2- file.txt will return from the 2nd character to the end of the line for each line in the file
cut -d’:’ -f1 file.txt will take a file, cut each line by the : character and return the first field resulting from this split on each line

rm is the remove command. It will completely remove the file from the system and does not ask for confirmation before doing so.
NOTE: rm removes the file and does not send it to the trash. Thus, once you use rm on a file it is gone and irretrievable.
NOTE2:be VERY careful using rm inside loops. If you make a mistake it may end up removing mutiple files you did not want removed without asking for confirmation. This is very dangerous when using in loops that use cd to step in and out of directories.
Always test loops you wish to use for rm with echo commands instead first.
The rm format is
rm <filename>

Remember: rm is like alcohol: use responsibly.

UNIX shortcuts

The UNIX terminal has many shortcuts using the keyboard to make it easier to edit commands. For example:
ctrl-a will bring you to the start of a line on the prompt
ctrl-e will bring you to the end of the line
Also, if you are unsure what flags etc. can be passed to a command there are manual pages built in to UNIX.
For example, to find all the options that can be passed to ls you can type

man ls

The manuals can then be progressed and quit as per the less command.