Basic UNIX, some tips and tricks, and more!

Alexandra Weber & Julia M.I. Barth, 22 January 2018



Background and Objectives

Welcome everyone!
We have prepared this UNIX activity to get all of you to the same level, as you will use the terminal a lot during this workskop.
Depending on your level, this activity will take more or less time. In case you are already very comfortable with UNIX and working in the command line, you may directly proceed to an additional activity on VCF (variant call format) files and the first analyses of genomic data: The VCF format, filtering, and first analyses.

Learning goals:

    • navigate in the UNIX environment
    • create, move and delete directories
    • create, move, delete and edit files
    • use basic unix commands and know where to find help

Why would we use the terminal / shell in the first place?

Scripting: We can write down a sequence of commands to perform particular tasks or analyses;
when working with genomic data, a task usually takes minues, sometimes hours or even days – it’s no fun to sit and wait in front of your computer this long just for a moust-click to initiate the next task.

Powerful Tools: In UNIX, powerful tools are available that enable you to work through large amounts of files, data, and tasks quite quickly and in an automated (that is, programmatic) way.

Easy remote access: In most cases, it is not possible anymore to deal with genomic data on a desktop computer. You will usually run analyses on clusters at high performance computing facilities at your university, or – like in this course – on the Amazon cluster.

A GUI (Graphical User Interface) is not available for many programs: Genomics is a fast evolving field and developing a graphical interface takes time and effort.

Compatibility: The terminal can (remotely) be accessed with computers running on different operating systems

Basic syntax of shell commands

UNIX or shell commands have a basic structure of:
command -options target
The command comes first (such as cd or ls as we will see later) then any options (always proceeded by a – and also called flags) and then the target (such as the file to
move or the directory to list). These commands are written on the prompt (terminal command line).

How to do this activity

Questions or tasks are indicated with Q .
All text in red underlined with gray color indicates commands that you can type or copy to the terminal.
If you get stuck, check the answer-box:

Show me the answer!

Oh no, only if you get stuck!! First try to find the answer yourself!



Table of contents



1) Find your keys!

There are some keys that are used a lot in UNIX commands but can be difficult to find on some keyboards.

Q Open a text editor and type the following keys:

~ tilda

/ forward slash

\ back slash or escape

| vertical bar or pipe

# hash or number or gate sign

$ dollar sign

* asterisk

single quote

" double quote

` backtick

ctrl c The panic button: If you are running a process or program and it is stuck or doing something you don’t
want it to do: then hold control and press c. This will kill the current process and return you to your prompt.



2) Getting help

A UNIX cheat sheet like this one here might be helpful as a reference.
Also, never forget that Google is your best friend!
Most UNIX commands and many other programs have help pages accessed through: command_name --help, or command_name -h, which also describe different ways to run a program.
Most programs also have a more exhausive manual page accessed by typing man command_name.

Q Access the bcftools help page.

Show me the answer!

if you haven’t done it yet, open a terminal window and login to your wpsg account using ssh
bcftools --help

Q What do the cp and vi commands do?

Show me the answer!

man cp: copy files
man vi: Vi IMproved, a programmers text editor
To exit the man page press q


3) Navigation I

A computer file system is laid out as a hierarchical multifurcating tree structure. This may sound confusing but it is easy to think of it as boxes of boxes where each box is a directory.
There is one big box called the root. All other boxes are contained in this one big box. Boxes have labels such as ‘Users’ or ‘Applications’. Each box may contain more boxes (like Desktop or Downloads or Work) or files (like ‘file1.txt’ or ‘draft.docx’)
Thus it is hierarchical (boxes in boxes), multifurcating (each box can contain multiple boxes or files) tree structure (similar to how a tree has branches and leaves).

directoryStruct

There are two ways to refer to directories and their positions in this hierarchy and relationship to other directories: absolute and relative paths:

A) Absolute path
The absolute path is the list of all directories starting from the root that lead to the current directory. Directories are separated using a /.

For example, the path to the directory ‘wpsg’ is /home/wpsg/

Q What is the absolute path to the directory ‘software’?

Show me the answer!

/home/wpsg/software

B) Relative path
A directory can also be referred to by its relative location from some other directory (usually where you are working from). The parent of a directory is referred to using ...
The current directory is referred to using .
For example, if I am in ‘software’ and want to get to ‘workshop_materials’ the relative path is ../workshop_materials/

Q What is the relative path from ‘software’ to ‘wspg’?

Show me the answer!

..


4) Navigation II

Q Go from your home directory to ‘beast’ (in the ‘software’ directory)
The Home directory is where you are upon login (/home/wpsg/).

Show me the answer!

cd stands for ‘change directory’:
cd software/beast/

Q Go back to your home directory

Show me the answer!

Using the absolute path:
cd /home/wpsg/
Using the relative path:
cd ../../
Using a really useful shortcut:
cd ~

Q Go to ‘software’

Show me the answer!

cd software

Q Check where you are (print the absolute path)

Show me the answer!

pwd
pwd stands for ‘print working directory’

Q List all the items in current directory

Show me the answer!

ls
ls stands for ‘list directory content’

Q Check the file sizes of items in the current directory

Show me the answer!

ls -l

Q Check them in human readable format

Show me the answer!

ls -lh
For more information on ls check the man pages: man ls



5) Managing your directories and files

A new folder can be created using mkdir “make directory”.

Q From your home directory, create a new directory called unix_tutorial

Show me the answer!

mkdir unix_tutorial

Q Create an empty text file called file1.txt using touch

Show me the answer!

touch file1.txt

Editing files in the terminal is a bit tedious but you’ll learn quickly!
Nano and vi/vim are useful text editors:

Q Edit file1.txt using vi or vim and write ‘Hello Workshop team’

Show me the answer!

vi file1.txt or vim file1.txt
i [for insert mode] Type: ‘Hello Workshop team’
ESC to escape the ‘insert mode’
:x to exit vi/vim while saving modifications to the file

Q Edit file1.txt using nano and write ‘Hello fellow participants’ in the second line

Show me the answer!

nano file1.txt
ENTER to access the second line
Write: ‘Hello fellow participants’
ctrl o to save –> yes, ^ corresponds to ctrl in case you were wondering 🙂
ENTER to validate saving
ctrl x to exit

Q Copy file1.txt in the unix_tutorial directory, name this copy file2.txt

Show me the answer!

cp file1.txt unix_tutorial/file2.txt
cp stands for ‘copy’

Q Rename file1.txt in myfile1.txt

Show me the answer!

mv file1.txt myfile1.txt
the command mv stands for ‘move’
it is the same command to move or to rename a file (‘move’ a file in the current directory with a different output name)

Q Move myfile1.txt to unix_tutorial

Show me the answer!

mv myfile1.txt unix_tutorial/

Q Go to unix_tutorial and delete the file file2.txt

Show me the answer!

cd unix_tutorial
rm file2.txt
rm stands for ‘remove’

Q Go one directory down and delete unix_tutorial

Show me the answer!

cd ..
rm -ri unix_tutorial
To delete an empty directory, you can also use rmdir.

There is no ‘undo’ or ‘trash folder’ in the terminal, so be very careful where deleting files or directories!
It is a good practice to use the -i flag as a safety step with the rm (e.g.,:rm -i file1.txt)



6) View and manipulate files

For the next exercise, you will need the text file Test_file_genomics_data.txt in the directory ~/workshop_materials/01_unix_intro/basic_unix/
Go in the ‘basic_unix’ directory

There are several ways to view the content of a file:

cat will print the whole file on the prompt. It can be useful for small files but is not adapted for large files.
You can always use ctrl c to kill the task.

less prints the content of a file one screen length at a time

Within ‘less’:

ENTER displays the next line
k displays the previous line
SPACE displays the next page
b displays the previous page
shift g prints the end of the file
q to exit

If a file contains very long lines, these lines will wrap to fit the screen width.
This can result in a confusing display, especially if there are, for example, long sequences in your file.
To stop this we can use:

less -S

which will stop the text wrapping. You can scroll horizontally across lines using the arrow keys.

Q explore the example file using less and cat

Show me the answer!

cat Test_file_genomics_data.txt
less Test_file_genomics_data.txt
This file contains population pairwise genome-wide statistics (FST, DXY, nucleotide diversity per population) calculated on 10 Kb windows.
‘scaffold’ specifies the linkage group or chromosome
‘Start’ & ‘End’ specify the start and end position of the window
‘FST’ & ‘DXY’ represent relative and absolute genomic differentiation measures
‘Set1_pi’ & ‘Set2_pi’ correspond to the nucleotide diversity for population 1 and population 2, respectively

head will print the first 10 lines of a file on the prompt
tail will print the last 10 lines of a file on the prompt

Q print the first 25 lines of the example file

Show me the answer!

head -n 25 Test_file_genomics_data.txt

Q print the last 50 lines of the example file

Show me the answer!

tail -n 50 Test_file_genomics_data.txt

Grep is a tool for searching files for a specific content. It has many powerful applications, the basics of which will be explained here.

The basic syntax of grep is

grep 'search pattern' 'filename'

Q Print all the lines that contain LG13 from the file Test_file_genomics_data.txt

Show me the answer!

grep 'LG13' Test_file_genomics_data.txt

Q How many lines contain LG20 in the file Test_file_genomics_data.txt?

Show me the answer!

grep -c 'LG20' Test_file_genomics_data.txt

Q Print three lines that come after the pattern ‘scaffold’ in the file Test_file_genomics_data.txt

Show me the answer!

grep -A 3 'scaffold' Test_file_genomics_data.txt

Q Print all the lines that do not contain LG in the file Test_file_genomics_data.txt

Show me the answer!

grep -v 'LG' Test_file_genomics_data.txt

cut allows you to extract a specific column from a file. By default, the column delimiter is TAB. You can change this using -d

Q Print the column 5 of the test file

Show me the answer!

cut -f 5 Test_file_genomics_data.txt

wc counts the number of lines, characters or words in a file

Q How many lines has the test file?

Show me the answer!

wc -l Test_file_genomics_data.txt

sort will sort lines of a text file

Q Sort the test file by increasing Fst value

Show me the answer!

sort -g -k 4 Test_file_genomics_data.txt
-g applies for a general numeric sort
-k specifies the column in which the values should be sorted

sed has many powerful applications including the replacement of one block of text with another.
The syntax for this is

sed 's/'pattern to find'/'text to replace it with'/g' 'filename'

This will output the changed file contents to the screen

If we want to redirect the output to a new file we can use > , for example:

sed 's/'pattern to find'/'text to replace it with'/g' 'filename' > 'new_filename.txt'

Q Change LG to LinkageGroup in the test file and redirect it to a new file called Test_file_genomics_data_renamed.txt, and check that it worked

Show me the answer!

sed 's/LG/LinkageGroup/g' Test_file_genomics_data.txt > Test_file_genomics_data_renamed.txt
head Test_file_genomics_data_renamed.txt
tail Test_file_genomics_data_renamed.txt

Pipe | is a very useful key that sends the output from one unix command as input into another command

example:

grep 'LG12' Test_file_genomics_data.txt | head

Q Create a new file containing the last five lines of the column two of the example file using a single command line

Show me the answer!

cut -f 2 Test_file_genomics_data.txt | tail -n 5 > New_file.txt

It is often useful to copy a file from a remote system (e.g. the amazon server) to a local system (e.g. your computer), and vice-versa
To do this, a useful command is scp, that stands for ‘secure copy’. It works like cp in the sense that both commands require a source and a destination filesystem location for the copy operation;
the big difference is that with scp, one or both of the locations are on a remote system.
This example would copy a file from your personal computer to the amazon server:

scp 'source_path/FILE_NAME.txt' 'wpsg@YOUR_IP.compute-1.amazonaws.com:/destination_path/'

Q Copy Test_file_genomics_data.txt to your local system (home directory) [hint: you will have to do that from your local system]

Show me the answer!

first you have to exit your ssh connection or open a new terminal window
scp wpsg@YOUR_IP.compute-1.amazonaws.com:/home/wpsg/workshop_materials/unix_activity/Test_file_genomics_data.txt ~/



7) Tips and tricks

7.1 Tab completion & up-arrow.
The tab button can be used to complete a file or directory name and to do a quick lookup of commands. If, for example, a file has a very long name, you can save time by using the tab completion.
Lets say you wanted to copy a file named ‘reallyLongFilename.txt’ to the parent directory, type:
cp rea… – hit the tab key …

Q Create a file called reallyreallyreally_long_filename.txt, then use less and the tab completion to fill in the filename with the command.
Q Create another file called reallyreallyreally_extralong_filename.txt, then use less and the tab completion to fill in the filename with the command.

If there are two or more files that start the same way then tab completion after typing ‘rea’ will not fill in the whole name as there is ambiguity to which file you mean. In this instance pressing tab will fill in as much as it can (in this case ‘reallyreallyreally_’) and stop. Pressing the tab button twice will now display all the options of files that start with those letters, allowing you to see what extra letters you must type to complete the file.
In this case you can type an extra ‘e’ (giving you ‘reallyreallyreally_e’) and then hit tab and it will complete it for you.

Finally, the up-arrow key is very useful because it will show you all the commands you previously typed in the terminal

7.2 Don’t loose your job – use screen
While working on a remote server, screen is very helpful to have multiple running jobs in multiple ‘windows’ at the same time and don’t loose them in case your local computer crashes or you lose the connection.

Q Type screen in your terminal. Press ‘space’ to get to the promt of the screen. Start a long process (e.g., a ‘word count’ on a VCF file zcat ~/workshop_materials/01_unix_intro/vcf_activity/data.vcf.gz | wc. Detach from the screen by typing: ‘control-a-d’. Do the same again – i.e. open another screen and start a long job.
Type screen -r to list all runnning screens. To re-attach your screen, type screen -r followed by the first digits that are listed in the first column after using the ‘screen -r’ command. To exit the screen, type ‘exit’ within the screen. If only one screen is open, screen -r will directly re-attch you with this screen. After your ‘long jobs’ are finishen, exit both screens.

7.3 Changing folder permissions

Q Type ls -l in the folder ~/workshop_materials/01_unix_intro/basic_unix. What information does the first column of the output contain? What is the meaning of the 3rd and 4th column?

Show me the answer!

-rw-r--r-- 1 wpsg workshop 5003911 Jan 18 17:50 Test_file_genomics_data.txt

The first column tells you about the permission rights in symbolic notation:
The first character just indicates the file type (‘-‘ regular file, ‘d’ directory file). The remaining nine characters are in three sets, each representing a class of permissions as three characters. The first set represents the ‘user’ class (what the owner can do), the second the ‘group’ class (what the group members can do), and the third class the ‘others’ class (what other users can do). Within each triad, the first character ‘r’ indicates read access, the second ‘w’ write access, and the third ‘x’ executable.

The third column tells you the name of the file ‘owner’, and the 4th thd name of the file ‘group’.

Q Which rights do you have as an ‘owner’ for the file: -rwxr-xr--. Which rights do you have as group member, or as ‘other’?

Show me the answer!

The owner has full permissions (rwx), the group member can read and execute the file, and all other can only read the file.

Access rights can be changed using the chmod command, followed by a numerical code for the file permissions, and then the file name. You can learn the numerical code, or use chmod calculator.

Q Create a file touch some_file.txt, type ls -l to see who is the owner and the group, and which rights they have. Change the access rights of this file to -rw-rw----. Type ls -l again to see if the command took effect.

Show me the answer!

chmod 660 some_file.txt

7.4 awk – a stream programming language

awk has the general structure of:

awk 'pattern' {action}

awk is column (field) aware:
$0 corresponds to the whole line
$1 corresponds to the first column
$2 corresponds to the second column
etc


‘pattern’ can be any logical statement:
$3 > 0 – if column 3 is greater than 0
$1 == 32 – if column 1 equals 32
$1 == $3 – if column 1 equals column 3
$1 == "consensus" – if column 1 contains the string “consensus”

if ‘pattern’ is true, then everything in {…} is executed

Q Print the lines corresponding to the first 100kb of LG7 from the file Test_file_genomics_data.txt using awk

Show me the answer!

awk ' $1 == "LG7"' Test_file_genomics_data.txt | awk '$2 < 100000'
We can first select the lines corresponding to LG7 and then print the ones corresponding to the windows that are smaller than 100000 bp


7.5 Repeating commands using loops

The real power of the shell is the ability to repeat commands on multiple targets. This is useful for example for creating multiple folders, moving files into each folder, running pipeline on multiple samples etc.
This is accomplished by using a tool called the 'for loop'. In order to use these properly, two features of the shell need to be understood: variables and wildcards.

7.5.1 Variable

A variable is a placeholder for some text such as a directory name, filename, number, sentence etc. These allow for the contents of the variable to be changed within a loop without manually having to do so yourself.
A variable is always initialised using the name you designate for the variable (e.g. file, direc, superman, x, etc.) It can be whatever you want once it is a single word without spaces or special characters.
The variable is then called using $ in from of its name. For example, if the variable is named direc it is referenced using $direc.

7.5.2 Wildcards

The asterisk (*) is referred to as a wildcard symbol in unix. This allows for matching of filenames, directories etc that all have a certain sections of their name in common.
For example, if all your files start with ‘result’ (e.g. result.txt, result.tree, result.nexus, resultFile, result) these will all be recognised using result*.
Alternatively, if all files of interest end with .txt, you can loop over all of them using *.txt

Both variables and wildcards are used in for loops to maximise their power. A for loop has the syntax:

for 'variable' in 'list'
do
'tasks to repeat for each item in list'
done

Each section is written on a separate line (e.g. after ‘do’ hit enter’) and instead of a prompt the terminal will display a > to designate you are in a multi-line command.
Alternatively, you can place a loop into a single line using the ; symbol to separate the commands (except for the line break after the 'do' where there should not be a ;).

For example, we can use the command ‘echo’ to print something to the screen. This is used like

echo ‘hello’

which will print hello to the terminal.
We can use a for loop to print the number 1 to 10 to screen by typing

for num in {1..10}
do
echo $num
done

This loop starts at 1, places the number in the variable num which can then be accessed inside the loop through $num.

Q Write the same loop as above on one line

Show me the answer!

for num in {1..10};do echo $num;done;
Note the lack of semi-colon after ‘do’

Q Use a loop to create 3 directories which will be named run1, run2 and run3

Show me the answer!

for num in {1..3};do mkdir "run"$num;done;
We place 'run' before the variable to tell the system we want this string to be placed before the variable as part of the directory name. If we wanted it placed after (e.g. create 1run etc.) we could use $num"run"


7.6 Output redirection
You already learned how to redirect the output of a command to a file using the >. Instead of writing to a new file, you can also append to an existing file by using >>

Q Append 'This is cool!' to the 'some_file.txt' that you created above. Check if the contend has been changed using less.

Show me the answer!

echo 'This is cool!' >> some_file.txt

The default output that is written to the screen and that you redirect to a file is called the 'stdout' (standard out). In addition, there are two more default standard files: 'stdin' (standard input) - the default place where commands listen for information, and 'stderr' (standard error) - used to write error messages.

Q Produce a stderr output.

Show me the answer!

For example, a 'stderr' can be produced by less some_file_that_does_not_exist
less some_file_that_does_not_exist
some_file_that_does_not_exist: No such file or directory
.

To save the 'stderr' message, you need to redirect the 'stderr' to a file by adressing it using a stream number.
1> will redirect your 'stdout',
2> will redirect your 'stderr',
&> will redirect 'stdout' and 'stderr'.

Q Save the stderr of the 'less' example above to a file.

Show me the answer!

less some_file_that_does_not_exist 2> error_msg.txt

Instead of writing the stdout/stderr to the screen or to a file, you can also immediately delete it by redirecting the output to > /dev/null.
less some_file_that_does_not_exist 2> /dev/null



8) Working with genomic(-sized) files - The VCF format, filtering, and first analyses

If you are already comfortable with working on the terminal, or you already finished the UNIX exercise, you can jump to this additional exercise on genomic files in VCF format and first genomic analyses:

The VCF format, filtering, and first analyses.