Basic UNIX, some tips and tricks, and more!

Angelica Cuevas, Alexandra Weber & Julia M.I. Barth, 20 January 2020



Background and Objectives

Welcome everyone!
We have prepared this UNIX activity to get all of you to the same level, as you will use the terminal a lot during this workskop.
Depending on your level, this activity will take more or less time.

Learning goals:

    • navigate in the UNIX environment
    • create, move and delete directories
    • create, move, delete and edit files
    • use basic unix commands and know where to find help

Why would we use the terminal / shell in the first place?

Scripting: We can write down a sequence of commands to perform particular tasks or analyses;
when working with genomic data, a task usually takes minues, sometimes hours or even days – it’s no fun to sit and wait in front of your computer this long just for a mouse-click to initiate the next task.

Powerful Tools: In UNIX, powerful tools are available that enable you to work through large amounts of files, data, and tasks quite quickly and in an automated (that is, programmatic) way.

Easy remote access: In most cases, it is not possible anymore to deal with genomic data on a desktop computer. You will usually run analyses on clusters at high performance computing facilities at your university, or – like in this course – on the Amazon cluster.

A GUI (Graphical User Interface) is not available for many programs: Genomics is a fast evolving field and developing a graphical interface takes time and effort.

Compatibility: The terminal can (remotely) be accessed with computers running on different operating systems

Basic syntax of shell commands

UNIX or shell commands have a basic structure of:
command -options target
The command comes first (such as cd or ls as we will see later) then any options (always proceeded by a and also called flags) and then the target (such as the file to move or the directory to list). These commands are written on the prompt (terminal command line).

How to do this activity

  • Connect to your Amazon instance via the terminal using SSH.
  • Questions or tasks are indicated with Q .
  • All text in red underlined with gray color indicates commands that you can type or copy to the terminal.
  • If you get stuck, check the answer-box:

Show me the answer!

Oh no, only if you get stuck!! First try to find the answer yourself!


Table of contents



1) Find your keys!

There are some keys that are used a lot in UNIX commands but can be difficult to find on some keyboards.

Q Open a text editor and type the following keys:

~ tilde

/ forward slash

\ back slash or escape

| vertical bar or pipe

# hash or number or gate sign

$ dollar sign

* asterisk

single quote

" double quote

` backtick

ctrl c The panic button: If you are running a process or program and it is stuck or doing something you don’t want it to do: then hold control and press c. This will kill the current process and return you to your prompt.



2) Getting help

A UNIX cheat sheet like this one here might be helpful as a reference.
Also, never forget that Google is your best friend!
Most UNIX commands and many other programs have help pages accessed through: command_name --help, or command_name -h, which also describe different ways to run a program.
Most programs also have a more exhaustive manual page accessed by typing man PROGRAM_NAME.

Q Access the ls help page and the ls manual page.

Show me the answer!

If you haven’t done it yet, open a terminal window and login to your Amazon instance using ssh:
ssh [email protected]

Type ls --help to access the “list” help page.
Type man ls to access the “list” manual page.

Note that if you type ls -h you don’t get the help page for ls. That’s because -h is the option that print sizes of files in a human-readable format (like 1K, 234M, 2G etc) when combined with the -l option, like in ls -lh. Find the -h option when you access the ls help page with ls --help

Q What do the cp, vim and nano commands do?

Show me the answer!

man cp: copy files
man vim: “Vi IMproved” – a text editor
man nano: “Nano’s ANOther editor” – a text editor
To exit the man page press q


3) Navigation I

A computer file system is laid out as a hierarchical multifurcating tree structure. This may sound confusing but it is easy to think of it as boxes of boxes where each box is a directory.
There is one big box called the root. All other boxes are contained in this one big box. Boxes have labels such as ‘Users’ or ‘Applications’. Each box may contain more boxes (like Desktop or Downloads or Work) or files (like ‘file1.txt’ or ‘draft.docx’)
Thus it is hierarchical (boxes in boxes), multifurcating (each box can contain multiple boxes or files) tree structure (similar to how a tree has branches and leaves).

directoryStruct

There are two ways to refer to directories and their positions in this hierarchy and relationship to other directories: absolute and relative paths.

A) Absolute path
The absolute path is the list of all directories starting from the root that lead to the current directory. Directories are separated using a /.

For example, the path to the directory ‘popgen’ is /home/popgen/

Q What is the absolute path to the directory ‘software’?

Show me the answer!

/home/popgen/software

B) Relative path
A directory can also be referred to by its relative location from some other directory (usually where you are working from). The parent of a directory is referred to using ..
The current directory is referred to using .
For example, if I am in ‘software’ and want to get to ‘workshop_materials’ the relative path is ../workshop_materials/

Q What is the relative path from ‘software’ to ‘popgen’?

Show me the answer!

..


4) Navigation II

Q Go from your home directory to ‘beast’ (in the ‘software’ directory)
The Home directory is where you are upon login (/home/popgen/).

Show me the answer!

cd stands for ‘change directory’:
cd software/beast/

Q Go back to your home directory

Show me the answer!

Using the absolute path:
cd /home/popgen/
Using the relative path:
cd ../../
Using a really useful shortcut:
cd ~

Q Go to ‘software’

Show me the answer!

cd software

Q Check where you are (print the absolute path)

Show me the answer!

pwd pwd stands for ‘print working directory’

Q List all the items in current directory

Show me the answer!

ls
ls stands for ‘list directory content’

Q Check the file sizes of items in the current directory

Show me the answer!

ls -l The flag -l specifies a ‘long listing format’. It returns the columns: permissions, number of hardlinks, file owner, file group, file size in bytes, modification date, filename.

Q Check them in human readable format

Show me the answer!

ls -lh
For more information on ls check the man pages: man ls



5) Managing your directories and files

A new folder can be created using mkdir “make directory”.

Q From your home directory, create a new directory called unix_tutorial

Show me the answer!

mkdir unix_tutorial

Q Create an empty text file called file1.txt using touch

Show me the answer!

touch file1.txt
Tip: touch can also be used to update the access date of a file or directory.

Editing files in the terminal is a bit tedious but you’ll learn quickly!
nano and vim) are useful text editors:

Q Edit file1.txt using vim and write ‘Hello Workshop team’

Show me the answer!

vi file1.txt or vim file1.txt
i [for insert mode] Type: ‘Hello Workshop team’
ESC to escape the ‘insert mode’
:q to exit vim without saving modifications to the file, use :wq to save.

Q Edit file1.txt using nano and write ‘Hello fellow participants’ in the second line

Show me the answer!

nano file1.txt
ENTER to access the second line
Write: ‘Hello fellow participants’
ctrl o to save –> yes, ^ corresponds to ctrl in case you were wondering 🙂
ENTER to validate saving
ctrl x to exit

Q Copy file1.txt to the ‘unix_tutorial’ directory, name this copy file2.txt

Show me the answer!

cp file1.txt unix_tutorial/file2.txt
cp stands for ‘copy’

Q Rename file1.txt to myfile1.txt

Show me the answer!

mv file1.txt myfile1.txt
the command mv stands for ‘move’
it is the same command to move or to rename a file (‘move’ a file in the current directory with a different output name)

Q Move myfile1.txt to the ‘unix_tutorial’ folder

Show me the answer!

mv myfile1.txt unix_tutorial/

Q Go to ‘unix_tutorial’ and delete the file file2.txt

Show me the answer!

cd unix_tutorial
rm file2.txt
rm stands for ‘remove’

Q Go one directory down and delete the ‘unix_tutorial’ folder

Show me the answer!

cd ..
rm unix_tutorial
This returns an error, only using rm is not possible to remove a directory, a flag that allows deleting a directory and its content is needed.
rm -ri unix_tutorial
The -r flag remove directories and their contents recursively and -i tells the command to ask for permission to delete.
To delete an empty directory, you can also use rmdir.

There is no ‘undo’ or ‘trash folder’ in the terminal, so be very careful where deleting files or directories!
It is a good practice to use the -i flag as a safety step with the rm (e.g.,:rm -i file1.txt)



6) View and manipulate files

For the next exercise, you are going to need the text file Test_file_genomics_data.txt, which is located in the directory ~/workshop_materials/20_unix_intro/basic_unix/.
Go to the ‘basic_unix’ directory.

There are several ways to view the content of a file:

cat will print the whole file. It can be useful for viewing small files and as a part of computational processing using the pipe |, but it is not suitable for viewing large files.
Remember, you can always use ctrl c to kill the task.

less prints the content of a file on one screen length at a time

Within ‘less’:

ENTER displays the next line
k displays the previous line
SPACE displays the next page
b displays the previous page
shift g prints the end of the file
q to exit

If a file contains very long lines, these lines will wrap to fit the screen width. This can result in a confusing display, especially if there are, for example, long sequences in your file. To chop the lines and only display the beginning of each line we can use:

less -S

You can scroll horizontally across lines using the arrow keys.

Q explore the example file using less and cat

Show me the answer!

cat Test_file_genomics_data.txt
less Test_file_genomics_data.txt
This file contains population pairwise genome-wide statistics (FST, DXY, nucleotide diversity per population) calculated on 10 Kb windows.
‘scaffold’ specifies the linkage group or chromosome
‘Start’ & ‘End’ specify the start and end position of the window
‘FST’ & ‘DXY’ represent relative and absolute genomic differentiation measures
‘Set1_pi’ & ‘Set2_pi’ correspond to the nucleotide diversity for population 1 and population 2, respectively

head will print the first 10 lines of a file on the prompt
tail will print the last 10 lines of a file on the prompt

Q print the first 25 lines of the example file Test_file_genomics_data.txt

Show me the answer!

head -n 25 Test_file_genomics_data.txt

Q print the last 50 lines of the example file Test_file_genomics_data.txt

Show me the answer!

tail -n 50 Test_file_genomics_data.txt

grep is a tool for searching files for a specific content. It has many powerful applications, the basics of which will be explained here.

The basic syntax of grep is

grep 'search pattern' 'filename'

Q Print all the lines that contain LG13 from the file Test_file_genomics_data.txt

Show me the answer!

grep 'LG13' Test_file_genomics_data.txt

Q How many lines contain LG20 in the file Test_file_genomics_data.txt?

Show me the answer!

grep -c 'LG20' Test_file_genomics_data.txt
Use grep --help to obtain information about the flag -c. It stands for “count”.

Q Print three lines that come before and three lines after the pattern ‘scaffold’ in the file Test_file_genomics_data.txt

Show me the answer!

grep -B 3 'scaffold' Test_file_genomics_data.txt
grep -A 3 'scaffold' Test_file_genomics_data.txt
You can also use the -C flag to print the lines before and after simultaneously.

Q Print all the lines that do not contain LG in the file Test_file_genomics_data.txt

Show me the answer!

grep -v 'LG' Test_file_genomics_data.txt
Use grep --help to obtain information about the flag -v. It stands for “invert-match”.

cut allows you to extract a specific column from a file. By default, the column delimiter is TAB. You can change this using -d

Q Print the column 5 of the test file

Show me the answer!

cut -f 5 Test_file_genomics_data.txt

wc counts the number of lines, characters or words in a file

Q How many lines has the test file?

Show me the answer!

wc -l Test_file_genomics_data.txt

sort will sort lines of a text file

Q Sort the test file by increasing Fst value

Show me the answer!

sort -g -k 4 Test_file_genomics_data.txt
-g applies for a general numeric sort
-k specifies the column in which the values should be sorted

sed has many powerful applications including the replacement of one block of text with another.
The syntax for this is

sed 's/'pattern to find'/'text to replace it with'/g' 'filename'

This will output the changed file contents to the screen

If we want to redirect the output to a new file we can use > , for example:

sed 's/'pattern to find'/'text to replace it with'/g' 'filename' > 'new_filename.txt'

Q Change the text ‘LG’ to ‘LinkageGroup’ on every line of the test file and redirect the output to a new file called Test_file_genomics_data_renamed.txt. Then visually inspect the file to check if it worked.

Show me the answer!

sed 's/LG/LinkageGroup/g' Test_file_genomics_data.txt > Test_file_genomics_data_renamed.txt
head Test_file_genomics_data_renamed.txt
tail Test_file_genomics_data_renamed.txt

The pipe | is a very useful key that sends the output from one unix command as input into another command

example:

grep 'LG12' Test_file_genomics_data.txt | head

Q Create a new file containing the last five lines of the column two of the example file Test_file_genomics_data.txt using a single command line.

Show me the answer!

cut -f 2 Test_file_genomics_data.txt | tail -n 5 > New_file.txt

It is often useful to copy a file from a remote system (e.g., the amazon server) to a local system (e.g., your computer), and vice-versa.
To do this, a useful command is scp, that stands for ‘Secure Copy Protocol’. It works like cp in the sense that both commands require a source and a destination location for the copy operation; the difference is that with scp, one or both of the locations are on a remote system and requires authentication.
This example would copy a file from your personal computer to the amazon server:

scp 'source_path/FILE_NAME.txt' '[email protected]:/destination_path/'

Q Copy Test_file_genomics_data.txt to your local system (home directory) [hint: you will have to do that from your local system]

Show me the answer!

First you have to open a new terminal window.
Then type: scp [email protected]:/home/popgen/workshop_materials/20_unix_intro/basic_unix/Test_file_genomics_data.txt ~

! Please remember that we ask you NOT to download larger files from the Amazon instances to your computer since this is quiet expensive (downloading smaller files like scripts or personal notes, as well as uploading to the Amazon instance is OK). We will provide a link to download the workshop material (including e.g., input files, templates, etc) at the end of the workshop.



7) Tips and tricks

7.1 Tab completion & up-arrow.
The tab button can be used to complete a file or directory name and to do a quick lookup of commands. If, for example, a file has a very long name, you can save time by using the tab completion.
Lets say you wanted to copy a file named ‘reallyLongFilename.txt’ to the parent directory, type:
cp rea… – hit the tab key …

Q Create a file called reallyreallyreally_long_filename.txt, then use ls -l and the tab completion to fill in the filename
Q Create another file called reallyreallyreally_extralong_filename.txt and then again type ls -l followed by trying to utilise the tab completion to fill in this even longer filename.

As you probably found out by doing the above, if there are two or more files that start the same way then tab completion after typing ‘rea’ will not fill in the whole name as there is ambiguity to which file you mean. In this instance pressing tab will fill in as much as it can (in this case ‘reallyreallyreally_’) and stop. Pressing the tab button twice will now display all the options of files that start with those letters, allowing you to see what extra letters you must type to complete the file.
In this case you can type an extra ‘e’ (giving you ‘reallyreallyreally_e’) and then hit tab and it will complete it for you.

Finally, the up-arrow key is very useful because pressing it repeatedly shows you the history of the commands you typed in the terminal previously.

7.2 Don’t loose your job – use screen
While working on a remote server, screen is very helpful to have multiple running jobs in multiple ‘windows’ at the same time and don’t loose them in case your local computer crashes or you lose the connection.

Q Type screen in your terminal. Press ‘space’ to get to the promt of the screen.
Start a long process e.g., a ‘word count’ on a large(ish) VCF file: zcat ~/workshop_materials/20_unix_intro/basic_unix/data.vcf.gz | wc.
Detach from the screen by pressing (simultaneously) the keys: ‘control-a-d’. Do the same again – i.e. open another screen and start a long job.

If you intent to have multiple screens and run different jobs on them it could be useful to specify a meaningful name for the different sessions, this way you know which job you are running in each screen, you can do that by using the -S flag, like screen -S.

To list all running screens you can use screen -list or screen -r

To re-attach your screen, type screen -r followed by the screen name, if you named it when starting the screen session, otherwise you can use the first digits that are listed in the first column after using the ‘screen -r’ command.

To exit the screen, type ‘exit’ within the screen. If only one screen is open, screen -r will directly re-attach you with this screen. After your ‘long jobs’ are finished, exit both screens by typing exit while being attached to the screen.

7.3 Changing folder permissions

Q Type ls -l in the folder ~/workshop_materials/20_unix_intro/basic_unix. What information does the first column of the output contain? What is the meaning of the 3rd and 4th column?

Show me the answer!

-rw-r--r-- 1 popgen workshop 5003911 Jan 16 18:07 Test_file_genomics_data.txt
-rw-r--r-- 1 popgen workshop 184547450 Jan 17 15:52 data.vcf.gz

The first column tells you about the permission rights in symbolic notation:
The first character just indicates the file type (‘-‘ regular file, ‘d’ directory file). The remaining nine characters are in three sets, each representing a class of permissions as three characters. The first set represents the ‘user’ class (what the owner can do), the second the ‘group’ class (what the group members can do), and the third class the ‘others’ class (what other users can do). Within each triad, the first character ‘r’ indicates read access, the second ‘w’ write access, and the third ‘x’ executable.

The third column tells you the name of the file ‘owner’, and the 4th thd name of the file ‘group’.

Q Which rights do you have as an ‘owner’ for the file: -rwxr-xr--. Which rights do you have as group member, or as ‘other’?

Show me the answer!

The owner has full permissions (rwx), the group member can read and execute the file, and all other can only read the file.

Access rights can be changed using the chmod command, followed by a numerical code for the file permissions, and then the file name. You can learn the numerical code, or use chmod calculator.

Q Create a file touch some_file.txt, type ls -l to see who is the owner and the group, and which rights they have. Change the access rights of this file to -rw-rw----. Type ls -l again to see if the command took effect.

Show me the answer!

chmod 660 some_file.txt

7.4 awk – a stream programming language

awk has the general structure of:

awk 'pattern' {action}

awk is column (field) aware:
$1 corresponds to the first column
$2 corresponds to the second column
$3 corresponds to the third column
etc.
$0 corresponds to the whole line

‘pattern’ can be any logical statement:
$3 > 0 – if column 3 is greater than 0
$1 == 32 – if column 1 equals 32
$1 == $3 – if column 1 equals column 3
$1 == "consensus" – if column 1 contains the string “consensus”

if ‘pattern’ is true, then everything in {…} is executed

Q Print the lines corresponding to the first 100kb of LG7 from the file Test_file_genomics_data.txt using awk

Show me the answer!

awk ' $1 == "LG7"' Test_file_genomics_data.txt | awk '$2 < 100000'

We can first select the lines corresponding to LG7 and then print the ones corresponding to the windows whose start coordinates are smaller than 100000 bp


7.5 Repeating commands using loops

One of the really powerful things you can do in the terminal is the ability to repeat commands on multiple targets. For example creating many folders, moving multiple files into each folder, running commands and pipelines (sequences of commands) on multiple files/samples etc.
This can be accomplished by using the so called 'for loop'. For loops use two features of the shell that have not been covered before: variables and wildcards.

7.5.1 Variables

A variable in the terminal shell can be seen as a placeholder for some text, such as a directory name, filename, number, sentence etc.
A variable is initialised using the name you designate for it. The name can be whatever you want (e.g. 'direc', 'superman', 'x', 'file' etc.) as long as it is a single word without spaces or special characters. The variable is then addressed using $ in front of its name. For example, if the variable is named 'direc' it is referenced using $direc.

The contents of the variable can be changed within a loop automatically without having to do so yourself manually.

7.5.2 Wildcards

The asterisk * is referred to as a 'wildcard' symbol in unix. It allows for matching of filenames, directories etc. that have a certain sections of their name in common.
For example, if the names of all the files you are interested in start with ‘result’ (e.g. result.txt, result.tree, result.nexus, resultFile, result) these will all be recognised using result*.
Alternatively, if all files of interest end with .txt, you can loop over all of them using *.txt

Both variables and wildcards are used in for loops to maximise their power.

A for-loop has the syntax:

for 'variable' in 'list'
do
'tasks to repeat for each item in list'
done

Each section is written on a separate line (e.g. after ‘do’ hit enter’) and instead of a prompt the terminal will display a > to designate you are in a multi-line command.
Alternatively, you can place a loop into a single line using the ; symbol to separate the commands, except for the line break after the 'do' where there should not be a ;

For example, we can use the command ‘echo’ to print something to the screen. This is used like

echo ‘hello’

which will print hello to the terminal.

We can also define a variable which content would be the word 'hello'

my_variable='hello' and print the content of the variable in the screen with echo $my_variable

We can use a for-loop to print the number 1 to 10 to screen by typing

for num in {1..10}
do
echo $num
done

This loop starts at 1, places the number in the variable num which can then be accessed inside the loop through $num.

Q Write the same loop as above on one line

Show me the answer!

for num in {1..10};do echo $num;done;

Note the lack of semi-colon after ‘do’

Q Use a loop to create 3 directories which will be named run1, run2 and run3

Show me the answer!

for num in {1..3};do mkdir "run"$num;done;

We place 'run' before the variable to tell the system we want this string to be placed before the variable as part of the directory name. If we wanted it placed after (e.g. create 1run etc.) we could use $num"run"


7.6 Output redirection
You already learned how to redirect the output of a command to a file using the >. Instead of writing to a new file, you can also append to an existing file by using >>

Q Append 'This is cool!' to the 'some_file.txt' that you created above. Check if the contend has been changed using less.

Show me the answer!

echo 'This is cool!' >> some_file.txt

The default output that is written to the screen and that you redirect to a file is called the 'stdout' (standard out). In addition, there are two more default standard files: 'stdin' (standard input) - the default place where commands listen for information, and 'stderr' (standard error) - used to write error messages.

Q Produce a stderr output.

Show me the answer!

For example, a 'stderr' can be produced by less some_file_that_does_not_exist
some_file_that_does_not_exist: No such file or directory

To save the 'stderr' message, you need to redirect the 'stderr' to a file by adressing it using a stream number.
1> will redirect your 'stdout',
2> will redirect your 'stderr',
&> will redirect 'stdout' and 'stderr'.

Q Save the stderr of the 'less' example above to a file.

Show me the answer!

less some_file_that_does_not_exist 2> error_msg.txt

Instead of writing the stdout/stderr to the screen or to a file, you can also immediately delete it by redirecting the output to > /dev/null.
less some_file_that_does_not_exist 2> /dev/null


8) Writing a script

Now what if we want to save some of the commands and run them automatically without the need to type them all over again? or for example to apply the same commands later to new files? For that we can write a bash script.

Q Start by creating an empty file and name it 'my_script.sh' then modify the content of the file, using nano, by including the for loop we used before:

Show me the answer!

touch my_script.sh
nano my_script.sh

You can write the following in your script:

#!/bin/sh
# a for-loop to print text and numbers from 0 to 10
for num in {1..10}; do
echo 'day '$num ':has been cool!'
done

Using the '#' we can write notes in our script. They can be a small description of what the script is about.
Save and exit the script. ctrl o to save. ENTER to validate saving. ctrl x to exit

Q This script will print out a series from 0 to 10 with the text 'day 1 :has been cool!'. Lets execute the script and see the output. To execute our script we can use the command:

bash my_script.sh

This will print out in the screen the result of our for loop command.
Instead of printing it in the screen we could save the output in a file just by using the >> to redirect the output.

Show me the answer!

Modify your script by adding >> after the loop and specify a file where the output would be saved.

#!/bin/sh
# a for-loop to print text and numbers from 0 to 10
for num in {1..10}; do
echo 'day '$num ':has been cool!'
done >> my_output.txt

Check the output file that was created with less my_output.txt

We can go a bit further and modify our script to make it interactive by allowing the script to get some of its arguments directly in the command line.
Modify the script by replacing ':has been cool!' by ${1} and my_output by ${2}:

#!/bin/sh
# a for-loop to print text and numbers from 0 to 10
for num in {1..10}; do
echo 'day '$num ${1}
done >> ${2}.txt

Save the changes and exit the script.

The ${1} and ${2} variables correspond to the first and second argument we will specify in the command line when running the script. Let's say we want to specify a different text for the result of the day ':has been cool!' and change it to 'great' and to save it in a different file called 'output2.txt', then we need to provide those arguments in the command line.

bash my_script.sh great output2

Now the new output file 'output2.txt' should have the word 'great' instead of the initial ':has been cool!'.

Q Compare 'my_output.txt' and 'output2.txt'

Show me the answer!

cat my_output.txt
cat output2.txt