Basic UNIX, some tips and tricks, and more!
Alexandra Weber & Julia M.I. Barth, 22 January 2018
Background and Objectives
Welcome everyone!
We have prepared this UNIX activity to get all of you to the same level, as you will use the terminal a lot during this workskop.
Depending on your level, this activity will take more or less time. In case you are already very comfortable with UNIX and working in the command line, you may directly proceed to an additional activity on VCF (variant call format) files and the first analyses of genomic data: The VCF format, filtering, and first analyses.
Learning goals:
- navigate in the UNIX environment
- create, move and delete directories
- create, move, delete and edit files
- use basic unix commands and know where to find help
Why would we use the terminal / shell in the first place?
Scripting: We can write down a sequence of commands to perform particular tasks or analyses;
when working with genomic data, a task usually takes minues, sometimes hours or even days – it’s no fun to sit and wait in front of your computer this long just for a moust-click to initiate the next task.
Powerful Tools: In UNIX, powerful tools are available that enable you to work through large amounts of files, data, and tasks quite quickly and in an automated (that is, programmatic) way.
Easy remote access: In most cases, it is not possible anymore to deal with genomic data on a desktop computer. You will usually run analyses on clusters at high performance computing facilities at your university, or – like in this course – on the Amazon cluster.
A GUI (Graphical User Interface) is not available for many programs: Genomics is a fast evolving field and developing a graphical interface takes time and effort.
Compatibility: The terminal can (remotely) be accessed with computers running on different operating systems
Basic syntax of shell commands
UNIX or shell commands have a basic structure of:
command -options target
The command comes first (such as cd or ls as we will see later) then any options (always proceeded by a – and also called flags) and then the target (such as the file to
move or the directory to list). These commands are written on the prompt (terminal command line).
How to do this activity
Questions or tasks are indicated with
All text in red underlined with gray color
indicates commands that you can type or copy to the terminal.
If you get stuck, check the answer-box:
Show me the answer!
Table of contents
There are some keys that are used a lot in UNIX commands but can be difficult to find on some keyboards.
~
tilda
#
hash or number or gate sign
$
dollar sign
*
asterisk
`
backtick
ctrl c
The panic button: If you are running a process or program and it is stuck or doing something you don’t
want it to do: then hold control and press c. This will kill the current process and return you to your prompt.
A UNIX cheat sheet like this one here might be helpful as a reference.
Also, never forget that Google is your best friend!
Most UNIX commands and many other programs have help pages accessed through: command_name --help
, or command_name -h
, which also describe different ways to run a program.
Most programs also have a more exhausive manual page accessed by typing man command_name
.
Show me the answer!
bcftools --help
cp
and vi
commands do?
Show me the answer!
man cp
: copy filesman vi
: Vi IMproved, a programmers text editorTo exit the man page press
q
A computer file system is laid out as a hierarchical multifurcating tree structure. This may sound confusing but it is easy to think of it as boxes of boxes where each box is a directory.
There is one big box called the root. All other boxes are contained in this one big box. Boxes have labels such as ‘Users’ or ‘Applications’. Each box may contain more boxes (like Desktop or Downloads or Work) or files (like ‘file1.txt’ or ‘draft.docx’)
Thus it is hierarchical (boxes in boxes), multifurcating (each box can contain multiple boxes or files) tree structure (similar to how a tree has branches and leaves).
There are two ways to refer to directories and their positions in this hierarchy and relationship to other directories: absolute and relative paths:
A) Absolute path
The absolute path is the list of all directories starting from the root that lead to the current directory. Directories are separated using a /
.
For example, the path to the directory ‘wpsg’ is /home/wpsg/
Show me the answer!
/home/wpsg/software
B) Relative path
A directory can also be referred to by its relative location from some other directory (usually where you are working from). The parent of a directory is referred to using ..
.
The current directory is referred to using .
For example, if I am in ‘software’ and want to get to ‘workshop_materials’ the relative path is ../workshop_materials/
Show me the answer!
..
The Home directory is where you are upon login (/home/wpsg/).
Show me the answer!
cd
stands for ‘change directory’:cd software/beast/
Show me the answer!
cd /home/wpsg/
Using the relative path:
cd ../../
Using a really useful shortcut:
cd ~
Show me the answer!
cd software
Show me the answer!
pwd
pwd stands for ‘print working directory’
Show me the answer!
ls
ls stands for ‘list directory content’
Show me the answer!
ls -l
Show me the answer!
ls -lh
For more information on
ls
check the man pages: man ls
5) Managing your directories and files
A new folder can be created using mkdir
“make directory”.
Show me the answer!
mkdir unix_tutorial
touch
Show me the answer!
touch file1.txt
Editing files in the terminal is a bit tedious but you’ll learn quickly!
Nano and vi/vim are useful text editors:
vi
or vim
and write ‘Hello Workshop team’
Show me the answer!
vi file1.txt
or vim file1.txt
i
[for insert mode]
Type: ‘Hello Workshop team’ESC
to escape the ‘insert mode’:x
to exit vi/vim while saving modifications to the file nano
and write ‘Hello fellow participants’ in the second line
Show me the answer!
nano file1.txt
ENTER
to access the second lineWrite: ‘Hello fellow participants’
ctrl o
to save –> yes, ^ corresponds to ctrl in case you were wondering 🙂ENTER
to validate savingctrl x
to exit
Show me the answer!
cp file1.txt unix_tutorial/file2.txt
cp
stands for ‘copy’
Show me the answer!
mv file1.txt myfile1.txt
the command
mv
stands for ‘move’it is the same command to move or to rename a file (‘move’ a file in the current directory with a different output name)
Show me the answer!
mv myfile1.txt unix_tutorial/
Show me the answer!
cd unix_tutorial
rm file2.txt
rm
stands for ‘remove’
Show me the answer!
cd ..
rm -ri unix_tutorial
To delete an empty directory, you can also use rmdir.
There is no ‘undo’ or ‘trash folder’ in the terminal, so be very careful where deleting files or directories!
It is a good practice to use the -i
flag as a safety step with the rm
(e.g.,:rm -i file1.txt
)
For the next exercise, you will need the text file Test_file_genomics_data.txt in the directory ~/workshop_materials/01_unix_intro/basic_unix/
Go in the ‘basic_unix’ directory
There are several ways to view the content of a file:
cat
will print the whole file on the prompt. It can be useful for small files but is not adapted for large files.
You can always use ctrl c
to kill the task.
less
prints the content of a file one screen length at a time
Within ‘less’:
ENTER
displays the next line
k
displays the previous line
SPACE
displays the next page
b
displays the previous page
shift g
prints the end of the file
q
to exit
If a file contains very long lines, these lines will wrap to fit the screen width.
This can result in a confusing display, especially if there are, for example, long sequences in your file.
To stop this we can use:
less -S
which will stop the text wrapping. You can scroll horizontally across lines using the arrow keys.
Show me the answer!
cat Test_file_genomics_data.txt
less Test_file_genomics_data.txt
This file contains population pairwise genome-wide statistics (FST, DXY, nucleotide diversity per population) calculated on 10 Kb windows.
‘scaffold’ specifies the linkage group or chromosome
‘Start’ & ‘End’ specify the start and end position of the window
‘FST’ & ‘DXY’ represent relative and absolute genomic differentiation measures
‘Set1_pi’ & ‘Set2_pi’ correspond to the nucleotide diversity for population 1 and population 2, respectively
head
will print the first 10 lines of a file on the prompt
tail
will print the last 10 lines of a file on the prompt
Show me the answer!
head -n 25 Test_file_genomics_data.txt
Show me the answer!
tail -n 50 Test_file_genomics_data.txt
Grep
is a tool for searching files for a specific content. It has many powerful applications, the basics of which will be explained here.
The basic syntax of grep is
grep 'search pattern' 'filename'
Show me the answer!
grep 'LG13' Test_file_genomics_data.txt
Show me the answer!
grep -c 'LG20' Test_file_genomics_data.txt
Show me the answer!
grep -A 3 'scaffold' Test_file_genomics_data.txt
Show me the answer!
grep -v 'LG' Test_file_genomics_data.txt
cut
allows you to extract a specific column from a file. By default, the column delimiter is TAB. You can change this using -d
Show me the answer!
cut -f 5 Test_file_genomics_data.txt
wc
counts the number of lines, characters or words in a file
Show me the answer!
wc -l Test_file_genomics_data.txt
sort
will sort lines of a text file
Show me the answer!
sort -g -k 4 Test_file_genomics_data.txt
-g
applies for a general numeric sort-k
specifies the column in which the values should be sorted
sed
has many powerful applications including the replacement of one block of text with another.
The syntax for this is
sed 's/'pattern to find'/'text to replace it with'/g' 'filename'
This will output the changed file contents to the screen
If we want to redirect the output to a new file we can use >
, for example:
sed 's/'pattern to find'/'text to replace it with'/g' 'filename' > 'new_filename.txt'
Show me the answer!
sed 's/LG/LinkageGroup/g' Test_file_genomics_data.txt > Test_file_genomics_data_renamed.txt
head Test_file_genomics_data_renamed.txt
tail Test_file_genomics_data_renamed.txt
Pipe |
is a very useful key that sends the output from one unix command as input into another command
example:
grep 'LG12' Test_file_genomics_data.txt | head
Show me the answer!
cut -f 2 Test_file_genomics_data.txt | tail -n 5 > New_file.txt
It is often useful to copy a file from a remote system (e.g. the amazon server) to a local system (e.g. your computer), and vice-versa
To do this, a useful command is scp
, that stands for ‘secure copy’. It works like cp
in the sense that both commands require a source and a destination filesystem location for the copy operation;
the big difference is that with scp
, one or both of the locations are on a remote system.
This example would copy a file from your personal computer to the amazon server:
scp 'source_path/FILE_NAME.txt' 'wpsg@YOUR_IP.compute-1.amazonaws.com:/destination_path/'
Show me the answer!
scp wpsg@YOUR_IP.compute-1.amazonaws.com:/home/wpsg/workshop_materials/unix_activity/Test_file_genomics_data.txt ~/
7.1 Tab completion & up-arrow.
The tab button can be used to complete a file or directory name and to do a quick lookup of commands. If, for example, a file has a very long name, you can save time by using the tab completion.
Lets say you wanted to copy a file named ‘reallyLongFilename.txt’ to the parent directory, type:
cp rea
… – hit the tab key …
reallyreallyreally_long_filename.txt
, then use less
and the tab completion to fill in the filename with the command.
reallyreallyreally_extralong_filename.txt
, then use less
and the tab completion to fill in the filename with the command.
If there are two or more files that start the same way then tab completion after typing ‘rea’ will not fill in the whole name as there is ambiguity to which file you mean. In this instance pressing tab will fill in as much as it can (in this case ‘reallyreallyreally_’) and stop. Pressing the tab button twice will now display all the options of files that start with those letters, allowing you to see what extra letters you must type to complete the file.
In this case you can type an extra ‘e’ (giving you ‘reallyreallyreally_e’) and then hit tab and it will complete it for you.
Finally, the up-arrow
key is very useful because it will show you all the commands you previously typed in the terminal
7.2 Don’t loose your job – use screen
While working on a remote server, screen
is very helpful to have multiple running jobs in multiple ‘windows’ at the same time and don’t loose them in case your local computer crashes or you lose the connection.
screen
in your terminal. Press ‘space’ to get to the promt of the screen. Start a long process (e.g., a ‘word count’ on a VCF file zcat ~/workshop_materials/01_unix_intro/vcf_activity/data.vcf.gz | wc
. Detach from the screen by typing: ‘control-a-d’. Do the same again – i.e. open another screen and start a long job.
Type screen -r
to list all runnning screens. To re-attach your screen, type screen -r
followed by the first digits that are listed in the first column after using the ‘screen -r’ command. To exit the screen, type ‘exit’ within the screen. If only one screen is open, screen -r will directly re-attch you with this screen. After your ‘long jobs’ are finishen, exit both screens.
7.3 Changing folder permissions
ls -l
in the folder ~/workshop_materials/01_unix_intro/basic_unix
. What information does the first column of the output contain? What is the meaning of the 3rd and 4th column?
Show me the answer!
-rw-r--r-- 1 wpsg workshop 5003911 Jan 18 17:50 Test_file_genomics_data.txt
The first column tells you about the permission rights in symbolic notation:
The first character just indicates the file type (‘-‘ regular file, ‘d’ directory file). The remaining nine characters are in three sets, each representing a class of permissions as three characters. The first set represents the ‘user’ class (what the owner can do), the second the ‘group’ class (what the group members can do), and the third class the ‘others’ class (what other users can do). Within each triad, the first character ‘r’ indicates read access, the second ‘w’ write access, and the third ‘x’ executable.
The third column tells you the name of the file ‘owner’, and the 4th thd name of the file ‘group’.
-rwxr-xr--
. Which rights do you have as group member, or as ‘other’?
Show me the answer!
Access rights can be changed using the chmod
command, followed by a numerical code for the file permissions, and then the file name. You can learn the numerical code, or use chmod calculator.
touch some_file.txt
, type ls -l
to see who is the owner and the group, and which rights they have. Change the access rights of this file to -rw-rw----
. Type ls -l
again to see if the command took effect.
Show me the answer!
chmod 660 some_file.txt
7.4 awk – a stream programming language
awk
has the general structure of:
awk 'pattern' {action}
awk
is column (field) aware:
$0
corresponds to the whole line
$1
corresponds to the first column
$2
corresponds to the second column
etc
‘pattern’ can be any logical statement:
$3 > 0
– if column 3 is greater than 0
$1 == 32
– if column 1 equals 32
$1 == $3
– if column 1 equals column 3
$1 == "consensus"
– if column 1 contains the string “consensus”
if ‘pattern’ is true, then everything in {…} is executed
awk
Show me the answer!
awk ' $1 == "LG7"' Test_file_genomics_data.txt | awk '$2 < 100000'
We can first select the lines corresponding to LG7 and then print the ones corresponding to the windows that are smaller than 100000 bp
7.5 Repeating commands using loops
The real power of the shell is the ability to repeat commands on multiple targets. This is useful for example for creating multiple folders, moving files into each folder, running pipeline on multiple samples etc.
This is accomplished by using a tool called the 'for loop'. In order to use these properly, two features of the shell need to be understood: variables and wildcards.
7.5.1 Variable
A variable is a placeholder for some text such as a directory name, filename, number, sentence etc. These allow for the contents of the variable to be changed within a loop without manually having to do so yourself.
A variable is always initialised using the name you designate for the variable (e.g. file, direc, superman, x, etc.) It can be whatever you want once it is a single word without spaces or special characters.
The variable is then called using $ in from of its name. For example, if the variable is named direc it is referenced using $direc.
7.5.2 Wildcards
The asterisk (*) is referred to as a wildcard symbol in unix. This allows for matching of filenames, directories etc that all have a certain sections of their name in common.
For example, if all your files start with ‘result’ (e.g. result.txt, result.tree, result.nexus, resultFile, result) these will all be recognised using result*.
Alternatively, if all files of interest end with .txt, you can loop over all of them using *.txt
Both variables and wildcards are used in for loops to maximise their power. A for loop has the syntax:
for 'variable' in 'list'
do
'tasks to repeat for each item in list'
done
Each section is written on a separate line (e.g. after ‘do’ hit enter’) and instead of a prompt the terminal will display a > to designate you are in a multi-line command.
Alternatively, you can place a loop into a single line using the ; symbol to separate the commands (except for the line break after the 'do' where there should not be a ;).
For example, we can use the command ‘echo’ to print something to the screen. This is used like
echo ‘hello’
which will print hello to the terminal.
We can use a for loop to print the number 1 to 10 to screen by typing
for num in {1..10}
do
echo $num
done
This loop starts at 1, places the number in the variable num which can then be accessed inside the loop through $num.
Show me the answer!
for num in {1..10};do echo $num;done;
Note the lack of semi-colon after ‘do’
Show me the answer!
for num in {1..3};do mkdir "run"$num;done;
We place 'run' before the variable to tell the system we want this string to be placed before the variable as part of the directory name. If we wanted it placed after (e.g. create 1run etc.) we could use $num"run"
7.6 Output redirection
You already learned how to redirect the output of a command to a file using the >
. Instead of writing to a new file, you can also append to an existing file by using >>
less
.
Show me the answer!
echo 'This is cool!' >> some_file.txt
The default output that is written to the screen and that you redirect to a file is called the 'stdout' (standard out). In addition, there are two more default standard files: 'stdin' (standard input) - the default place where commands listen for information, and 'stderr' (standard error) - used to write error messages.
Show me the answer!
less some_file_that_does_not_exist
less some_file_that_does_not_exist
some_file_that_does_not_exist: No such file or directory
.
To save the 'stderr' message, you need to redirect the 'stderr' to a file by adressing it using a stream number.
1>
will redirect your 'stdout',
2>
will redirect your 'stderr',
&>
will redirect 'stdout' and 'stderr'.
Show me the answer!
less some_file_that_does_not_exist 2> error_msg.txt
Instead of writing the stdout/stderr to the screen or to a file, you can also immediately delete it by redirecting the output to > /dev/null
.
less some_file_that_does_not_exist 2> /dev/null
8) Working with genomic(-sized) files - The VCF format, filtering, and first analyses
If you are already comfortable with working on the terminal, or you already finished the UNIX exercise, you can jump to this additional exercise on genomic files in VCF format and first genomic analyses: