Basic UNIX, some tips and tricks, and more!
Angelica Cuevas, Alexandra Weber & Julia M.I. Barth, 20 January 2020
Background and Objectives
Welcome everyone!
We have prepared this UNIX activity to get all of you to the same level, as you will use the terminal a lot during this workskop.
Depending on your level, this activity will take more or less time.
Learning goals:
- navigate in the UNIX environment
- create, move and delete directories
- create, move, delete and edit files
- use basic unix commands and know where to find help
Why would we use the terminal / shell in the first place?
Scripting: We can write down a sequence of commands to perform particular tasks or analyses;
when working with genomic data, a task usually takes minues, sometimes hours or even days – it’s no fun to sit and wait in front of your computer this long just for a mouse-click to initiate the next task.
Powerful Tools: In UNIX, powerful tools are available that enable you to work through large amounts of files, data, and tasks quite quickly and in an automated (that is, programmatic) way.
Easy remote access: In most cases, it is not possible anymore to deal with genomic data on a desktop computer. You will usually run analyses on clusters at high performance computing facilities at your university, or – like in this course – on the Amazon cluster.
A GUI (Graphical User Interface) is not available for many programs: Genomics is a fast evolving field and developing a graphical interface takes time and effort.
Compatibility: The terminal can (remotely) be accessed with computers running on different operating systems
Basic syntax of shell commands
UNIX or shell commands have a basic structure of:
command -options target
The command comes first (such as cd
or ls
as we will see later) then any options (always proceeded by a –
and also called flags) and then the target (such as the file to move or the directory to list). These commands are written on the prompt (terminal command line).
How to do this activity
- Connect to your Amazon instance via the terminal using SSH.
- Questions or tasks are indicated with
Q . - All text in
red underlined with gray color
indicates commands that you can type or copy to the terminal. - If you get stuck, check the answer-box:
Show me the answer!
Table of contents
There are some keys that are used a lot in UNIX commands but can be difficult to find on some keyboards.
~
tilde
#
hash or number or gate sign
$
dollar sign
*
asterisk
`
backtick
ctrl c
The panic button: If you are running a process or program and it is stuck or doing something you don’t want it to do: then hold control and press c. This will kill the current process and return you to your prompt.
A UNIX cheat sheet like this one here might be helpful as a reference.
Also, never forget that Google is your best friend!
Most UNIX commands and many other programs have help pages accessed through: command_name --help
, or command_name -h
, which also describe different ways to run a program.
Most programs also have a more exhaustive manual page accessed by typing man PROGRAM_NAME
.
ls
help page and the ls
manual page.
Show me the answer!
ssh [email protected]
Type ls --help
to access the “list” help page.
Type man ls
to access the “list” manual page.
Note that if you type ls -h
you don’t get the help page for ls
. That’s because -h
is the option that print sizes of files in a human-readable format (like 1K, 234M, 2G etc) when combined with the -l
option, like in ls -lh
. Find the -h
option when you access the ls
help page with ls --help
cp
, vim
and nano
commands do?
Show me the answer!
man cp
: copy filesman vim
: “Vi IMproved” – a text editorman nano
: “Nano’s ANOther editor” – a text editorTo exit the man page press
q
A computer file system is laid out as a hierarchical multifurcating tree structure. This may sound confusing but it is easy to think of it as boxes of boxes where each box is a directory.
There is one big box called the root. All other boxes are contained in this one big box. Boxes have labels such as ‘Users’ or ‘Applications’. Each box may contain more boxes (like Desktop or Downloads or Work) or files (like ‘file1.txt’ or ‘draft.docx’)
Thus it is hierarchical (boxes in boxes), multifurcating (each box can contain multiple boxes or files) tree structure (similar to how a tree has branches and leaves).
There are two ways to refer to directories and their positions in this hierarchy and relationship to other directories: absolute and relative paths.
A) Absolute path
The absolute path is the list of all directories starting from the root that lead to the current directory. Directories are separated using a /
.
For example, the path to the directory ‘popgen’ is /home/popgen/
Show me the answer!
/home/popgen/software
B) Relative path
A directory can also be referred to by its relative location from some other directory (usually where you are working from). The parent of a directory is referred to using ..
The current directory is referred to using .
For example, if I am in ‘software’ and want to get to ‘workshop_materials’ the relative path is ../workshop_materials/
Show me the answer!
..
The Home directory is where you are upon login (/home/popgen/).
Show me the answer!
cd
stands for ‘change directory’:cd software/beast/
Show me the answer!
cd /home/popgen/
Using the relative path:
cd ../../
Using a really useful shortcut:
cd ~
Show me the answer!
cd software
Show me the answer!
pwd
pwd stands for ‘print working directory’
Show me the answer!
ls
ls stands for ‘list directory content’
Show me the answer!
ls -l
The flag -l
specifies a ‘long listing format’. It returns the columns: permissions, number of hardlinks, file owner, file group, file size in bytes, modification date, filename.
Show me the answer!
ls -lh
For more information on
ls
check the man pages: man ls
5) Managing your directories and files
A new folder can be created using mkdir
“make directory”.
Show me the answer!
mkdir unix_tutorial
touch
Show me the answer!
touch file1.txt
Tip:
touch
can also be used to update the access date of a file or directory. Editing files in the terminal is a bit tedious but you’ll learn quickly!
nano and vim) are useful text editors:
vim
and write ‘Hello Workshop team’
Show me the answer!
vi file1.txt
or vim file1.txt
i
[for insert mode]
Type: ‘Hello Workshop team’ESC
to escape the ‘insert mode’:q
to exit vim without saving modifications to the file, use :wq
to save. nano
and write ‘Hello fellow participants’ in the second line
Show me the answer!
nano file1.txt
ENTER
to access the second lineWrite: ‘Hello fellow participants’
ctrl o
to save –> yes, ^ corresponds to ctrl in case you were wondering 🙂ENTER
to validate savingctrl x
to exit
Show me the answer!
cp file1.txt unix_tutorial/file2.txt
cp
stands for ‘copy’
Show me the answer!
mv file1.txt myfile1.txt
the command
mv
stands for ‘move’it is the same command to move or to rename a file (‘move’ a file in the current directory with a different output name)
Show me the answer!
mv myfile1.txt unix_tutorial/
Show me the answer!
cd unix_tutorial
rm file2.txt
rm
stands for ‘remove’
Show me the answer!
cd ..
rm unix_tutorial
This returns an error, only using
rm
is not possible to remove a directory, a flag that allows deleting a directory and its content is needed.rm -ri unix_tutorial
The
-r
flag remove directories and their contents recursively and -i
tells the command to ask for permission to delete.To delete an empty directory, you can also use rmdir.
There is no ‘undo’ or ‘trash folder’ in the terminal, so be very careful where deleting files or directories!
It is a good practice to use the -i
flag as a safety step with the rm
(e.g.,:rm -i file1.txt
)
For the next exercise, you are going to need the text file Test_file_genomics_data.txt, which is located in the directory ~/workshop_materials/20_unix_intro/basic_unix/
.
Go to the ‘basic_unix’ directory.
There are several ways to view the content of a file:
cat
will print the whole file. It can be useful for viewing small files and as a part of computational processing using the pipe |
, but it is not suitable for viewing large files.
Remember, you can always use ctrl c
to kill the task.
less
prints the content of a file on one screen length at a time
Within ‘less’:
ENTER
displays the next line
k
displays the previous line
SPACE
displays the next page
b
displays the previous page
shift g
prints the end of the file
q
to exit
If a file contains very long lines, these lines will wrap to fit the screen width. This can result in a confusing display, especially if there are, for example, long sequences in your file. To chop the lines and only display the beginning of each line we can use:
less -S
You can scroll horizontally across lines using the arrow keys.
Show me the answer!
cat Test_file_genomics_data.txt
less Test_file_genomics_data.txt
This file contains population pairwise genome-wide statistics (FST, DXY, nucleotide diversity per population) calculated on 10 Kb windows.
‘scaffold’ specifies the linkage group or chromosome
‘Start’ & ‘End’ specify the start and end position of the window
‘FST’ & ‘DXY’ represent relative and absolute genomic differentiation measures
‘Set1_pi’ & ‘Set2_pi’ correspond to the nucleotide diversity for population 1 and population 2, respectively
head
will print the first 10 lines of a file on the prompt
tail
will print the last 10 lines of a file on the prompt
Test_file_genomics_data.txt
Show me the answer!
head -n 25 Test_file_genomics_data.txt
Test_file_genomics_data.txt
Show me the answer!
tail -n 50 Test_file_genomics_data.txt
grep
is a tool for searching files for a specific content. It has many powerful applications, the basics of which will be explained here.
The basic syntax of grep is
grep 'search pattern' 'filename'
Test_file_genomics_data.txt
Show me the answer!
grep 'LG13' Test_file_genomics_data.txt
Test_file_genomics_data.txt
?
Show me the answer!
grep -c 'LG20' Test_file_genomics_data.txt
Use
grep --help
to obtain information about the flag -c
. It stands for “count”.Test_file_genomics_data.txt
Show me the answer!
grep -B 3 'scaffold' Test_file_genomics_data.txt
grep -A 3 'scaffold' Test_file_genomics_data.txt
You can also use the
-C
flag to print the lines before and after simultaneously.Test_file_genomics_data.txt
Show me the answer!
grep -v 'LG' Test_file_genomics_data.txt
Use
grep --help
to obtain information about the flag -v
. It stands for “invert-match”.
cut
allows you to extract a specific column from a file. By default, the column delimiter is TAB. You can change this using -d
Show me the answer!
cut -f 5 Test_file_genomics_data.txt
wc
counts the number of lines, characters or words in a file
Show me the answer!
wc -l Test_file_genomics_data.txt
sort
will sort lines of a text file
Show me the answer!
sort -g -k 4 Test_file_genomics_data.txt
-g
applies for a general numeric sort-k
specifies the column in which the values should be sorted
sed
has many powerful applications including the replacement of one block of text with another.
The syntax for this is
sed 's/'pattern to find'/'text to replace it with'/g' 'filename'
This will output the changed file contents to the screen
If we want to redirect the output to a new file we can use >
, for example:
sed 's/'pattern to find'/'text to replace it with'/g' 'filename' > 'new_filename.txt'
Test_file_genomics_data_renamed.txt
. Then visually inspect the file to check if it worked.
Show me the answer!
sed 's/LG/LinkageGroup/g' Test_file_genomics_data.txt > Test_file_genomics_data_renamed.txt
head Test_file_genomics_data_renamed.txt
tail Test_file_genomics_data_renamed.txt
The pipe |
is a very useful key that sends the output from one unix command as input into another command
example:
grep 'LG12' Test_file_genomics_data.txt | head
Test_file_genomics_data.txt
using a single command line.
Show me the answer!
cut -f 2 Test_file_genomics_data.txt | tail -n 5 > New_file.txt
It is often useful to copy a file from a remote system (e.g., the amazon server) to a local system (e.g., your computer), and vice-versa.
To do this, a useful command is scp
, that stands for ‘Secure Copy Protocol’. It works like cp
in the sense that both commands require a source and a destination location for the copy operation; the difference is that with scp
, one or both of the locations are on a remote system and requires authentication.
This example would copy a file from your personal computer to the amazon server:
scp 'source_path/FILE_NAME.txt' '[email protected]:/destination_path/'
Show me the answer!
Then type:
scp [email protected]:/home/popgen/workshop_materials/20_unix_intro/basic_unix/Test_file_genomics_data.txt ~
! Please remember that we ask you NOT to download larger files from the Amazon instances to your computer since this is quiet expensive (downloading smaller files like scripts or personal notes, as well as uploading to the Amazon instance is OK). We will provide a link to download the workshop material (including e.g., input files, templates, etc) at the end of the workshop.
7.1 Tab completion & up-arrow.
The tab button can be used to complete a file or directory name and to do a quick lookup of commands. If, for example, a file has a very long name, you can save time by using the tab completion.
Lets say you wanted to copy a file named ‘reallyLongFilename.txt’ to the parent directory, type:
cp rea
… – hit the tab key …
reallyreallyreally_long_filename.txt
, then use ls -l
and the tab completion to fill in the filename
reallyreallyreally_extralong_filename.txt
and then again type ls -l
followed by trying to utilise the tab completion to fill in this even longer filename.
As you probably found out by doing the above, if there are two or more files that start the same way then tab completion after typing ‘rea’ will not fill in the whole name as there is ambiguity to which file you mean. In this instance pressing tab will fill in as much as it can (in this case ‘reallyreallyreally_’) and stop. Pressing the tab button twice will now display all the options of files that start with those letters, allowing you to see what extra letters you must type to complete the file.
In this case you can type an extra ‘e’ (giving you ‘reallyreallyreally_e’) and then hit tab and it will complete it for you.
Finally, the up-arrow
key is very useful because pressing it repeatedly shows you the history of the commands you typed in the terminal previously.
7.2 Don’t loose your job – use screen
While working on a remote server, screen
is very helpful to have multiple running jobs in multiple ‘windows’ at the same time and don’t loose them in case your local computer crashes or you lose the connection.
screen
in your terminal. Press ‘space’ to get to the promt of the screen.
Start a long process e.g., a ‘word count’ on a large(ish) VCF file: zcat ~/workshop_materials/20_unix_intro/basic_unix/data.vcf.gz | wc
.
Detach from the screen by pressing (simultaneously) the keys: ‘control-a-d’. Do the same again – i.e. open another screen and start a long job.
If you intent to have multiple screens and run different jobs on them it could be useful to specify a meaningful name for the different sessions, this way you know which job you are running in each screen, you can do that by using the -S
flag, like screen -S
.
To list all running screens you can use screen -list
or screen -r
To re-attach your screen, type screen -r
followed by the screen name, if you named it when starting the screen session, otherwise you can use the first digits that are listed in the first column after using the ‘screen -r’ command.
To exit the screen, type ‘exit’ within the screen. If only one screen is open, screen -r will directly re-attach you with this screen. After your ‘long jobs’ are finished, exit both screens by typing exit
while being attached to the screen.
7.3 Changing folder permissions
ls -l
in the folder ~/workshop_materials/20_unix_intro/basic_unix
. What information does the first column of the output contain? What is the meaning of the 3rd and 4th column?
Show me the answer!
-rw-r--r-- 1 popgen workshop 5003911 Jan 16 18:07 Test_file_genomics_data.txt
-rw-r--r-- 1 popgen workshop 184547450 Jan 17 15:52 data.vcf.gz
The first column tells you about the permission rights in symbolic notation:
The first character just indicates the file type (‘-‘ regular file, ‘d’ directory file). The remaining nine characters are in three sets, each representing a class of permissions as three characters. The first set represents the ‘user’ class (what the owner can do), the second the ‘group’ class (what the group members can do), and the third class the ‘others’ class (what other users can do). Within each triad, the first character ‘r’ indicates read access, the second ‘w’ write access, and the third ‘x’ executable.
The third column tells you the name of the file ‘owner’, and the 4th thd name of the file ‘group’.
-rwxr-xr--
. Which rights do you have as group member, or as ‘other’?
Show me the answer!
Access rights can be changed using the chmod
command, followed by a numerical code for the file permissions, and then the file name. You can learn the numerical code, or use chmod calculator.
touch some_file.txt
, type ls -l
to see who is the owner and the group, and which rights they have. Change the access rights of this file to -rw-rw----
. Type ls -l
again to see if the command took effect.
Show me the answer!
chmod 660 some_file.txt
7.4 awk – a stream programming language
awk
has the general structure of:
awk 'pattern' {action}
awk
is column (field) aware:
$1
corresponds to the first column
$2
corresponds to the second column
$3
corresponds to the third column
etc.
$0
corresponds to the whole line
‘pattern’ can be any logical statement:
$3 > 0
– if column 3 is greater than 0
$1 == 32
– if column 1 equals 32
$1 == $3
– if column 1 equals column 3
$1 == "consensus"
– if column 1 contains the string “consensus”
if ‘pattern’ is true, then everything in {…} is executed
awk
Show me the answer!
awk ' $1 == "LG7"' Test_file_genomics_data.txt | awk '$2 < 100000'
We can first select the lines corresponding to LG7 and then print the ones corresponding to the windows whose start coordinates are smaller than 100000 bp
7.5 Repeating commands using loops
One of the really powerful things you can do in the terminal is the ability to repeat commands on multiple targets. For example creating many folders, moving multiple files into each folder, running commands and pipelines (sequences of commands) on multiple files/samples etc.
This can be accomplished by using the so called 'for loop'. For loops use two features of the shell that have not been covered before: variables and wildcards.
7.5.1 Variables
A variable in the terminal shell can be seen as a placeholder for some text, such as a directory name, filename, number, sentence etc.
A variable is initialised using the name you designate for it. The name can be whatever you want (e.g. 'direc', 'superman', 'x', 'file' etc.) as long as it is a single word without spaces or special characters. The variable is then addressed using $ in front of its name. For example, if the variable is named 'direc' it is referenced using $direc
.
The contents of the variable can be changed within a loop automatically without having to do so yourself manually.
7.5.2 Wildcards
The asterisk *
is referred to as a 'wildcard' symbol in unix. It allows for matching of filenames, directories etc. that have a certain sections of their name in common.
For example, if the names of all the files you are interested in start with ‘result’ (e.g. result.txt, result.tree, result.nexus, resultFile, result) these will all be recognised using result*.
Alternatively, if all files of interest end with .txt, you can loop over all of them using *.txt
Both variables and wildcards are used in for loops to maximise their power.
A for-loop has the syntax:
for 'variable' in 'list'
do
'tasks to repeat for each item in list'
done
Each section is written on a separate line (e.g. after ‘do’ hit enter’) and instead of a prompt the terminal will display a > to designate you are in a multi-line command.
Alternatively, you can place a loop into a single line using the ; symbol to separate the commands, except for the line break after the 'do' where there should not be a ;
For example, we can use the command ‘echo’ to print something to the screen. This is used like
echo ‘hello’
which will print hello to the terminal.
We can also define a variable which content would be the word 'hello'
my_variable='hello'
and print the content of the variable in the screen with echo $my_variable
We can use a for-loop to print the number 1 to 10 to screen by typing
for num in {1..10}
do
echo $num
done
This loop starts at 1, places the number in the variable num which can then be accessed inside the loop through $num.
Show me the answer!
for num in {1..10};do echo $num;done;
Note the lack of semi-colon after ‘do’
Show me the answer!
for num in {1..3};do mkdir "run"$num;done;
We place 'run' before the variable to tell the system we want this string to be placed before the variable as part of the directory name. If we wanted it placed after (e.g. create 1run etc.) we could use $num"run"
7.6 Output redirection
You already learned how to redirect the output of a command to a file using the >
. Instead of writing to a new file, you can also append to an existing file by using >>
less
.
Show me the answer!
echo 'This is cool!' >> some_file.txt
The default output that is written to the screen and that you redirect to a file is called the 'stdout' (standard out). In addition, there are two more default standard files: 'stdin' (standard input) - the default place where commands listen for information, and 'stderr' (standard error) - used to write error messages.
Show me the answer!
less some_file_that_does_not_exist
some_file_that_does_not_exist: No such file or directory
To save the 'stderr' message, you need to redirect the 'stderr' to a file by adressing it using a stream number.
1>
will redirect your 'stdout',
2>
will redirect your 'stderr',
&>
will redirect 'stdout' and 'stderr'.
Show me the answer!
less some_file_that_does_not_exist 2> error_msg.txt
Instead of writing the stdout/stderr to the screen or to a file, you can also immediately delete it by redirecting the output to > /dev/null
.
less some_file_that_does_not_exist 2> /dev/null
Now what if we want to save some of the commands and run them automatically without the need to type them all over again? or for example to apply the same commands later to new files? For that we can write a bash script.
nano
, by including the for loop we used before:
Show me the answer!
touch my_script.sh
nano my_script.sh
You can write the following in your script:
#!/bin/sh
# a for-loop to print text and numbers from 0 to 10
for num in {1..10}; do
echo 'day '$num ':has been cool!'
done
Using the '#' we can write notes in our script. They can be a small description of what the script is about.
Save and exit the script. ctrl
o to save. ENTER to validate saving. ctrl x
to exit
bash my_script.sh
This will print out in the screen the result of our for loop command.
Instead of printing it in the screen we could save the output in a file just by using the >> to redirect the output.
Show me the answer!
Modify your script by adding >> after the loop and specify a file where the output would be saved.
#!/bin/sh
# a for-loop to print text and numbers from 0 to 10
for num in {1..10}; do
echo 'day '$num ':has been cool!'
done >> my_output.txt
Check the output file that was created with less my_output.txt
We can go a bit further and modify our script to make it interactive by allowing the script to get some of its arguments directly in the command line.
Modify the script by replacing ':has been cool!' by ${1} and my_output by ${2}:
#!/bin/sh
# a for-loop to print text and numbers from 0 to 10
for num in {1..10}; do
echo 'day '$num ${1}
done >> ${2}.txt
Save the changes and exit the script.
The ${1}
and ${2}
variables correspond to the first and second argument we will specify in the command line when running the script. Let's say we want to specify a different text for the result of the day ':has been cool!' and change it to 'great' and to save it in a different file called 'output2.txt', then we need to provide those arguments in the command line.
bash my_script.sh great output2
Now the new output file 'output2.txt' should have the word 'great' instead of the initial ':has been cool!'.
Show me the answer!
cat my_output.txt
cat output2.txt