Connecting to your Amazon instance
Michael Matschiner, Milan Malinsky, and the Workshop Team; 20th January 2020
Amazon Web Services and Amazon instances
For this workshop we are going to use a high performance cloud platform called Amazon Web Services (AWS). This is a flexible resource to get computational work done quickly and relatively inexpensively. We have set up individual virtual machines called “Amazon instances” for each participant. These have all the software and data you will need along with sufficient CPU, RAM, and storage resources. Your instance IP address can be found on this list and the list of IP addresses is also linked from the main workshop page (in the box with links).
The IP addresses are going to change regularly during the workshop, as we’ll stop the instances overnight and resume them on the next day to cut down the cost that we need to pay to Amazon. Whenever instances are restarted, the IP addresses will change; however, the content of the instance will remain the same. So the files that you write on any given day on your instance should usually still be present on the next. We are going to announce when we stop instances overnight, and we will accept requests to leave instances running in case that you’ld like to keep working on an analysis.
In addition to stopping instances overnight, it may also occur that we need to issue completely new instances at some point. This may be the case when/if we need to install additional software or add datasets for activities. In such cases, we’ll terminate old instances, meaning that the files from these instances will not be accessible anymore to you. We are going to announce if/when that happens, and we will give you time to finish analyses or download data if you request it.
Generally, however, we discourage downloading data — at least large datasets — from the Amazon instances, because we have to pay for download volume. We are going to provide a Dropbox directory at the end of the workshop, which will contain the dataset and scripts used during the Workshop, and we ask you to download this Dropbox directory (if interested) instead of copying large volumes of data from the Amazon instances.
Logging-in to your Amazon instance
There are different ways to interact with a running Amazon instance, and the most convenient methods usually differ between the activities and depend on your own operating system. Below, we describe the three most important ways for accessing the instance; via programs that support SSH and SCP, and through web browsers with Apache Guacamole and RStudio. At the beginning of each activity description, we specify the most convenient method for that activity.
Using SSH / SCP
A number of the activities requires only/mostly command-line access to the Amazon instance. In these cases we recommend that you access the instance directly from your machine via a terminal program and the “shell”. You will learn more about how to use the shell in the Unix tutorial. The shell is often the only way to run genomic analyses and interact with high performance compute clusters. The interaction with remote server through the shell takes place primarily by means of two programs, called SSH (Secure Shell) and SCP (Secure copy protocol). In brief, the difference between the two programs is that SSH allows access to a remote server while SCP allows copying of files from and to a remote server. The shell, including SSH and SCP, can be used through terminal programs that are available on for all operating systems, natively or through additional installation.
Identify a suitable terminal program installed or installable on your own machine.
Help needed?The following terminal programs are available on Mac OS X, Linux, and Windows:
Mac OS X comes preinstalled with an application called “Terminal”, you can search for it with Spotlight or in Finder.
All Linux distributions come with a terminal application. These may be in different places depending on what type of linux system you have. If you are using linux you probably already know how to find this; if not please ask one of the instructors for help.
Most Windows systems unfortunately do not have a terminal program that would be useful for our purposes. There are several free options that you could use, including PuTTY and MobaXterm. The newest Windows operating system also comes with Linux Bash Shell and Windows Terminal; however, we have been unable to test these terminal programs. Below, we provide instructions on the download and installation of PuTTY, a terminal program that has previously been used successfully in the Workshop.
To download PuTTY, go to http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html and select either the 32-bit or 64-bit Windows installer, depending on the version of Windows on your machine.
Double-click on the file after downloading. In the field below “Host Name” (marked with orange rectangle) insert the IP address for the Amazon instance, as shown in the screenshot below. Then click “Open”.
If the warning message shown below appears, click “Yes”. This is only a check that you trust the computer you are connecting to.
Make sure to know how you can find these programs quickly on your computer; you’ll need them frequently during the Workshop.
- Apple Mac
Open your favourite Terminal application and connect to your instance via SSH, using the username “popgen” and the following command after replacing “XX-XXX-XXX-XXX” with your Instance IP address:
ssh [email protected]. You will be asked for a password, which will be provided by us on the whiteboard in the House of Prelate.
To close the connection to the Amazon instance, just type the command
Via the web browser and Apache Guacamole
Some activities are going to require the use of a graphical user interface (GUI) of one or more programs. To be able to use GUIs on the Amazon instance, we can access the instance through the ‘Apache Guacamole‘ remote desktop gateway. We have already set up Guacamole on the Amazon instances, allowing you to access the instance’s desktop through your web browser. Access through Guacamole should work reliable for all participants; however, the use of GUIs through Guacamole may not feel quite as smooth as on your own desktop because short lag periods may be experienced when you type text, scroll, or move objects on the desktop. Because of these lag periods, it may be preferable to run activities without GUI requirements through SSH and SCP (see above) and to download and run GUI programs on your own machines if possible. Another potential issue with Guacamole is that copy-pasting from outside of Guacamole into it may not work on all operating systems. You may need to test both options (Guacamole and SSH/SCP) for some activities to find the best solution in each case.
- To connect to an Amazon instance through Guacamole, open a new browser window or tab (as far as we know, all major browsers seem to work just fine). Then enter “http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8080/guacamole/” into the browser’s address bar, again replacing “XX-XXX-XXX-XXX” with your Amazon instance IP address. After pressing “Enter”, you should see the login page as shown in the screenshot below.
- As when connnecting through SSH or SCP, the username is “popgen” and the password can be found on the whiteboard in the House of Prelate. After you hit the “Login button”, you get to choose whether you want a connection to a terminal provided by Guacamole (this may be useful for Windows users and for people whose terminal application on their laptops is for some reason not working well), or a connection to the desktop (for activities requiring GUI), as shown in the next screenshot.
- If you click on the “Desktop” option, you’ll see another login window, where you’ll need to enter the same username and password as before.
- After a few seconds, this should bring you to the desktop of the Amazon instance, as shown below.
- To close the connection to the Amazon instance, just close the browser tab or window.
Via the web browser and Rstudio
For some (actually many) activities, we are going to use the R environment for statistical computing. If you don’t know what R is or how to use it, you are going to learn more about R later today. A good way to interact with the R installed on your Amazon instance is via the Rstudio server installed there. You can also connect to it with your web browser, similar to how you connected to Guacamole above.
- Open a new browser window (or tab) and enter http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8787 to the browser address bar, replacing “XX-XXX-XXX-XXX” again with your Amazon instance IP address. Simple as that, you are at the login page of RStudio, as shown below.
- Enter again the “popgen” username and the usual password (see the whiteboard), hit “Sign In”, and that’s it, you are in RStudio. The browser window should then look similar to the screenshot below.
- As with Guacamole just close the browser tab or window to close the connection to the Amazon instance.
- To activate the Jupyter server on your instance, you’ll first need to connect to the instance as explained above with SSH. Thus, use “popgen” as the username and the password from the whiteboard.
Move into a directory for a Python activity, such as
Execute the command
conda activate conda
- Enter http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8888 into the browser address bar, replacing “XX-XXX-XXX-XXX” again with your Amazon instance IP address. You should then be asked for the password as shown in the next screenshot. Type again the same password as before.
After hitting the “Log In” button, you should get to a list of files, including some that end in “.ipynb”. These are the Jupyter Notebooks. Click on the file named
202001_jupyter_pandas_tutorial.ipynb, which should open the notebook shown in the next screenshot.
More information on how to use RStudio will be given in the Introduction to R activity.
Via the web browser and Jupyter Notebooks
For a few activities that are heavily based on Python scripts we are also going to interact with the Amazon instances through Jupyter Notebooks. This interaction will be very similar to the access through Guacamole or RStudio, except that in this case a Jupyter server will need to be activated prior to the connection through the browser.