A brief tutorial on the use of jupyter notebooks and the python data analysis library pandas for genomic data analysis.

Workshop on Population and Speciation Genomics, Český Krumlov, June 2022.
By Hannes Svardal ([email protected])

Starting a jupyter notebook on your web instance and connecting to it

For a few activities that are heavily based on Python scripts we are also going to interact with the Amazon instances through Jupyter Notebooks. This interaction will be very similar to the access through Guacamole or RStudio, except that in this case a Jupyter server will need to be activated prior to the connection through the browser.

  • To activate the Jupyter server on your instance, you’ll first need to connect to the instance as explained above with SSH. Thus, use “wpsg” as the username and the password from the whiteboard..

 

  • Navigate into the tutorial directory: cd ~/workshop_materials/a06_python_jupyter_intro
  • Start a screen session by typing: screen
  • Confirm with Return
  • Start the conda virtual environment: conda activate sweeps-env (Which conda environment to activate will depend on the session you have. This conda environment for Derek Setter’s Selective Sweeps exercise will work well for this basic intro, as it contains all required python packages)
  • Start the notebook server: jupyter notebook --no-browser --port=8889
  • The command blocks the terminal. That is normal. Keep it running. You can get back to a functional terminal by typing Ctrl + a, d (first Ctrl + a, then d)
  • In your local browser, navigate to the web address: http://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com:8889 (replace XXX with your Amazon instance IP address)
  • You will be asked to enter a password (see whiteboard).

  • Then you should see see the contents of the folder in which your instance is running.

 

Note: Running the tutorial after the workshop, on your local machine

If you have python and jupyter installed, you can simply run the notebook in the following way:

    • Start the notebook server: jupyter notebook --no-browser --port=8889
    • In your local browser, navigate to the web address: localhost:8889
    • If you have not set up your password, you might also need to copy some token in the address which will be shown in the ssh command line where you started the notebook server.

Further resources about jupyter notebooks can be found here:

Jupyter notebook basics

Note: Advanced users that are familiar with jupyter and python can directly jump to the section “Genome data analysis with python in juptyer notebooks”

You should now see the jupyter interface in your browser showing you a list of folder contents.

In this interface you can navigate through folder and files and open files (jupyter notebooks or text) by clicking on them. Careful: Don’t try to open very large/ compressed  or binary files.

To get familiar with jupyter notebooks, I first want you to create a new jupyter notebook. Click on the new button and select Python 3 (ipykernel)

This should start a new jupyter notebook in a new tab running a Python 3 kernel. Jupyter notebooks can run different kernels: Python 2/3, R, Julia, bash. (But R and Julia` are not installed on the AWS so you did not see the option).

 

Now you should have a notebook like this in front of you.

  • At the top of the webpage, the notebook environment has a header and a toolbar, which can be used to change settings, formatting, and interrupt or restart the kernel that interprets the notebook cells.
  • The body of the notebook is built up of cells of two major types: markdown cells and code cells. You can set the type for each cell either using the toolbar or with keyboard commands. The right-most button in the toolbar shows all keyboard shortcuts.
  • To create more cells you can use either the “plus” button in the toolbar above or the keyboard combination (on most operating systems, this is done by pressing  ESC + b`, see also Help). Try to define variables and do calculations in these cells.

Rename this jupyter notebook to a name of your choice, I will refer to it as your notebook.

Code cells

Code cells contain computer code (in our case written in python 3). Per default, input cells are code cells. Code cells have an input field in which you type code. Cells are evaluated by pressing Ctrl + Enter or Shift + Enter with the cursor being in the cell. The difference is that the latter advanced to the next cell. This produces an output field with the result of the evaluation that would be returned to stdout in a normal python (or R) session. Note that by default only the result of the last operation will be output, and that only if it is not asigned to a variable, but all lines will be evaluated.

Enter the code cell you have in front of you and type some python code, let’s start with 1+1 . Excecute the code cell pressing Ctrl + Enter. What happend and what ouput do you get?

Show me the answer!

Once you evaluate, your code is interpreted by the python kernel. The result of the code of the last less is returned as output, unless it is assigned to a variable. Hence, you should have received the expected result 2.

You can also type multiple commands in a single cell. Create a new cell, type and excecute:

a = 5
b = 3
c = a * b

a, b, and c are variables, their assigned values are remembered by the python kernel. Test this, by typing and excecuting in a new cell:

print('a is', a)
print('b is', b)
print('c is a*b, which is', c)

What output do you get?

Show me the answer!

a is 5
b is 3
c is a*b, which is 15

 

Markdown cells

Markdown cells contain text that can be formatted using html-like syntax. See here for more info: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html Double-klick into a markdown cell to get into edit mode.

In your notebook, create a markdown cell above your first code cell. Add a title and some explanatory text of what the cells below do. Tipp: If you are not familiar with markdown, check the link above.

Using python in Jupyter notebooks

Needless to say, in order to take full advantage of jupyter notebooks and python for genome data analysis, you need have/gain some understanding of how python works. (However, remember that you can also run R kernels and use R code in jupyter notebooks if it is installed.)

Unfortunately, we do not have the time to give a thorough python introduction here. If you want to learn python from bottom up (which I encourage if you do not know it yet), then follow at some point one of the many tutorials available online. I recommend the one from w3schools.

Here we are going to give a very quick introduction to some of the main python principles and some useful features, which will be important for several of the practicals this week.

For this, please go back to your jupyter server at http://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com:8889 and click on 2022-05_python_jupyter_tutorial.ipynb to open the notebook and start a python kernel.

  • Follow the exercises in the jupyter notebook. If you have questions please ask.

Genome data analysis with python in juptyer notebooks

This part here is meant for more advanced users that already know the basics of python and jupyter notebooks. As a beginner you might not be able to finish this during the practical, but you can come back to it at a later time if you are interested.

Please open the jupyter notebook 2022-05_python_jupyter_genome_data_analysis.ipynb from your juptyer server interface. Follow the instructions and interpretations there.

You can download the whole material from this tutorial as a zip file here .