Running AWS: instances, Jupyter lab, file transfers¶

(c) 2018 Justin Bois and Julian Wagner. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

As we start to do more involved statistical inference (and image processing), you will want to run your calculations on more powerful machines. There are many options for this, including Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, and Caltech's own high performance computing center. In this tutorial, we will show you how to get up and running with AWS. While it looks like a lot of steps, some of the steps are done only once, so it is not much more work to launch instances after the initial setup.

Index¶

Initial set up/launching instance
Connecting to your instance
Copying to/from instance
Attaching volumes to your instance
Working with files on Amazon S3

Operating on AWS¶

The first thing to do is create an AWS account by clicking on the link on the upper right corner of this page. Once you have your account you can go back to it by clicking on Sign In To Console, which should now be in the upper right corner of the same site. If you are working in the Parker lab, instead of doing this, you should contact the AWS account manager for the lab to get credentials to log into the lab account. As of now, this is Julian Wagner.

2. Launch your instance¶

After you have an account, you can launch your instance. We have set up an Amazon Machine Image (AMI), which has the software you need for the course installed. The AMI is available in Oregon. Be sure to select this region from the top right corner of the console. You should use the same region throughout the course, since that is physically where your machine will live.

To launch an instance with this AMI, choose EC2 among the services available from your AWS console. You can select EC2 from the Services pulldown menu at the top of your screen.
After selecting EC2, you will see the EC2 Dashboard on the left pane. Under the Images there, click AMIs.
The resulting menu will default to AMIs Owned by me (you likely do not have any). Select instead Public images.
In the search menu, search for bi160, and the class AMI should appear. If it does not, double check to make sure your region is Oregon.
You will see the bi160_ami AMI listed. Right click on it and select Launch. (You may also select Spot request if you want to save some money, but when you stop a spot instance, you will lose whatever you stored there.)
You will then be given many choices of instance types. The t2.micro type will work for testing purposes, and small tasks which are not computationally intensive. This instance is free, provided your storage does not exceed 30 GB, which it shouldn't (yet). When we use AWS for more involved calculations that require more RAM, you should look into the memory optimized instances (the M5 series). This are much more expensive (dollars per hour of use) so you probably shouldn't launch these yet until have a particular task that needs more compute.
After you select which instance you want, click Next: Configure Instance Details at the bottom of the screen.
You can leave the defaults in the Step 3: Configure Instance Details page and click Next: Add Storage at the bottom of the page.
You can change the root storage size in the Size (GiB) box. 30 GB is the default for this AMI, which falls within the free tier, and should be enough. Later in this document is a description of how to add more storage to your instance by attaching additional volumes. EBS storage (the type of storage used by instances) is quite expensive (10 cents per GB per month). This might not sound so expensive, but if you have a 1 terrabyte volume, this is about ~\$3.3 **per day** it is registered, **\$100 per month**. Rather than designating a large root volume, it is better to attach additional volumes for transitory storage needs (it is a much less expensive way of working). How to do this is described in a later section of this document.
Click Next: Add Tags at the bottom of the page.
Leave the defaults and click Next: Add Security Group.
Select Create a new security group. You can name the group and add a description so you can use it again later. Leave the default SSH rule as it is.
Click Add Rule, and select HTTPS from the pulldown menu. Leave the defaults for that rule that appear.
Click Add Rule, and select Custom TCP Rule. Change the Port Range to 8888.
For the Source column, you may want to adjust the source to be Anywhere for each pulldown menu. That allows you to access your instance from anywhere, but it is not very secure. It is convenient because you can run the instance from your work machine and then from home later on. However, you may want to be more secure and instead select My IP or provide a custom rule for IP addresses that may access the instance.
Click Review and Launch at the bottom of the page and make sure everything is specified as you like.
Click Launch.
You will be prompted for a key pair. For your first time launching, you will need to create a keypair. This will be provided to you only once, so download it and store it locally on your machine. DO NOT, I repeat, DO NOT store it store it in any git repository, anything that is backed up to the cloud, like Dropbox. ONLY store it locally on your machine and never let it out to the internet. If it is your second time launching, you can use an exisiting key pair.
Finally, click Launch Instance.

After following those steps, your instance will spin up. You can view your running instances by clicking on Instances on the dashboard on the left part of the screen. It will take a while for your instance to spin up. When the Instance State says running and the Status Checks are complete, your instance is ready for you.

As a last step, mouse over your instance in the list on the dashboard, and click on the little pencil icon to rename your instance. Give it a name that includes your name, and the purpose of the instance so we can keep track of whats allocated and why.

3. Prepare your keypair for use in connecting to your instance¶

Now that your instance is launched, you can connect to it using your computer and the ssh protocol. The instructions work for Windows, macOS, or Linux, assuming you have a terminal running bash. In Windows, this is accomplished using GitBash. For macOS, use Terminal.

Identify where you put your keypair file. For the purposes of this exercise, I will assume that you have a directory in your home directory called key_pairs/ and that your keypair file is ~/key_pairs/julian_aws_keypair.pem.
Change permissions on your keypair for security. Do this in the terminal using

chmod 400 ~/key_pairs/julian_aws_keypair.pem

4. Connect to your instance¶

Open a new GitBash (Windows) or Terminal (macOS) window.
SSH into your instance in the terminal. To do this, click on yout instance on the Instances page in the Management Console. At the bottom of the webpage will appear information about your instance, inclugint the IPv4 Public IP. It will look something like 54.92.67.113. Copy this. In what following, I refer to this as <IPv4 Public IP>. SSH into your instance by doing

ssh -i ~/key_pairs/julian_aws_keypair.pem ubuntu@<IPv4 Public IP>

5. Launch JupyterLab¶

When you launch JupyterLab, you want to use screen. By running screen, your JupyterLab session will not get interrupted if you disconnect from your instance. So, on the command line in your instance, execute

screen

Then, you can launch JupyterLab by executing

jupyter lab --no-browser

on the command line. This will launch JupyterLab. You will see output like this:

Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
    http://localhost:8888/?token=4eea0b108226fe68b7ahdf219d5efe85d16a0648a3f3f4

Keep this window open.

In order to use JupyterLab through a browser on your machine, you need to set up a socket. To do so, open up another GitBash or Terminal window and execute the following.

ssh -i ~/key_pairs/julian_aws_keypair.pem -L 8000:localhost:8888 ubuntu@<IPv4 Public IP>

This sets up a socket connecting port 8888 on your EC2 instance to port 8000 on your local machine. You can change these numbers as necessary. For example, in the URL listed above that you got with you launched JupyterLab, the port may be localhost:8889, in which case you need to substitute 8889 for 8888 in your ssh command. You may also want a different local port if you already have a JupyterLab instance running on port 8000, e.g., 8001. In what follows, I will use port number 8000 and 8888, which you will probably use 90% of the time, but you can make changes as you see fit.

After you have set up the socket, you can paste the URL given when you launched JupyterLab on your EC2 instance into your browser, but substitute 8000 for 8888. That is, direct your browser to

http://localhost:8000/?token=4eea0b108226fe68b7ahdf219d5efe85d16a0648a3f3f4

You will now have JupyterLab up and running!

6. If you get detached¶

If you lose your internet connection, you can reconnect to your session, with JupyterLab running, by reattaching your screen. To do this, start by SSH-ing back into your instance with the -L 8000:localhost:8888. You must be SHH-ed into the instance with this flag to tell your local computer and the instance where to expect the JupyterLab session to port, and without it you won't be able to connect again.
Execute screen -r on the command line after SSH-ing back in to your EC2 instance to reconnect to your session. Any log information that was pushed to the JupyterLab session should show up to your screen session once you reconnect.

You can see what screens are active by doing screen -ls on the command line. You can also detach the current screen by using screen -d.

7. Copying results to and from AWS to your local machine¶

As you work on notebooks and create new files you want to save, you may want to move them to your local machine. For these file, you an use scp. Within your GitBash or Terminal window on your local machine (you probably have to open yet another), you can copy files as follows.

scp -i ~/key_pairs/julian_aws_keypair.pem ubuntu@<IPv4 Public IP>:~/my_file.csv ./

This command will copy files from your EC2 instance to your present working directory. Simply put the full path to the file you want to transfer after the colon above (remember ~/ means "home directory"). The second argument of scp is where you want to copy the file.

Similarly, you can upload files to your EC2 instance as follows (in this example to the home directory in your instance).

scp -i ~/key_pairs/julian_aws_keypair.pem my_file.txt ubuntu@<IPv4 Public IP>:~/

8. Exiting¶

When you are finished with your session, you can shut down your notebook in the browser. Then, in the terminal window, you can shut down JupyterLab by pressing Ctrl-c. After Jupyter is terminated, you should detach your screen by doing screen -d. Finally, you should quit your screen by doing screen -X quit.

In the past, I have had students have their instances littered with detached screens. You should clean house from time to time and run screen -X quit.

After you are finished with your work on your instance, you should stop your instance. To do this, go back to the Instances page on your EC2 console. Right click your instance, and navigate to Instance State → Stop. Do not terminate your instance unless you really want to. Terminating an instance will get rid of any changes you made to it.

9. Seriously. Stop your instances if you are not using them.¶

If your instance is not stopped and you leave it running, you will get charged. You can rack up a massive bill with idle, but running, instances. You should stop your instances whenever you are not using them. It is a minor pain to wait for them to spin up again, but forgetting about a running instance will cause more pain than that to your pocketbook.

10. Terminate your instances when you are done using it¶

After the class is over, you might want to terminate your instance. This is because the storage in your instance (stored using AWS's EBS, which is what keeps your repository, installations, etc., all in tact) is not free. Once your free tier accessibility expires in a year if you are new to AWS, and/or your promo codes expire, you will start getting bills for your EBS usage. These get wiped if you terminate your instance and you will not get billed.

Working with larger files on AWS via attaching volumes¶

In order to avoid wasting money on large EBS root volumes (as described previously), a better work flow is to create and attach a large volume to work with your files only as long as you need it. To make and attach a volume,

1) Access EC2 from the Services tab at the top of the AWS console.

2) Before you create a volume, you need to check the Availability zone of your instance. If a volume and instance have a different Availability zone, you won't be able to attach the storage to the instance. To check this, click on the Instances menu item under the Instances heading on the left of the screen. Click on the instance that you want to attach the volume to, and look at the Availability zone listed in the Description tab at the bottom of the screen. This will look something like us-west-2a. Make note of this for later use.

3) From the menu options on the left, under the Elastic Block Storage heading, select Volumes.

4) Click Create Volume.

5) You can change the size of the volume depending on your needs in the Size (GiB) box. Make sure that the Availability zone selected in the drop down menu matches that of your instance that you noted above. If not, select the correct zone. Click Create Volume on the bottom of the screen to finish the process.

6) I recommend that you change the name of the volume that you just created to keep track of it. After doing this, select your volume and under Actions on the top of the box and select Attach Volume. Select your instance, and click Attach.

Before you can access the volume on your instance, you will need to make a file system for it on that instance and mount it to a folder. To do this, perform the following steps:¶

1) Connect to your instance that has the volume attached.

2) First check that the volume shows up but is not set up. To do this, run the command lsblk to list block devices. This shows something like this:

NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0         7:0    0   18M  1 loop /snap/amazon-ssm-agent/930
loop1         7:1    0   91M  1 loop /snap/core/6350
loop2         7:2    0 89.4M  1 loop /snap/core/6818
loop3         7:3    0   18M  1 loop /snap/amazon-ssm-agent/1335
nvme1n1     259:0    0   30G  0 disk
└─nvme1n1p1 259:2    0   30G  0 part /
nvme0n1     259:1    0   10G  0 disk

This indicates that there are two volumes attached. One is called nvme1n1 and is the root volume; you can tell this because it is attached in the root volder, i.e. /. The other is a disk without a file system that is not mounted anywhere, called nvme0n1. This is device we will now set up. It may have a slightly different name on your instance, but will be something like that. You can check that it is the volume of interest becasue its SIZE will match what you designated when making the volume.

3) To make a file system and attach it, you will need to run some unix commands in the terminal that is connected to your instance. Make sure to input the name of the volume that you got from the lsblk command above where relavent. I am using the name from the above call as an example. You can run these three commands separatly, or copy and paste the block all at once.

sudo mkfs -t xfs /dev/nvme0n1
sudo mkdir /data
sudo mount /dev/nvme0n1 /data

You also will probably want to change the permission on this /data folder. If you do not, you may need to add sudo before the commands I instruct you to run later. To change the folder permission, run

sudo chmod -R 777 /data

4) If you run lsblk again, you will now see that this device has a MOUNTPOINT, namely the /data folder that you just created. This will look like

NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0         7:0    0   18M  1 loop /snap/amazon-ssm-agent/930
loop1         7:1    0   91M  1 loop /snap/core/6350
loop2         7:2    0 89.4M  1 loop /snap/core/6818
loop3         7:3    0   18M  1 loop /snap/amazon-ssm-agent/1335
nvme1n1     259:0    0   30G  0 disk
└─nvme1n1p1 259:2    0   30G  0 part /
nvme0n1     259:1    0   10G  0 disk /data

5) Once your volume is mounted on the data folder (or whatever folder name you wanted to substitute in in the commands listed in 6), you can copy your files from your computer to your instance volume. You can do this with the same type of command listed previously, i.e. scp commands. To put the files into this folder, just make sure to indicate that you want the files in the /data folder. Note that the /data folder is in the root of your instance, not in your home directory. This means that cd ~/data will not bring you there, but cd /data will. ~/ and / are very different file locations.

6) In general, you can now access the files in the /data folder. You can work with this data as you would any other file on the machine.

Working with files on Amazon S3¶

Amazon's primary cloud storage solution is called S3, which stands for Simple Storage Service. It can take a while to transfer large files to your instance, so if you are working with large amounts of image or sequence data, you will probably want to transfer your files to S3 before trying to access them on your instance. You can get around an order of magnitude speedup in the transfer speed from "S3 -> instance" as compared to "local computer -> instance." Transfer to S3 may still take a long time, but if you plan properly you can run this in the background or overnight and then access the files from S3 to your instance pretty rapidly.

1. Working on the AWS cli (command line interface) on your local computer¶

Install the CLI. If you are transferring files from your local machine to S3, you will need to install the command line interface on your computer. To do this, follow the relevant instructions on Amazon's documentation
Configure credentials on the AWS cli on your local machine. To do this, run aws configure once you have installed the cli. This will prompt you to input the credentials for your IAM role that you got from the account administrator in the very first step above. This will look something like:
```
aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json
```

Note that us-west-2 is the correct region to enter and that json is good for the output format. The access key and secret key will be specific to your IAM role.

2. Copying files to S3¶

S3 works on the principle of buckets. These are locations that folders and files are stored in Amazon's cloud. You can check what buckets are by navigating to Services -> S3 on the Amazon online console. Once you know what buckets you have to put files in, copying files to S3 is very similar to the copy commands listed above, except in this case the commands are called through the aws cli. To copy a file to S3, you would call a command like this

aws s3 cp ~/where/the/file/is/locally s3://s3bucket/

Where s3bucket is the particular bucket that you want to put your files in. If you want to put the file into a subfolder in the bucket, you can create a folder by telling the aws s3 cp command what folder (whether it exists or not) that you want the file in. This would look like

aws s3 cp ~/where/the/file/is/locally s3://s3bucket/new_or_old_folder/

If you want to copy a whole folder of files over to S3, you can do so with a sync command. This looks like the following:

aws s3 sync ~/where/the/folder/is/locally/ s3://s3bucket/new_or_old_folder/

This will copy all the files in the indicated folder onto S3.

3. Copying files from the shared network attached files system¶

We have a shared, network attached storage device where many of the large files for the lab are stored. Information on how to work with this server is elsewhere, so reference that to get information on how to work with it on your local machine. In order to transfer files directly from the storage server to S3, you will need to be able to ssh into the server, which is restricted to administrators. The AWS cli is already installed there with an IAM role for accessing S3. If you need to do such file transfers, contact the system administrator (Julian Wagner) as the admin will either need to run these commands for you, or give you access to the proper credentials.

4. On an instance, attach an IAM role for S3 access¶

In order to allow your instance to access the files on S3, you will need to attach an IAM role allowing it to do so.

Navigate to Services -> EC2 on the AWS console, and select Instances from the tab on the side.
Select the instance that you want to access the S3 files with, right click on it, and select Attach/Replace IAM Role.
From the dropdown menu, click on the acesss3 role that has already been created. If you want, you could also create an IAM role here to meet your particular needs, but the existing one will do the trick for S3 access.
Click Apply, and you should be good to go. You do not need to restart the instance for changes to take effect, they should work right away.

5. Copy files to your instance from S3¶

If you want to grab files from S3 and put them on your instance, you simply reverse the order of the commands listed above, so that the files are taken from S3 and put in a location on your local machine. This will look something like

aws s3 cp s3://s3bucket/file_you_want ~/

Note that any of the above commands can be reversed to take from S3 and put the files locally. Note that if you are trying to copy the files to an attached volume at, say, the /data/ folder do not put the ~ in from the the folder location. As mentioned before ~ indicates your home folder, which is not the same as the root directory, which is denoted by /.

Building the AMI¶

You should now have the power to use AWS for your computing. If you are curious, to build the AMI, I ran the steps below on the command line of my instance when I first SSH'd into it. I made a shell script to do all the installs, and pasted this into a file called setup.sh, which you will see the in the home directory on the instance. The contents of this file are:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
bash ./miniconda.sh -b -p $HOME/miniconda;
export PATH="$HOME/miniconda/bin:$PATH";
echo 'export PATH="$HOME/miniconda/bin:$PATH"' >> ~/.bashrc;
source ~/.bashrc;
conda update -q -y conda;
conda install -y numpy cython pystan;
conda config --add channels conda-forge;
conda config --add channels defaults;
conda install -y scipy pandas numba scikit-image scikit-learn statsmodels bokeh altair holoviews watermark tqdm matplotlib seaborn ipython jupyter jupyterlab nodejs xarray netcdf4;
pip install arviz;
pip install altair-catplot;
pip install bebi103;
jupyter labextension install --no-build jupyterlab_bokeh;
jupyter labextension install --no-build @ijmbarr/jupyterlab_spellchecker;
jupyter labextension install --no-build @jupyter-widgets/jupyterlab-manager;
jupyter labextension install --no-build @pyviz/jupyterlab_pyviz;
jupyter lab build;
echo 'export PS1="\[\e[1;32m\]\u\[\e[0m\]@\e[1;36m\]\h\[\e[0m\] [\w]\n% "' >> ~/.bashrc;
echo 'export LSCOLORS="gxfxcxdxCxegedabagacad"' >> ~/.bashrc;

conda install -y -c bioconda bwa;
conda install -y -c bioconda picard;
conda install -y -c bioconda samtools;
conda install -y -c bioconda bedtools;
sudo apt-get install unzip;
wget https://github.com/broadinstitute/gatk/releases/download/4.1.2.0/gatk-4.1.2.0.zip ~/
unzip ~/gatk-4.1.2.0.zip ~/;
echo 'alias gatk=$HOME/gatk-4.1.2.0/gatk' >> ~/.bashrc;

After putting this into setup.sh, I simply called

bash ~/setup.sh