Literate programming, Markdown & Colab

Let’s start with the definition of literate programming (from Wikipedia):

Literate programming is a programming paradigm introduced by Donald Knuth in which a computer program is given an explanation of its logic in a natural language, such as English, interspersed with snippets of macros and traditional source code, from which compilable source code can be generated.

Jupyter notebook is a method for performing literate programming and is a powerful way of presenting your analysis in a way that’s easy to code and easy to read. The “snippets of source code” are your Python syntax. The “explanation of the its logic in a natural language” is your analysis report.

This lesson will draw heavily from the following excellent resources:

Markdown - what is it?

Markdown is a simple plain-text language that can be turned into formatted text, e.g., webpages, forum posts (stackexchange, Github) etc.

Markdown allows you to incorporate code within the final document. It can be turned into HTML, pdfs, word documents, books, theses etc. as well.

This course is written in Markdown.

Markdown - what’s it good for?

Consider your normal data analysis workflow. It’s probably something like this:

1. Conventional workflow

  1. Do some data analysis in an python script. Use comments to help reader understand.

  2. Save plots, tables, etc. in separate files (fig1.png, tab1.csv, etc.).

  3. Write report in Word and insert figures, tables etc.

Updating/changing analysis: Edit python scripts, Word document, insert figures, potentially change figure lables, fig1.png-> fig2.png etc..

Sharing analysis: Share data, R scripts, plots, tables, Word/Latex files. Also include how all the parts relate to each other in a README.

With Colab your workflow will be more like this:

2. Literate programming workflow

  1. Do some data analysis in an Colab Notebook. All figures, tables are displayed in line - no need to save sepate files.

  2. Write your report around the code in the Colab file. The narrative of the analysis helps the reader understand the code.

  3. One-click to create formatted Word/HTML/pdf report, or, one-click to publish to the web (requires setup - not covered in this course).

Updating/changing analysis: Edit the single Colab file and click to create report.

Sharing analysis: Share data and the single Colab file.

This difference can be seen in the figure below which shows two directory structures the top with the conventional workflow, the bottom with the literate programming workflow.

Of course for larger projects, you may have more than one Colab file and you may have separate analysis scripts as well.

Goals for this notebook

This notebook will familiarise you with all the main features of working with Jupyter Notebooks or Google Colab.

We will be demonstrating the following features:

  1. Setting up a notebook on Google Colab

  2. Exporting this notebook as a PDF file

Notebooks depend on kernels

If you’re using Jupyter Lab locally or from a host provider, kernels are a big issue because it allows you to use the same notebook infrastructure across different programming languages (e.g. R, Python, Bash, Javascript etc).

At the moment, Colab does not support these capabilities and only supports Python, but it’s useful to know.

Analysis is contained within notebooks

The notebook keeps track of settings tied to a particular data analysis project. It consists of a list of cells containing either explanatory text or executable code with its output.

A folder hierarchy is common when starting a new project, which contains all the relevant data for others to replicate your work.

When creating notebooks, it is good practice to use consistent naming conventions so that different information can be included, such as the topic of analysis, the date of creation, and the notebook creator.

You can keep working on your project by just opening up the relevant Jupyter notebook in the corresponding folder.

Setting up a new Python project using Colab

  1. Just click here to spawn a new Jupyter notebook in Colab

  2. Change the notebook name at the top to e.g. oasis.ipynb

  3. Then you are ready for the main exercise!

Before diving into that, we need to make sure that the we can export the PDF properly.

Working on a specific folder in drive

The following step is technically not necessary if you intend to just use Colab right away as a normal Jupyter notebook.

However, since Colab is storing its work on your hard drive, it would be a good idea to specify which folder it will be using.

We will create a folder in your drive and then move our new Colab notebook inside that folder.

First, mount your drive in the Colab container.

The following code can also be used to mount your entire drive so it is accessible from your notebook (this requires a gmail account and your approval):

from google.colab import drive

drive.mount('/content/gdrive')

Let’s run it ourselves:

from google.colab import drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive

Use IPython magic commands

  • Magic commands (initiated with %%) in a notebook can be used to accomplish a variety of tasks.

  • In this example, a shell cell will create a folder on our local Google Drive and move our new Colab notebook into it.

  • You can learn more about magic commands by visiting link

%%bash
mkdir -p gdrive/MyDrive/DEMON/
if [ -f gdrive/MyDrive/Colab\ Notebooks/oasis.ipynb ]; then 
  mv gdrive/MyDrive/Colab\ Notebooks/oasis.ipynb gdrive/MyDrive/DEMON/oasis.ipynb
fi
ls gdrive/MyDrive/DEMON/
Reproducible_Reporting.ipynb
Reproducible_Reporting.pdf

Saving the OASIS Data

Once we have a local folder mounted in our virtual session we can save the data that we’ll be using.

To do this, we start by changing our current working directory using the %cd magic command.

%cd gdrive/MyDrive/DEMON/
/content/gdrive/MyDrive/DEMON

Then we’ll load the data into memory in this notebook, and then use to_csv() to save it to the current working directory.

import pandas as pd

df = pd.read_csv('http://www.oasis-brains.org/pdf/oasis_longitudinal.csv')
df.to_csv('oasis_raw.csv')

Notebooks make analysis easier

Among the advantages of a notebook is the ability to pick up an analysis from where you left off last time. These include the notebook kernel, variables made, packages imported, and commands executed.

Let’s investigate this with the following two exercises:

Starting where you left off

  1. Go to your open notebook

  2. Ensure the top cell has all the packages you need available.

  3. Press the play button to launch the cell (or press shift+enter).

  4. Add the following to the empty cell below (or just copy and paste!):

import numpy as np
x = np.random.randn(10) # Sample a vector of 10 numbers from the normal distribution
total = 0 
for value in x:
    total += value
print(total)
  1. Rename the notebook start_again.ipynb

  2. Close the browser tab

  3. Reopen the notebook

  4. Type x or total in a cell and run it - these should be recognised by the Colab interpreter and have the same value as before.

x
array([ 0.15872969, -0.25680134, -0.1102667 ,  0.56692093,  0.1241041 ,
        1.51965837,  1.62610112, -0.92023229, -1.51431169,  0.36782038])

Colab is memoryless

  1. Importantly this only works as long as the notebook is on, i.e. the variables are stored in runtime memory.

  2. To emphasise this important issue, let’s do another exercise

  3. In your colab tab open the Runtime menu and select Restart runtime

  4. Now rerun the cell which will now raise an error

  5. This is because the variables are no longer in memory.

  6. Furthermore if you open the Runtime menu and select Run all you will see that the results change

  7. This is because we are sampling random numbers from a different Generator instance (you can read more about this here)

  8. See the following code:

rng = np.random.default_rng(42) 
x = rng.normal(size=10) 
total = 0 
for value in x:
    total += value
print(total)

Now if you reset your runtime the results will be the same every time

Generate PDF reports from a notebook

  • One of the goals of this session was to show how you can export your notebook to a PDF report

  • To do this, you need to install some essential software on your virtual machine

  • We will use the %%capture magic command to supress the cell output (it’s a lot!)

  • And use the ! symbol to run bash line without the need to declare a bash cell

%%capture
!apt update
!apt install texlive-xetex texlive-fonts-recommended texlive-generic-recommended

Export your notebook as a PDF report

Below we will convert a notebook named oasis.ipynb (that we made earlier) to a PDF:

!jupyter nbconvert oasis.ipynb --to pdf --output oasis.pdf
[NbConvertApp] Converting notebook oasis.ipynb to pdf
[NbConvertApp] Support files will be in oasis_files/
[NbConvertApp] Making directory ./oasis_files
[NbConvertApp] Making directory ./oasis_files
[NbConvertApp] Making directory ./oasis_files
[NbConvertApp] Making directory ./oasis_files
[NbConvertApp] Making directory ./oasis_files
[NbConvertApp] Making directory ./oasis_files
[NbConvertApp] Making directory ./oasis_files
[NbConvertApp] Writing 59602 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: [u'xelatex', u'./notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: [u'bibtex', u'./notebook']
[NbConvertApp] WARNING | bibtex had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 344981 bytes to oasis.pdf