Python Worksheet¶
Setup¶
Import packages¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Remove unrelated warning
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm
Download OASIS Dataset¶
We’ll be downloading the data from http://www.oasis-brains.org/pdf/oasis_longitudinal.csv
.
A full explanation of the data can be found in:
Open Access Series of Imaging Studies (OASIS): Longitudinal MRI Data in Nondemented and Demented Older Adults. Marcus, DS, Fotenos, AF, Csernansky, JG, Morris, JC, Buckner, RL, 2010. Journal of Cognitive Neuroscience, 22, 2677-2684. doi: 10.1162/jocn.2009.21407
df = pd.read_csv('http://www.oasis-brains.org/pdf/oasis_longitudinal.csv')
# Print out the top 5 rows of the DataFrame
df.head(5)
Subject ID | MRI ID | Group | Visit | MR Delay | M/F | Hand | Age | EDUC | SES | MMSE | CDR | eTIV | nWBV | ASF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | OAS2_0001 | OAS2_0001_MR1 | Nondemented | 1 | 0 | M | R | 87 | 14 | 2.0 | 27.0 | 0.0 | 1987 | 0.696 | 0.883 |
1 | OAS2_0001 | OAS2_0001_MR2 | Nondemented | 2 | 457 | M | R | 88 | 14 | 2.0 | 30.0 | 0.0 | 2004 | 0.681 | 0.876 |
2 | OAS2_0002 | OAS2_0002_MR1 | Demented | 1 | 0 | M | R | 75 | 12 | NaN | 23.0 | 0.5 | 1678 | 0.736 | 1.046 |
3 | OAS2_0002 | OAS2_0002_MR2 | Demented | 2 | 560 | M | R | 76 | 12 | NaN | 28.0 | 0.5 | 1738 | 0.713 | 1.010 |
4 | OAS2_0002 | OAS2_0002_MR3 | Demented | 3 | 1895 | M | R | 80 | 12 | NaN | 22.0 | 0.5 | 1698 | 0.701 | 1.034 |
Pre-processing the Data¶
Dementia Status from CDR¶
We’re told in the associated paper (linked above) that the CDR values correspond to the following levels of severity:
0 = No dementia
0.5 = Very mild Dementia
1 = Mild Dementia
2 = Moderate Dementia
3 = Severe Dementia
So let’s derive this “Status” from the CDR values:
# Create a mapper from values to status/severity
cdr_mapper = {
0: "No Dementia",
0.5: "Very Mild Dementia",
1: "Mild Dementia",
2: "Moderate Dementia",
3: "Severe Dementia",
}
# Create a column, mapping CDR values to this status
# Convert to categorical type, with severity as order
df["Status"] = pd.Categorical(
df["CDR"].map(cdr_mapper),
categories=cdr_mapper.values(),
ordered=True
)
# Create colours for the different 'Status' categories
status_colours = {
i:j for i,j in zip(cdr_mapper.values(), sns.color_palette("colorblind", 5))
}
Now we can look at this status, and how many records there are both overall, and at each of the visits (or time points).
# Let's look at how many records we have of each severity
df["Status"].value_counts()
No Dementia 206
Very Mild Dementia 123
Mild Dementia 41
Moderate Dementia 3
Severe Dementia 0
Name: Status, dtype: int64
# There are no records for "Severe Dementia", so we can tidy things up by removing this unused category
df["Status"] = df["Status"].cat.remove_unused_categories()
# How many of each status do we have at each visit?
df.groupby(["Visit", "Status"]).size()
Visit Status
1 No Dementia 85
Very Mild Dementia 52
Mild Dementia 13
Moderate Dementia 0
2 No Dementia 73
Very Mild Dementia 49
Mild Dementia 19
Moderate Dementia 3
3 No Dementia 34
Very Mild Dementia 18
Mild Dementia 6
Moderate Dementia 0
4 No Dementia 10
Very Mild Dementia 3
Mild Dementia 2
Moderate Dementia 0
5 No Dementia 4
Very Mild Dementia 1
Mild Dementia 1
Moderate Dementia 0
dtype: int64
Variable Types¶
df.dtypes
Subject ID object
MRI ID object
Group object
Visit int64
MR Delay int64
M/F object
Hand object
Age int64
EDUC int64
SES float64
MMSE float64
CDR float64
eTIV int64
nWBV float64
ASF float64
Status category
dtype: object
Note: Further coercing of types (e.g. “Visit” is ordinal) is good practice, but the necessity depends on your analysis.
Convert Categorical Data¶
df["Sex"] = df["M/F"].map({
"M": 0,
"F": 1
})
Visits & Baseline Data¶
df["Visit"].value_counts()
1 150
2 144
3 58
4 15
5 6
Name: Visit, dtype: int64
Filter for baseline data, and look at their status on arrival
df_baseline = df.loc[df["Visit"]==1]
df_baseline["Status"].value_counts()
No Dementia 85
Very Mild Dementia 52
Mild Dementia 13
Moderate Dementia 0
Name: Status, dtype: int64
Data Exploration & Analysis¶
Summarize Data¶
df.describe(include="all")
Subject ID | MRI ID | Group | Visit | MR Delay | M/F | Hand | Age | EDUC | SES | MMSE | CDR | eTIV | nWBV | ASF | Status | Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 373 | 373 | 373 | 373.000000 | 373.000000 | 373 | 373 | 373.000000 | 373.000000 | 354.000000 | 371.000000 | 373.000000 | 373.000000 | 373.000000 | 373.000000 | 373 | 373.000000 |
unique | 150 | 373 | 3 | NaN | NaN | 2 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | NaN |
top | OAS2_0048 | OAS2_0179_MR2 | Nondemented | NaN | NaN | F | R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | No Dementia | NaN |
freq | 5 | 1 | 190 | NaN | NaN | 213 | 373 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 206 | NaN |
mean | NaN | NaN | NaN | 1.882038 | 595.104558 | NaN | NaN | 77.013405 | 14.597855 | 2.460452 | 27.342318 | 0.290885 | 1488.128686 | 0.729568 | 1.195461 | NaN | 0.571046 |
std | NaN | NaN | NaN | 0.922843 | 635.485118 | NaN | NaN | 7.640957 | 2.876339 | 1.134005 | 3.683244 | 0.374557 | 176.139286 | 0.037135 | 0.138092 | NaN | 0.495592 |
min | NaN | NaN | NaN | 1.000000 | 0.000000 | NaN | NaN | 60.000000 | 6.000000 | 1.000000 | 4.000000 | 0.000000 | 1106.000000 | 0.644000 | 0.876000 | NaN | 0.000000 |
25% | NaN | NaN | NaN | 1.000000 | 0.000000 | NaN | NaN | 71.000000 | 12.000000 | 2.000000 | 27.000000 | 0.000000 | 1357.000000 | 0.700000 | 1.099000 | NaN | 0.000000 |
50% | NaN | NaN | NaN | 2.000000 | 552.000000 | NaN | NaN | 77.000000 | 15.000000 | 2.000000 | 29.000000 | 0.000000 | 1470.000000 | 0.729000 | 1.194000 | NaN | 1.000000 |
75% | NaN | NaN | NaN | 2.000000 | 873.000000 | NaN | NaN | 82.000000 | 16.000000 | 3.000000 | 30.000000 | 0.500000 | 1597.000000 | 0.756000 | 1.293000 | NaN | 1.000000 |
max | NaN | NaN | NaN | 5.000000 | 2639.000000 | NaN | NaN | 98.000000 | 23.000000 | 5.000000 | 30.000000 | 2.000000 | 2004.000000 | 0.837000 | 1.587000 | NaN | 1.000000 |
Missing Data¶
# Count amount of missing data
df.isna().sum(0)
Subject ID 0
MRI ID 0
Group 0
Visit 0
MR Delay 0
M/F 0
Hand 0
Age 0
EDUC 0
SES 19
MMSE 2
CDR 0
eTIV 0
nWBV 0
ASF 0
Status 0
Sex 0
dtype: int64
Visualize Data¶
# Histograms
fig, axes = plt.subplots(2, 2, figsize=(12,8))
axes = axes.flatten()
for i, x_col in enumerate(["Age", "EDUC", "MMSE"]):
sns.histplot(data=df, x=x_col, ax=axes[i])
fig.delaxes(axes[-1])
# Histograms stratified by dementia status
fig, axes = plt.subplots(2, 2, figsize=(12,8))
# Make it easier to loop over
axes = axes.flatten()
# Loop through the columns we want to plot a histogram for
for i, x_col in enumerate(["Age", "EDUC", "MMSE"]):
sns.histplot(data=df, x=x_col, hue="Status", multiple="stack", ax=axes[i])
# Manually tidy up the legends
axes[0].get_legend().remove()
axes[1].get_legend().remove()
# Delete the extra axis (as we only plot 3 columns)
fig.delaxes(axes[-1])
MMSE trajectories over time¶
We can visualize the MMSE trajectories of each subject over time to get an overall view of the cohort.
fig, ax = plt.subplots(figsize=(12,6))
for _, indiv_data in df.groupby("Subject ID"):
ax = sns.lineplot(
data=indiv_data,
x="MR Delay",
y="MMSE",
ax=ax,
color="grey",
alpha=0.5,
linewidth=1.25,
linestyle="solid"
)
We can make this graph a little more interesting by adding colour to each line, corresponding to the dementia status:
fig, ax = plt.subplots(figsize=(12,6))
# Loop over the individual subjects
for group, indiv_data in df.groupby(["Subject ID"]):
# Extract xy data
xy = indiv_data[["MR Delay", "MMSE"]].values
# Plot normally for patients that do not change status
if len(indiv_data["Status"].unique()) == 1:
ax.plot(
xy[:, 0],
xy[:, 1],
color=status_colours[indiv_data["Status"].unique()[0]],
alpha=0.5,
linewidth=1.25
)
else:
# Duplicate points to construct line segments of different colours
for i, (start, stop) in enumerate(zip(xy[:-1], xy[1:])):
x, y = zip(start, stop)
ax.plot(
x, y,
color=status_colours[indiv_data["Status"].iloc[i]],
alpha=0.5,
linewidth=1.25
)
# Add the labels
ax.set_xlabel("MR Delay")
ax.set_ylabel("MMSE")
Text(0, 0.5, 'MMSE')
Regression Analysis¶
Simple Linear Regression¶
# Select the columns to iterate over
x_cols = ["Age", "Sex", "EDUC", "CDR"]
# Choose nicer names to display for each column
x_labels = ["Age", "Sex", "Yrs of Education", "CDR"]
fig, axes = plt.subplots(2,2, figsize=(16,10))
# Select the dependent variable
y = df_baseline["MMSE"].values
# Loop over the columns to fit each regression
for x_col, label, ax in zip(x_cols, x_labels, axes.flatten()):
# Select x data
x = df_baseline[x_col].values
# Create the model
model = sm.OLS(
y, sm.add_constant(x)
)
# Fit the model
res = model.fit()
# Add a plot with regression line
ax.plot(x, res.predict())
# and the raw data
ax.scatter(x, y, c=[status_colours[i] for i in df_baseline["Status"]])
ax.set_title(label)
ax.set_ylabel("MMSE")
# Print the R^2
print(f"{label} vs MMSE: R-squared = {res.rsquared:.3f}")
Age vs MMSE: R-squared = 0.001
Sex vs MMSE: R-squared = 0.048
Yrs of Education vs MMSE: R-squared = 0.047
CDR vs MMSE: R-squared = 0.479
Then we can do the same again, but this time stratify by the dementia status (not including CDR this time):
x_cols = ["Age", "Sex", "EDUC"]
x_labels = ["Age", "Sex", "Yrs of Education"]
fig, axes = plt.subplots(2,2, figsize=(16,10))
axes = axes.flatten()
for x_col, label, ax in zip(x_cols, x_labels, axes):
for name, group_data in df_baseline.groupby("Status", observed=True):
y = group_data["MMSE"]
x = group_data[x_col]
model = sm.OLS(
y, sm.add_constant(x)
)
res = model.fit()
ax.plot(x, res.predict())
ax.scatter(x, y, color=status_colours[name])
ax.set_title(label)
ax.set_ylabel("MMSE")
# Delete the extra axis (as we only plot 3 columns)
fig.delaxes(axes[-1])
Multiple Linear Regression¶
We can create a regression model using age, sex, education, and CDR as our independent variables together:
# Create string for our independent variables
x = "Age + EDUC + Sex + C(CDR)"
# And select the dependent one
y = "MMSE"
mod = smf.ols(formula = f"{y} ~ {x}", data=df_baseline)
res = mod.fit()
# Print out a nice summary of results
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MMSE R-squared: 0.493
Model: OLS Adj. R-squared: 0.475
Method: Least Squares F-statistic: 27.95
Date: Thu, 08 Jul 2021 Prob (F-statistic): 1.02e-19
Time: 10:03:10 Log-Likelihood: -324.67
No. Observations: 150 AIC: 661.3
Df Residuals: 144 BIC: 679.4
Df Model: 5
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 29.1743 2.140 13.632 0.000 24.944 33.404
C(CDR)[T.0.5] -2.9600 0.411 -7.209 0.000 -3.772 -2.148
C(CDR)[T.1.0] -6.0383 0.649 -9.305 0.000 -7.321 -4.756
Age -0.0194 0.024 -0.825 0.411 -0.066 0.027
EDUC 0.0734 0.065 1.137 0.257 -0.054 0.201
Sex 0.5643 0.374 1.511 0.133 -0.174 1.303
==============================================================================
Omnibus: 23.128 Durbin-Watson: 1.929
Prob(Omnibus): 0.000 Jarque-Bera (JB): 53.918
Skew: -0.623 Prob(JB): 1.96e-12
Kurtosis: 5.660 Cond. No. 943.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Call statsmodel function for partial regression plot
fig = sm.graphics.plot_partregress_grid(res)
# Making the output nicer
fig.set_figheight(10)
fig.set_figwidth(12)
fig.tight_layout(pad=3)
Results Table¶
Either using some of the analysis above, or through additional analysis of your own, create a summary results table below. What it includes is up to you, the overall aim is to get familiar with creating tables in markdown.
Export to PDF¶
Now we can export this report to PDF, using the steps below.
Note: You may need to change the paths and filenames depending your Google Drive structure, and what you have named this notebook.
# Install what we need to create the PDF
%%capture
!apt update
!apt install texlive-xetex texlive-fonts-recommended texlive-generic-recommended
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
# Create a DEMON folder and move this notebook into it
%%bash
mkdir -p gdrive/MyDrive/DEMON/
if [ -f gdrive/MyDrive/Reproducible_Reporting.ipynb ]; then
mv gdrive/MyDrive/Reproducible_Reporting.ipynb gdrive/MyDrive/DEMON/Reproducible_Reporting.ipynb
fi
ls gdrive/MyDrive/DEMON/
Reproducible_Reporting.ipynb
# Change to the drive we created above
%cd gdrive/MyDrive/DEMON/
/content/gdrive/MyDrive/DEMON
!jupyter nbconvert Reproducible_Reporting.ipynb --to pdf --output ./Reproducible_Reporting.pdf
[NbConvertApp] Converting notebook Reproducible_Reporting.ipynb to pdf
[NbConvertApp] Writing 49953 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: [u'xelatex', u'./notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: [u'bibtex', u'./notebook']
[NbConvertApp] WARNING | bibtex had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 59264 bytes to ./Reproducible_Reporting.pdf