Regressions
Linear regressions on spiracle data, non-parametric and Bayesian¶
To begin with, we need to import necessary python packages.
import numpy as np
import pandas as pd
import statsmodels.api
import scipy.stats
import bokeh.io
import bokeh.plotting
from Bio import Phylo
import io
import cmdstanpy
import arviz as az
import logging
import bebi103
#local .py file for some plotting functions and non-parametric bootstrapping utils
import plotting_utils
bokeh.io.output_notebook()
We can now read in the data into a dataframe for analyis.
df = pd.read_csv("./20190322_supp_table_2.csv")
We take a look at the format for the data.
df['species_underscore'] = [spec.replace(" ", "_") for spec in df['species']]
df.head()
For some of this analysis, we will look at the per-species averages for our measurements. To get this, we use a simple aggregate function on the dataframe and take a look at the results.
df_averages = df.groupby(['species', 'species_underscore', 'spiracle'], as_index=False).aggregate(np.average)
df_averages['subfamily'] = df.groupby(['species', 'species_underscore', 'spiracle'], as_index=False).aggregate(max)['subfamily']
df_averages.head()
Let's take a look at the number of species per subfamily in the data.
species_per_subfam=df_averages.groupby(['subfamily', 'spiracle'], as_index=False).count().groupby('subfamily').aggregate(max).reset_index()[['subfamily', 'species']]
species_per_subfam.columns = ('subfamily', 'subfam_count')
species_per_subfam
df_averages = df_averages.merge(species_per_subfam, on='subfamily')
For our plots, we will log transform the data. We will add a column to the dataframe with the log transformed data. We will also need some transforms of our data, which we will do here.
df_averages['log area (mm^2)'] = np.log10(df_averages['area (mm^2)'])
df_averages['log dist'] = np.log10(df_averages['depth (mm)'])
df_averages['log mass (g)'] = np.log10(df_averages['mass (g)'])
df_averages['log area/dist'] = np.log10(df_averages['area (mm^2)']/df_averages['depth (mm)'])
df_averages['log area^2/dist'] = np.log10(df_averages['area (mm^2)']**2/df_averages['depth (mm)'])
df_averages.head()
In addition to log transforming the species averaged data, we will do the same for the whole data set.
df['log area (mm^2)'] = np.log10(df['area (mm^2)'])
df['log dist'] = np.log10(df['depth (mm)'])
df['log mass (g)'] = np.log10(df['mass (g)'])
df['log area/dist'] = np.log10(df['area (mm^2)']/df['depth (mm)'])
df['log area^2/dist'] = np.log10(df['area (mm^2)']**2/df['depth (mm)'])
df.head()
With our data in this format, we can build simple model for our regression. To do this, we will write a statistical model for the data. We choose a normal distribution with mean given by a simple linear function (with slope $a$ and intercept $b$), and homoscedastic variance $\sigma$. For our priors, we will take a normal distribution for the slope with mean given by the slope value representing isometric scaling in the log-log space, and standard deviation of $0.3$. This encodes our prior expectation that isometry is the most common/likely outcome for morphological features. If they scale uniformly with one another, we would see isometry. The variance term on this prior gives a broad but reasonable range of potential values for the parameter around the isometric value. This prior is hence weakly informative, fairly encoding our prior expectation without constraining the ability of our data to inform the parameter value. For the intercept, we expect a small value for spiracle dimensions of a 1g insect (what the intercept of the log mass x axis represents), so we put this mean at around 1/10th of a square mm for area (mean of -1 in log space), with a standard deviation of 1. Hence ~95% of the probability mass for this prior goes from ~100 times smaller to ~100 times larger than this mean. Once again, this is a pretty broad distribution so shouldn't drag the parameter estimates much, but does encode our prior intuition of what we expect from beetle spiracle dimensions. For the prior on the $\sigma$ term for the normal distribution, we take a half normal ($\sigma$ must be positive), with mean at zero and standard deviation of 1, once again this allows for a wide range of values for sigma while also encoding our expectation that morphological measures shouldn't have extreme variation on a log scale. This gives the following model:
\begin{align} f_{\mu}(x_i; a, b) &= ax_i + b \\ b &\sim \text{Norm}(-1, 3) \\ a &\sim \text{Norm}(\mathrm{isometry}, 0.3)\\ [\mathrm{isometry_{area}=2/3, \, isometry_{depth}=1/3},& \, \mathrm{isometry_{area/depth}}=1/3, \, \mathrm{isometry_{area^2/depth}}=1]\\ \sigma &\sim \text{HalfNorm}(0, 1)\\ y_i &\sim \text{Normal}\left(f_\mu(x_i; a, b), \sigma\right) \end{align}The Stan code for the model is as follows:
functions {
real f(real x_i, real a, real b) {
real y_i = x_i*a + b;
return y_i;
}
}
data {
int<lower=1> N;
vector[N] x;
vector[N] y;
real priora;
int<lower=1> N_ppc;
vector[N_ppc] x_ppc;
}
parameters {
real a;
real b;
real<lower=0> sigma;
}
transformed parameters {
vector[N] mu;
for (i in 1:N) {
mu[i] = f(x[i], a, b);
}
}
model {
a ~ normal(priora, 0.3);
b ~ normal(-1.0, 3.0);
sigma ~ normal(0.0, 1.0);
y ~ normal(mu, sigma);
}
generated quantities {
vector[N_ppc] y_ppc;
vector[N_ppc] mu_ppc;
real coef_var = (sigma)/(10^b);
for (i in 1:N_ppc) {
mu_ppc[i] = f(x_ppc[i], a, b);
}
for (i in 1:N_ppc) {
y_ppc[i] = normal_rng(mu_ppc[i], sigma);
}
}
We will do a quick prior-predictive check of this model to make sure it gives reasonable potential datasets.
x = df_averages.sort_values('mass (g)').loc[df_averages['spiracle'] == '1', 'log mass (g)'].values
size = 1000
lines = []
for a, b, sigma in zip(np.random.normal(0.67,0.3, size=size), np.random.normal(-1,1,size=size), np.abs(np.random.normal(1,size=size))):
lines.append(np.random.normal(a*x+b, sigma))
p = bokeh.plotting.figure(plot_height=250, x_axis_label='mass', y_axis_label='spiracle dimension')
[p.line(x, y, alpha=0.05) for y in lines]
bokeh.io.show(p)
This looks good! The proposed model gives a slight upward slope and a nice wide range of values for both the slope and intercept based on the priors. The data should be able to inform the parameter values well based on this model. We are now ready to sample to get parameter estimates! We do this below. In addition to sampling, we also produce some plots of the model outputs given the sampled parameters (i.e. the posterior samples as well as samples from the posterior predictive distribution). We store these plots for later visualization. In addition to the visual checks, we also run diagnostics on our parameter samples to check that sampling went well.
logging.getLogger("cmdstanpy").setLevel(logging.WARNING)
all_summaries = {}
all_regressions = {}
N_ppc = 200
for morphs, iso in zip(['log area (mm^2)', 'log dist', 'log area/dist', 'log area^2/dist'], [0.67, 0.33, 0.33, 1.0]):
sigmas = []
slopes = []
intercepts = []
coef_vars = []
summaries = {}
plots = []
for spir in ['S', 'T', '1', '2', '3', '4', '5', '6']:
x = df_averages.sort_values('mass (g)').loc[df_averages['spiracle'] == spir, 'log mass (g)'].values
y = df_averages.sort_values('mass (g)').loc[df_averages['spiracle'] == spir, morphs].values
x_ppc = np.linspace(x.min(), x.max(), N_ppc)
data = {
"N": len(x),
"x": x,
"y": y,
"priora": iso,
"x_ppc": x_ppc,
'N_ppc': N_ppc,
}
sm = cmdstanpy.CmdStanModel(stan_file='spiracle_regression.stan')
samples = sm.sample(data=data, sampling_iters=1000, chains=4)
samples = az.from_cmdstanpy(posterior=samples, posterior_predictive=["y_ppc"])
print("________________________________________________________________________________________________________________\n" \
"Checking diagnostics for samples for " + morphs + ' spiracle ' + spir + '\n' \
"________________________________________________________________________________________________________________\n")
bebi103.stan.check_all_diagnostics(samples, var_names=['b', 'a', 'sigma'])
sigmas.append(samples.posterior['sigma'].values.ravel())
coef_vars.append(samples.posterior['coef_var'].values.ravel())
slopes.append(samples.posterior['a'].values.ravel())
intercepts.append(samples.posterior['b'].values.ravel())
summaries[spir] = az.summary(samples, var_names=['b', 'a', 'sigma', 'coef_var'])
summaries[spir]['median'] = [np.median(samples.posterior['b'].values.ravel()),
np.median(samples.posterior['a'].values.ravel()),
np.median(samples.posterior['sigma'].values.ravel()),
np.median(samples.posterior['coef_var'].values.ravel())]
plots.append(plotting_utils.plot_regression_comparison(df, df_averages, x, spir, morphs, samples, iso, x_ppc, circle_size=9, line_width=1, circle_alpha=1, width=225, height=225))
all_regressions[morphs] = plots
all_summaries[morphs] = summaries
logging.getLogger("cmdstanpy").setLevel(logging.INFO)
Nice! the sampling looks to have gone well, with no major issues based on diagnostics. With samples for all the parameters and for all the spiracles/morphological features of interest in hand, we can now generate a plot for some summary stats about the data. We will also compute similar estimates for regression features of interest using a non-parametric bootstrapping approach for comparison.
plots = []
summaries_nonparametric = {}
for morphs, iso, heading in zip(['log area (mm^2)', 'log dist', 'log area/dist', 'log area^2/dist'], [0.67, 0.33, 0.33, 1.0], ['Area', 'Depth', 'Diffusive', 'Advective']):
#print(morphs)
summaries = all_summaries[morphs]
summaries_nonparametric[morphs] = {}
for var, yax, var_title in zip(['a', 'b', 'sigma', 'coef_var'], ['log-log slope', 'intercept (mm^2)', 'log-log sigma', 'σ/10^b'], ['slope', 'intercept', 'variance', 'σ/10^b']):
p = bokeh.plotting.figure(plot_height=200, plot_width=300, x_range=['S', 'T', '1', '2', '3', '4', '5', '6'], title=heading + ', Median ± HPD for ' + var_title, y_axis_label=yax, x_axis_label='Spiracle')#, y_axis_type='log')
#p = bokeh.plotting.figure(plot_height=200, plot_width=300, x_range=['S', 'T', '1', '2', '3', '4', '5', '6'])
p.xaxis.visible = False
p.outline_line_color = None
p.yaxis.minor_tick_line_color = None
p.xaxis.minor_tick_line_color = None
if var == 'a':
p.line([-10, 100], [iso, iso], color='black', line_width=2, line_alpha=1)
for i, spir in enumerate(['S', 'T', '1', '2', '3', '4', '5', '6']):
CI_a, CI_b, CI_σ, CI_co_σ, slope, intercept, σ, co_σ = plotting_utils.make_CIs(df_averages, morphs, spir)
summaries_nonparametric[morphs][spir] = [CI_a, CI_b, CI_σ, CI_co_σ, slope, intercept, σ, co_σ]
if var == 'b':
c = 1
if morphs == 'log area/dist':
c = 0.178*404*(1/10)
elif morphs == 'log area^2/dist':
c = (1/(3.1415*8*1.86*(10**(-8))))*(0.1**3)
p.line([spir, spir], c*(10**np.array([summaries[spir].loc[var]['hpd_3%'], summaries[spir].loc[var]['hpd_97%']])), color='grey', line_width=7, alpha=0.75)
p.line([spir, spir], c*(10**CI_b), color='black', line_width=2, alpha=0.75)
p.diamond([spir,], c*(10**summaries[spir].loc[var]['median']), color='white', size=9, alpha=1)
p.diamond([spir,], c*(10**intercept), color='black', size=5)
elif var == 'a':
p.line([spir, spir], [summaries[spir].loc[var]['hpd_3%'], summaries[spir].loc[var]['hpd_97%']], color='grey', line_width=7, alpha=0.75)
p.line([spir, spir], CI_a, color='black', line_width=2, alpha=0.75)
p.diamond([spir,], summaries[spir].loc[var]['median'], color='white', size=9, alpha=1)
p.diamond([spir,], slope, color='black', size=5)
elif var == 'sigma':
p.line([spir, spir], [summaries[spir].loc[var]['hpd_3%'], summaries[spir].loc[var]['hpd_97%']], color='grey', line_width=7, alpha=0.75)
p.line([spir, spir], CI_σ, color='black', line_width=2, alpha=0.75)
p.diamond([spir,], summaries[spir].loc[var]['median'], color='white', size=9, alpha=1)
p.diamond([spir,], σ, color='black', size=5)
elif var == 'coef_var':
p.line([spir, spir], [summaries[spir].loc[var]['hpd_3%'], summaries[spir].loc[var]['hpd_97%']], color='grey', line_width=7, alpha=0.75)
p.line([spir, spir], CI_co_σ, color='black', line_width=2, alpha=0.75)
p.diamond([spir,], summaries[spir].loc[var]['median'], color='white', size=9, alpha=1)
p.diamond([spir,], co_σ, color='black', size=5)
p.xgrid.grid_line_color = None
#p.ygrid.grid_line_color = None
p.output_backend = 'svg'
plots.append(p)
#bokeh.io.show(p)
print("FOR ALL PLOTS: White diamond is median Bayesian parameter sample, grey box is 3%-97% highest posterior density interval for parameter samples")
print("FOR ALL PLOTS: Black diamond is median summary statistic for non-parametric bootstrapping on OLS, black line is 2.5%-97.5% percentile for bootstrapped summary stat\n")
print('Area regressions (log-log slope, intercept (mm^2), log-log σ)')
bokeh.io.show(bokeh.layouts.gridplot(plots[0:4], ncols=2))
print('Depth regressions (log-log slope, intercept (mm), log-log σ)')
bokeh.io.show(bokeh.layouts.gridplot(plots[4:8], ncols=2))
print('Area/Depth regressions (log-log slope, intercept (G_diff), log-log σ)')
bokeh.io.show(bokeh.layouts.gridplot(plots[8:12], ncols=2))
print('Area^2/Depth regressions (log-log slope, intercept (G_adv), log-log σ)')
bokeh.io.show(bokeh.layouts.gridplot(plots[12:16], ncols=2))
#print('Area regressions (log-log slope, intercept (mm^2), log-log σ)')
#bokeh.io.show(bokeh.layouts.gridplot(plots[0:3], ncols=3))
#print('Depth regressions (log-log slope, intercept (mm), log-log σ)')
#bokeh.io.show(bokeh.layouts.gridplot(plots[3:6], ncols=3))
#print('Area/Depth regressions (log-log slope, intercept (G_diff), log-log σ)')
#bokeh.io.show(bokeh.layouts.gridplot(plots[6:9], ncols=3))
#print('Area^2/Depth regressions (log-log slope, intercept (G_adv), log-log σ)')
#bokeh.io.show(bokeh.layouts.gridplot(plots[9:12], ncols=3))
#To save our individual svgs of some a plot:
#bokeh.io.export_svgs(plots[7], filename='C:/Users/jwagne/Downloads/coef_var_depth.svg')
#bokeh.io.export_svgs(plots[3], filename='C:/Users/jwagne/Downloads/coef_var_area.svg')
Nice! With these plots finished, we will also convert the regression results into a dataframe format for a table.
#CI_a, CI_b, CI_σ, CI_co_σ, slope, intercept, σ, co_σ
header = ['morphology', 'converted_b_units', 'spiracle', 'a', 'a_2.5', 'a_97.5', 'b', 'b_2.5', 'b_97.5', 'b_converted', 'b_converted_2.5', 'b_converted_97.5', 'σ', 'σ_2.5', 'σ_97.5', 'co_σ', 'co_σ_2.5', 'co_σ_97.5',
'a_median_bayes', 'a_hpd_3_bayes', 'a_hpd_97_bayes', 'b_median_bayes', 'b_hpd_3_bayes', 'b_hpd_97_bayes', 'σ_median_bayes', 'σ_hpd_3_bayes', 'σ_hpd_97_bayes', 'co_σ_median_bayes', 'co_σ_hpd_3_bayes', 'co_σ_hpd_97_bayes']
m = 'log area (mm^2)'
s = 'S'
index = 0
for m in summaries_nonparametric:
for s in summaries_nonparametric[m]:
c = 1
morph = m[4:]
if morph =='dist':
morph = 'dist (mm)'
if m == 'log area/dist':
c = 0.178*404*(1/10)
morph = 'g_diff (nmol/(sec*kPa))'
elif m == 'log area^2/dist':
c = (1/(3.1415*8*1.86*(10**(-8))))*(0.1**3)
morph='g_adv (cm^3/(sec*kPa))'
row = {header[0]: m,
header[1]: morph,
header[2]: s,
#a
header[3]: summaries_nonparametric[m][s][4],
header[4]: summaries_nonparametric[m][s][0][0],
header[5]: summaries_nonparametric[m][s][0][1],
#b
header[6]: summaries_nonparametric[m][s][5],
header[7]: summaries_nonparametric[m][s][1][0],
header[8]: summaries_nonparametric[m][s][1][1],
header[9]: c*(10**summaries_nonparametric[m][s][5]) ,
header[10]: c*(10**summaries_nonparametric[m][s][1][0]),
header[11]: c*(10**summaries_nonparametric[m][s][1][1]),
#σ
header[12]: summaries_nonparametric[m][s][6][0],
header[13]: summaries_nonparametric[m][s][2][0],
header[14]: summaries_nonparametric[m][s][2][1],
#σ_co
header[15]: summaries_nonparametric[m][s][7][0],
header[16]: summaries_nonparametric[m][s][3][0],
header[17]: summaries_nonparametric[m][s][3][1],
header[18]: all_summaries[m][s].loc['a']['median'],
header[19]: all_summaries[m][s].loc['a']['hpd_3%'],
header[20]: all_summaries[m][s].loc['a']['hpd_97%'],
header[21]: all_summaries[m][s].loc['b']['median'],
header[22]: all_summaries[m][s].loc['b']['hpd_3%'],
header[23]: all_summaries[m][s].loc['b']['hpd_97%'],
header[24]: all_summaries[m][s].loc['sigma']['median'],
header[25]: all_summaries[m][s].loc['sigma']['hpd_3%'],
header[26]: all_summaries[m][s].loc['sigma']['hpd_97%'],
header[27]: all_summaries[m][s].loc['coef_var']['median'],
header[28]: all_summaries[m][s].loc['coef_var']['hpd_3%'],
header[29]: all_summaries[m][s].loc['coef_var']['hpd_97%'],
}
if index == 0:
df = pd.DataFrame(row, index=[index])
else:
df = df.append(pd.DataFrame(row, index=[index]))
index += 1
df.head(5)
#To copy the table to the clipboard to paste into another software!
#df.to_clipboard()
Great! With these summary stats in hand, we can also do some plots of all of these regressions themselves. Note that for all of these plots the grey ranges show the range of regression output values (either from sampling or bootstrapping) at the 2.5th-97.5th percentiles. Central line is the 45th-55th percentile range for Bayesian regression values, or the OLS line with non-resampled data for the the bootstrap regression plots. The two grey regions for the Bayesian regressions are the 95th and 80th percentile ranges for the posterior predictive distribution for the model. The black dots are the measured data points.
Plot for species averaged mass vs species averaged spiracle area (log transformed)¶
print("Bayesian regression:")
bokeh.io.show(bokeh.layouts.gridplot(all_regressions['log area (mm^2)'], ncols=4))
print("Non-parametric bootstrapping:")
b_plots = plotting_utils.make_plot(df_averages, 'log area (mm^2)', 2/3, width=225, height=225, point_size=9)
#To save our individual svgs of some a plot:
#p = b_plots[0]
#p.outline_line_color = None
#p.yaxis.minor_tick_line_color = None
#p.xaxis.minor_tick_line_color = None
#bokeh.io.export_svgs(p, filename='C:/Users/jwagne/Downloads/S_area_nonparametric.svg')
Species averaged mass vs species averaged spiracle depth (log transformed)¶
print("Bayesian regression:")
bokeh.io.show(bokeh.layouts.gridplot(all_regressions['log dist'], ncols=4))
print("Non-parametric bootstrapping:")
b_plots = plotting_utils.make_plot(df_averages, 'log dist', 1/3, width=225, height=225, point_size=9)
Species averaged mass vs $\frac{\mathrm{species\,averaged\,area}}{\mathrm{species\,averaged\,depth}}$ (log transformed)¶
print("Bayesian regression:")
bokeh.io.show(bokeh.layouts.gridplot(all_regressions['log area/dist'], ncols=4))
print("Non-parametric bootstrapping:")
b_plots = plotting_utils.make_plot(df_averages, 'log area/dist', 1/3, width=225, height=225, point_size=9)
Species averaged mass vs $\frac{(\mathrm{species\,averaged\,area})^2}{\mathrm{species\,averaged\,depth}}$ (log transformed)¶
print("Bayesian regression:")
bokeh.io.show(bokeh.layouts.gridplot(all_regressions['log area^2/dist'], ncols=4))
print("Non-parametric bootstrapping:")
b_plots = plotting_utils.make_plot(df_averages, 'log area^2/dist', 1, width=225, height=225, point_size=9)
%reload_ext watermark
%watermark -p bokeh,cmdstanpy