bayesian_nonparametric_linear_regression

Linear regressions on spiracle data, non-parametric and Bayesian

To begin with, we need to import necessary python packages.

In [1]:
import numpy as np
import pandas as pd

import statsmodels.api
import scipy.stats

import bokeh.io
import bokeh.plotting

from Bio import Phylo
import io

import cmdstanpy
import arviz as az
import logging

import bebi103

#local .py file for some plotting functions and non-parametric bootstrapping utils
import plotting_utils

bokeh.io.output_notebook()
Loading BokehJS ...

We can now read in the data into a dataframe for analyis.

In [2]:
df = pd.read_csv("./20190322_supp_table_2.csv")

We take a look at the format for the data.

In [3]:
df['species_underscore'] = [spec.replace(" ", "_") for  spec in df['species']]
df.head()
Out[3]:
Unnamed: 0 subfamily species sex mass (g) spiracle area (mm^2) depth (mm) species_underscore
0 0 Cetoniinae Goliathus goliathus M 16.280 6 0.274408 2.512648 Goliathus_goliathus
1 1 Cetoniinae Goliathus goliathus F 18.150 6 0.134949 1.606189 Goliathus_goliathus
2 2 Cetoniinae Coelorrhina hornimani M 1.130 6 0.212131 0.553833 Coelorrhina_hornimani
3 3 Cetoniinae Dicronorrhina derbyana M 2.120 6 0.039532 0.473369 Dicronorrhina_derbyana
4 4 Cetoniinae Dicronorrhina derbyana F 2.145 6 0.049701 0.496320 Dicronorrhina_derbyana

For some of this analysis, we will look at the per-species averages for our measurements. To get this, we use a simple aggregate function on the dataframe and take a look at the results.

In [4]:
df_averages = df.groupby(['species', 'species_underscore', 'spiracle'], as_index=False).aggregate(np.average)
df_averages['subfamily'] = df.groupby(['species', 'species_underscore', 'spiracle'], as_index=False).aggregate(max)['subfamily']
df_averages.head()
Out[4]:
species species_underscore spiracle Unnamed: 0 area (mm^2) depth (mm) mass (g) subfamily
0 Coelorrhina hornimani Coelorrhina_hornimani 1 87.0 0.135347 0.416717 1.13 Cetoniinae
1 Coelorrhina hornimani Coelorrhina_hornimani 2 70.0 0.084207 0.451409 1.13 Cetoniinae
2 Coelorrhina hornimani Coelorrhina_hornimani 3 53.0 0.106693 0.325444 1.13 Cetoniinae
3 Coelorrhina hornimani Coelorrhina_hornimani 4 36.0 0.115574 0.481558 1.13 Cetoniinae
4 Coelorrhina hornimani Coelorrhina_hornimani 5 19.0 0.119145 0.506751 1.13 Cetoniinae

Let's take a look at the number of species per subfamily in the data.

In [5]:
species_per_subfam=df_averages.groupby(['subfamily', 'spiracle'], as_index=False).count().groupby('subfamily').aggregate(max).reset_index()[['subfamily', 'species']]
species_per_subfam.columns = ('subfamily', 'subfam_count')
species_per_subfam
Out[5]:
subfamily subfam_count
0 Cetoniinae 6
1 Dynastinae 3
2 Rutelinae 1
In [6]:
df_averages = df_averages.merge(species_per_subfam, on='subfamily')

For our plots, we will log transform the data. We will add a column to the dataframe with the log transformed data. We will also need some transforms of our data, which we will do here.

In [7]:
df_averages['log area (mm^2)'] = np.log10(df_averages['area (mm^2)'])
df_averages['log dist'] = np.log10(df_averages['depth (mm)'])
df_averages['log mass (g)'] = np.log10(df_averages['mass (g)'])
df_averages['log area/dist'] = np.log10(df_averages['area (mm^2)']/df_averages['depth (mm)'])
df_averages['log area^2/dist'] = np.log10(df_averages['area (mm^2)']**2/df_averages['depth (mm)'])

df_averages.head()
Out[7]:
species species_underscore spiracle Unnamed: 0 area (mm^2) depth (mm) mass (g) subfamily subfam_count log area (mm^2) log dist log mass (g) log area/dist log area^2/dist
0 Coelorrhina hornimani Coelorrhina_hornimani 1 87.0 0.135347 0.416717 1.13 Cetoniinae 6 -0.868551 -0.380159 0.053078 -0.488392 -1.356943
1 Coelorrhina hornimani Coelorrhina_hornimani 2 70.0 0.084207 0.451409 1.13 Cetoniinae 6 -1.074651 -0.345430 0.053078 -0.729221 -1.803872
2 Coelorrhina hornimani Coelorrhina_hornimani 3 53.0 0.106693 0.325444 1.13 Cetoniinae 6 -0.971862 -0.487524 0.053078 -0.484339 -1.456201
3 Coelorrhina hornimani Coelorrhina_hornimani 4 36.0 0.115574 0.481558 1.13 Cetoniinae 6 -0.937142 -0.317351 0.053078 -0.619790 -1.556932
4 Coelorrhina hornimani Coelorrhina_hornimani 5 19.0 0.119145 0.506751 1.13 Cetoniinae 6 -0.923923 -0.295205 0.053078 -0.628717 -1.552640

In addition to log transforming the species averaged data, we will do the same for the whole data set.

In [8]:
df['log area (mm^2)'] = np.log10(df['area (mm^2)'])
df['log dist'] = np.log10(df['depth (mm)'])
df['log mass (g)'] = np.log10(df['mass (g)'])
df['log area/dist'] = np.log10(df['area (mm^2)']/df['depth (mm)'])
df['log area^2/dist'] = np.log10(df['area (mm^2)']**2/df['depth (mm)'])
df.head()
Out[8]:
Unnamed: 0 subfamily species sex mass (g) spiracle area (mm^2) depth (mm) species_underscore log area (mm^2) log dist log mass (g) log area/dist log area^2/dist
0 0 Cetoniinae Goliathus goliathus M 16.280 6 0.274408 2.512648 Goliathus_goliathus -0.561603 0.400132 1.211654 -0.961735 -1.523338
1 1 Cetoniinae Goliathus goliathus F 18.150 6 0.134949 1.606189 Goliathus_goliathus -0.869831 0.205797 1.258877 -1.075628 -1.945459
2 2 Cetoniinae Coelorrhina hornimani M 1.130 6 0.212131 0.553833 Coelorrhina_hornimani -0.673395 -0.256621 0.053078 -0.416774 -1.090169
3 3 Cetoniinae Dicronorrhina derbyana M 2.120 6 0.039532 0.473369 Dicronorrhina_derbyana -1.403054 -0.324800 0.326336 -1.078254 -2.481309
4 4 Cetoniinae Dicronorrhina derbyana F 2.145 6 0.049701 0.496320 Dicronorrhina_derbyana -1.303635 -0.304238 0.331427 -0.999397 -2.303033

With our data in this format, we can build simple model for our regression. To do this, we will write a statistical model for the data. We choose a normal distribution with mean given by a simple linear function (with slope $a$ and intercept $b$), and homoscedastic variance $\sigma$. For our priors, we will take a normal distribution for the slope with mean given by the slope value representing isometric scaling in the log-log space, and standard deviation of $0.3$. This encodes our prior expectation that isometry is the most common/likely outcome for morphological features. If they scale uniformly with one another, we would see isometry. The variance term on this prior gives a broad but reasonable range of potential values for the parameter around the isometric value. This prior is hence weakly informative, fairly encoding our prior expectation without constraining the ability of our data to inform the parameter value. For the intercept, we expect a small value for spiracle dimensions of a 1g insect (what the intercept of the log mass x axis represents), so we put this mean at around 1/10th of a square mm for area (mean of -1 in log space), with a standard deviation of 1. Hence ~95% of the probability mass for this prior goes from ~100 times smaller to ~100 times larger than this mean. Once again, this is a pretty broad distribution so shouldn't drag the parameter estimates much, but does encode our prior intuition of what we expect from beetle spiracle dimensions. For the prior on the $\sigma$ term for the normal distribution, we take a half normal ($\sigma$ must be positive), with mean at zero and standard deviation of 1, once again this allows for a wide range of values for sigma while also encoding our expectation that morphological measures shouldn't have extreme variation on a log scale. This gives the following model:

\begin{align} f_{\mu}(x_i; a, b) &= ax_i + b \\ b &\sim \text{Norm}(-1, 3) \\ a &\sim \text{Norm}(\mathrm{isometry}, 0.3)\\ [\mathrm{isometry_{area}=2/3, \, isometry_{depth}=1/3},& \, \mathrm{isometry_{area/depth}}=1/3, \, \mathrm{isometry_{area^2/depth}}=1]\\ \sigma &\sim \text{HalfNorm}(0, 1)\\ y_i &\sim \text{Normal}\left(f_\mu(x_i; a, b), \sigma\right) \end{align}

The Stan code for the model is as follows:

functions {
  real f(real x_i, real a, real b) {
    real y_i = x_i*a + b;
    return y_i;
  }
}

data {
  int<lower=1> N;
  vector[N] x;
  vector[N] y;
  real priora;
  int<lower=1> N_ppc;
  vector[N_ppc] x_ppc;
}

parameters {
  real a;
  real b;
  real<lower=0> sigma;
}

transformed parameters {
  vector[N] mu;

  for (i in 1:N) {
    mu[i] = f(x[i], a, b);
  }

}

model {
  a ~ normal(priora, 0.3);
  b ~ normal(-1.0, 3.0);

  sigma ~ normal(0.0, 1.0);
  y ~ normal(mu, sigma);

}

generated quantities {
  vector[N_ppc] y_ppc;
  vector[N_ppc] mu_ppc;
  real coef_var = (sigma)/(10^b);

  for (i in 1:N_ppc) {
    mu_ppc[i] = f(x_ppc[i], a, b);
  }
  for (i in 1:N_ppc) { 
      y_ppc[i] = normal_rng(mu_ppc[i], sigma);
  }

}

We will do a quick prior-predictive check of this model to make sure it gives reasonable potential datasets.

In [9]:
x = df_averages.sort_values('mass (g)').loc[df_averages['spiracle'] == '1', 'log mass (g)'].values
size = 1000
lines = []
for a, b, sigma in zip(np.random.normal(0.67,0.3, size=size), np.random.normal(-1,1,size=size), np.abs(np.random.normal(1,size=size))):
    lines.append(np.random.normal(a*x+b, sigma))
    
p = bokeh.plotting.figure(plot_height=250, x_axis_label='mass', y_axis_label='spiracle dimension')
[p.line(x, y, alpha=0.05) for y in lines]
bokeh.io.show(p)

This looks good! The proposed model gives a slight upward slope and a nice wide range of values for both the slope and intercept based on the priors. The data should be able to inform the parameter values well based on this model. We are now ready to sample to get parameter estimates! We do this below. In addition to sampling, we also produce some plots of the model outputs given the sampled parameters (i.e. the posterior samples as well as samples from the posterior predictive distribution). We store these plots for later visualization. In addition to the visual checks, we also run diagnostics on our parameter samples to check that sampling went well.

In [10]:
logging.getLogger("cmdstanpy").setLevel(logging.WARNING)
all_summaries = {}
all_regressions = {}
N_ppc = 200
for morphs, iso in zip(['log area (mm^2)', 'log dist', 'log area/dist', 'log area^2/dist'], [0.67, 0.33, 0.33, 1.0]):

    sigmas = []
    slopes = []
    intercepts = []
    coef_vars = []
    summaries = {}
    plots = []
    for spir in ['S', 'T', '1', '2', '3', '4', '5', '6']:
        x = df_averages.sort_values('mass (g)').loc[df_averages['spiracle'] == spir, 'log mass (g)'].values
        y = df_averages.sort_values('mass (g)').loc[df_averages['spiracle'] == spir, morphs].values

        x_ppc = np.linspace(x.min(), x.max(), N_ppc)
        
        data = {
            "N": len(x),
            "x": x,
            "y": y,
            "priora": iso,
            "x_ppc": x_ppc,
            'N_ppc': N_ppc,
        }

        sm = cmdstanpy.CmdStanModel(stan_file='spiracle_regression.stan')

        samples = sm.sample(data=data, sampling_iters=1000, chains=4)

        samples = az.from_cmdstanpy(posterior=samples, posterior_predictive=["y_ppc"])
        print("________________________________________________________________________________________________________________\n" \
              "Checking diagnostics for samples for " + morphs + ' spiracle ' + spir + '\n' \
              "________________________________________________________________________________________________________________\n")
        bebi103.stan.check_all_diagnostics(samples, var_names=['b', 'a', 'sigma'])
        sigmas.append(samples.posterior['sigma'].values.ravel())
        coef_vars.append(samples.posterior['coef_var'].values.ravel())
        slopes.append(samples.posterior['a'].values.ravel())
        intercepts.append(samples.posterior['b'].values.ravel())
        summaries[spir] = az.summary(samples, var_names=['b', 'a', 'sigma', 'coef_var'])
        summaries[spir]['median'] = [np.median(samples.posterior['b'].values.ravel()),
                                     np.median(samples.posterior['a'].values.ravel()),
                                     np.median(samples.posterior['sigma'].values.ravel()),
                                     np.median(samples.posterior['coef_var'].values.ravel())]
        plots.append(plotting_utils.plot_regression_comparison(df, df_averages, x, spir, morphs, samples, iso, x_ppc, circle_size=9, line_width=1, circle_alpha=1, width=225, height=225))
        
    all_regressions[morphs] = plots
    all_summaries[morphs] = summaries
logging.getLogger("cmdstanpy").setLevel(logging.INFO)
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area (mm^2) spiracle S
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area (mm^2) spiracle T
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area (mm^2) spiracle 1
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area (mm^2) spiracle 2
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area (mm^2) spiracle 3
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area (mm^2) spiracle 4
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area (mm^2) spiracle 5
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area (mm^2) spiracle 6
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log dist spiracle S
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log dist spiracle T
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log dist spiracle 1
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log dist spiracle 2
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log dist spiracle 3
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log dist spiracle 4
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log dist spiracle 5
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log dist spiracle 6
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area/dist spiracle S
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area/dist spiracle T
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area/dist spiracle 1
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area/dist spiracle 2
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area/dist spiracle 3
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area/dist spiracle 4
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area/dist spiracle 5
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area/dist spiracle 6
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area^2/dist spiracle S
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area^2/dist spiracle T
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area^2/dist spiracle 1
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area^2/dist spiracle 2
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area^2/dist spiracle 3
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area^2/dist spiracle 4
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area^2/dist spiracle 5
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
________________________________________________________________________________________________________________
Checking diagnostics for samples for log area^2/dist spiracle 6
________________________________________________________________________________________________________________

Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.

Nice! the sampling looks to have gone well, with no major issues based on diagnostics. With samples for all the parameters and for all the spiracles/morphological features of interest in hand, we can now generate a plot for some summary stats about the data. We will also compute similar estimates for regression features of interest using a non-parametric bootstrapping approach for comparison.

In [11]:
plots = []
summaries_nonparametric = {}
for morphs, iso, heading in zip(['log area (mm^2)', 'log dist', 'log area/dist', 'log area^2/dist'], [0.67, 0.33, 0.33, 1.0], ['Area', 'Depth', 'Diffusive', 'Advective']):
    #print(morphs)
    summaries = all_summaries[morphs]
    summaries_nonparametric[morphs] = {}
    for var, yax, var_title in zip(['a', 'b', 'sigma', 'coef_var'], ['log-log slope', 'intercept (mm^2)', 'log-log sigma', 'σ/10^b'], ['slope', 'intercept', 'variance', 'σ/10^b']):
        p = bokeh.plotting.figure(plot_height=200, plot_width=300, x_range=['S', 'T', '1', '2', '3', '4', '5', '6'], title=heading + ', Median ± HPD for ' + var_title, y_axis_label=yax, x_axis_label='Spiracle')#, y_axis_type='log')
        #p = bokeh.plotting.figure(plot_height=200, plot_width=300, x_range=['S', 'T', '1', '2', '3', '4', '5', '6'])
        p.xaxis.visible = False
        p.outline_line_color = None
        p.yaxis.minor_tick_line_color = None
        p.xaxis.minor_tick_line_color = None
        if var == 'a':
                p.line([-10, 100], [iso, iso], color='black', line_width=2, line_alpha=1)
        for i, spir in enumerate(['S', 'T', '1', '2', '3', '4', '5', '6']):
            CI_a, CI_b, CI_σ, CI_co_σ, slope, intercept, σ, co_σ = plotting_utils.make_CIs(df_averages, morphs, spir)
            summaries_nonparametric[morphs][spir] = [CI_a, CI_b, CI_σ, CI_co_σ, slope, intercept, σ, co_σ]
            if var == 'b':
                c = 1
                if morphs == 'log area/dist':
                    c = 0.178*404*(1/10)
                elif morphs == 'log area^2/dist':
                    c = (1/(3.1415*8*1.86*(10**(-8))))*(0.1**3)
                p.line([spir, spir], c*(10**np.array([summaries[spir].loc[var]['hpd_3%'], summaries[spir].loc[var]['hpd_97%']])), color='grey', line_width=7, alpha=0.75)
                p.line([spir, spir], c*(10**CI_b), color='black', line_width=2, alpha=0.75)
                p.diamond([spir,],   c*(10**summaries[spir].loc[var]['median']), color='white', size=9, alpha=1)
                p.diamond([spir,],   c*(10**intercept), color='black', size=5)
            elif var == 'a':
                p.line([spir, spir], [summaries[spir].loc[var]['hpd_3%'], summaries[spir].loc[var]['hpd_97%']], color='grey', line_width=7, alpha=0.75)
                p.line([spir, spir], CI_a, color='black', line_width=2, alpha=0.75)
                p.diamond([spir,], summaries[spir].loc[var]['median'], color='white', size=9, alpha=1)
                p.diamond([spir,], slope, color='black', size=5)
            elif var == 'sigma':
                p.line([spir, spir], [summaries[spir].loc[var]['hpd_3%'], summaries[spir].loc[var]['hpd_97%']], color='grey', line_width=7, alpha=0.75)
                p.line([spir, spir], CI_σ, color='black', line_width=2, alpha=0.75)
                p.diamond([spir,], summaries[spir].loc[var]['median'], color='white', size=9, alpha=1)
                p.diamond([spir,], σ, color='black', size=5)
                
            elif var == 'coef_var':
                p.line([spir, spir], [summaries[spir].loc[var]['hpd_3%'], summaries[spir].loc[var]['hpd_97%']], color='grey', line_width=7, alpha=0.75)
                p.line([spir, spir], CI_co_σ, color='black', line_width=2, alpha=0.75)
                p.diamond([spir,], summaries[spir].loc[var]['median'], color='white', size=9, alpha=1)
                p.diamond([spir,], co_σ, color='black', size=5)

        p.xgrid.grid_line_color = None
        #p.ygrid.grid_line_color = None
        p.output_backend = 'svg'
        plots.append(p)
        #bokeh.io.show(p)
        
print("FOR ALL PLOTS: White diamond is median Bayesian parameter sample, grey box is 3%-97% highest posterior density interval for parameter samples")
print("FOR ALL PLOTS: Black diamond is median summary statistic for non-parametric bootstrapping on OLS, black line is 2.5%-97.5% percentile for bootstrapped summary stat\n")
print('Area regressions (log-log slope, intercept (mm^2), log-log σ)')
bokeh.io.show(bokeh.layouts.gridplot(plots[0:4], ncols=2))
print('Depth regressions (log-log slope, intercept (mm), log-log σ)')
bokeh.io.show(bokeh.layouts.gridplot(plots[4:8], ncols=2))
print('Area/Depth regressions (log-log slope, intercept (G_diff), log-log σ)')
bokeh.io.show(bokeh.layouts.gridplot(plots[8:12], ncols=2))
print('Area^2/Depth regressions (log-log slope, intercept (G_adv), log-log σ)')
bokeh.io.show(bokeh.layouts.gridplot(plots[12:16], ncols=2))

#print('Area regressions (log-log slope, intercept (mm^2), log-log σ)')
#bokeh.io.show(bokeh.layouts.gridplot(plots[0:3], ncols=3))
#print('Depth regressions (log-log slope, intercept (mm), log-log σ)')
#bokeh.io.show(bokeh.layouts.gridplot(plots[3:6], ncols=3))
#print('Area/Depth regressions (log-log slope, intercept (G_diff), log-log σ)')
#bokeh.io.show(bokeh.layouts.gridplot(plots[6:9], ncols=3))
#print('Area^2/Depth regressions (log-log slope, intercept (G_adv), log-log σ)')
#bokeh.io.show(bokeh.layouts.gridplot(plots[9:12], ncols=3))
FOR ALL PLOTS: White diamond is median Bayesian parameter sample, grey box is 3%-97% highest posterior density interval for parameter samples
FOR ALL PLOTS: Black diamond is median summary statistic for non-parametric bootstrapping on OLS, black line is 2.5%-97.5% percentile for bootstrapped summary stat

Area regressions (log-log slope, intercept (mm^2), log-log σ)
Depth regressions (log-log slope, intercept (mm), log-log σ)
Area/Depth regressions (log-log slope, intercept (G_diff), log-log σ)
Area^2/Depth regressions (log-log slope, intercept (G_adv), log-log σ)
In [12]:
#To save our individual svgs of some a plot:
#bokeh.io.export_svgs(plots[7], filename='C:/Users/jwagne/Downloads/coef_var_depth.svg')
#bokeh.io.export_svgs(plots[3], filename='C:/Users/jwagne/Downloads/coef_var_area.svg')

Nice! With these plots finished, we will also convert the regression results into a dataframe format for a table.

In [13]:
#CI_a, CI_b, CI_σ, CI_co_σ, slope, intercept, σ, co_σ

header = ['morphology', 'converted_b_units', 'spiracle', 'a', 'a_2.5', 'a_97.5', 'b', 'b_2.5', 'b_97.5', 'b_converted', 'b_converted_2.5', 'b_converted_97.5', 'σ', 'σ_2.5', 'σ_97.5', 'co_σ', 'co_σ_2.5', 'co_σ_97.5',
          'a_median_bayes', 'a_hpd_3_bayes', 'a_hpd_97_bayes', 'b_median_bayes', 'b_hpd_3_bayes', 'b_hpd_97_bayes', 'σ_median_bayes', 'σ_hpd_3_bayes', 'σ_hpd_97_bayes', 'co_σ_median_bayes', 'co_σ_hpd_3_bayes', 'co_σ_hpd_97_bayes']
m = 'log area (mm^2)'
s = 'S'
index = 0
for m in summaries_nonparametric:
    for s in summaries_nonparametric[m]:

        c = 1
        morph = m[4:]
        if morph =='dist':
            morph = 'dist (mm)'
        if m == 'log area/dist':
            c = 0.178*404*(1/10)
            morph = 'g_diff (nmol/(sec*kPa))'
        elif m == 'log area^2/dist':
            c = (1/(3.1415*8*1.86*(10**(-8))))*(0.1**3)
            morph='g_adv (cm^3/(sec*kPa))'
        
        row = {header[0]:   m,
               header[1]:   morph,
               header[2]:   s,
               #a
               header[3]:   summaries_nonparametric[m][s][4],
               header[4]:   summaries_nonparametric[m][s][0][0],
               header[5]:   summaries_nonparametric[m][s][0][1],
               #b
               header[6]:   summaries_nonparametric[m][s][5],
               header[7]:   summaries_nonparametric[m][s][1][0],
               header[8]:   summaries_nonparametric[m][s][1][1],
               
               header[9]:   c*(10**summaries_nonparametric[m][s][5])   ,
               header[10]:   c*(10**summaries_nonparametric[m][s][1][0]),
               header[11]:   c*(10**summaries_nonparametric[m][s][1][1]),
               
               #σ
               header[12]:   summaries_nonparametric[m][s][6][0],
               header[13]:   summaries_nonparametric[m][s][2][0],
               header[14]:  summaries_nonparametric[m][s][2][1],
               #σ_co
               header[15]:  summaries_nonparametric[m][s][7][0],
               header[16]:  summaries_nonparametric[m][s][3][0],
               header[17]:  summaries_nonparametric[m][s][3][1],
               
               header[18]:  all_summaries[m][s].loc['a']['median'],
               header[19]:  all_summaries[m][s].loc['a']['hpd_3%'],   
               header[20]:  all_summaries[m][s].loc['a']['hpd_97%'],
               header[21]:  all_summaries[m][s].loc['b']['median'],   
               header[22]:  all_summaries[m][s].loc['b']['hpd_3%'],   
               header[23]:  all_summaries[m][s].loc['b']['hpd_97%'],   
               header[24]:  all_summaries[m][s].loc['sigma']['median'],
               header[25]:  all_summaries[m][s].loc['sigma']['hpd_3%'],
               header[26]:  all_summaries[m][s].loc['sigma']['hpd_97%'],
               header[27]:  all_summaries[m][s].loc['coef_var']['median'],
               header[28]:  all_summaries[m][s].loc['coef_var']['hpd_3%'], 
               header[29]:  all_summaries[m][s].loc['coef_var']['hpd_97%'],
              }
        if index == 0:
            df = pd.DataFrame(row, index=[index])
        else:
            df = df.append(pd.DataFrame(row, index=[index]))
        index += 1
df.head(5)
Out[13]:
morphology converted_b_units spiracle a a_2.5 a_97.5 b b_2.5 b_97.5 b_converted ... a_hpd_97_bayes b_median_bayes b_hpd_3_bayes b_hpd_97_bayes σ_median_bayes σ_hpd_3_bayes σ_hpd_97_bayes co_σ_median_bayes co_σ_hpd_3_bayes co_σ_hpd_97_bayes
0 log area (mm^2) area (mm^2) S 0.705576 0.562905 0.851876 -0.695591 -0.783258 -0.602272 0.201562 ... 0.857 -0.695337 -0.853 -0.553 0.204583 0.115 0.347 0.997717 0.523 1.914
1 log area (mm^2) area (mm^2) T 0.751667 0.536277 0.991070 -1.198240 -1.365652 -1.072011 0.063352 ... 0.928 -1.194020 -1.349 -1.028 0.227815 0.133 0.373 3.478430 1.797 6.451
2 log area (mm^2) area (mm^2) 1 0.764034 0.503852 1.014768 -1.155312 -1.351592 -0.988989 0.069934 ... 0.969 -1.147330 -1.373 -0.933 0.298381 0.179 0.489 4.122820 2.225 8.186
3 log area (mm^2) area (mm^2) 2 0.759870 0.458095 1.040989 -1.275840 -1.523675 -1.056977 0.052986 ... 0.987 -1.272285 -1.501 -1.039 0.324187 0.191 0.537 5.942770 2.842 12.237
4 log area (mm^2) area (mm^2) 3 0.588662 0.341499 0.792693 -1.328242 -1.519145 -1.146570 0.046963 ... 0.806 -1.332050 -1.524 -1.162 0.257950 0.151 0.421 5.418770 2.689 10.314

5 rows × 30 columns

In [14]:
#To copy the table to the clipboard to paste into another software!
#df.to_clipboard()

Great! With these summary stats in hand, we can also do some plots of all of these regressions themselves. Note that for all of these plots the grey ranges show the range of regression output values (either from sampling or bootstrapping) at the 2.5th-97.5th percentiles. Central line is the 45th-55th percentile range for Bayesian regression values, or the OLS line with non-resampled data for the the bootstrap regression plots. The two grey regions for the Bayesian regressions are the 95th and 80th percentile ranges for the posterior predictive distribution for the model. The black dots are the measured data points.


Plot for species averaged mass vs species averaged spiracle area (log transformed)

In [15]:
print("Bayesian regression:")
bokeh.io.show(bokeh.layouts.gridplot(all_regressions['log area (mm^2)'], ncols=4))
print("Non-parametric bootstrapping:")
b_plots = plotting_utils.make_plot(df_averages, 'log area (mm^2)', 2/3, width=225, height=225, point_size=9)
Bayesian regression:
Non-parametric bootstrapping:
In [16]:
#To save our individual svgs of some a plot:
#p = b_plots[0]
#p.outline_line_color = None
#p.yaxis.minor_tick_line_color = None
#p.xaxis.minor_tick_line_color = None
#bokeh.io.export_svgs(p, filename='C:/Users/jwagne/Downloads/S_area_nonparametric.svg')

Species averaged mass vs species averaged spiracle depth (log transformed)

In [17]:
print("Bayesian regression:")
bokeh.io.show(bokeh.layouts.gridplot(all_regressions['log dist'], ncols=4))
print("Non-parametric bootstrapping:")
b_plots = plotting_utils.make_plot(df_averages, 'log dist', 1/3, width=225, height=225, point_size=9)
Bayesian regression:
Non-parametric bootstrapping:

Species averaged mass vs $\frac{\mathrm{species\,averaged\,area}}{\mathrm{species\,averaged\,depth}}$ (log transformed)

In [18]:
print("Bayesian regression:")
bokeh.io.show(bokeh.layouts.gridplot(all_regressions['log area/dist'], ncols=4))
print("Non-parametric bootstrapping:")
b_plots = plotting_utils.make_plot(df_averages, 'log area/dist', 1/3, width=225, height=225, point_size=9)
Bayesian regression:
Non-parametric bootstrapping:

Species averaged mass vs $\frac{(\mathrm{species\,averaged\,area})^2}{\mathrm{species\,averaged\,depth}}$ (log transformed)

In [19]:
print("Bayesian regression:")
bokeh.io.show(bokeh.layouts.gridplot(all_regressions['log area^2/dist'], ncols=4))
print("Non-parametric bootstrapping:")
b_plots = plotting_utils.make_plot(df_averages, 'log area^2/dist', 1, width=225, height=225, point_size=9)
Bayesian regression:
Non-parametric bootstrapping:
In [20]:
%reload_ext watermark
%watermark -p bokeh,cmdstanpy
bokeh 1.4.0
cmdstanpy 0.8.0
In [ ]: