1. INTRODUCTION

(Erp, Oberski, and Mulder 2018) provided an interesting comparison of different shrinkage priors in a regularized penalization task. The work visualized and described shrinkage priors well and compared them in a variety of conditions. However, the paper contained also some deficiencies regarding simulated data, convergence diagnostics and predictive performance presentation. The simulation conditions were not sparse enough to sufficiently investigate shrinkage abilities of the priors. Moreover, convergence problems were investigated only by investigating the trace plots and in some cases even by excluding simulation draws. The paper also used a large step size (1) and a low target acceptance rate (0.85). In addition, number of computed prediction mean square error (PMSE) values was inadequate in each simulation condition and their presentation difficult for a prior comparison.

The notebook presented here is a special assignment made under supervision of Prof. Aki Vehtari in Aalto university, aimed to provide tools and ideas to fix these shortcomings. The organization is as follows. Firstly, the data used is described in the section 2. The data is as presented in (Erp, Oberski, and Mulder 2018) but with more sparse representation and with a higher number of test samples. Secondly, the notebook describes how parallel coordinate and scatter plots can be used to diagnose convergence in section 3. The section concentrates on using Horseshoe and Regularized horseshoe priors as example models based on their convergence difficulties in (Erp, Oberski, and Mulder 2018). Moreover, statistics generated from multiple replications of the data is presented. In the section 4, model comparison is presented using PMSE and log pointwise predictive density (LPPD) values for test data. Originally, also the variable selection procedure described in (Erp, Oberski, and Mulder 2018) was planned to replace with projpred-package (Piironen and Vehtari 2017) but the high k-values prevented using it in this assignment. Finally, the conclusions of the assignment are discussed.

Another goal of the assignment is to provide a set of functions which can be utilized not only with the specific case described in this notebook, but also with a variety of different models and data sets by only renaming the directory paths. Moreover, the assignment details how Aalto-university students can utilize Triton in Stan model fittings in tasks which are practically infeasible with personal computers. The codes for Triton are not illustrated in the notebook but they can be attained from Triton folder along with usage instructions.

Next, let us start by including necessary libraries, functions and by choosing a working directory:

# Acquire packages
library(rstan)
library(gridExtra)
library(grid)
library(bayesplot)
library(knitr)
library(ggplot2)
library(parallel)
library("reshape2")
options(mc.cores = parallel::detectCores())
knitr::opts_knit$set(root.dir = normalizePath("~/Documents/Rbayes/Bayes_project_files/notebook"))
# Import from utilities folder
source("model_diagnostics.R")
source("data_generation.R")
source("model_generator.R")
source("choose_fits.R")

2. DATA

In (Erp, Oberski, and Mulder 2018) six different data sets, called as simulation conditions were illustrated. This notebook uses the first condition but with two modifications. Firstly, the number of coefficients is increased to \(38\) by padding \(30\) zeros in the end of the original coefficients. The reason for this is to improve the comparison of shrinkage abilities of the priors. In addition, the test set is enlarged to \(2000\) samples to attain more accurate distributions for PMSEs and LPPDs. From here on, the data set will be referred as modified simulation condition 1 and can be generated with the following code:

# Creates "project_data.RData"
data_generation()

3. STAN MODELS AND FITS

As discussed into the introduction, the special assignment has two main goals: Provide a convergence diagnosis for Horseshoe prior and to carry out a model comparison. The former is obtained by experimenting with three different parameterizations for Horseshoe prior and with Regularized Horseshoe. The parametrizations focus on reforming the Half-Cauchy distribution which is the main reason for convergence problems in Horseshoe prior. The latter is performed with the same 9 priors (including Horseshoe prior) as in (Erp, Oberski, and Mulder 2018) for simulation condition 1 with full Bayes approach. In total, 11 Stan models is created. The specifics of Horseshoe prior parametrizations are as follows:

Horseshoe 1: The same as given in (Erp, Oberski, and Mulder 2018) but with added prior for intercept \[\beta_0 \sim \mathcal{N}(0, 10) \] and with added generated quantity as test set log likelihood for LPPD computation. The original Stan code by Sara van Erp is slightly modified by using vectorized representations for speeding up computations and swapping the variable names of \(\tau\) and \(\lambda\).

Horseshoe 2. The same as Horseshoe 1 but with Tan-function parametrization for Half-Cauchy distribution as introduced by (Betancourt 2018).

Horseshoe 3. The same as Horseshoe 1 but with inverse-Gamma parametrization for Half-Cauchy distribution again described by (Betancourt 2018)

Other 8 models are otherwise the same as in (Erp, Oberski, and Mulder 2018) but as Horseshoe parametrizations, include computation of test set log-likelihoods and vectorized computing of \(y_{test}\) values. All the models can be created with following code block:

# Import from stan_files folder
stan_files <- list.files()[grep(".stan", list.files())]
model_generator(project_data, stan_files, 0)

Stan fits for all models are generated using \(sampling\) function with number of iterations as 8000, number of chains as 4, adapt_delta as 0.999 and max_treedepth ad 15. The high value of adaptation delta is necessary for not only Horseshoe prior which is prone to convergence problems but to also for the other models to ensure proper convergence. Number of iterations is higher than in (Erp, Oberski, and Mulder 2018) in order to prevent occurences of low effective sample size (\(N_{eff}\)) values. Fits can be generated exploiting again \(model.generator\) function. In this notebook, only fits for horseshoe parametrizations are necessary for divergent transition visualization and they are obtained as follows:

load("project_data.RData")
model_files <- c("horseshoe_1.SM", "horseshoe_2.SM", "horseshoe_3.SM")
model_generator(project_data, model_files, 1)

4. CONVERGENCE DIAGNOSTICS FOR HORSESHOE AND REGULARIZED HORSESHOE PRIORS

This section consists of two parts. The first demonstrates how parallel coordinate and scatter plots can be utilized in divergent transition visualization. In addition, Rhat and \(N_{eff}\) values extracted from the Stan fit are illustrated. The second part performs wider statistical diagnosis with multiple replications of the data. As previously discussed, all experiments are performed with Horseshoe prior and the main objective of the section is to used methods discussed to obtain optimal parametrization for the prior.

4.1 Experiments with 1 replication

Rhat and \(N_{eff}\) values are standard measures for convergence of Stan models (Gelman et al. 1995). However, in some cases the models may also suffer from divergent transitions which need to be examined with other tools. In (Erp, Oberski, and Mulder 2018), authors were investigating trace plots and in some cases dropping out simulations with high number of divergent transitions. This assignment proposes the use of parallel coordinate and scatter plots as suggested in (Gabry et al. 2017), in order to find out the nature of the transitions, that is, if they are false positives or not.

In parallel coordinate plots, the MCMC draws of Stan model are illustrated in two dimensions, which are parameters and their draw samples. The plots can be generated using \(mcmc.parcoord\) function and starting points of divergent transitions can be made visible using \(nuts \enspace parameters\). Generally, the parallel coordinates enable investigating divergent trajectories which are concentrated in the position of certain parameter and do not follow the same patterns as other draw trajectories. However, the parallel coordinates have trouble to visualize in the situations when the model has high number of parameters and the parameter values have high variance. Such is the case in this assignment, and in order to counter these problems, all draws in each chain are normalized between 0 and 1 and only 6 variables with lowest \(N_{eff}\) values are chosen for investigation. To ensure more clear visualization of trajectories, draw values are finally computed in logarithmic form with an added bias as 0.5 to ensure positivity. Non-divergent and divergent trajectories are plotted in black and green, respectively.

Other option for examining divergent transitions is to generate scatter plots for draw samples of two chosen parameters from distribution. As previously, \(nuts \enspace parameters\) can then be exploited to detect samples which are linked to divergent transitions. If these samples are concentrated in some location in scatter plot, model convergence could be questioned. Here, same draw values, parameters and coloration for divergent transitions as in parallel coordinate plots are presented in three different scatter plots for each Horseshoe parametrization.

fitfls <- list.files()[grep(".FIT", list.files())]
dt_visual(fitfls[1])
[1] "Model name: horseshoe_1 "
[1] "Number of RHAT values above 1.1: 0"
[1] "Number of divergent transitions: 325"
[1] "Minimum N_eff value and corresponding variable: 68, lambda[1]"

dt_visual(fitfls[2])
[1] "Model name: horseshoe_2 "
[1] "Number of RHAT values above 1.1: 0"
[1] "Number of divergent transitions: 31"
[1] "Minimum N_eff value and corresponding variable: 3875, lp__"

dt_visual(fitfls[3])
[1] "Model name: horseshoe_3 "
[1] "Number of RHAT values above 1.1: 0"
[1] "Number of divergent transitions: 16"
[1] "Minimum N_eff value and corresponding variable: 2168, tau"

The experiments on three different parametrizations show that in each case, divergent transitions are present and form concentrations in scatter plots to some extend. Therefore, the result clearly leaves out the discussion of false positives. Still, the visualizations illustrate that the results differ between parametrization. The first parametrization is clearly the worse, being the only one which exhibits clear divergent trajectories and concentrations and has lowest \(N_{eff}\) value over hundred times lower than number of iterations. The third parametrization performs the best, as it includes the lowest number of divergent transitions.

However, as all the used data is simulated, it might be the case that for this particular simulation the first parametrization happens to perform worse and the third parametrization better. In addition, only 1 replication of the data might be insufficient to give reliable convergence diagnostics. Consequently, in the setting of this assignment, parallel coordinate and scatter plots provide some insight to the problem but are insufficient for selecting optimal parametrization. Wider statistical analysis with multiple data replications is required and it is performed next.

3.2 Experiments with 500 replications

To further compare the three prametrizations, a more statistical approach is proposed. In this approach, the models are each fitted for 500 different replications and for each fit, Rhat-values, minimum \(N_{eff}\) values and corresponding parameters and divergent transitions are collected. This time, also results for the Regularized Horseshoe are included. Furthermore, PMSE and LPPD values are also generated for each fit in order to study models predictive performance. All the values were computed with Triton and can be collected with loading following Gzip archive which also includes statistics for all the other 7 shrinkage priors mentioned in section 3.

load("fit_diagnostics")
chosen_fit_diagn <- choose_fits(fit_diagn, c("horseshoe_1.SM", "horseshoe_2.SM", "horseshoe_3.SM", "regularized_horseshoe.SM"), 500)
show_diagnostics(chosen_fit_diagn,0,show_neff = TRUE)

Number of RHAT values above 1.1
mean sd max min
horseshoe_1.SM 0.1 2.0 44 0
horseshoe_2.SM 0.5 8.3 163 0
horseshoe_3.SM 0.0 0.1 2 0
regularized_horseshoe.SM 0.0 0.0 0 0
Statistics of divergent transitions
mean sd max min
horseshoe_1.SM 52.6 66.2 984 10
horseshoe_2.SM 53.7 107.0 1607 10
horseshoe_3.SM 41.4 43.6 644 7
regularized_horseshoe.SM 0.0 0.1 1 0

Two main conclusions can be drawn from the statistics. Firstly, each Horseshoe parametrization suffers from convergence problems. Mean of divergent transitions is over 40 for each model, \(N_{eff}\) may have magnitudes even below 100 and in some fits, Rhat values may exceed threshold 1.1. Contrary to the results in previous section, parametrizations 1 and 2 have now equal performance. The outcome clearly emphasizes that parallel coordinate and scatter plots have limitations regarding convergence diagnosis. Considering the parametrization ranking, Horseshoe 3 has best statistics considering Rhat, \(N_{eff}\) value and divergent transition distributions. The result is as expected as the parametrization was suggested in (Piironen and Vehtari 2016) .

More importantly, the statistics show that in terms of convergence, none of the Horseshoe parametrizations can match the Regularized Horseshoe. In all replications, no Rhat values above 1.1 emerge, smallest N_{eff} values are all higher than 1000 and only one divergent transition is found. The result with divergent transitions is also better than in (Erp, Oberski, and Mulder 2018), which implies that high value of adapt delta and number of iterations are necessary for model convergence. However, it is also important to examine how models perform with unseen data. As previously discussed, this assignment uses PMSE and LPPD values for predictive performance comparison and their distribution are shown next.

show_diagnostics(chosen_fit_diagn,1,show_neff = FALSE)

Above distributions show that the models have very similar predictive performance. In the case of Horseshoe parametrizations, the distributions are almost identical which is as expected as all the parametrizations describe the same shrinkage prior. Similarity of distributions implies that if only predictive perfromance of the model is considered, the choice of model makes no drastic difference. Moreover, the similarity of predictive performances question also the effect of divergent transitions. One possible explanation is that mapping from the parameters to the predictive distributions is such that the bias in the parameter posterior indicated by the divergences does not have an effect. This can happen, for example if the unexplored part of the parameter posterior maps to very similar predictions to the other parts of the posterior.

4. MODEL COMPARISON

In (Erp, Oberski, and Mulder 2018) model comparisons were provided using mean and standard deviation values of PMSEs. As the mean values of different models were very close to each others, the comparison was slightly difficult. In addition, no further visualization was provided. This section aims to provide a tool for more clear comparison, using the PMSE and LPPD value distributions. Based on the experiments in previous section, the chosen parametrization for Horseshoe prior is the third parametrization. All results shown in this section are again computed from 500 simulation replications and with Triton. Let us start by visualizing the distributions:

# Creates fit_diagn file
chosen_fit_diagn <- choose_fits(fit_diagn, fit_diagn$models[c(1, 4:11)], 500)
show_diagnostics(chosen_fit_diagn,1,show_neff = FALSE)

As can be seen from above figures, predictive performance of the shrinkage priors do differ. Again, the outcome deviates from the results illustrated in (Erp, Oberski, and Mulder 2018). However, since so many models are involved, it is difficult to assess which performs best. In order to provide more informative model comparison, also statistics for rankings of PMSE and LPPD values are presented. In other words, ranking values 1 to 9 are assessed to each model in each replication in a such manner that value 1 is assigned to lowest PMSE value or absolute LPPD value and 9 to highest. Results are presented in two tables below:

show_diagnostics(chosen_fit_diagn,2,show_neff = FALSE)

Ranking values of PMSES (lower is better)
mean sd max min
elastic_net_cauchy.SM 8.2 1.2 9 1
horseshoe_3.SM 2.2 1.6 9 1
hyperlasso_cauchy.SM 4.7 1.3 7 1
lasso_cauchy.SM 5.3 1.1 9 1
normal_mixture_bernoulli.SM 3.8 1.7 9 1
normal_mixture_uniform.SM 3.9 1.6 9 1
regularized_horseshoe.SM 2.0 1.4 9 1
ridge_cauchy.SM 7.8 1.1 9 2
student_cauchy.SM 7.1 1.5 9 1
Ranking values of LPPDS (lower is better)
mean sd max min
elastic_net_cauchy.SM 8.4 1.1 9 2
horseshoe_3.SM 2.2 1.3 9 1
hyperlasso_cauchy.SM 4.6 1.3 8 1
lasso_cauchy.SM 5.3 1.1 9 1
normal_mixture_bernoulli.SM 3.8 1.7 9 1
normal_mixture_uniform.SM 3.9 1.6 9 1
regularized_horseshoe.SM 1.9 1.5 9 1
ridge_cauchy.SM 7.8 0.9 9 2
student_cauchy.SM 7.0 1.4 9 1

The tables depict that Horseshoe and Regularized Horseshoe have best predictive performance abilities. The success of Horseshoe priors is as expected since their shrinkage abilities have been previously well covered in literature (Carvalho, Polson, and Scott 2010, Piironen, Vehtari, and others (2017)). However, here it must be noted that just as in (Erp, Oberski, and Mulder 2018), sparsity parameter of the Regularized Horseshoe prior has a prior which is centered on the correct number of non-zero parameters in the model. The worst predictive performance is attained with Elastic Net and the second worst with Ridge Cauchy. All models are ranked in a similar order in both LPPD and PMSE cases. To ensure that results are valid, also Rhat values and divergent transitions are illustrated for each model:

show_diagnostics(chosen_fit_diagn,0,show_neff = FALSE)

Number of RHAT values above 1.1
mean sd max min
elastic_net_cauchy.SM 0 0.0 0 0
horseshoe_3.SM 0 0.1 2 0
hyperlasso_cauchy.SM 0 0.0 0 0
lasso_cauchy.SM 0 0.0 0 0
normal_mixture_bernoulli.SM 0 0.0 0 0
normal_mixture_uniform.SM 0 0.0 0 0
regularized_horseshoe.SM 0 0.0 0 0
ridge_cauchy.SM 0 0.0 0 0
student_cauchy.SM 0 0.0 0 0
Statistics of divergent transitions
mean sd max min
elastic_net_cauchy.SM 0.0 0.2 4 0
horseshoe_3.SM 41.4 43.6 644 7
hyperlasso_cauchy.SM 0.4 4.3 86 0
lasso_cauchy.SM 0.0 0.0 0 0
normal_mixture_bernoulli.SM 0.0 0.0 0 0
normal_mixture_uniform.SM 0.0 0.0 0 0
regularized_horseshoe.SM 0.0 0.1 1 0
ridge_cauchy.SM 0.0 0.0 0 0
student_cauchy.SM 0.0 0.2 3 0

5. CONCLUSIONS

The conclusions of the assignment can be briefly summarized as follows:

1: When model is suffering from the convergence difficulties and includes divergent transitions, the adaptation delta should be assessed as high as possible (0.999), number of iterations as 8000 and if the transitions still exist, scatter and parallel coordinate plots used. However, using only plots may be insufficient and therefore also wider statistics for the transitions and also for Rhat and \(N_{eff}\) values are required.

2: Using the modified simulation condition 1, best parametrization for Horseshoe prior is by reforming Half-Cauchy distribution with Inverse_Gamma parametrization. However, Regularized Horseshoe prior is still a better choice of parametrization as it includes virtually no divergent transitions in the setting of this assignment. The predictive performance of the priors are still very close to each other.

3: When comparing different shrinkage priors with modified simulation condition 1, the differences in model performance can be seen visually from PMSE and LPPD distributions and the best model can be determined based on their values. Horseshoe and Regularized Horseshoe were found to be the top shrinkage priors for the task.

REFERENCES

Betancourt, Michael. 2018. “Fitting the Cauchy Distribution.” https://betanalpha.github.io/assets/case_studies/fitting_the_cauchy.html.

Carvalho, Carlos M, Nicholas G Polson, and James G Scott. 2010. “The Horseshoe Estimator for Sparse Signals.” Biometrika 97 (2). Oxford University Press: 465–80.

Erp, Sara van, Daniel L Oberski, and Joris Mulder. 2018. “Shrinkage Priors for Bayesian Penalized Regression.” Open Science Framework.

Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2017. “Visualization in Bayesian Workflow.” arXiv Preprint arXiv:1709.01449.

Gelman, Andrew, John B Carlin, Hal S Stern, and Donald B Rubin. 1995. Bayesian Data Analysis. Chapman; Hall/CRC.

Piironen, Juho, and Aki Vehtari. 2016. “On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior.” arXiv Preprint arXiv:1610.05559.

———. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3). Springer: 711–35.

Piironen, Juho, Aki Vehtari, and others. 2017. “Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors.” Electronic Journal of Statistics 11 (2). The Institute of Mathematical Statistics; the Bernoulli Society: 5018–51.

---
title: "CONVERGENCE DIAGNOSIS AND COMPARISON OF SHRINKAGE PRIORS"
output: html_notebook
bibliography: ref.bib
author: Tuomas Kaseva
date: 14.10.2018
---

# 1. INTRODUCTION

[@van2018shrinkage] provided an interesting comparison of different shrinkage priors in a regularized penalization task. The work visualized and described shrinkage priors well and compared them in a variety of conditions. However, the paper contained also some deficiencies regarding simulated data, convergence diagnostics and predictive performance presentation. The simulation conditions were not sparse enough to sufficiently investigate shrinkage abilities of the priors. Moreover, convergence problems were investigated only by investigating the trace plots and in some cases even by excluding simulation draws. The paper also used a large step size (1) and a low target acceptance rate (0.85). In addition, number of computed prediction mean square error (PMSE) values was inadequate in each simulation condition and their presentation difficult for a prior comparison.

The notebook presented here is a special assignment made under supervision of **Prof. Aki Vehtari** in Aalto university, aimed to provide tools and ideas to fix these shortcomings. The organization is as follows. Firstly, the data used is described in the section 2. The data is as presented in [@van2018shrinkage] but with more sparse representation and with a higher number of test samples.
Secondly, the notebook describes how parallel coordinate and scatter plots can be used to diagnose convergence in section 3.  The section concentrates on using Horseshoe and Regularized horseshoe priors as example models based on their convergence difficulties in  [@van2018shrinkage]. Moreover, statistics generated from multiple replications of the data is presented.  In the section 4, model comparison is presented using PMSE and log pointwise predictive density (LPPD) values for test data. Originally, also the variable selection procedure described in [@van2018shrinkage] was planned to replace with projpred-package [@piironen2017comparison] but the high k-values prevented using it in this assignment. Finally, the conclusions of the assignment are discussed.

Another goal of the assignment is to provide a set of functions which can be utilized not only with the specific case described in this notebook, but also with a variety of different models and data sets by only renaming the directory paths. Moreover, the assignment details how Aalto-university students can utilize Triton in Stan model fittings in tasks which are practically infeasible with personal computers. The codes for Triton are not illustrated in the notebook but they can be attained from Triton folder along with usage instructions. 

Next, let us start by including necessary libraries, functions and by choosing a working directory: 

```{r}
# Acquire packages
library(rstan)
library(gridExtra)
library(grid)
library(bayesplot)
library(knitr)
library(ggplot2)
library(parallel)
library("reshape2")
options(mc.cores = parallel::detectCores())
```

```{r setup}
knitr::opts_knit$set(root.dir = normalizePath("~/Documents/Rbayes/Bayes_project_files/notebook"))
```


```{r}
# Import from utilities folder
source("model_diagnostics.R")
source("data_generation.R")
source("model_generator.R")
source("choose_fits.R")

```

# 2. DATA

In [@van2018shrinkage] six different data sets, called as simulation conditions were illustrated. This notebook uses the first condition but with two modifications. Firstly, the number of coefficients is increased to $38$ by padding $30$ zeros in the end of the original coefficients. The reason for this is to improve the comparison of shrinkage abilities of the priors. In addition, the test set is enlarged to $2000$ samples to attain more accurate distributions for PMSEs and LPPDs. From here on, the data set will be referred as **modified simulation condition 1** and can be generated with the following code:

```{r}
# Creates "project_data.RData"
data_generation()
```

# 3. STAN MODELS AND FITS

As discussed into the introduction, the special assignment has two main goals: Provide a convergence diagnosis for Horseshoe prior and to carry out a model comparison. The former is obtained by experimenting with three different parameterizations for Horseshoe prior and with Regularized Horseshoe. The parametrizations focus on reforming the Half-Cauchy distribution which is the main reason for convergence problems in Horseshoe prior. The latter is performed with the same 9 priors (including Horseshoe prior) as in [@van2018shrinkage] for simulation condition 1 with full Bayes approach. In total, 11 Stan models is created. The specifics of Horseshoe prior parametrizations are as follows:

**Horseshoe 1:** The same as given in [@van2018shrinkage] but with added prior for intercept $$\beta_0 \sim  \mathcal{N}(0, 10) $$
and with added generated quantity as test set log likelihood for LPPD computation. The original Stan code by Sara van Erp is slightly modified by using vectorized representations for speeding up computations and swapping the variable names of $\tau$ and $\lambda$. 

**Horseshoe 2.** The same as Horseshoe 1 but with Tan-function parametrization for Half-Cauchy distribution as introduced by [@Betancourt].  

**Horseshoe 3.** The same as Horseshoe 1 but with inverse-Gamma parametrization for Half-Cauchy distribution again described by [@Betancourt]


**Other 8 models** are otherwise the same as in [@van2018shrinkage] but as Horseshoe parametrizations, include computation of test set log-likelihoods and vectorized computing of $y_{test}$ values. All the models can be created with following code block:

```{r}
# Import from stan_files folder
stan_files <- list.files()[grep(".stan", list.files())]
model_generator(project_data, stan_files, 0)
```

**Stan fits** for all models are generated using $sampling$ function with number of iterations as 8000, number of chains as 4, adapt_delta as 0.999 and max_treedepth ad 15. The high value of adaptation delta is necessary for not only Horseshoe prior which is prone to convergence problems but to also for the other models to ensure proper convergence. Number of iterations is higher than in [@van2018shrinkage] in order to prevent occurences of low effective sample size ($N_{eff}$) values. Fits can be generated exploiting again $model.generator$ function. In this notebook, only fits for horseshoe parametrizations are necessary for divergent transition visualization and they are obtained as follows:

```{r}
load("project_data.RData")
model_files <- c("horseshoe_1.SM", "horseshoe_2.SM", "horseshoe_3.SM")
model_generator(project_data, model_files, 1)
```



# 4. CONVERGENCE DIAGNOSTICS FOR HORSESHOE AND REGULARIZED HORSESHOE PRIORS

This section consists of two parts. The first demonstrates how parallel coordinate and scatter plots can be utilized in divergent transition visualization. In addition, Rhat and $N_{eff}$ values extracted from the Stan fit are illustrated. The second part performs wider statistical diagnosis with multiple replications of the data. As previously discussed, all experiments are performed with Horseshoe prior and the main objective of the section is to used methods discussed to obtain optimal parametrization for the prior.

## 4.1 Experiments with 1 replication

Rhat and $N_{eff}$ values are standard measures for convergence of Stan models [@gelman1995bayesian]. However, in some cases the models may also suffer from divergent transitions which need to be examined with other tools. In [@van2018shrinkage], authors were investigating trace plots and in some cases dropping out simulations with high number of divergent transitions. This assignment proposes the use of parallel coordinate and scatter plots as suggested in [@gabry2017visualization], in  order to find out the nature of the transitions, that is, if they are false positives or not. 

In **parallel coordinate plots**, the MCMC draws of Stan model are illustrated in two dimensions, which are parameters and their draw samples. The plots can be generated using $mcmc.parcoord$ function and starting points of divergent transitions can be made visible using $nuts \enspace parameters$. Generally, the parallel coordinates enable investigating divergent trajectories which are concentrated in the position of certain parameter and do not follow the same patterns as other draw trajectories. However, the parallel coordinates have trouble to visualize in the situations when the model has high number of parameters and the parameter values have high variance. Such is the case in this assignment, and in order to counter these problems, all draws in each chain are normalized between 0 and 1 and only 6 variables with lowest $N_{eff}$ values are chosen for investigation. To ensure more clear visualization of trajectories, draw values are finally computed in logarithmic form with an added bias as 0.5 to ensure positivity. Non-divergent and divergent trajectories are plotted in black and green, respectively.

Other option for examining divergent transitions is to generate **scatter plots** for draw samples of two chosen parameters from distribution. As previously, $nuts \enspace parameters$ can then be exploited to detect samples which are linked to divergent transitions. If these samples are concentrated in some location in scatter plot, model convergence could be questioned. Here, same draw values, parameters and coloration for divergent transitions as in parallel coordinate plots are presented in three different scatter plots for each Horseshoe parametrization.


```{r}
fitfls <- list.files()[grep(".FIT", list.files())]
dt_visual(fitfls[1])
```

```{r}
dt_visual(fitfls[2])
```

```{r}
dt_visual(fitfls[3])
```

The experiments on three different parametrizations show that in each case, divergent transitions are present and form concentrations in scatter plots to some extend. Therefore, the result clearly leaves out the discussion of false positives. Still, the visualizations illustrate that the results differ between parametrization. The first parametrization is clearly the worse, being the only one which exhibits clear divergent trajectories and concentrations and has lowest $N_{eff}$ value over hundred times lower than number of iterations. The third parametrization performs the best, as it includes the lowest number of divergent transitions. 

However, as all the used data is simulated, it might be the case that for this particular simulation the first parametrization happens to perform worse and the third parametrization better. In addition, only 1 replication of the data might be insufficient to give reliable convergence diagnostics.  Consequently, in the setting of this assignment, **parallel coordinate and scatter plots provide some insight to the problem but are insufficient for selecting optimal parametrization**. Wider statistical analysis with multiple data replications is required and it is performed next.

## 3.2 Experiments with 500 replications

To further compare the three prametrizations, a more statistical approach is proposed. In this approach, the models are each fitted for 500 different replications and for each fit, Rhat-values, minimum $N_{eff}$ values and corresponding parameters and divergent transitions are collected. This time, also results for the Regularized Horseshoe are included. Furthermore, PMSE and LPPD values are also generated for each fit in order to study models predictive performance. All the values were computed with Triton and can be collected with loading following Gzip archive which also includes statistics for all the other 7 shrinkage priors mentioned in section 3.
```{r}
load("fit_diagnostics")
```

```{r}
chosen_fit_diagn <- choose_fits(fit_diagn, c("horseshoe_1.SM", "horseshoe_2.SM", "horseshoe_3.SM", "regularized_horseshoe.SM"), 500)
show_diagnostics(chosen_fit_diagn,0,show_neff = TRUE)


```
Two main conclusions can be drawn from the statistics. Firstly, each Horseshoe parametrization suffers from convergence problems. Mean of divergent transitions is over 40 for each model, $N_{eff}$ may have magnitudes even below 100 and in some fits, Rhat values may exceed threshold 1.1. Contrary to the results in previous section, parametrizations 1 and 2 have now equal performance. The outcome clearly emphasizes that parallel coordinate and scatter plots have limitations regarding convergence diagnosis. Considering the parametrization ranking,  Horseshoe 3 has best statistics considering Rhat,  $N_{eff}$ value and divergent transition distributions.  The result is as expected as the parametrization was suggested in [@piironen2016hyperprior] .

More importantly, the statistics show that in terms of convergence, **none of the Horseshoe parametrizations can match the Regularized Horseshoe**. In all replications, no Rhat values above 1.1 emerge, smallest N_{eff} values are all higher than 1000 and only one divergent transition is found. The result with divergent transitions is also better than in [@van2018shrinkage], which implies that high value of adapt delta and number of iterations are necessary for model convergence. However, it is also important to examine how models perform with unseen data. As previously discussed, this assignment uses PMSE and LPPD values for predictive performance comparison and their distribution are shown next.
```{r}
show_diagnostics(chosen_fit_diagn,1,show_neff = FALSE)
```
Above distributions show that **the models have very similar predictive performance.** In the case of Horseshoe parametrizations, the distributions are almost identical which is as expected as all the parametrizations describe the same shrinkage prior. Similarity of distributions implies that if only predictive perfromance of the model is considered, the choice of model makes no drastic difference. Moreover, the similarity of predictive performances question also the effect of divergent transitions. One possible explanation is that mapping from the parameters to the predictive 
distributions is such that the bias in the parameter posterior indicated by the 
divergences does not have an effect. This can happen, for example if the 
unexplored part of the parameter posterior maps to very similar predictions to 
the other parts of the posterior.

## 4. MODEL COMPARISON

In [@van2018shrinkage] model comparisons were provided using mean and standard deviation values of PMSEs. As the mean values of different models were very close to each others, the comparison was slightly difficult. In addition, no further visualization was provided. This section aims to provide a tool for more clear comparison, using the PMSE and LPPD value distributions. Based on the experiments in previous section, the chosen parametrization for Horseshoe prior is the third parametrization. All results shown in this section are again computed from 500 simulation replications and with Triton.  Let us start by visualizing the distributions:

```{r}
# Creates fit_diagn file
chosen_fit_diagn <- choose_fits(fit_diagn, fit_diagn$models[c(1, 4:11)], 500)
show_diagnostics(chosen_fit_diagn,1,show_neff = FALSE)
```
As can be seen from above figures, **predictive performance of the shrinkage priors do differ**. Again, the outcome deviates from the results illustrated in [@van2018shrinkage].  However, since so many models are involved, it is difficult to assess which performs best. In order to provide more informative model comparison, also statistics for rankings of PMSE and LPPD values are presented. In other words, ranking values 1 to 9 are assessed to each model in each replication in a such manner that value 1 is assigned to lowest PMSE value or absolute LPPD value and 9 to highest. Results are presented in two tables below:


```{r}
show_diagnostics(chosen_fit_diagn,2,show_neff = FALSE)
```
The tables depict that **Horseshoe and Regularized Horseshoe have best predictive performance abilities**.  The success of Horseshoe priors is as expected since their shrinkage abilities have been previously well covered in literature [@carvalho2010horseshoe, @piironen2017sparsity]. However, here it must be noted that just as in [@van2018shrinkage], sparsity parameter of the Regularized Horseshoe prior has a prior which is centered on the correct number of non-zero parameters in the model. The worst predictive performance is attained with Elastic Net and the second worst with Ridge Cauchy. All models are ranked in a similar order in both LPPD and PMSE cases. To ensure that results are valid, also Rhat values and divergent transitions are illustrated for each model:

```{r}
show_diagnostics(chosen_fit_diagn,0,show_neff = FALSE)
```


## 5. CONCLUSIONS

The conclusions of the assignment can be briefly summarized as follows:

**1:** When model is suffering from the convergence difficulties and includes divergent transitions, the adaptation delta should be assessed as high as possible (0.999), number of iterations as 8000 and if the transitions still exist, scatter and parallel coordinate plots used. However, using only plots may be insufficient and therefore also wider statistics for the transitions and also for Rhat and $N_{eff}$ values are required.

**2:** Using the modified simulation condition 1, best parametrization for Horseshoe prior is by reforming Half-Cauchy distribution with Inverse_Gamma parametrization. However, Regularized Horseshoe prior is still a better choice of parametrization as it includes virtually no divergent transitions in the setting of this assignment. The predictive performance of the priors are still very close to each other.

**3:** When comparing different shrinkage priors with modified simulation condition 1, the differences in model performance can be seen visually from PMSE and LPPD distributions and the best model can be determined based on their values. Horseshoe and Regularized Horseshoe were found to be the top shrinkage priors for the task.

## REFERENCES