Summary statistic GSEA univariate simulations

Last updated: 2022-04-06

Checks: 7 0

Knit directory: logistic-susie-gsea/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220105)

The command set.seed(20220105) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 8944883

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8944883. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    library/
    Ignored:    renv/library/
    Ignored:    renv/staging/
    Ignored:    staging/

Untracked files:
    Untracked:  _targets.R
    Untracked:  _targets.html
    Untracked:  _targets.md
    Untracked:  _targets/
    Untracked:  _targets_r/
    Untracked:  analysis/deng_example.Rmd
    Untracked:  analysis/fetal_reference_cellid_gsea.Rmd
    Untracked:  analysis/fixed_intercept.Rmd
    Untracked:  analysis/iDEA_examples.Rmd
    Untracked:  analysis/latent_gene_list.Rmd
    Untracked:  analysis/latent_logistic_susie.Rmd
    Untracked:  analysis/libra_setup.Rmd
    Untracked:  analysis/linear_method_failure_modes.Rmd
    Untracked:  analysis/linear_regression_failure_regime.Rmd
    Untracked:  analysis/logistic_susie_veb_boost_vs_vb.Rmd
    Untracked:  analysis/logistic_susie_vis.Rmd
    Untracked:  analysis/references.bib
    Untracked:  analysis/simulations.Rmd
    Untracked:  analysis/single_cell_pbmc_l1.Rmd
    Untracked:  analysis/test.Rmd
    Untracked:  analysis/wenhe_baboon_example.Rmd
    Untracked:  baboon_diet_cache/
    Untracked:  build_site.R
    Untracked:  cache/
    Untracked:  code/latent_logistic_susie.R
    Untracked:  code/logistic_susie_data_driver.R
    Untracked:  code/marginal_sumstat_gsea_collapsed.R
    Untracked:  data/adipose_2yr_topsnp.txt
    Untracked:  data/deng/
    Untracked:  data/fetal_reference_cellid_gene_sets.RData
    Untracked:  data/pbmc-purified/
    Untracked:  data/wenhe_baboon_diet/
    Untracked:  docs.zip
    Untracked:  index.md
    Untracked:  latent_logistic_susie_cache/
    Untracked:  simulation_targets/
    Untracked:  single_cell_pbmc_cache/
    Untracked:  single_cell_pbmc_l1_cache/
    Untracked:  summary_stat_gsea_exploration_cache/
    Untracked:  summary_stat_gsea_sim_cache/

Unstaged changes:
    Modified:   _simulation_targets.R
    Modified:   _targets.Rmd
    Modified:   analysis/baboon_diet.Rmd
    Modified:   analysis/gseabenchmark_tcga.Rmd
    Deleted:    analysis/summary_stat_gsea_univariate_simulations.Rmd
    Modified:   code/fit_baselines.R
    Modified:   code/fit_logistic_susie.R
    Modified:   code/fit_mr_ash.R
    Modified:   code/fit_susie.R
    Modified:   code/load_gene_sets.R
    Modified:   code/simulate_gene_lists.R
    Modified:   code/utils.R
    Modified:   target_components/factories.R
    Modified:   target_components/methods.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/summary_stat_gsea_sim.Rmd) and HTML (docs/summary_stat_gsea_sim.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	8944883	karltayeb	2022-04-06	wflow_publish(c(“analysis/index.Rmd”, “analysis/summary_stat_gsea_sim.Rmd”))
html	281c7d2	karltayeb	2022-04-06	Build site.
Rmd	eb8b2ae	karltayeb	2022-04-06	wflow_publish(“analysis/summary_stat_gsea_sim.Rmd”)

Introduction

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.6     ✓ dplyr   1.0.8
✓ tidyr   1.2.0     ✓ stringr 1.4.0
✓ readr   2.1.2     ✓ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

library(latex2exp)
source('code/marginal_sumstat_gsea.R')

#' simulate summary statistics from two component enrichment model
#' params is a list with elements theta, sigma0, alpha
#' r.se is a function for sampling standard errors runif(1e-3, 5) by default
simulate.gene.summary.stats = function(params, x = NULL, gene.set.prop=0.01, n.genes=10000, r.se=NULL) {
  theta0 = params$theta[1]
  theta1 = params$theta[2]
  sigma0 = params$sigma0
  alpha = params$alpha
  # simulate x if not provided
  if(is.null(x)){
    x <- rbinom(n.genes, 1, gene.set.prop)  # gene set
  } else{
    n.genes <- length(x)
  }
  
  if(is.null(r.se)){
    se <- runif(n.genes, 1e-3, 5)  # simulate standard errors, reasonable?
  } else {
    se <- r.se(n.genes)
  }
  sigma <- se ^ alpha * sigma0
  gamma <- rbinom(n.genes, 1, sigmoid(theta0 + theta1 * x))  # null/non-null
  beta <- rnorm(n.genes, mean = 0, sd = se)
  beta <- beta + (rnorm(n.genes, mean=0, sd=sigma) * gamma)
  return(list(beta=beta, se=se, gamma=gamma))
}

roll = function(v, n=1){
  if(n==0){
    return(v)
  } else{
    return(c(tail(v, n), head(v, -n)))
  }
}

#' make sequence of simulated gene sets of fixed size
#' and decreasing overlap with first gene set
#' gene.set.size = size of gene set
#' by = how much to overlap incriment
sim.X.base = function(n.genes=1e4, gene.set.size=1e2, from=0, to=NULL, by=5){
  to <- ifelse(is.null(to), gene.set.size, to)
  x = c(rep(1, gene.set.size), rep(0, n.genes-gene.set.size))
  u <- map(seq(from, to, by=by), ~roll(x, n=.x))
  X <- matrix(unlist(u), nrow = n.genes)
  return(X)
}

#' make sequence of gene sets overlapping base gene set
sim.X.other = function(gene.set.size, other.size, by=5){
  sim.X.base(
    gene.set.size=other.size,
    from=max(0, gene.set.size - other.size), to=gene.set.size, by=by
  )
}

#' put it all together
#' returns matrix X
#' first columns is the gene set of interest
#' other columns are gene sets of varying sizes that overlap the first
sim.X = function(gene.set.size=50, set.sizes = NULL, by=5){
  X <- sim.X.base(gene.set.size = gene.set.size, by=by)
  if(!is.null(set.sizes)){
    Xs <- do.call('cbind', map(set.sizes, ~sim.X.other(gene.set.size, .x, by=by)))
    X <- cbind(X, Xs)
  }
  return(X)
}


X.overlap <- sim.X(gene.set.size=100, by=20)
X.nest <- cbind(
  sim.X.base(gene.set.size = 50, by=50),
  sim.X.base(gene.set.size = 100, by=100),
  sim.X.base(gene.set.size = 200, by=200),
  sim.X.base(gene.set.size = 500, by=500),
  sim.X.base(gene.set.size = 1000, by=1000),
  sim.X.base(gene.set.size = 2000, by=2000),
  sim.X.base(gene.set.size = 5000, by=5000)
)

X <- cbind(X.overlap, X.nest)

X <- sim.X(gene.set.size = 100, by=100)
X.true <- X.overlap[, 1]

Here’s just a reminder of our optimization options. We can either optimize the marginal likelihood directly, or perform EM. For the scale of problems we’re working on, they run about the same speed. optim is maybe 20% faster. But, I’ve also seen that it’s harder to initialize right. And in general we need a good initialization strategy.

library(tictoc)
params = list(
  theta = c(-2, 1),
  alpha = 0.5,
  sigma0 = 3.0
)

x = c(rep(1, 100), rep(0, 9900))
sim <- simulate.gene.summary.stats(params, x)
gsea <- summary.stat.gsea(x, sim$beta, sim$se)

params.init = list(
  theta = c(0, 0),
  alpha = 1.0,
  sigma0 = 1.0
)

tic('optimizing marginal likelihood')
params.fit1 <- gsea$optimize.marginal.likelihood(
  params.init, update.alpha = T, update.sigma0 = T)

optimize all parameters

toc()

optimizing marginal likelihood: 5.068 sec elapsed

tic('optimize via EM')
params.fit2 <- gsea$expectation.maximiztion(
  params.init, n.iter=1000, update.alpha = T, update.sigma0 = T)
toc()

optimize via EM: 6.298 sec elapsed

#' fit the model and return parameter list
#' ... arguments to pass to to EM
fit.sumstat.gsea = function(beta, se, x, params.init=NULL, theta=NULL, alpha=NULL, sigma0=NULL, ...){
  gsea <- summary.stat.gsea(x, beta, se)
  # TODO: figure out how to initialize parameters
  if(is.null(params.init)){
    params.init = list(
      theta = c(-3, 0),
      alpha = 1.0,
      sigma0 = 1
    )
  }
  if(!is.null(theta)){
    params.init$theta = theta
  }
  if(!is.null(alpha)){
    params.init$alpha = alpha
  }
  if(!is.null(sigma0)){
    params.init$sigma0 = sigma0
  }
  params.fit = gsea$expectation.maximiztion(params.init, ...)
  return(params.fit)
}

#' driver function for simulations
#' fit model for all gene sets
#' assume first column of X is the gene is sim$x
sim.driver = function(X, sim, params.init=NULL, update.alpha=T, update.sigma0=T){
  # fit model for each gene set
  res <- map(1:dim(X)[2], ~fit.sumstat.gsea(
    sim$beta, sim$se, X[,.x],
    params.init=params.init,
    update.alpha=update.alpha, update.sigma0=update.sigma0
  ))
  
  # clean up data
  for (i in 1:length(res)){
    res[[i]]$responsibilities <- NULL
    names(res[[i]]$theta) <- paste0('theta', c(0, 1))
  }

  # dump into table with some useful stats
  res_tbl <- tibble(
      overlap=(X[,1] %*% X)[1,],
      active = 1:dim(X)[2] == 1,
      set.size=colSums(X),
      p.overlap = overlap/set.size,
      res=res
    ) %>% 
    unnest_wider(res) %>%
    unnest_wider(theta) %>%
    rename(
      theta0_hat=theta0,
      theta1_hat=theta1,
      alpha_hat=alpha,
      sigma0_hat=sigma0) %>%
    mutate(likelihood = map_dbl(lik.history, ~tail(.x, 1)))
  return(res_tbl)
}

Simulation

We have a null and non-null simulation. \(\theta_0 \sim \text{Unif}[-6, 0]\)

\(\theta_1 \sim \text{Unif}[0.1, 4]\) for the non-null, and \(\theta_1 = 0\) for the null simulation.

\(\alpha \sim \text{Unif}[0, 1]\)

\(s_i^2 \sim log \mathcal{N}(-1, 0.5^2)\) to approximate the histogram that peter showed me.

\(\sigma_0 \in \{0.1, 0.5, 1.0, 2.0, 4.0}\)

From these we sample the latent indicator of component membership

\(\gamma_i \sim \text{Bernoulli} (p_i)\) Where \(p_i = \sigma(\theta_0 + \theta_1 x_i)\))

And finally

\(\hat \beta_i \sim \mathcal{N}(0, s_i^2 + \gamma_i s_i^{2\alpha} \sigma_0^2)\)

It seems to me the relative values of \(s_i\) and \(\sigma_0\) are whats critical, so I picked a range of values for \(\sigma_0\) so we can study how easy it is for the model to estimate our enrichment parameters when effect sizes are strong or weak.

The “true” gene set is taken to be the first 100 genes. We simulate 10000 genes. For each non-null simulation, we also fit the model to a gene set of 100 genes with no overap with the true gene set (the second 100 genes…).

For the null simulations we just fit the one geneset consisting of the first hundred genes.

Of course, we would like to know how we estimate the parameters when the model is fit against sub-sets and super-sets of the gene set, and at various degrees of overlap. We’d also like to know how often we prioritize the true gene set over these other various sets (which are all enriched marginally). But we hdo not explore that here.

library(future)
plan(multisession)

r.se = function(n) exp(rnorm(n, mean=-1, sd=0.5))
hist(r.se(100000))

Version	Author	Date
281c7d2	karltayeb	2022-04-06

# make simulation
# sigma0=0 is the null simulation
sim <- xfun::cache_rds({
  n <- 100
  map_dfr(c(0.1, 0.5, 1.0, 2.0, 4.0),
  ~tibble(  # sample parameters
    theta0 = runif(n, -6, 0),
    theta1 = runif(n, 0.1, 4),
    alpha = runif(n, 0, 1),
    sigma0 = rep(.x, n)
  ) %>% 
  rowwise() %>%
  mutate(  # simulate gene lists
    params = list(list(
      theta = c(theta0, theta1),
      alpha = alpha,
      sigma0 = sigma0
    )),
    sim = list(simulate.gene.summary.stats(params, X.true, r.se=r.se))
  ) %>% ungroup())}, dir='cache/summary_stat_gsea_sim/')

possibly.sim.driver = possibly(sim.driver, NA)

# fix alpha = 0 
exchangeable.effects.fit <- xfun::cache_rds({
  params.init = list(
    theta = c(0, 0),
    alpha = 0.,
    sigma0 = 1
  )
  sim %>%
    mutate(
      fit = furrr::future_map(sim, ~possibly.sim.driver(
        X, .x, params.init, update.alpha=F, update.sigma0=T))
    )
}, hash = list(sim), dir='cache/summary_stat_gsea_sim/')

exchangeable.z.fit <- xfun::cache_rds({
  params.init <- list(
    theta = c(0, 0),
    alpha = 1.,
    sigma0 = 1
  )
  sim %>%
    mutate(
      fit = furrr::future_map(sim, ~possibly.sim.driver(
        X, .x, params.init, update.alpha=F, update.sigma0=T))
    )
}, hash = list(sim), dir='cache/summary_stat_gsea_sim/')

estimate.alpha.fit <- xfun::cache_rds({
  params.init <- list(
    theta = c(0, 0),
    alpha = 1.,
    sigma0 = 1
  )
  sim %>%
    mutate(
      fit = furrr::future_map(sim, ~possibly.sim.driver(
        X, .x, params.init, update.alpha=T, update.sigma0=T))
    )
}, hash = list(sim), dir='cache/summary_stat_gsea_sim/')

# simulate at various levels of sigma0
null.sim <- xfun::cache_rds({
  n <- 20
  map_dfr(c(0.1, 0.5, 1.0, 2.0, 4.0),
  ~tibble(  # sample parameters
    theta0 = runif(n, -6, 0),
    theta1 = rep(0, n),
    alpha = runif(n, 0, 1),
    sigma0 = rep(.x, n)
  ) %>% 
  rowwise() %>%
  mutate(  # simulate gene lists
    params = list(list(
      theta = c(theta0, theta1),
      alpha = alpha,
      sigma0 = sigma0
    )),
    sim = list(simulate.gene.summary.stats(params, X[, 1], r.se=r.se))
  ) %>% ungroup())}, dir='cache/summary_stat_gsea_sim/')

# fix alpha = 0 
null.exchangeable.effects.fit <- xfun::cache_rds({
  params.init = list(
    theta = c(0, 0),
    alpha = 0.,
    sigma0 = 1
  )
  null.sim %>%
    mutate(
      fit = furrr::future_map(sim, ~possibly.sim.driver(
        X[, 1, drop=FALSE], .x, params.init, update.alpha=F, update.sigma0=T))
    )
}, hash = list(null.sim), dir='cache/summary_stat_gsea_sim/')

null.exchangeable.z.fit <- xfun::cache_rds({
  params.init <- list(
    theta = c(0, 0),
    alpha = 1.,
    sigma0 = 1
  )
  null.sim %>%
    mutate(
      fit = furrr::future_map(sim, ~possibly.sim.driver(
        X[, 1, drop=FALSE], .x, params.init, update.alpha=F, update.sigma0=T))
    )
}, hash = list(null.sim), dir='cache/summary_stat_gsea_sim/')

null.estimate.alpha.fit <- xfun::cache_rds({
  params.init <- list(
    theta = c(0, 0),
    alpha = 1.,
    sigma0 = 1
  )
  null.sim %>%
    mutate(
      fit = furrr::future_map(sim, ~possibly.sim.driver(
        X[, 1, drop=FALSE], .x, params.init, update.alpha=T, update.sigma0=T))
    )
}, hash = list(null.sim), dir='cache/summary_stat_gsea_sim/')

#' file is the stem of the file name...
#' search for the file in the cache directory file_HASH.rds
read_cache = function(file, dir='cache/summary_stat_gsea_sim/'){
  files <- list.files(path = "cache/summary_stat_gsea_sim/", pattern=paste0('^', file, '_'), full.names = TRUE)
  if(length(files == 1)){
    print(paste('loading from:', files[1]))
    return(readRDS(files[1]))
  }
}

sim <- read_cache('sim')
exchangeable.z.fit <- read_cache('sim.ex.z')
exchangeable.effects.fit <- read_cache('sim.ex.effects')
estimate.alpha.fit <- read_cache('sim.alpha.fit')

null.sim <- read_cache('null.sim')
null.exchangeable.z.fit <- read_cache('null.sim.ex.z')
null.exchangeable.effects.fit <- read_cache('null.sim.ex.effect')
null.estimate.alpha.fit <- read_cache('null.sim.alpha.fit')

do.true.hat.scatter = function(tbl, par){
  plot <- tbl %>% 
    unnest(fit) %>% 
    filter(active) %>%
    ggplot(aes(x=!!sym(par), y=!!sym(paste0(par, '_hat')))) +
    geom_line(aes(x=!!sym(par), y=!!sym(par), color='r')) + 
    geom_point() + facet_wrap(vars(sigma0), nrow=1)
  return(plot)
}

do.true.hat.violin = function(tbl, par){
  plot <- tbl %>% 
  unnest(fit) %>% 
  filter(active) %>%
  ggplot(aes(x=factor(!!sym(par)), y=!!sym(paste0(par, '_hat')))) +
  geom_violin() +
  geom_hline(aes(yintercept=!!sym(par))) +
  facet_wrap(vars(!!sym(par)), scale='free', nrow=1)
  return(plot)
}

Non-null simulations

How well are we estimating our effect size and other (nuiscance) parameters

Effect size

p1 <- do.true.hat.scatter(exchangeable.z.fit %>% filter(sigma0 >0), 'theta1') +
  labs(title='Exchangeable z')
p2 <- do.true.hat.scatter(exchangeable.effects.fit %>% filter(sigma0 >0), 'theta1') +
  labs(title='Exchangeable effects')
p3 <- do.true.hat.scatter(estimate.alpha.fit %>% filter(sigma0 >0), 'theta1') +
  labs(title='Estimate')
cowplot::plot_grid(p1, p2, p3, ncol = 1)

Version	Author	Date
281c7d2	karltayeb	2022-04-06

Intercept

p1 <- do.true.hat.scatter(exchangeable.z.fit %>% filter(sigma0 >0), 'theta0') +
  labs(title='Exchangeable z')
p2 <- do.true.hat.scatter(exchangeable.effects.fit %>% filter(sigma0 >0), 'theta0') +
  labs(title='Exchangeable effects')
p3 <- do.true.hat.scatter(estimate.alpha.fit %>% filter(sigma0 >0), 'theta0') +
  labs(title='Estimate')
cowplot::plot_grid(p1, p2, p3, ncol = 1)

Version	Author	Date
281c7d2	karltayeb	2022-04-06

\(\alpha\)

p3 <- do.true.hat.scatter(estimate.alpha.fit %>% filter(sigma0 >0), 'alpha') +
  labs(title='Estimate')
p3

Version	Author	Date
281c7d2	karltayeb	2022-04-06

\(\sigma_0\)

p1 <- do.true.hat.violin(exchangeable.z.fit %>% filter(sigma0 >0), 'sigma0') +
  labs(title='Exchangeable z')
p2 <- do.true.hat.violin(exchangeable.effects.fit %>% filter(sigma0 >0), 'sigma0') +
  labs(title='Exchangeable effects')
p3 <- do.true.hat.violin(estimate.alpha.fit %>% filter(sigma0 >0), 'sigma0') +
  labs(title='Estimate')
cowplot::plot_grid(p1, p2, p3, ncol = 1)

Version	Author	Date
281c7d2	karltayeb	2022-04-06

Null simulation

So interestingly, we sometimes aggressively deplete an irrelevant gene sets. The gene set we simulated here are small (1% of genes) so if the background rate is small there is a high probability that none of the null gene set genes are non-null. The model pays likelihood for keeping non-zero weight on the non-null component.

Indeed, if we plot \(\hat\theta_1\) against \(\theta_0\) we see that as the background rate decreases, the frequency of theses strong estimated depletions increases.

null.fit <- rbind(
  null.exchangeable.effects.fit %>% mutate(model='exchangeable.effects'),
  null.exchangeable.z.fit %>% mutate(model='exchangeable.z'),
  null.estimate.alpha.fit %>% mutate(model='estimate.alpha')
)

null.fit %>%
  unnest(fit) %>%
  ggplot(aes(x=theta1_hat)) +
  geom_histogram() +
  facet_wrap(vars(model))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version	Author	Date
281c7d2	karltayeb	2022-04-06

null.fit %>%
  unnest(fit) %>%
  ggplot(aes(y=theta1_hat, x=theta0)) +
  geom_point() +
  facet_wrap(vars(model))

Version	Author	Date
281c7d2	karltayeb	2022-04-06

knitr::knit_exit()