Last updated: 2022-04-27

Checks: 2 0

Knit directory: logistic-susie-gsea/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 853102f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    library/
    Ignored:    renv/library/
    Ignored:    renv/staging/
    Ignored:    staging/

Untracked files:
    Untracked:  .ipynb_checkpoints/
    Untracked:  _targets.R
    Untracked:  _targets.html
    Untracked:  _targets.md
    Untracked:  _targets/
    Untracked:  _targets_r/
    Untracked:  analysis/alpha_ash_v_point_normal.Rmd
    Untracked:  analysis/de_droplet_noshrink.Rmd
    Untracked:  analysis/de_droplet_noshrink_logistic_susie.Rmd
    Untracked:  analysis/fetal_reference_cellid_gsea.Rmd
    Untracked:  analysis/fixed_intercept.Rmd
    Untracked:  analysis/iDEA_examples.Rmd
    Untracked:  analysis/latent_gene_list.Rmd
    Untracked:  analysis/linear_method_failure_modes.Rmd
    Untracked:  analysis/linear_regression_failure_regime.Rmd
    Untracked:  analysis/logistic_susie_veb_boost_vs_vb.Rmd
    Untracked:  analysis/logistic_susie_vis.Rmd
    Untracked:  analysis/logsitic_susie_template.Rmd
    Untracked:  analysis/references.bib
    Untracked:  analysis/roadmap.Rmd
    Untracked:  analysis/simulations.Rmd
    Untracked:  analysis/test.Rmd
    Untracked:  build_site.R
    Untracked:  cache/
    Untracked:  code/html_tables.R
    Untracked:  code/latent_logistic_susie.R
    Untracked:  code/load_data.R
    Untracked:  code/logistic_susie_data_driver.R
    Untracked:  code/marginal_sumstat_gsea_collapsed.R
    Untracked:  code/point_normal.R
    Untracked:  code/sumstat_gsea.py
    Untracked:  code/susie_gsea_queries.R
    Untracked:  data/adipose_2yr_topsnp.txt
    Untracked:  data/de-droplet/
    Untracked:  data/deng/
    Untracked:  data/fetal_reference_cellid_gene_sets.RData
    Untracked:  data/human_chimp_eb/
    Untracked:  data/pbmc-purified/
    Untracked:  data/wenhe_baboon_diet/
    Untracked:  docs.zip
    Untracked:  export/
    Untracked:  index.md
    Untracked:  references.bib
    Untracked:  simulation_targets/

Unstaged changes:
    Modified:   .Rprofile
    Modified:   _simulation_targets.R
    Modified:   _targets.Rmd
    Modified:   analysis/alpha_for_single_cell.Rmd
    Modified:   analysis/baboon_diet.Rmd
    Modified:   analysis/gseabenchmark_tcga.Rmd
    Modified:   analysis/human_chimp_eb_de_example.Rmd
    Modified:   analysis/single_cell_pbmc.Rmd
    Modified:   analysis/single_cell_pbmc_l1.Rmd
    Deleted:    analysis/summary_stat_gsea_univariate_simulations.Rmd
    Modified:   analysis/the_big_geneset.Rmd
    Modified:   code/enrichment_pipeline.R
    Modified:   code/fit_baselines.R
    Modified:   code/fit_logistic_susie.R
    Modified:   code/fit_mr_ash.R
    Modified:   code/fit_susie.R
    Modified:   code/load_gene_sets.R
    Modified:   code/logistic_susie_vb.R
    Modified:   code/marginal_sumstat_gsea.R
    Modified:   code/simulate_gene_lists.R
    Modified:   code/tccm_ebnm.R
    Modified:   target_components/factories.R
    Modified:   target_components/methods.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/gsea_overview.Rmd) and HTML (docs/gsea_overview.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 853102f karltayeb 2022-04-27 wflow_publish(c(“analysis/lit_review.Rmd”, “analysis/gsea_overview.Rmd”))
html 0033f11 karltayeb 2022-04-27 Build site.
Rmd 3905c90 karltayeb 2022-04-27 wflow_publish(c(“analysis/lit_review.Rmd”, “analysis/gsea_overview.Rmd”))

Introduction

Given a set of gene level statistics, GSEA seeks to test for the enrichment of “interesting” or “significant” genes in a pre-defined set of genes. Gene sets are typically groups of genes known to belong to the same biological pathway, implicated in disease, etc. There are many sources of gene sets: the Gene Ontology (GO), Molecular Signatures of Disease Database (MSigDB), Kyoto Encylopedia of Genes and Genomes, etc.

Researches are typically interested in conducting GSEA to characterize the biological signal in their experiment. Consequently, GSEA is carried out in a highly parallel fashion across thousands of gene-sets from multiple databases, often leading to hundreds (or thousands!) of enrichments. Often enrichment results will exhibit considerable redundancy where many overlapping/nested gene sets are significantly enriched.

Interpretation of these results are challenging. Given a long list of enriched gene sets, it can be difficult to pick out unique enrichment among many redundant gene sets. Worse, when tasked with manually curating and summarizing results a researcher might inadvertently select “nice” enrichments that confirm their prior beliefs about their experiment, giving an incomplete description of the results at best. Methods to avoid this include (1) curating a smaller set gene sets before conducting GSEA (see GO Slims) to avoid abundant and redundant results and (2) post-hoc clustering of enriched gene sets (DAVID, WebGESTALT, etc).

Here, we propose a different approach. Rather than testing for enrichment in each gene set separately, we compete gene sets against each other in a (sparse) multiple regression. Among many overlapping gene sets, all which would have significant marginal enrichment, out model can identify the single best gene set to describe the enrichment signal. In the presence of multiple unrelated enrichment, we can clearly identify the combination of gene sets with the strongest evidence of enrichment in a principled fashion.

Accomodating different input data

The gene level statistics we have access to will depend on the particular analysis upstream of GSEA. For example in differential expression experiments we might have effect sizes (log fold changes) and standard errors for each gene. If we wished to perform enrichment analysis on clusters of genes we only have access binary indicators of gene membership in each cluster. Further still, the adventurous quantitative biologist might concoct a more exotic analysis yielding gene level statistics for which there is no straight forward way of attaching uncertainty estimates to. In this case performing GSEA with gene ranks would be most appropriate.

Regardless of the input data, one can often binarize their inputs into a gene list (e.g by thresholding on p-value or taking the top \(k\) genes from a rank). Indeed among the most common approach to GSEA are simple contingency table tests on binary gene lists. We treat the binary case with a sparse multivariate logistic regression. We extend this basic model to accommodate summary statistics, recovering a covariate moderated version of the EBNM problem. I still need to decide how (or if) to address rankings…