fishHook Tutorial

Introduction

The fishHook R package enables agile statistical analysis of coding and non-coding mutational recurrence in cancer through generalized linear modeling (GLM) of somatic mutation densities and their heterogeneity along the genome. fishHook can be applied to the analysis of any collection of genomic intervals (e.g. genes, enhancers, promoters, genomic tiles) or complex sets of intervals (e.g. genes sets representing pathways, enhancer sets known to interact with a gene). The fishHook package is integrated with GenomicRanges and data.table packages, allowing easy incorporation into bioinformatics workflows that employ the R/Bioconductor ecosystem.

fishHook enables nomination of loci following the correction of known covariates of neutral mutation, e.g. chromatin state, replication timing, and nucleotide context. The goal of fishHook is to identify cancer drivers, i.e. loci that are under positive somatic selection and accumulate mutations above “background”. This analysis hinges on the application of a correct null / background model, i.e. one that yields near-uniform Q-Q plots for P value distributions.

Though we provide pre-computed covariates and a “black box” command-line tool that applies several generic exome and whole genome analyses, the key power of fishHook lies in its customizability. This includes the ability to easily incorporate custom covariates and provide a framework for the generation and fitting of bespoke models to nominate loci, e.g. modeling variant- and tumor-type specific background mutational processes.

For installation instructions, please visit the fishHook github page. For background, it may help to have some familiarity with data.table, GenomicRanges, and gUtils packages.

If you use fishHook in your work, please cite: Imielinski, Guo, Meyerson. Cell. 2017 Jan 26;168(3):460-472.

Driver discovery in cancer whole exomes

We will demonstrate a quick whole exome analysis using public TCGA lung adenocarcinoma mutation data. Additional packages like gTrack and rtracklayer will help with data import and visualization, but are not necessary to run fishHook.

library(fishHook)    
library(gTrack)
library(rtracklayer)
library(skitools)

Read in data

Read in the mutation data and additional tracks that we will use in this analysis.

## mutation calls cached from public GDAC Broad firehose https://bit.ly/2sFxWY6
mutations = dt2gr(fread('http://mskilab.com/fishHook/hg19/luad.maf')) ## using data.table::fread to read maf and gUtils::dt2gr to convert to GRanges

## GENCODE v19 genes these are our "hypotheses"
genes = gr.sub(import('http://mskilab.com/fishHook/hg19/gencode.v19.genes.gtf')) # rtracklayer::import reads gtf and gr.sub replaces chr
genes = genes %Q% (gene_type == 'protein_coding') %Q% (level<3)  # %Q% is a gUtils subsetting operator for GRanges

## protein coding CDS definitions
cds = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/gencode.v19.cds.rds')))

## bigWig file of fractional coverage of hg19 positions by Agilent exome
## we will use this in combination with cds to define eligible positions
exomecov = import('http://mskilab.com/fishHook/hg19/exome_coverage.bw')

Take a peek at our mutations GRanges object:

head(mutations[, c('Tumor_Sample_Barcode', 'Variant_Type', 'Variant_Classification', 'Reference_Allele', 'Tumor_Seq_Allele2')])
## GRanges object with 6 ranges and 5 metadata columns:
##       seqnames    ranges strand |         Tumor_Sample_Barcode
##          <Rle> <IRanges>  <Rle> |                  <character>
##   [1]        1    905907      + | TCGA-05-4249-01A-01D-1105-08
##   [2]        1   1192480      + | TCGA-05-4249-01A-01D-1105-08
##   [3]        1   1854885      + | TCGA-05-4249-01A-01D-1105-08
##   [4]        1   9713992      + | TCGA-05-4249-01A-01D-1105-08
##   [5]        1  12908093      + | TCGA-05-4249-01A-01D-1105-08
##   [6]        1  17257855      + | TCGA-05-4249-01A-01D-1105-08
##       Variant_Type Variant_Classification Reference_Allele
##        <character>            <character>      <character>
##   [1]          SNP      Missense_Mutation                A
##   [2]          SNP                 Silent                C
##   [3]          SNP                    IGR                G
##   [4]          SNP                 Intron                G
##   [5]          SNP      Missense_Mutation                C
##   [6]          SNP      Missense_Mutation                C
##       Tumor_Seq_Allele2
##             <character>
##   [1]                 T
##   [2]                 A
##   [3]                 C
##   [4]                 A
##   [5]                 A
##   [6]                 T
##   -------
##   seqinfo: 24 sequences from an unspecified genome

Instantiate FishHook object from events and eligible territory

First we define the “eligible territory”. This is a key component of all somatic mutational recurrence analyses, since much of the genome is not covered in sequencing studies. For example, in a whole exome sequencing dataset, less than 2% of the genome is reliably captured. In a targeted sequencing panel, this fraction will be even smaller. Even in whole genome sequencing using Illumina short reads, only a subset (70%) of the genome is reliably callable (Li Bioinformatics 2014 Oct 15;30(20):2843–8751).

Eligible territory coverage will influence the “denominator” of our recurrence analysis, i.e. the number of positions in each hypothesis interval where a mutation could have possibly been detected. If we do not take eligible territory into account we will mis-estimate the background mutation rate in a given region.

To define eligible territory for this whole exome analysis, we will choose the portion of cds (protein coding) bases that are captured in at least 95% of whole exomes, which represents about 24MB of genome.

eligible = exomecov %Q% which(score>0.95) # %Q% is a gUtils operator for subsetting on GRanges metadata
eligible = reduce(GenomicRanges::intersect(eligible, cds, ignore.strand = TRUE)) # we intersect and reduce / collapse our prelim eligible intervals with CDS boundaries

We define “events” as nonsynonymous mutations. In this simple model, we will lump together SNVs (of different flavors), and indels (of different flavors). (We discuss more complex models that subdivide mutation types later in the tutorial).

events = mutations %Q% (Variant_Classification != 'Silent')  ## using gUtils operator %Q% to subset mutations GRanges

Now that we have loaded our hypotheses (i.e. genes), events, and eligible, we are ready to create and analyze a basic FishHook object. Under the hood, object creation triggers counting of how many events are in the eligible portion of each hypothesis interval. We provide the idcol parameter so that each tumor sample (as defined by the Tumor_Sample_Barcode column in the events GRanges) will provide at most one event to the counts of each interval.

fish = Fish(hypotheses = genes[, 'gene_name'], events = events, eligible = eligible, idcol = 'Tumor_Sample_Barcode')

## FishHook object spanning 19796 hypotheses, 0 covariates, and 56809 events. 
##  Covering 24.58 MB of eligible territory.

Run basic model without covariates

We can score this basic FishHook object, i.e. compute p values for every hypothesis, using a simple glm that models a uniform mutation density along the genome, i.e. the glm fits only an intercept and applies no covariates (after correcting for the number of eligible bases in each interval).

The $res field of the FishHook contains a data.table of scoring results. $res has one row per input hypothesis, with P values, FDRs, effect sizes (fold enrichment above background), observed and predicted event counts and densities, and additional interval annotations provided by the user in the hypotheses GRanges.

fish$score()
head(fish$res %Q% order(p))
fish$qqp(plotly = FALSE)

seqnames	start	end	width	gene_name	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
17	7565097	7590856	25760	TP53	14970	1.1e-29	0.00000	99	5.58	2.070	0.0878	0.00184	1127	1	1
2	79384132	79386879	2748	REG3A	2447	9e-08	0.00079	16	4.05	0.964	0.0305	0.00184	525	1	1
19	10596796	10614417	17622	KEAP1	16561	1.2e-07	0.00079	30	3.60	2.480	0.0222	0.00184	1350	1	1
8	88882973	88886296	3324	DCAF4L2	8280	4.3e-06	0.01800	21	3.27	2.180	0.0177	0.00184	1185	1	1
14	19553365	19590078	36714	POTEG	12711	4.5e-06	0.01800	13	3.59	1.080	0.0221	0.00184	587	1	1
7	53103349	53104617	1269	POM121L12	7260	5.9e-06	0.01900	15	3.48	1.340	0.0205	0.00184	730	1	1

## $rect
## $rect$w
## [1] 7.479725
## 
## $rect$h
## [1] 5.233331
## 
## $rect$left
## [1] 23.15723
## 
## $rect$top
## [1] 4.054987
## 
## 
## $text
## $text$x
## [1] 24.81427
## 
## $text$y
## [1] 0.835282

You will notice that this Q-Q plot appears curved and inflated, though its slope (lambda) is reasonably near 1. The low alpha value (MLE of the alpha parameter of the Gamma distribution), suggests that the GLM is detecting additional variance in the data that is unmodeled by a pure Poisson regression. Adding covariates to the model should improve the quality of the fit.

The top hits in the plot (you can hover over them) include TP53 but also many unlikely cancer gene candidates. Among these are olfactory receptors, which are located in late replicating regions of the human genome and thus accumulate neutral mutations more frequently (Lawrence et al 2013 Nature Jul 11;499(7457):214-218).

Add covariates to FishHook model

To address these issues, we will load in data specifying replication timing, chromatin state, and nucleotide context. These are all important determinants of somatic neutral mutation density. We load these data in as GRanges objects using functions in data.table, rtracklayer, and gUtils packages (however you are free to use any GRanges import utility of your choice).

We first load in replication timing data as a GRanges then instantiate it as a Covariate Replication timing information is contained in the $score metadata field of reptime. We instantiate it as a covariate of type “numeric” by specifying field score.

## replication timing for NHEK obtained from  https://bit.ly/2sRsXT9 and
## converted to rds via rtracklayer::import
reptimedata = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/RT_NHEK_Keratinocytes_Int92817591_hg19.rds')))

## instantiate covariate around 'score' field, name the Covariate "replication timing"
reptime = Cov(data = reptimedata, field = 'score', name = 'ReplicationTiming')

Below, context is a GRanges object with 98 columns representing tri, di, and mononucleotide context counts in the hg19 genome. Code for computing context (e.g. for another genome) is provided here.

We instantiate a numeric Covariate object from context, choosing only two of the columns here to take into account G and C content. Note that the covariate object can be vectorized (concatenated, subsetted) and instantiated around several columns of an input GRanges. As a result, contextcov will be length 2 (representing G and C nucleotide fraction).

context = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/nucleotide.context.rds')))
gc = Cov(data = context, field = c('C', 'G')) ## instantiate Covariate around G and C fields

Finally we load in chromHMM data for cell line A549 from Epigenomics Roadmap. We will want to create a covariate that will model the fraction of heterochromatic and quiescent regions in each query interval.

To do so, we will create an “interval” covariate by not specifying a metadata field.

### data cached from https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E114_15_coreMarks_mnemonics.bed.gz
chromhmm = gr.sub(import('http://mskilab.com/fishHook/hg19/E114_15_coreMarks_mnemonics.bed.gz'), 'chr', '') ## import from bed then gUtils::gr.sub to strip 'chr' identifier
hetchromdata = chromhmm %Q% (name %in% c('8_ZNF/Rpts', '9_Het', '15_Quies')) # %Q% is gUtils operator for subsetting on GRanges metadata, in this case selecting for heterochromatic regions
hetchrom = Cov(hetchromdata, name = 'Heterochromatin') ## instantiate interval covariate

We now add these covariates to the model. For type numeric covariates, e.g. replication timing, this will trigger the calculation of the average value of each covariate within the eligible subset of each hypothesis interval. and the fractional overlap of of its eligible subset This annotation is the most computationally intensive and slowest aspect of fishHook analyses, though occurs within a few seconds for this small number of covariates.

## note how we can concatenate Covariate objects with c() operator
fish$covariates = c(reptime, gc, hetchrom)
## FishHook 2019-12-05 15:19:12: Aggregating covariates over eligible subset of hypotheses
## FishHook 2019-12-05 15:19:12: Overlapping with covered intervals
## FishHook 2019-12-05 15:19:14: Finished overlapping with covered intervals
## FishHook 2019-12-05 15:19:14: Annotating track ReplicationTiming
## FishHook 2019-12-05 15:19:16: Annotating track C
## FishHook 2019-12-05 15:19:17: Annotating track G
## FishHook 2019-12-05 15:19:18: Annotating track Heterochromatin

Now that we’ve added covariates, we can re-score fish and compute p values. Looking at these results, we see an improvement in lambda (closer to one) and alpha (increased), and the nominated gene list (no more olfactory receptors, We see also a reasonable number of significant (fdr<0.25) genes at the top of the list (or at the top right of the QQ plot) that have been biologically implicated in lung adenocarcinoma tumorigenesis.

fish$score()
fish$res %Q% (fdr<0.25) %Q% order(p)
fish$qqp(plotly = FALSE)

seqnames	start	end	width	gene_name	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
17	7565097	7590856	25760	TP53	14970	1.8e-55	0.000	99	6.21	1.340	0.08780	0.001190	1127	1	1
19	10596796	10614417	17622	KEAP1	16561	4.7e-13	0.000	30	4.05	1.810	0.02220	0.001340	1350	1	1
7	55086714	55324313	237600	EGFR	7263	3.8e-06	0.019	33	2.68	5.140	0.00901	0.001400	3661	1	1
1	157800704	157868046	67343	CD5L	1360	3.9e-06	0.019	14	3.11	1.620	0.01340	0.001560	1041	1	1
7	142457319	142460923	3605	PRSS1	7746	7e-06	0.026	15	3.03	1.840	0.02020	0.002480	741	1	1
14	19553365	19590078	36714	POTEG	12711	8e-06	0.026	13	3.06	1.560	0.02210	0.002650	587	1	1
21	44513066	44527697	14632	U2AF1	18408	1.2e-05	0.034	8	3.41	0.754	0.01180	0.001110	677	1	1
7	140419127	140624564	205438	BRAF	7728	2.8e-05	0.069	20	2.64	3.220	0.00937	0.001510	2134	1	1
1	175284330	175712906	428577	TNR	1552	0.00014	0.210	31	2.25	6.530	0.00761	0.001600	4073	1	1
14	20215587	20216528	942	OR4Q3	12714	0.00014	0.210	17	2.44	3.140	0.01810	0.003350	939	1	1
19	43256838	43359843	103006	PSG8	17153	0.00014	0.210	12	2.69	1.860	0.00929	0.001440	1292	1	1
2	113730780	113743242	12463	IL36G	2626	0.00015	0.210	5	3.36	0.486	0.00986	0.000958	507	1	1
2	131486643	131487909	1267	GPR148	2697	0.00015	0.210	10	2.86	1.380	0.00992	0.001360	1008	1	1
6	136578001	136610989	32989	BCLAF1	6827	0.00015	0.210	19	2.42	3.550	0.00689	0.001290	2758	1	1
1	175036994	175117202	80209	TNN	1550	0.00016	0.210	24	2.27	4.960	0.00685	0.001420	3502	1	1
7	27132612	27135615	3004	HOXA1	7126	0.00019	0.240	9	2.91	1.200	0.00896	0.001190	1005	1	1

## $rect
## $rect$w
## [1] 14.02698
## 
## $rect$h
## [1] 9.814243
## 
## $rect$left
## [1] 43.42754
## 
## $rect$top
## [1] 7.604454
## 
## 
## $text
## $text$x
## [1] 46.53505
## 
## $text$y
## [1] 1.566433

Note that this model is still quite rudimentary (e.g. we have lumped together SNVs and indels, we have not substratified SNVs by mutational context, we have employed very few covariates) but we still get a reasonable gene list and minimal statistical inflation.

Merge additional covariates

We may notice that some of the significantly mutated genes are not known to be expressed in lung adenocarcinoma tissues. We also know that genes that are more highly expressed in a given tissue have lower mutation rates. We want to merge in gene expression into the model, but not start from scratch (i.e. create a new model). We can do this using the $merge function.

## load GRanges of lung adenocarcinoma average gene experssion
exprdata = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/lung.expr.rds'))) 

# log transform expression values
exprdata$log.tpm = log10(exprdata$tpm+0.01) 

## create Covariate for log gene expression, using log.tpm field in the exprdata object
expr = Cov(exprdata, field = 'log.tpm', name = 'LungExpression') 

## merge / append this new covariate into the FishHook object
fish$merge(expr) 
## FishHook 2019-12-05 15:19:22: Appending new Covariates to this fishHook object
## FishHook 2019-12-05 15:19:22: Overlapping with covered intervals
## FishHook 2019-12-05 15:19:22: Finished overlapping with covered intervals
## FishHook 2019-12-05 15:19:22: Annotating track LungExpression

## look at the $data field to see the new merged covariate data as an additional column
head(fish$data)
## GRanges object with 6 ranges and 10 metadata columns:
##       seqnames        ranges strand |       hid     count  eligible
##          <Rle>     <IRanges>  <Rle> | <integer> <numeric> <numeric>
##   [1]        1   69091-70008      + |         1      <NA>         0
##   [2]        1 367640-368634      + |         2      <NA>         0
##   [3]        1 621059-622053      - |         3      <NA>         0
##   [4]        1 860260-879955      + |         4         0       193
##   [5]        1 879584-894689      - |         5         1       634
##   [6]        1 895967-901095      + |         6      <NA>         0
##        query.id ReplicationTiming                 C                 G
##       <integer>         <numeric>         <numeric>         <numeric>
##   [1]         1              <NA>              <NA>              <NA>
##   [2]         2              <NA>              <NA>              <NA>
##   [3]         3              <NA>              <NA>              <NA>
##   [4]         4  1.69339861733204            0.3204             0.353
##   [5]         5  1.69542176470588 0.333717647058824 0.289047058823529
##   [6]         6              <NA>              <NA>              <NA>
##       Heterochromatin    LungExpression frac.eligible
##             <numeric>         <numeric>     <numeric>
##   [1]            <NA>              <NA>             0
##   [2]            <NA>              <NA>             0
##   [3]            <NA>              <NA>             0
##   [4]               0 0.645780413675521      0.009799
##   [5]               0  1.76972436511895       0.04197
##   [6]            <NA>              <NA>             0
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Subset the FishHook object

Now that we’ve merged gene expression into this model, we would like to apply it both as a covariate and as a gene filter. We would like to exclude from the analysis genes that are known to have poor expression in lung adenocarcinoma tissue, because these genes are unlikely to harbor driver alterations.

The dimension of the fish object is hypotheses by covariates. To subset rows (i.e. hypotheses) from this model we can use the subsetting feature of the FishHook object via the [ operator. We will use this functionality to apply a strict “expression filter” and keep only genes that are expressed >10 TPM in lung adenocarcinoma. We will then re-score this subsetted model.

fish = fish[which(fish$data$LungExpression>1), ] ## subset for high lung expression

## FishHook object spanning 8765 hypotheses, 5 covariates, and 56809 events. 
##  Covering 24.58 MB of eligible territory.
## 5 Covariates with features:
##                 name     type   class   field signature na.rm pad grep
## 1: ReplicationTiming  numeric GRanges   score      <NA>    NA   0   NA
## 2:                 C  numeric GRanges       C      <NA>    NA   0   NA
## 3:                 G  numeric GRanges       G      <NA>    NA   0   NA
## 4:   Heterochromatin interval GRanges    <NA>      <NA>    NA   0   NA
## 5:    LungExpression  numeric GRanges log.tpm      <NA>    NA   0   NA

This object now contains 8765 hypotheses (i.e. genes), which we can re-score to obtain P values and FDRs.

fish$score()
fish$res %Q% (fdr<0.25) %Q% order(p)
fish$qqp(plotly = FALSE)

seqnames	start	end	width	gene_name	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
17	7565097	7590856	25760	TP53	6855	3.7e-62	0.00000	99	6.23	1.320	0.08780	0.00117	1127	1	1
19	10596796	10614417	17622	KEAP1	7532	5.5e-15	0.00000	30	4.21	1.620	0.02220	0.00120	1350	1	1
7	55086714	55324313	237600	EGFR	3414	7.8e-08	0.00023	33	2.86	4.540	0.00901	0.00124	3661	1	1
7	140419127	140624564	205438	BRAF	3636	9.1e-07	0.00200	20	2.89	2.690	0.00937	0.00126	2134	1	1
21	44513066	44527697	14632	U2AF1	8281	4.4e-06	0.00770	8	3.50	0.709	0.01180	0.00105	677	1	1
6	136578001	136610989	32989	BCLAF1	3197	2.9e-05	0.04200	19	2.55	3.250	0.00689	0.00118	2758	1	1
7	41724712	41742706	17995	INHBA	3384	5.3e-05	0.06600	13	2.60	2.150	0.01020	0.00169	1273	1	1
6	291630	351355	59726	DUSP22	2814	7.2e-05	0.07900	10	2.90	1.340	0.01750	0.00235	572	1	1
X	47004268	47046212	41945	RBM10	8583	9.9e-05	0.09600	7	3.05	0.844	0.01330	0.00160	526	1	1
4	177604689	177713881	109193	VEGFC	2412	0.00012	0.10000	10	2.74	1.500	0.00928	0.00139	1078	1	1
21	27838528	27945603	107076	CYYR1	8224	0.00013	0.10000	6	3.27	0.621	0.01250	0.00130	479	1	1
5	17065707	17276943	211237	BASP1	2463	0.00024	0.18000	4	3.87	0.274	0.01810	0.00124	221	1	1

## $rect
## $rect$w
## [1] 15.72487
## 
## $rect$h
## [1] 11.0022
## 
## $rect$left
## [1] 48.6842
## 
## $rect$top
## [1] 8.524932
## 
## 
## $text
## $text$x
## [1] 52.16786
## 
## $text$y
## [1] 1.756041

To inspect the parameters of this model and see which features it is using, we can employ the$modelaccessor:

summary(fish$model)
## 
## Call:
## glm.nb(formula = formula, data = as.data.frame(tdt), maxit = iter, 
##     init.theta = 4.955847251, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8185  -0.9986  -0.2746   0.4495  16.4831  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -6.86048    0.07620 -90.029  < 2e-16 ***
## ReplicationTiming -0.28667    0.01850 -15.497  < 2e-16 ***
## C                  0.01459    0.57485   0.025  0.97975    
## G                  1.04825    0.58157   1.802  0.07148 .  
## Heterochromatin    0.36417    0.03067  11.874  < 2e-16 ***
## LungExpression     0.08789    0.02673   3.289  0.00101 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.9558) family taken to be 1)
## 
##     Null deviance: 9604.1  on 8764  degrees of freedom
## Residual deviance: 9082.1  on 8759  degrees of freedom
## AIC: 27326
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.956 
##           Std. Err.:  0.268 
## 
##  2 x log-likelihood:  -27312.243

We can see from the Estimate and Pr(|>z|) columns of the Coefficients table that replication timing and lung gene expression are significantly negatively correlated and heterochromatin is significantly positively correlated with mutational density (as expected). However, this table shows that G and C content are not significantly correlated with mutation density. We can use the column subsetting function to remove these covariates and re-score to see how the results change.

fish2 = fish[, -c(2:3)] ## remove the 2nd and 3rd covariates ('G', 'C')
fish2$score() ## re-score
## FishHook 2019-12-05 15:19:25: Setting up problem
## FishHook 2019-12-05 15:19:25: Fitting model with 8,765 data points and 3 covariates
## FishHook 2019-12-05 15:19:25: Computing p values for 8765 hypotheses.

## check if our new top genes are identical to the previous - they are
identical((fish2$res %Q% (fdr<0.25))$gene_name, (fish2$res %Q% (fdr<0.25))$gene_name)
## [1] TRUE

The results are identical with and without G and C covariates. This further suggests that these covariates are not necessary in the model and can likely be excluded.

Analyze Reactome pathways

Read in and parse reactome pathways into list of gene symbols, then match against genes in our model.

## parse Reactome pathways from .gmt format
pathways = strsplit(readLines('http://mskilab.com/fishHook/hg19/ReactomePathways.gmt'), '\t')
pathways = structure(lapply(pathways, '[', -1), names = sapply(pathways, '[', 1))

## match them to create sets of indices as a named list
sets = sapply(sapply(pathways, match, fish$hypotheses$gene_name), setdiff, NA)

Here is what the pathways and sets look like:

head(pathways[1:2])
## $`2-LTR circle formation`
##  [1] "R-HSA-164843" "BANF1"        "HMGA1"        "LIG4"        
##  [5] "PSIP1"        "XRCC4"        "XRCC5"        "XRCC6"       
##  [9] "gag"          "gag-pol"      "rev"          "vif"         
## [13] "vpr"          "vpu"         
## 
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] "R-HSA-73843" "PRPS1"       "PRPS1L1"     "PRPS2"

head(sets[1:2])
## $`2-LTR circle formation`
## [1] 5005 2973 5852 4018 2573 1523 8471
## 
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] 8672 8535

The list sets contains integer vectors that index fish$hypotheses.

To run a set analysis, we just set the $sets variable in the FishHook object. This triggers scoring of hypothesis sets (in this case gene sets). The results of the set analysis are shown in $setres variable.

The set analysis is a bit more computationally intensive. We can speed things up through parallelization (setting `fish$mc.cores = 5).

fish$mc.cores = 5

## this triggers scoring of gene sets using covariate corrected model
fish$sets = sets

## list table of results for top sets
values(fish$setres %Q% order(p)[1:5])

setname	n	p	effect	estimate	p.left
TP53 Regulates Transcription of Genes Involved in G1 Cell Cycle Arrest	9	2.42e-37	10.4 [7.26-15]	10.40	1
TP53 regulates transcription of several additional cell death genes whose specific roles in p53-dependent apoptosis remain uncertain	10	4.63e-37	8.94 [6.37-12.6]	8.94	1
Regulation of TP53 Activity through Association with Co-factors	8	4.8e-37	10.2 [7.12-14.6]	10.20	1
RUNX3 regulates CDKN1A transcription	5	2.83e-36	17.3 [11.1-27]	17.30	1
PI5P Regulates TP53 Acetylation	8	3.88e-36	11 [7.53-16]	11.00	1

Examining the results table, we can see that most of the significant pathways appear related to TP53. Indeed, if we inspect the hypotheses (i.e. genes) contributing to these these gene sets, we will see that they are dominated by TP53 and 1-2 additional genes. For example:

## pick top set in setres, 
## setres is a GRangesList, containing supporting hypotheses sorted by p value
fish$setres %Q% order(p)[1]

seqnames	start	end	width	gene_name	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
17	7565097	7590856	25760	TP53	6855	3.7e-62	0.00	99	6.23	1.320	0.08780	0.00117	1127	1.00	1
19	30302805	30315215	12411	CCNE1	7670	0.016	0.97	5	1.74	1.500	0.00478	0.00143	1046	0.99	1
12	12867992	12875305	7314	CDKN1B	5324	0.097	0.97	2	1.95	0.517	0.00403	0.00104	496	0.98	1
4	122737599	122745087	7489	CCNA2	2344	0.2	0.97	2	0.59	1.330	0.00184	0.00122	1086	0.83	1
3	51991470	52002426	10957	PCBP4	1792	0.39	0.97	2	0.33	1.600	0.00182	0.00145	1098	0.77	1
20	32263489	32274210	10722	E2F1	8044	0.56	0.97	0	-Inf	0.875	0.00000	0.00110	795	0.45	1
12	56360553	56366568	6016	CDK2	5473	0.66	0.97	0	-Inf	0.865	0.00000	0.00109	795	0.45	1
12	54762917	54785082	22166	ZNF385A	5466	0.73	0.97	0	-Inf	0.842	0.00000	0.00123	687	0.46	1
6	36644305	36655116	10812	CDKN1A	2996	0.93	0.99	0	-Inf	0.445	0.00000	0.00115	388	0.65	1

This is a common challenge with pathway analysis of mutations, since many cancer pathways are usually driven by a single “celebrity” gene. We can dig a little deeper to identify significant gene sets that do not have TP53 . Here is one approach:

## these are gene sets with TP53
has.tp53 = which(grl.eval(fish$setres, 'TP53' %in% gene_name))

## we subset on gene sets that do not have TP53
values(fish$setres[-has.tp53, ] %Q% order(p)[1:5])

setname	n	p	fdr	effect	estimate	p.left	p.twosided
Extracellular matrix organization	173	4.61e-09	1.0e-07	1.36 [1.23-1.51]	1.36	1	0e+00
GRB2 events in EGFR signaling	6	4.06e-08	1.1e-06	4.11 [2.45-6.88]	4.11	1	1e-07
GRB2 events in ERBB2 signaling	9	6.48e-08	1.7e-06	3.33 [2.13-5.2]	3.33	1	1e-07
EGFR Transactivation by Gastrin	7	1.34e-07	3.4e-06	3.73 [2.26-6.15]	3.73	1	3e-07
SHC1 events in EGFR signaling	7	1.59e-07	4.1e-06	3.56 [2.19-5.8]	3.56	1	3e-07

These sets appear interesting and are related to additional biological processes known to be important to lung adenocarcinoma biology, such EGFR signaling. The top gene set “Extracellular matrix organization” appears especially interesting, because it is not obviously associated with known targets of driver mutations in this disease. Let’s examine genes contributing to this top gene set:

## inspect the non-TP53 associated gene set (GRangesList)
fish$setres[-has.tp53, ] %Q% order(p)[1]

gene_name	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
FGB	2382	0.0034	0.84	9	2.09	2.11	0.00662	0.00155	1359	1.00	1
COL5A2	1459	0.0036	0.84	19	1.66	6.01	0.00495	0.00157	3838	1.00	1
ITGAL	6570	0.0047	0.84	16	1.64	5.15	0.00478	0.00154	3344	1.00	1
CAPN6	8682	0.0056	0.86	11	1.69	3.40	0.00580	0.00180	1895	0.99	1
COL11A1	484	0.0088	0.96	36	1.36	14.00	0.00778	0.00303	4629	0.99	1
SERPINE1	3539	0.0095	0.96	6	2.05	1.45	0.00498	0.00120	1206	1.00	1

Interesting! This significant gene set is composed of extracellular matrix genes (COL5A2, ITGAL) that are not significant on their own. This constitutes a true pathway level hit.

Analyze truncating mutations

We can run a similar analysis but choosing only truncating mutations (by subsetting the mutation GRanges). Scoring this new model, we obtain a different mutation list, containing likely candidate drivers with enrichment of frameshift, nonsense, or nonstop indels and SNVs.

## replace events with new subset of mutations (using %Q% subsetting operator from gUtils)
fish$events = mutations %Q% 
   (grepl('(Frame_Shift_)|(Nonsense)|(OutOfFrame)|(Nonstop)', Variant_Classification))

## re-score model and inspect results
fish$score()
fish$res %Q% (fdr<0.25) %Q% order(p)
fish$qqp(plotly = FALSE)

seqnames	start	end	width	gene_name	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
17	7565097	7590856	25760	TP53	6855	1.7e-32	0.00000	33	7.54	0.1770	0.02930	0.000157	1127	1	1
X	47004268	47046212	41945	RBM10	8583	2.7e-08	0.00012	6	6.06	0.0897	0.01140	0.000171	526	1	1
3	47057919	47205457	147539	SETD2	1740	6.4e-07	0.00190	15	4.08	0.8840	0.00245	0.000145	6111	1	1
16	3115298	3131908	16611	IL32	6448	2.1e-05	0.04600	4	5.46	0.0906	0.00905	0.000205	442	1	1
1	27022524	27108595	86072	ARID1A	166	4.7e-05	0.08200	10	3.62	0.8120	0.00176	0.000143	5673	1	1
7	855528	936072	80545	SUN1	3275	0.00015	0.22000	5	4.01	0.3110	0.00224	0.000140	2228	1	1

## $rect
## $rect$w
## [1] 8.193441
## 
## $rect$h
## [1] 5.732696
## 
## $rect$left
## [1] 25.36689
## 
## $rect$top
## [1] 4.441914
## 
## 
## $text
## $text$x
## [1] 27.18205
## 
## $text$y
## [1] 0.9149847

Though TP53, RBM10, SETD2, and ARID1A are well known targets of truncating mutations in lung adenocarcinoma, IL32, TRIB1, SUN1 are interesting candidates that have not previously been associated with lung adenocarcinoma.

Driver discovery in cancer whole genomes

Read in data

As with whole exome sequencig read in the mutation calls and gene annotations. We can separately analyze different event types. Here we create two GRanges for point mutations and Indels.

## makeGRangesFromDataFrame
## rtracklayer::import reads gtf and gr.sub replaces chr
genes = gr.sub(import('http://mskilab.com/fishHook/hg19/gencode.v19.genes.gtf'))

## %Q% is a gUtils subsetting operator for GRanges
genes = genes %Q% (gene_type == 'protein_coding') %Q% (level<3)

## Reading in mutation calls from WGS

mutations.wgs = readRDS("/gpfs/commons/groups/imielinski_lab/projects/FishHook/CurrentProtocols/db/de_identified_LUAD_mutations.rds")

## Separating SNVs and INDELS
snv.wgs = mutations.wgs %Q% (Variant_Type == 'SNP')
indel.wgs = mutations.wgs %Q% (Variant_Type %in% c('DEL', 'INS'))

eligible.wgs = readRDS('~/DB/fishHook/eligible/wgs.eligible.rds')

Looking into the GRanges object.

head(snv.wgs)
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames    ranges strand | Variant_Classification Variant_Type
##          <Rle> <IRanges>  <Rle> |            <character>  <character>
##   [1]        1    927349      * |                    IGR          SNP
##   [2]        1   2134667      * |                 Intron          SNP
##   [3]        1   2454284      * |                 Intron          SNP
##   [4]        1   2623503      * |                 Intron          SNP
##   [5]        1   2726119      * |                    IGR          SNP
##   [6]        1   2875213      * |                    IGR          SNP
##                                       uuid
##                                <character>
##   [1] 86188246-6189-4332-9f74-6790fac3c3c9
##   [2] 86188246-6189-4332-9f74-6790fac3c3c9
##   [3] 86188246-6189-4332-9f74-6790fac3c3c9
##   [4] 86188246-6189-4332-9f74-6790fac3c3c9
##   [5] 86188246-6189-4332-9f74-6790fac3c3c9
##   [6] 86188246-6189-4332-9f74-6790fac3c3c9
##   -------
##   seqinfo: 64 sequences from an unspecified genome
head(indel.wgs)
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames              ranges strand | Variant_Classification
##          <Rle>           <IRanges>  <Rle> |            <character>
##   [1]        1   22042515-22042520      * |                 Intron
##   [2]        1            30450801      * |                    IGR
##   [3]        1   58913032-58913037      * |                 Intron
##   [4]        1   72171118-72171135      * |                 Intron
##   [5]        1   97150123-97150126      * |                    IGR
##   [6]        1 107471278-107471277      * |                    IGR
##       Variant_Type                                 uuid
##        <character>                          <character>
##   [1]          DEL 86188246-6189-4332-9f74-6790fac3c3c9
##   [2]          INS 86188246-6189-4332-9f74-6790fac3c3c9
##   [3]          DEL 86188246-6189-4332-9f74-6790fac3c3c9
##   [4]          DEL 86188246-6189-4332-9f74-6790fac3c3c9
##   [5]          DEL 86188246-6189-4332-9f74-6790fac3c3c9
##   [6]          DEL 86188246-6189-4332-9f74-6790fac3c3c9
##   -------
##   seqinfo: 64 sequences from an unspecified genome

Creating tiles and supertiles

Since we want to consider the whole genome, we can create contigous tiles or overlapping tiles viz. super tiles of a given size (e.g. 10 Kb). Each tile/supertile can then be annotaed by gene closest to it.


# We can use gr.tile function from gUtils package to divide genome into equal sized tiles
tiles = gr.tile(seqinfo(mutations.wgs), 1e4)

# %+% is a gUtils operator that shifts ech GRange by given distance
# Here we create additional GRanges by shifting all tiles by 5000 Kb
supertiles = c(tiles, tiles %+% 5000)

## Annotating by nearest gene
supertiles$nearest.gene = genes$gene_name[nearest(supertiles, gr.stripstrand(genes))]

tiles$nearest.gene = genes$gene_name[nearest(tiles, gr.stripstrand(genes))]

Instantiate FishHook object from events and eligible territory and run basic model

We can run a basic model wtihout the covariates. Here we use point mutations to demonstrate. We use the tiles and supertiles as the hypotheses. We can score this model and take a look at the reults

For SNVs


## Supertiles
fish.supertiles.snv = Fish(hypotheses = supertiles, events = snv.wgs, eligible = eligible.wgs, idcol = 'uuid', use_local_mut_density = TRUE, mc.cores = 40)
fish.supertiles.snv = fish.supertiles.snv[fish.supertiles.snv$data$frac.eligible > 0.75, ]
fish.supertiles.snv$score()
gr2dt(fish.supertiles.snv$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
    scroll_box(width = "100%", height = "200px")

seqnames	start	end	strand	width	query.id	tile.id	nearest.gene	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
17	7570001	7580000		10000	9	107636	TP53	86964	3.8e-10	9.6e-05	31	2.69	4.81	0.00331	0.000514	9367	1	1

p1 = fish.supertiles.snv$qqp(plotly = FALSE)

For Indels


## Supertiles
fish.supertiles.indel = Fish(hypotheses = supertiles, events = indel.wgs, eligible = eligible.wgs, idcol = 'uuid', use_local_mut_density = TRUE, mc.cores = 40)
fish.supertiles.indel = fish.supertiles.indel[fish.supertiles.indel$data$frac.eligible > 0.75, ]
fish.supertiles.indel$score()
gr2dt(fish.supertiles.indel$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
    scroll_box(width = "100%", height = "200px")

seqnames	start	end	width	query.id	tile.id	nearest.gene	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
2	85880001	85890000	10000	12	137301	SFTPB	111940	1.3e-12	3.0e-07	10	5.47	0.226	0.001100	2.50e-05	9057	1	1
17	7570001	7580000	10000	9	107636	TP53	86964	5.3e-10	6.7e-05	8	5.24	0.211	0.000854	2.25e-05	9367	1	1
10	81370001	81380000	10000	2	33063	SFTPA1	27038	4.6e-07	3.9e-02	5	5.23	0.133	0.000626	1.67e-05	7993	1	1

p1 = fish.supertiles.indel$qqp(plotly = FALSE)

Loading covariates

As with whole exome example, we will add covariates that influence the occurence of neutral mutations. These include replication timing, chromatin state, and nucleotide context.

## load GRanges corresponding to lung covariates
chmm = readRDS('~/DB/fishHook/covariates/A549_ChromHMM.rds')
hetchrom = chmm %Q% (type %in% c('Repress','Het', 'Repet'))
active = chmm %Q% (type %in% c('ActEnh', 'ActEnhG', 'ActProm'))
gc = readRDS('~/DB/fishHook/covariates/gc500.rds')
reptime = readRDS('~/DB/fishHook/covariates/replication.timing.rds')
dnase = readRDS('~/DB/fishHook/covariates/A549_DNAsa.rds')

covariates = c(
    Cov(hetchrom, name = 'Heterochromatin'),
    Cov(active, name = 'ActiveChromatin'),
    Cov(gc, 'score', name = 'GC'),
    Cov(reptime, 'val', name = 'ReplicationTiming'),
    Cov(dnase, name = 'OpenChromatin'))
covariates$pad = 1e5

Adding covariates to the models

For SNVs


## Supertiles
fish.supertiles.snv$covariates = covariates
fish.supertiles.snv = fish.supertiles.snv[fish.supertiles.snv$data$frac.eligible > 0.75 & fish.supertiles.snv$data$Heterochromatin<0.95,]
fish.supertiles.snv$score()
gr2dt(fish.supertiles.snv$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
    scroll_box(width = "100%", height = "200px")

seqnames	start	end	width	query.id	tile.id	nearest.gene	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
17	7570001	7580000	10000	9	107636	TP53	52674	1.3e-08	0.0018	31	2.81	4.42	0.00331	0.000472	9367	1	1
20	9290001	9300000	10000	13	153961	PLCB4	77354	4.2e-06	0.2400	21	2.50	3.70	0.00220	0.000387	9548	1	1
15	26950001	26960000	10000	7	90292	GABRB3	43092	5.2e-06	0.2400	25	2.31	5.05	0.00262	0.000530	9535	1	1

p1 = fish.supertiles.snv$qqp(plotly = FALSE)

For Indels


## Supertiles
fish.supertiles.indel$covariates = covariates 
fish.supertiles.indel = fish.supertiles.indel[fish.supertiles.indel$data$frac.eligible > 0.75 & fish.supertiles.indel$data$Heterochromatin<0.95, ]
fish.supertiles.indel$score()
gr2dt(fish.supertiles.indel$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
    scroll_box(width = "100%", height = "200px")

seqnames	start	end	width	query.id	tile.id	nearest.gene	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
2	85880001	85890000	10000	12	137301	SFTPB	68968	7.8e-15	0.0e+00	10	6.49	0.1110	0.001100	1.23e-05	9057	1	1
17	7570001	7580000	10000	9	107636	TP53	52674	4.2e-11	3.0e-06	8	6.04	0.1210	0.000854	1.29e-05	9367	1	1
10	81370001	81380000	10000	2	33063	SFTPA1	16893	2.2e-06	1.0e-01	5	5.04	0.1510	0.000626	1.89e-05	7993	1	1
1	241000001	241010000	10000	1	24101	RGS7	12353	5.1e-06	1.5e-01	5	4.77	0.1830	0.000544	1.99e-05	9190	1	1
2	85890001	85900000	10000	12	137302	SFTPB	68969	5.3e-06	1.5e-01	4	5.17	0.1110	0.000434	1.21e-05	9206	1	1
8	22020001	22030000	10000	21	261477	SFTPC	124499	7e-06	1.6e-01	4	5.34	0.0987	0.000444	1.10e-05	9009	1	1
1	114540001	114550000	10000	1	11455	OLFML3	6306	9.6e-06	1.9e-01	4	4.85	0.1390	0.000425	1.47e-05	9421	1	1

p1 = fish.supertiles.indel$qqp(plotly = FALSE)

Choice of hypotheses

Since we are dealing with WGS, we can query different types of events that can be interesting for understanding the genomic characterisitic of samples under consideration. For ecample, we can query just the accessible regions of the genome by using Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) peaks. For this demonstration we will use the lung specific ATAC peaks described in [REF] and Indels for the sample set

Loading in the peaks ans lifting to hg19


# Reading the ATAC peak annotations 
LUAD = readRDS("/gpfs/commons/groups/imielinski_lab/data/GDAN/ATAC/LUAD.fwp.filter.non_overlapping.annotated.rds")
atac.peaks.luad = gr2dt(LUAD)
atac.peaks.luad[ , peakid := 1:nrow(atac.peaks.luad)] # Adding peak IDs for query later
atac.peaks.luad[ , diff := atac.peaks.luad$end - atac.peaks.luad$start]
peaks.pos = atac.peaks.luad[atac.peaks.luad$diff > 0,] 
gr.atac.luad = dt2gr(peaks.pos)
chain38to19 = rtracklayer::import.chain("~/DB/UCSC/hg38ToHg19.over.chain")
hg19.atac = gr.sub(dt2gr(as.data.table(rtracklayer::liftOver(gr.atac.luad, chain38to19))))
hg19.atac$gene = hg19.atac$SYMBOL %>% as.character
hg19.atac$ann = hg19.atac$annotation %>% as.character
hg19.atac$pk = hg19.atac$peakid %>% as.character


## Indels in ATAC peaks
fish.atac = Fish(hypotheses = hg19.atac, events = indel.wgs, covariates = covariates, eligible = eligible.wgs, idcol = 'uuid', use_local_mut_density = TRUE, mc.cores = 40)
fish.atac = fish.atac[fish.atac$data$frac.eligible > 0.75 & fish.atac$data$Heterochromatin<0.95,]
fish.atac$score()
gr2dt(fish.atac$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
    scroll_box(width = "100%", height = "200px")

seqnames	start	end	strand	width	group	group_name	name	score	annotation	geneId	distanceToTSS	ENSEMBL	SYMBOL	GC	AT	N	CTCF	peakid	diff	gene	ann	pk	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
2	85885274	85885774		501	369	NA	LUAD_18411	31.86810	3’ UTR	6439	5077	ENSG00000168878	SFTPB	0.4730539	0.5269461	0	FALSE	369	500	SFTPB	3’ UTR	369	343	4.5e-07	0.03	3	7.80	0.0134	0.00599	2.68e-05	501	1	1
2	85894071	85894571		501	59516	NA	LUAD_18423	3.06975	Intron	6439	974	ENSG00000168878	SFTPB	0.5508982	0.4491018	0	FALSE	59516	500	SFTPB	Intron	59516	44553	5.5e-07	0.03	3	7.82	0.0133	0.00599	2.65e-05	501	1	1

p1 = fish.atac$qqp(plotly = FALSE)

In order to gain power, we can collapse the neighboring ATAC peaks and run fishHook model using the collpase ATAC peaks as hypotheses.


pad = 1e3 # distance within which all peaks are merged; set to 1000 bp
reduced.atac = GenomicRanges::reduce(hg19.atac+pad, ignore.strand = T)-pad

Running a new model


fish.reduced.atac = Fish(hypotheses = reduced.atac, covariates = covariates,
               events = indel.wgs,
               eligible = eligible.wgs, idcol = "uuid", mc.cores = 40)
fish.reduced.atac = fish.reduced.atac[fish.reduced.atac$data$frac.eligible > 0.75 & fish.reduced.atac$data$Heterochromatin<0.95,]          
fish.reduced.atac$score()
gr2dt(fish.reduced.atac$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
    scroll_box(width = "100%", height = "200px")

seqnames	start	end	strand	width	hid	p	fdr	count	effectsize	count.pred	count.density	count.pred.density	eligible	p.neg	fdr.neg
2	85883214	85896852		13639	32965	8e-11	5.1e-06	12	5.90	0.2010	0.000955	1.6e-05	12572	1	1
10	81374144	81375818		1675	8401	1.3e-07	4.2e-03	4	6.93	0.0329	0.002430	2.0e-05	1646	1	1

gr2dt(genes %&% (fish.reduced.atac$res %Q% (order(p)) %Q% (fdr < 0.25))) %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
    scroll_box(width = "100%", height = "200px")

seqnames	start	end	strand	width	source	type	score	phase	gene_id	transcript_id	gene_type	gene_status	gene_name	transcript_type	transcript_status	transcript_name	level	havana_gene	tag
2	85884437	85895864		11428	HAVANA	gene	NA	NA	ENSG00000168878.12	ENSG00000168878.12	protein_coding	KNOWN	SFTPB	protein_coding	KNOWN	SFTPB	2	OTTHUMG00000130181.5	NA
10	81370695	81375196		4502	HAVANA	gene	NA	NA	ENSG00000122852.10	ENSG00000122852.10	protein_coding	KNOWN	SFTPA1	protein_coding	KNOWN	SFTPA1	2	OTTHUMG00000018565.3	NA

p1 = fish.reduced.atac$qqp(plotly = FALSE)

fishHook Tutorial

Zoran Gajic and Marcin Imielinski

Introduction

Driver discovery in cancer whole exomes

Read in data

Instantiate FishHook object from events and eligible territory

Run basic model without covariates

Add covariates to FishHook model

Merge additional covariates

Subset the FishHook object

Analyze Reactome pathways

Analyze truncating mutations

Driver discovery in cancer whole genomes

Read in data

Creating tiles and supertiles

Instantiate FishHook object from events and eligible territory and run basic model

Loading covariates

Adding covariates to the models

Choice of hypotheses