Introduction

The Chromunity R package enables the nomination and statistical evaluation of high order interactions among genomic loci using Pore-C data. The basic unit of Pore-C data is called a concatemer that encodes the multiway interactions of multiple monomers. Chromunity can be applied to de novo detection of high order interacting loci using the graph-based community detection of concatemers. Chromunity also provides a platform to test the enrichment of high order interactions as compared the pairwise background (a phenomenon we call "synergy"). The Chromunity package is integrated with GenomicRanges and data.table packages, allowing easy incorporation into bioinformatics workflows that employ the R/Bioconductor ecosystem.

For installation instructions, please visit the Chromunity github page. For background, it may help to have some familiarity with data.table, GenomicRanges, and gUtils packages.

If you use Chromunity in your work, please cite: Deshpande et.al. Nature Biotechnology 2022.

Preliminary QC

To understand the nature of the run, following are some metrics to assess the quality of the data.

  1. The Base level info can be found under basecall/{enzyme}_{run_id}.rd.summary.csv. This file gives aggregate stats and you can retrieve the yield in Gb and N50 of read length from this file.
  2. The pairwise count yield can be extracted from contacts/{enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.contacts.parquet. This file is equivalent to pairs file so the number of rows in this file should indicate the number of pairwise contacts. If the run was divided into chunks, you can write a loop to gather number of rows from each file and aggregate them.
  3. The Pair wise density then could be calculated as number of (pairwise contacts in Million)/(Yield in Gb). This is a measure of the contact density of the raw reads.

Empirically, the values as follows are indicative of a successful run.

Metric Value
Yield 30 to 100 Gb
N50 3-6kb
Number of Pairwise Contacts 200 - 500 M
The Pairwise Density 3 to 6M/Gb

Converting Parquet files to GRanges

Here we demonstrate the Chromunity workflow from making a GRange object to calling communities and testing them for synergy. The first step in the workflow is the conversion of parquet files to GRanges for downstream analyses. To read more about Parquet format and for alternate processing of files visit the Apache page. This step needs R implementation of Apache arrow. In order to install apache arrow, visit their page for arrow implementation in R and installation instructions.

The Pore-C snakemake creates parquet files in chunks. This function allows to compile them in one GRange.

Note

If using some other way to concatenate parquet files, make sure to re-assign index under read_idx column such that they are uniquely matched to read_name. Additionally, if there are more than one runs for the same sample, apply similar procedure after merging. For the example parquet file, please download the folder from here and replace the path below with that folder path locally.

library(rtracklayer)
library(skitools)
library(chromunity)
library(MASS)
## This is the path to directory with all chunks
example_dir1 = "/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs"

## Generating GRanges
this_gr = parquet2gr(example_dir1, mc.cores = 2)

## Removing low quality monomers, using gUtils functions 
this_gr = this_gr %Q% (pass_filter == TRUE)

head(this_gr)
## GRanges object with 6 ranges and 3 metadata columns:
##                     seqnames            ranges strand |
##                        <Rle>         <IRanges>  <Rle> |
##   [1] chr9_KI270717v1_random         2283-3462      * |
##   [2]                   chr9 63403784-63404412      * |
##   [3]                   chr9 63388544-63389031      * |
##   [4]                   chr3 44088326-44090787      * |
##   [5]                   chr3 45455924-45456924      * |
##   [6]                   chr3 45442793-45443532      * |
##                                  read_name pass_filter  read_idx
##                                <character>   <logical> <integer>
##   [1] bcb51469-f2e9-41df-b4e2-fdd9eb63b039        TRUE         1
##   [2] bcb51469-f2e9-41df-b4e2-fdd9eb63b039        TRUE         1
##   [3] bcb51469-f2e9-41df-b4e2-fdd9eb63b039        TRUE         1
##   [4] e754f6cc-804a-4d14-b46c-a652723d9524        TRUE         2
##   [5] e754f6cc-804a-4d14-b46c-a652723d9524        TRUE         2
##   [6] e754f6cc-804a-4d14-b46c-a652723d9524        TRUE         2
##   -------
##   seqinfo: 118 sequences from an unspecified genome

Converting CSV files to GRanges

The data for the paper is uploaded in CSV format to GEO. One file for each run, per sample. This function allows for readig all the files and generate a single GRanges file with a unique read_idx to each concatemer. Please substitute the directory path in the following code with the path with all the CSV files.

## This is the path to directory with all chunks
example_dir1 = "/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/csv_file"

## Generating GRanges
this_gr = csv2gr(example_dir1, mc.cores = 2)

## Removing low quality monomers, using gUtils functions 
this_gr = this_gr %Q% (pass_filter == TRUE)

head(this_gr)
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames              ranges strand |
##          <Rle>           <IRanges>  <Rle> |
##   [1]    chr11   74929590-74929845      * |
##   [2]    chr11   73908636-73908936      * |
##   [3]    chr11 102426931-102427103      * |
##   [4]     chrX   18488003-18488611      * |
##   [5]    chr10   16354569-16354834      * |
##   [6]     chr8   47609593-47609775      * |
##                                  read_name pass_filter  read_idx
##                                <character>   <logical> <integer>
##   [1] c7631d1d-6eeb-402e-8f53-d084692ed451        TRUE         1
##   [2] c7631d1d-6eeb-402e-8f53-d084692ed451        TRUE         1
##   [3] c7631d1d-6eeb-402e-8f53-d084692ed451        TRUE         1
##   [4] 99c09b00-f7a2-4074-8dbf-c0424de01533        TRUE         2
##   [5] 99c09b00-f7a2-4074-8dbf-c0424de01533        TRUE         2
##   [6] 99c09b00-f7a2-4074-8dbf-c0424de01533        TRUE         2
##   -------
##   seqinfo: 147 sequences from an unspecified genome

Using graph based approach to call communities

To enable de novo discovery of high order interacting bin-sets we developed an algorithm for community detection of concatemers. Since concatemer community detection involves building a massive graph, computational complexity can be limiting for the analysis of very large concatemer datasets. This may limit the portion of the dataset that can be used for community detection and hence limit the power for de novo high order bin-set discovery. We implemented two optimization strategies to overcome this problem.

"Sliding window" Chromunity

"Sliding window" Chromunity where the genome is defined around (10 kb, 25 kb, and/or 50 kb) tiles in each 2Mb width, 1Mb stride sliding window to study regional high order structure. The benefit of this approach is that it enables a relatively unbiased discovery of high order structure or cooperativity (i.e. without needing to know specific annotations); however, it does not scale to the analysis of reference-distal or interchromosomal cooperativity.

Following are details of various parameters for sliding_window_chromunity:

Parameters Description
concatemers GRanges with $cid
resolution bin size for community detection [5e4]
region region to run on [si2gr(concatemers)]
windows GRanges or GRangesList of windows to test, more appropriate for testing specific windows. To be used with RE chromunity, e.g. targets such as promoters/enhancers windows. If sliding window approach is desired, keep this parameter as NULL and use piecewise argument
window.size window size to do community detection within
max.size max size of problem (concatemers x bins) to consider (default is 2^31-1). If we hit this problem size, will subsample concatemers so that concatemers * bins is < max.dim, this step is done downstream of shaving
k.knn KNN parameter specifying how many nearest neighbors to sample when building KNN graph
peak.thresh peak threshold with which to call a peak
take_sub_sample take subsample to be used for training
subsample.frac proprtion of concatemers to subsample
k.min minimal number of nearest neighbors an edge in KNN graph needs to have before community detection
chr chromosome to run on
pad integer pad to use when computing the footprint of each chromunity and finding peak regions which become bin-sets

Calling sliding window chromunity


## First using sliding window. Run separately for each chromosome

## This is demo GRange
## this_gr = readRDS( "/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/chr8.single.run.sub.rds")
this_gr = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/chr8.single.run.sub.rds")))

##
this_gr$cid = this_gr$read_idx

##
this_sliding_chrom = sliding_window_chromunity(concatemers = this_gr, resolution = 5e4, window.size = 2e6, take_sub_sample = TRUE, chr  = "chr8", subsample.frac = 0.5, mc.cores = 1)
## Chromunity 2022-08-17 20:21:31: Generated 61776 bins across 146 windows
## Chromunity 2022-08-17 20:21:32: Taking sub-sample with fraction 0.5
## Chromunity 2022-08-17 20:22:31: Analyzing gr.sums associated with 37 concatemer communities to generate binsets

The outputs of this_sliding_chrom function is a Chromunity object with a concatemer GRanges where each row is a monomer indexed by read_id as well as community id or chid. The bin-sets are created and stored in bin-sets GRanges object that maps community id to bin-id or bid.


## Outputs
## Concatemers GRanges
head(this_sliding_chrom$concatemers)
## GRanges object with 6 ranges and 8 metadata columns:
##       seqnames              ranges strand |       cid  read_idx     binid
##          <Rle>           <IRanges>  <Rle> | <integer> <integer> <integer>
##   [1]     chr8 125390785-125391227      * |     24515     24515     30338
##   [2]     chr8 125114129-125114349      * |     24515     24515     30333
##   [3]     chr8 125346026-125346181      * |     24515     24515     30337
##   [4]     chr8 125318791-125319499      * |     24986     24986     30337
##   [5]     chr8 125370091-125370333      * |     24986     24986     30338
##   [6]     chr8 124860528-124860658      * |     24986     24986     30328
##             tix     count      chid   support     winid
##       <numeric> <integer> <numeric> <integer> <integer>
##   [1]        38         3         1         1       125
##   [2]        33         3         1         1       125
##   [3]        37         3         1         1       125
##   [4]        37         3         2         1       125
##   [5]        38         3         2         1       125
##   [6]        28         3         2         1       125
##   -------
##   seqinfo: 1 sequence from an unspecified genome

## Bin-Sets GRanges
head(this_sliding_chrom$binsets)
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames              ranges strand |      chid       bid     winid
##          <Rle>           <IRanges>  <Rle> | <numeric> <numeric> <integer>
##   [1]     chr8 124750001-125000000      * |        17        17       125
##   [2]     chr8 124850001-125000000      * |        21        21       125
##   [3]     chr8 124750001-124800000      * |        42        42       125
##   [4]     chr8 125100001-125250000      * |        42        42       125
##   [5]     chr8 125300001-125350000      * |        42        42       125
##   [6]     chr8 125400001-125450000      * |        42        42       125
##   -------
##   seqinfo: 1 sequence from an unspecified genome

Creating covariate object

Once you have communities called, next step is to test them for synergy. We define synergy as significant enrichment of high order interactions as compared to pairwise background. This involves a few steps. If a set of covariates are desired to be used, it is necessary to create a covariate object. In this demonstration we will use GC content and number of fragments as covariates. We load these data in as GRanges objects using functions in data.table, rtracklayer, and gUtils packages (however you are free to use any GRanges import utility of your choice). See out fishHook tutorial to gain more intuition about this implementation


## Frag counts
## frags = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/Nlaiii.frags.hg38.rds") %>% dt2gr()
frags = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/Nlaiii.frags.hg38.rds"))) %>% dt2gr()

## GC content
## gc5b = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/gc.38.rds")
gc5b = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/gc.38.rds")))

## 
cov_list = list(gc5b, frags)

## Specify what kind of covariate it is. Score will be aggregated over bins while the number of intervals will be calculated otherwise.
names(cov_list) <- c("score:gc.cov", "interval:frag.cov")


gc_frag_cov = covariate(name = c("gc", "frag"), type = c("numeric", "interval"), field = c("score", NA), data = cov_list)

gc_frag_cov
## 2 Covariates with features:
##    name     type   class field signature na.rm pad  log count
## 1:   gc  numeric GRanges score      <NA>    NA   0 TRUE  TRUE
## 2: frag interval GRanges  <NA>      <NA>    NA   0 TRUE  TRUE

This covariate object will be used to assign the respective values to each bin in bin-sets.

Annotation of bin-sets with concatemer counts and covariates

Next step is to aggregate the count of concatemers interacting with a given bin along with all the other features of the genome to be used as covariates. This is done using the annotate function. In short, given n bin-sets annotates their k-power set (i.e. all sets in the power set with cardinality <= k) with basic (min median max distance, cardinality, width) and optional numerical and interval covariates provided as input. Following are the details of the parameters of this function.

Parameters Description
bin-sets GRanges of bins with fields bid specifying bin-set id
concatemers GRanges of monomers with fields seqnames, start, end, and cid specifying concatemer id, which will be counted across each bin-set
covariates Covariate object specifying numeric or interval covariates to merge with this bin-set k the max cardinality of sets to annotate from the power set of each binmer
gg optional gGraph input specifying alternate distance function
interchromosomal.dist numeric scalar of "effective" distance for inter chromosomal bins [1e8]

## concatemers used for testing
this.chrom = gr2dt(this_sliding_chrom$concatemers)

## Concatemers for testing
this_gr_testing = dt2gr(gr2dt(this_gr)[!read_idx %in% unique(this.chrom$read_idx)]) 

annotated_chrom = annotate(binsets = this_sliding_chrom$binsets,
           k = 5,
           concatemers = this_gr_testing,
           covariates = gc_frag_cov, resolution = 5e4,
           mc.cores = 3) 
## Synergy 2022-08-17 20:22:56: Overlapping 104 bins with 163837 monomers
## Synergy 2022-08-17 20:22:57: Computing bin by bin pairwise distance
## Synergy 2022-08-17 20:22:57: Making sub binsets
## Synergy 2022-08-17 20:22:57: Made 3525 sub-binsets
## Synergy 2022-08-17 20:22:57: Counting concatemers across sub-binsets across 3 cores
## Synergy 2022-08-17 20:23:23: Computing min median max distances per setid
## Synergy 2022-08-17 20:23:25: Computing marginal sum per bid
## Synergy 2022-08-17 20:23:25: Computing total width and cardinality per setid
## Synergy 2022-08-17 20:23:25: Merging counts, distance, and width
## Synergy 2022-08-17 20:23:26: Adding covariates
## Synergy 2022-08-17 20:23:26: Overlapping 15441361 ranges from covariate gc
## Synergy 2022-08-17 20:23:26: Tallying values from numeric covariate gc
## Synergy 2022-08-17 20:23:26: Merging data from numeric covariate gc
## Synergy 2022-08-17 20:23:26: Overlapping 13836523 ranges from covariate frag
## Synergy 2022-08-17 20:23:27: Tallying values from interval covariate frag
## Synergy 2022-08-17 20:23:27: Merging data from interval covariate frag

head(annotated_chrom)
##      bid setid count max.dist min.dist  width cardinality sum.counts        gc
## 1: 10216     3    24    50001    50001 100001           2       2526 0.4455106
## 2: 10291     4    29   150001   150001 200001           2       6942 0.4274106
## 3: 10291     5     8   350001   350001 100001           2       6942 0.4474706
## 4: 10291     6    28    50001    50001 200001           2       6942 0.4241306
## 5: 10291     7     0   350001    50001 250001           3       6942 0.4301106
## 6: 10524     5     5   200001   200001 100001           2       8724 0.3781506
##        frag
## 1:  510.259
## 2: 1020.259
## 3:  544.259
## 4: 1012.259
## 5: 1288.259
## 6:  450.259

Generating background bin-sets based on called communities

Next, we construct a data matrix to train the Synergy background model. To provide examples for model training, we generate a large collection of random bin-sets based on bin-sets called from chromunity function, such that the empirical distribution of the core covariates of subsets matches the empirical distribution of core covariates associated with called bin-sets. This is achieved by the SampleRandombin-sets procedure (described in methods), which randomly samples sets of intervals that match the input collection with respect to their cardinality, interval widths, bin-to-bin distances, and interchromosomal jumps. The output of this algorithm is the bin-set collection, which is used for model training.

Following are the parameters for the sliding_window_background function:

Parameters Description
chromosome chrtomosome the chromosome to work on
binsets GRanges of bins with fields seqnames, start, end, and $bid specifying bin-set id
resolution The resolution to use for distance simulation
n Integer scalar specifying number of sets to sample (default = nrow(bin-sets))
genome.to.use genome build to use for the simulation [hg38]

set.seed(198)
back_gr = sliding_window_background(chromosome= "chr8", binsets = this_sliding_chrom$binsets, n = 1000,
                                 resolution = 5e4)
## Generating distributions
## Generating GRanges
back_gr[, V1 := NULL]
back_gr = na.omit(back_gr) 
## Adding few filters to remove outlier simulation, can be customized
## Removing bins less than resolution and lying out of bounds

## Getting seqlengths of each chromosome
upper.bound = as.data.table(hg_seqlengths(genome = "BSgenome.Hsapiens.UCSC.hg38::Hsapiens"), keep.rownames = T) 
setkeyv(back_gr, c("seqnames", "start"))
back_gr = back_gr[!bid %in% back_gr[width < (5e4-1)]$bid]
back_gr = gr2dt(gr.reduce(dt2gr(back_gr), by = "bid"))
back_gr$bid <- as.factor(back_gr$bid)
back_gr = merge(back_gr, upper.bound, by.x = "seqnames", by.y = "V1", all.x = T, allow.cartesian = T)
back_gr = back_gr[end < V2][start < V2]
back_gr[, overall.cardinality := .N, by = bid]
back_gr = back_gr[overall.cardinality > 1]
back_gr = dt2gr(back_gr)

head(back_gr)
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames            ranges strand |      bid        V2
##          <Rle>         <IRanges>  <Rle> | <factor> <integer>
##   [1]     chr8 54856203-55056202      * |   rand_1 145138636
##   [2]     chr8 55106203-55256202      * |   rand_1 145138636
##   [3]     chr8 55606203-55806202      * |   rand_1 145138636
##   [4]     chr8 56706203-56756202      * |   rand_1 145138636
##   [5]     chr8 33893925-33943924      * |  rand_10 145138636
##   [6]     chr8 34243925-34393924      * |  rand_10 145138636
##       overall.cardinality
##                 <integer>
##   [1]                   4
##   [2]                   4
##   [3]                   4
##   [4]                   4
##   [5]                   2
##   [6]                   2
##   -------
##   seqinfo: 1 sequence from an unspecified genome

Similar to the chromunity calls, we now annotate the background bins with concatemer counts and covariates.


set.seed(198)
annotated_back = annotate(binsets = back_gr,
           k = 5,
           concatemers = this_gr_testing,
           covariates = gc_frag_cov, resolution = 5e4,
           mc.cores = 3) 
## Synergy 2022-08-17 20:23:47: Overlapping 3396 bins with 163837 monomers
## Synergy 2022-08-17 20:23:47: Computing bin by bin pairwise distance
## Synergy 2022-08-17 20:23:48: Making sub binsets
## Synergy 2022-08-17 20:23:50: Made 105998 sub-binsets
## Synergy 2022-08-17 20:23:50: Counting concatemers across sub-binsets across 3 cores
## Synergy 2022-08-17 20:24:55: Computing min median max distances per setid
## Synergy 2022-08-17 20:25:20: Computing marginal sum per bid
## Synergy 2022-08-17 20:25:20: Computing total width and cardinality per setid
## Synergy 2022-08-17 20:25:24: Merging counts, distance, and width
## Synergy 2022-08-17 20:25:24: Adding covariates
## Synergy 2022-08-17 20:25:24: Overlapping 15441361 ranges from covariate gc
## Synergy 2022-08-17 20:25:25: Tallying values from numeric covariate gc
## Synergy 2022-08-17 20:25:34: Merging data from numeric covariate gc
## Synergy 2022-08-17 20:25:34: Overlapping 13836523 ranges from covariate frag
## Synergy 2022-08-17 20:25:36: Tallying values from interval covariate frag
## Synergy 2022-08-17 20:25:44: Merging data from interval covariate frag
## Removing edge cases with no counts
annotated_back = annotated_back[!bid %in% annotated_back[, .(sum(count)), by = bid][V1 == 0]$bid]

head(annotated_back)
##         bid setid count max.dist min.dist  width cardinality sum.counts
## 1: rand_105     7    25   250001   250001 250001           2      10718
## 2: rand_105     8    20   650001   650001 200001           2      10718
## 3: rand_105     9     0   950001   950001 100001           2      10718
## 4: rand_105    10     0  1700001  1700001 100001           2      10718
## 5: rand_105    11     0  2150001  2150001 200001           2      10718
## 6: rand_105    12   137   200001   200001 350001           2      10718
##           gc      frag
## 1: 0.4095764 1168.3814
## 2: 0.4213357 1030.3814
## 3: 0.4145672  500.3814
## 4: 0.3825482  469.3814
## 5: 0.3960412  918.3814
## 6: 0.4217752 1764.3814

Training the first model using background bin-sets and getting expectation

Background bin-sets is then used to populate a data matrix whose rows represents each size 2 to k subset of each background bin-set. The count of concatemers overlapping each subset is predicted from the covariates by fitting a gamma-Poisson generalized linear model with parameters (including the dispersion parameter theta). We use the fit function within the package to do so.

Parameters Description
annotated.binsets data.table of annotated binsets with at least the fields \(bid (binset id) \)setid (subset id) count min.dist med.dist max.dist width cardinality +/- additional covariates as well, note: every column in this data.table must be a covariate
nb logical flag specifying whether to run negative binomial vs. Poisson model
return.model logical flag whether to return the model or the scored input

set.seed(198)
back_model = fit(annotated_back)

annotated_chrom = sscore(annotated_chrom, model = back_model) 
head(annotated_chrom)
##      bid setid count max.dist min.dist  width cardinality sum.counts        gc
## 1: 10216     3    24    50001    50001 100001           2       2526 0.4455106
## 2: 10291     4    29   150001   150001 200001           2       6942 0.4274106
## 3: 10291     5     8   350001   350001 100001           2       6942 0.4474706
## 4: 10291     6    28    50001    50001 200001           2       6942 0.4241306
## 5: 10291     7     0   350001    50001 250001           3       6942 0.4301106
## 6: 10524     5     5   200001   200001 100001           2       8724 0.3781506
##        frag count.predicted enrichment      pval
## 1:  510.259      18.5746262  1.2920852 0.2872729
## 2: 1020.259      28.0294544  1.0346259 0.3674657
## 3:  544.259       4.3629933  1.8336035 0.1664706
## 4: 1012.259      66.6446031  0.4201390 0.7119675
## 5: 1288.259       0.3417598  0.0000000 0.3839606
## 6:  450.259       8.7721394  0.5699864 0.6268825

Second model to test for synergy

In the second stage of the Synergy model, we apply the background model to the covariates associated with bin-subsets to obtain expected counts. We use these results to test each individual set for enrichment of counts in high-order vs. pairwise contacts. To do so, we populate a second data matrix comprising observed counts, expected counts, and an indicator variable encoding whether a given set is pairwise (0) or high-order (1). Given this matrix, a second gamma-Poisson generalized linear model is then fit to assess whether the ratio of observed to expected counts is significantly higher among high-order relative to pairwise interactions.


set.seed(198)
this_synergy = na.omit(synergy(binsets = this_sliding_chrom$binsets,
                annotated.binsets = annotated_chrom, model = back_model))
## Synergy 2022-08-17 20:25:44: Scoring binsets

this_synergy$fdr = signif(p.adjust(this_synergy$p, "BH"), 2)

head(this_synergy[order(p)])
##     bid                   method       p estimate  ci.lower   ci.upper
## 1: 4911 Negative Binomial(1.351) 0.00875 3.661974 1.3876249   9.664031
## 2: 7511 Negative Binomial(1.351) 0.16700 7.128654 0.4393602 115.662972
## 3: 5301 Negative Binomial(1.351) 0.19500 1.357184 0.8553877   2.153349
## 4: 9832 Negative Binomial(1.351) 0.21800 1.981554 0.6678601   5.879308
## 5:  170 Negative Binomial(1.351) 0.25100 5.106705 0.3155017  82.657039
## 6: 4960 Negative Binomial(1.351) 0.26000 1.939704 0.6129328   6.138443
##               effect  fdr
## 1:  3.66 [1.39-9.66] 0.16
## 2:  7.13 [0.439-116] 0.70
## 3: 1.36 [0.855-2.15] 0.70
## 4: 1.98 [0.668-5.88] 0.70
## 5: 5.11 [0.316-82.7] 0.70
## 6: 1.94 [0.613-6.14] 0.70

"Regulatory element" (RE) Chromunity

"Regulatory element" (RE) Chromunity where the genomic region to be queried is defined around regulatory elements such as promoters and enhancers. RE Chromunity is useful for the study reference-distal and interchromosomal cooperativity but requires a hypothesis set of regulatory elements.

Defining targets and running community detection

With this approach, we are trying to narrow down bins of interests. In this case we are going to use promoters and enhancers, annotations of which are obtained from chromHMM with 15 states. We add a pad around the annotations and merge the annotations together if they overlap. Once we have a footprint of the targets, we subsample the concatemers overlapping it. Using this reduced concatemer set, we run the steps as above. Here we demonstrate chromosome wide implementation.

The shave algorithm is another layer of optimization. It removes bins that overlap fewer than \(b\) concatemers and remove concatemers that hit \(c\). Repeat until a fixed point is reached.

Following are details of various parameters for re_chromunity:

Parameters Description
concatemers GRanges with $cid
resolution bin size for community detection [5e4]
region region to run on [si2gr(concatemers)]
windows GRanges or GRangesList of windows to test, more appropriate for testing specific windows. To be used with RE chromunity, e.g. targets such as promoters/enhancers windows. If sliding window approach is desired, keep this parameter as NULL and use piecewise argument
shave logical flag specifying whether to iteratively "shave" concatemers and bins to a subset C and B where every bin in B has at least bthresh concatemer support across C and every concatemer in C has order / cardinality of at least cthres across B, this is done by iteratively removing bins and concatemers until you reach a fixed point. Appropriate to be used with RE chromunity
window.size window size to do community detection within
max.size max size of problem (concatemers x bins) to consider (default is 2^31-1). If we hit this problem size, will subsample concatemers so that concatemers * bins is < max.dim, this step is done downstream of shaving
k.knn KNN parameter specifying how many nearest neighbors to sample when building KNN graph
peak.thresh peak threshold with which to call a peak
k.min minimal number of nearest neighbors an edge in KNN graph needs to have before community detection
pad integer pad to use when computing the footprint of each chromunity and finding peak regions which become bin-sets

set.seed(198)
## Chr2 single run
## chr2_parq_gr = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/chr2.single.run.sub.rds")
chr2_parq_gr = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/chr2.single.run.sub.rds")))

## getting E-P annotations
chain19to38 = rtracklayer::import.chain("~/DB/UCSC/hg19ToHg38.over.chain")
chmm_19 = import("https://mskilab.s3.amazonaws.com/chromunity/wgEncodeBroadHmmGm12878HMM.bed")
chmm = chmm_19 %>% rtracklayer::liftOver(chain19to38)%>% grl.unlist %Q% (!duplicated(grl.ix))
chmm_ep = dt2gr(gr2dt(chmm)[grepl("Promoter", name) | grepl("Enhancer", name)])

## resolution of the pad
resolution = 2.5e4
tiles = gr.tile(hg_seqlengths(genome = "BSgenome.Hsapiens.UCSC.hg38::Hsapiens"), resolution)

## Targets
this_EP = (chmm_ep %Q% (grepl("Active_Promoter|Strong_Enhancer", name)))
targets_EP = gr.reduce(this_EP+resolution)

## Subsample
chr2_parq_gr = chr2_parq_gr %&% targets_EP

## Subsampling concatemers
set.seed(125)
all_concatemers = unique(chr2_parq_gr$read_idx)
training_concatemers = sample(all_concatemers, length(all_concatemers)/2)
testing_concatemers = setdiff(all_concatemers, training_concatemers)

this_gr_training = chr2_parq_gr %Q% (read_idx %in% training_concatemers)
this_gr_testing = chr2_parq_gr %Q% (read_idx %in% testing_concatemers)

##
this_gr_training$cid = this_gr_training$read_idx
this_gr_testing$cid = this_gr_testing$read_idx


## Running Chromunity
this_re_chrom = re_chromunity(concatemers = this_gr_training, windows = targets_EP, piecewise = FALSE, shave = TRUE, resolution = 2.5e4, mc.cores = 5)
## Chromunity 2022-08-17 20:25:58: Generated 40288 bins across 8555 windows
## Chromunity 2022-08-17 20:25:58: Matching concatemers with bins, and bins with windows using gr.match with max.slice 1e+06 and 5 cores
## Chromunity 2022-08-17 20:26:01: Shaving concatemers with bthresh = 3 and cthresh = 3
## Chromunity 2022-08-17 20:26:02: shaving --> Diff: 864155, Old: 1126127, New:261972
## Chromunity 2022-08-17 20:26:02: shaving --> Diff: 97, Old: 261972, New:261875
## Chromunity 2022-08-17 20:26:02: shaving --> Diff: 72, Old: 261875, New:261803
## Chromunity 2022-08-17 20:26:03: Running concatemer communities with 65755 concatemers and 3106 bins
## Chromunity 2022-08-17 20:26:05: Matching reads to tiles
## Chromunity 2022-08-17 20:26:05: Matrices made
## Chromunity 2022-08-17 20:26:05: Making Pairs object
## Chromunity 2022-08-17 20:26:15: Pairs object made
## Chromunity 2022-08-17 20:28:04: Pairs made
## Chromunity 2022-08-17 20:28:07: KNN done
## Chromunity 2022-08-17 20:28:07: Communities made
## Chromunity 2022-08-17 20:28:08: Analyzing gr.sums associated with 207 concatemer communities to generate binsets

Creating covariate object



## Frag counts
## frags = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/Nlaiii.frags.hg38.rds") %>% dt2gr()
frags = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/Nlaiii.frags.hg38.rds"))) %>% dt2gr()

## GC content
## gc5b = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/gc.38.rds")
gc5b = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/gc.38.rds")))


## 
cov_list = list(gc5b, frags)

## Specify what kind of covariate it is. Score will be aggregated over bins while the number of intervals will be calculated otherwise.
names(cov_list) <- c("score:gc.cov", "interval:frag.cov")


gc_frag_cov = covariate(name = c("gc", "frag"), type = c("numeric", "interval"), field = c("score", NA), data = cov_list)

gc_frag_cov
## 2 Covariates with features:
##    name     type   class field signature na.rm pad  log count
## 1:   gc  numeric GRanges score      <NA>    NA   0 TRUE  TRUE
## 2: frag interval GRanges  <NA>      <NA>    NA   0 TRUE  TRUE

Annotation of bin-sets with concatemer counts and covariates


set.seed(198)
annotated_re_chrom = annotate(binsets = this_re_chrom$binsets,
           k = 3,
           concatemers = this_gr_testing,
           covariates = gc_frag_cov, resolution = 2.5e4,
           mc.cores = 5) 
## Synergy 2022-08-17 20:28:33: Overlapping 520 bins with 1126124 monomers
## Synergy 2022-08-17 20:28:33: Computing bin by bin pairwise distance
## Synergy 2022-08-17 20:28:33: Making sub binsets
## Synergy 2022-08-17 20:28:34: Made 50472 sub-binsets
## Synergy 2022-08-17 20:28:34: Counting concatemers across sub-binsets across 5 cores
## Synergy 2022-08-17 20:29:15: Computing min median max distances per setid
## Synergy 2022-08-17 20:29:23: Computing marginal sum per bid
## Synergy 2022-08-17 20:29:23: Computing total width and cardinality per setid
## Synergy 2022-08-17 20:29:25: Merging counts, distance, and width
## Synergy 2022-08-17 20:29:25: Adding covariates
## Synergy 2022-08-17 20:29:25: Overlapping 15441361 ranges from covariate gc
## Synergy 2022-08-17 20:29:26: Tallying values from numeric covariate gc
## Synergy 2022-08-17 20:29:26: Merging data from numeric covariate gc
## Synergy 2022-08-17 20:29:26: Overlapping 13836523 ranges from covariate frag
## Synergy 2022-08-17 20:29:27: Tallying values from interval covariate frag
## Synergy 2022-08-17 20:29:28: Merging data from interval covariate frag

head(annotated_re_chrom)
##    bid setid count max.dist min.dist width cardinality sum.counts        gc
## 1:   1    27    11    68601    68601 75001           2      18638 0.3964695
## 2:   1    28     8    97001    97001 75001           2      18638 0.3884678
## 3:   1    29     1   172001   172001 63401           2      18638 0.3812581
## 4:   1    30     4   255801   255801 75001           2      18638 0.3770913
## 5:   1    31     7   405801   405801 75001           2      18638 0.4029149
## 6:   1    32     6   553001   553001 75001           2      18638 0.3968287
##        frag
## 1: 297.2067
## 2: 304.2067
## 3: 256.2067
## 4: 323.2067
## 5: 320.2067
## 6: 345.2067

Generating background bin-sets based on called communities. This function is different from sliding window implementation as it considers inter-chrmosomal chromunities

Parameters Description
binsets GRanges of bins with fields seqnames, start, end, and $bid specifying bin-set id
resolution The resolution to use for distance simulation
n Integer scalar specifying number of sets to sample (default = nrow(bin-sets))
interchromosomal.dist integer inferred average inter-chr distance to be used in case of whole genome simulation
interchromosomal.table data.table optional table that contains specific chr-chr inferred distances
gg gGnome object to be used for simulating bin-sets in the presence of structural variation

set.seed(198)
back_re_gr = gr2dt(dt2gr(re_background(binsets = this_re_chrom$binsets, n = 1000,
                                 resolution = 2.5e4)))
## Synergy 2022-08-17 20:29:28: Making kernels
## Synergy 2022-08-17 20:29:28: Making transition matrices
## Synergy 2022-08-17 20:29:28: Generating random sets

## Adding few filters to remove outlier simulation, can be customized
## Removing bins less than resolution and lying out of bounds

## Getting seqlengths of each chromosome
upper.bound = as.data.table(hg_seqlengths(genome = "BSgenome.Hsapiens.UCSC.hg38::Hsapiens"), keep.rownames = T) 
setkeyv(back_re_gr, c("seqnames", "start"))
back_re_gr = back_re_gr[!bid %in% back_re_gr[width < (2.5e4-1)]$bid]
back_re_gr = gr2dt(gr.reduce(dt2gr(back_re_gr), by = "bid"))
back_re_gr$bid <- as.factor(back_re_gr$bid)
back_re_gr = merge(back_re_gr, upper.bound, by.x = "seqnames", by.y = "V1", all.x = T, allow.cartesian = T)
back_re_gr = back_re_gr[end < V2][start < V2]
back_re_gr[, overall.cardinality := .N, by = bid]
back_re_gr = back_re_gr[overall.cardinality > 1]
back_re_gr = dt2gr(back_re_gr)

head(back_re_gr)
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames              ranges strand |      bid        V2
##          <Rle>           <IRanges>  <Rle> | <factor> <integer>
##   [1]     chr2   48060335-48110335      * |        1 242193529
##   [2]     chr2   48135335-48210335      * |        1 242193529
##   [3]     chr2 122979392-123004392      * |        2 242193529
##   [4]     chr2 123129392-123204392      * |        2 242193529
##   [5]     chr2 123404392-123454392      * |        2 242193529
##   [6]     chr2 123629392-123704392      * |        2 242193529
##       overall.cardinality
##                 <integer>
##   [1]                   2
##   [2]                   2
##   [3]                   4
##   [4]                   4
##   [5]                   4
##   [6]                   4
##   -------
##   seqinfo: 1 sequence from an unspecified genome

Similar to the chromunity calls, we now annotate the background bins with concatemer counts and covariates.


set.seed(198)
annotated_re_back = annotate(binsets = back_re_gr,
           k = 3,
           concatemers = this_gr_testing,
           covariates = gc_frag_cov, resolution = 2.5e4,
           mc.cores = 5) 
## Synergy 2022-08-17 20:29:35: Overlapping 1821 bins with 1126124 monomers
## Synergy 2022-08-17 20:29:36: Computing bin by bin pairwise distance
## Synergy 2022-08-17 20:29:36: Making sub binsets
## Synergy 2022-08-17 20:29:38: Made 65964 sub-binsets
## Synergy 2022-08-17 20:29:38: Counting concatemers across sub-binsets across 5 cores
## Synergy 2022-08-17 20:30:01: Computing min median max distances per setid
## Synergy 2022-08-17 20:30:19: Computing marginal sum per bid
## Synergy 2022-08-17 20:30:19: Computing total width and cardinality per setid
## Synergy 2022-08-17 20:30:22: Merging counts, distance, and width
## Synergy 2022-08-17 20:30:22: Adding covariates
## Synergy 2022-08-17 20:30:22: Overlapping 15441361 ranges from covariate gc
## Synergy 2022-08-17 20:30:23: Tallying values from numeric covariate gc
## Synergy 2022-08-17 20:30:24: Merging data from numeric covariate gc
## Synergy 2022-08-17 20:30:24: Overlapping 13836523 ranges from covariate frag
## Synergy 2022-08-17 20:30:25: Tallying values from interval covariate frag
## Synergy 2022-08-17 20:30:27: Merging data from interval covariate frag

## Removing edge cases with no counts
annotated_re_back = annotated_re_back[!bid %in% annotated_re_back[, .(sum(count)), by = bid][V1 == 0]$bid]

head(annotated_re_back)
##    bid setid count max.dist min.dist  width cardinality sum.counts        gc
## 1:   1     3     5    25000    25000 125003           2       1055 0.4036367
## 2: 103     3    21    50000    50000  75003           2       1207 0.4127890
## 3: 115     4     0    75000    75000 200003           2       2180 0.4003545
## 4: 115     5     0   700000   700000  75003           2       2180 0.4548158
## 5: 115     6    12   450000   450000 225003           2       2180 0.4015623
## 6: 115     7     0   700000    75000 250004           3       2180 0.4090673
##        frag
## 1:  617.173
## 2:  333.173
## 3:  873.173
## 4:  418.173
## 5: 1027.173
## 6: 1159.173

Training the first model using background bin-sets and getting expectation


set.seed(198)
back_re_model = fit(annotated_re_back)

annotated_re_chrom = sscore(annotated_re_chrom, model = back_re_model) 
head(annotated_re_chrom)
##    bid setid count max.dist min.dist width cardinality sum.counts        gc
## 1:   1    27    11    68601    68601 75001           2      18638 0.3964695
## 2:   1    28     8    97001    97001 75001           2      18638 0.3884678
## 3:   1    29     1   172001   172001 63401           2      18638 0.3812581
## 4:   1    30     4   255801   255801 75001           2      18638 0.3770913
## 5:   1    31     7   405801   405801 75001           2      18638 0.4029149
## 6:   1    32     6   553001   553001 75001           2      18638 0.3968287
##        frag count.predicted enrichment       pval
## 1: 297.2067      11.6107569  0.9473973 0.22530725
## 2: 304.2067       6.8392673  1.1697159 0.19421392
## 3: 256.2067       2.6380810  0.3790634 0.32268332
## 4: 323.2067       1.7950637  2.2283332 0.12783515
## 5: 320.2067       1.4569959  4.8044061 0.05873120
## 6: 345.2067       0.9119446  6.5793473 0.04661607

Second model to test for synergy


set.seed(198)
this_synergy = na.omit(synergy(binsets = this_re_chrom$binsets,
                annotated.binsets = annotated_re_chrom, model = back_re_model))
## Synergy 2022-08-17 20:30:29: Scoring binsets

this_synergy$fdr = signif(p.adjust(this_synergy$p, "BH"), 2)

head(this_synergy[order(p)])
##    bid                    method        p estimate  ci.lower  ci.upper
## 1:   3 Negative Binomial(0.1597) 0.000345 2.877102 1.6128597  5.132322
## 2:  18 Negative Binomial(0.1597) 0.057200 4.202671 0.9571404 18.453346
## 3:  26 Negative Binomial(0.1597) 0.059900 4.368017 0.9401296 20.294621
## 4:  10 Negative Binomial(0.1597) 0.063400 4.033579 0.9252857 17.583494
## 5:   7 Negative Binomial(0.1597) 0.069600 4.925413 0.8798582 27.572281
## 6:  14 Negative Binomial(0.1597) 0.074700 8.708385 0.8059100 94.099792
##               effect   fdr
## 1:  2.88 [1.61-5.13] 0.023
## 2:  4.2 [0.957-18.5] 0.780
## 3:  4.37 [0.94-20.3] 0.780
## 4: 4.03 [0.925-17.6] 0.780
## 5:  4.93 [0.88-27.6] 0.780
## 6: 8.71 [0.806-94.1] 0.780

Points to note and potential issues

  1. While using RE chromunity, based on the number of concatemers and bin size, the problem set may be too large to fit in memory. In such cases it is advisable to use a higher value for the shave() parameters.

  2. It is important to make sure that sliding window chromunity is run on individual chromosomes. A for loop can be used to run each chromosome and save outputs before moving to next chromosome.

  3. If at higher resolution, counts are too sparse, it is advisable to rerun the analyses with larger bin sizes.

  4. An empty output from the Chromunity may be indicative of sparseness in the data. Apart from changing the resolution as pointed above, the KNN parameters can be increased to relax the number of neighbors to be considered.