The Chromunity R package enables the nomination and statistical evaluation of high order interactions among genomic loci using Pore-C data. The basic unit of Pore-C data is called a concatemer
that encodes the multiway interactions of multiple monomers
. Chromunity can be applied to de novo detection of high order interacting loci using the graph-based community detection of concatemers. Chromunity also provides a platform to test the enrichment of high order interactions as compared the pairwise background (a phenomenon we call "synergy"). The Chromunity package is integrated with GenomicRanges
and data.table
packages, allowing easy incorporation into bioinformatics workflows that employ the R/Bioconductor
ecosystem.
For installation instructions, please visit the Chromunity github page. For background, it may help to have some familiarity with data.table
, GenomicRanges
, and gUtils
packages.
If you use Chromunity in your work, please cite: Deshpande et.al. Nature Biotechnology 2022.
To understand the nature of the run, following are some metrics to assess the quality of the data.
basecall/{enzyme}_{run_id}.rd.summary.csv
. This file gives aggregate stats and you can retrieve the yield in Gb and N50 of read length from this file.contacts/{enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.contacts.parquet
. This file is equivalent to pairs file so the number of rows in this file should indicate the number of pairwise contacts. If the run was divided into chunks, you can write a loop to gather number of rows from each file and aggregate them.(pairwise contacts in Million)/(Yield in Gb)
. This is a measure of the contact density of the raw reads.Empirically, the values as follows are indicative of a successful run.
Metric | Value |
---|---|
Yield | 30 to 100 Gb |
N50 | 3-6kb |
Number of Pairwise Contacts | 200 - 500 M |
The Pairwise Density | 3 to 6M/Gb |
Here we demonstrate the Chromunity workflow from making a GRange object to calling communities and testing them for synergy. The first step in the workflow is the conversion of parquet files to GRanges for downstream analyses. To read more about Parquet format and for alternate processing of files visit the Apache page. This step needs R implementation of Apache arrow. In order to install apache arrow, visit their page for arrow implementation in R and installation instructions.
The Pore-C snakemake creates parquet files in chunks. This function allows to compile them in one GRange.
Note
If using some other way to concatenate parquet files, make sure to re-assign index under read_idx
column such that they are uniquely matched to read_name
. Additionally, if there are more than one runs for the same sample, apply similar procedure after merging. For the example parquet file, please download the folder from here and replace the path below with that folder path locally.
library(rtracklayer)
library(skitools)
library(chromunity)
library(MASS)
## This is the path to directory with all chunks
example_dir1 = "/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs"
## Generating GRanges
this_gr = parquet2gr(example_dir1, mc.cores = 2)
## Removing low quality monomers, using gUtils functions
this_gr = this_gr %Q% (pass_filter == TRUE)
head(this_gr)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr9_KI270717v1_random 2283-3462 * |
## [2] chr9 63403784-63404412 * |
## [3] chr9 63388544-63389031 * |
## [4] chr3 44088326-44090787 * |
## [5] chr3 45455924-45456924 * |
## [6] chr3 45442793-45443532 * |
## read_name pass_filter read_idx
## <character> <logical> <integer>
## [1] bcb51469-f2e9-41df-b4e2-fdd9eb63b039 TRUE 1
## [2] bcb51469-f2e9-41df-b4e2-fdd9eb63b039 TRUE 1
## [3] bcb51469-f2e9-41df-b4e2-fdd9eb63b039 TRUE 1
## [4] e754f6cc-804a-4d14-b46c-a652723d9524 TRUE 2
## [5] e754f6cc-804a-4d14-b46c-a652723d9524 TRUE 2
## [6] e754f6cc-804a-4d14-b46c-a652723d9524 TRUE 2
## -------
## seqinfo: 118 sequences from an unspecified genome
The data for the paper is uploaded in CSV format to GEO. One file for each run, per sample. This function allows for readig all the files and generate a single GRanges file with a unique read_idx
to each concatemer. Please substitute the directory path in the following code with the path with all the CSV files.
## This is the path to directory with all chunks
example_dir1 = "/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/csv_file"
## Generating GRanges
this_gr = csv2gr(example_dir1, mc.cores = 2)
## Removing low quality monomers, using gUtils functions
this_gr = this_gr %Q% (pass_filter == TRUE)
head(this_gr)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr11 74929590-74929845 * |
## [2] chr11 73908636-73908936 * |
## [3] chr11 102426931-102427103 * |
## [4] chrX 18488003-18488611 * |
## [5] chr10 16354569-16354834 * |
## [6] chr8 47609593-47609775 * |
## read_name pass_filter read_idx
## <character> <logical> <integer>
## [1] c7631d1d-6eeb-402e-8f53-d084692ed451 TRUE 1
## [2] c7631d1d-6eeb-402e-8f53-d084692ed451 TRUE 1
## [3] c7631d1d-6eeb-402e-8f53-d084692ed451 TRUE 1
## [4] 99c09b00-f7a2-4074-8dbf-c0424de01533 TRUE 2
## [5] 99c09b00-f7a2-4074-8dbf-c0424de01533 TRUE 2
## [6] 99c09b00-f7a2-4074-8dbf-c0424de01533 TRUE 2
## -------
## seqinfo: 147 sequences from an unspecified genome
To enable de novo discovery of high order interacting bin-sets we developed an algorithm for community detection of concatemers. Since concatemer community detection involves building a massive graph, computational complexity can be limiting for the analysis of very large concatemer datasets. This may limit the portion of the dataset that can be used for community detection and hence limit the power for de novo high order bin-set discovery. We implemented two optimization strategies to overcome this problem.
"Sliding window" Chromunity where the genome is defined around (10 kb, 25 kb, and/or 50 kb) tiles in each 2Mb width, 1Mb stride sliding window to study regional high order structure. The benefit of this approach is that it enables a relatively unbiased discovery of high order structure or cooperativity (i.e. without needing to know specific annotations); however, it does not scale to the analysis of reference-distal or interchromosomal cooperativity.
Following are details of various parameters for sliding_window_chromunity
:
Parameters | Description |
---|---|
concatemers | GRanges with $cid |
resolution | bin size for community detection [5e4] |
region | region to run on [si2gr(concatemers)] |
windows | GRanges or GRangesList of windows to test, more appropriate for testing specific windows. To be used with RE chromunity, e.g. targets such as promoters/enhancers windows. If sliding window approach is desired, keep this parameter as NULL and use piecewise argument |
window.size | window size to do community detection within |
max.size | max size of problem (concatemers x bins) to consider (default is 2^31-1). If we hit this problem size, will subsample concatemers so that concatemers * bins is < max.dim, this step is done downstream of shaving |
k.knn | KNN parameter specifying how many nearest neighbors to sample when building KNN graph |
peak.thresh | peak threshold with which to call a peak |
take_sub_sample | take subsample to be used for training |
subsample.frac | proprtion of concatemers to subsample |
k.min | minimal number of nearest neighbors an edge in KNN graph needs to have before community detection |
chr | chromosome to run on |
pad | integer pad to use when computing the footprint of each chromunity and finding peak regions which become bin-sets |
## First using sliding window. Run separately for each chromosome
## This is demo GRange
## this_gr = readRDS( "/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/chr8.single.run.sub.rds")
this_gr = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/chr8.single.run.sub.rds")))
##
this_gr$cid = this_gr$read_idx
##
this_sliding_chrom = sliding_window_chromunity(concatemers = this_gr, resolution = 5e4, window.size = 2e6, take_sub_sample = TRUE, chr = "chr8", subsample.frac = 0.5, mc.cores = 1)
## Chromunity 2022-08-17 20:21:31: Generated 61776 bins across 146 windows
## Chromunity 2022-08-17 20:21:32: Taking sub-sample with fraction 0.5
## Chromunity 2022-08-17 20:22:31: Analyzing gr.sums associated with 37 concatemer communities to generate binsets
The outputs of this_sliding_chrom
function is a Chromunity object with a concatemer
GRanges where each row is a monomer indexed by read_id
as well as community id or chid
. The bin-sets are created and stored in bin-sets
GRanges object that maps community id to bin-id or bid
.
## Outputs
## Concatemers GRanges
head(this_sliding_chrom$concatemers)
## GRanges object with 6 ranges and 8 metadata columns:
## seqnames ranges strand | cid read_idx binid
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## [1] chr8 125390785-125391227 * | 24515 24515 30338
## [2] chr8 125114129-125114349 * | 24515 24515 30333
## [3] chr8 125346026-125346181 * | 24515 24515 30337
## [4] chr8 125318791-125319499 * | 24986 24986 30337
## [5] chr8 125370091-125370333 * | 24986 24986 30338
## [6] chr8 124860528-124860658 * | 24986 24986 30328
## tix count chid support winid
## <numeric> <integer> <numeric> <integer> <integer>
## [1] 38 3 1 1 125
## [2] 33 3 1 1 125
## [3] 37 3 1 1 125
## [4] 37 3 2 1 125
## [5] 38 3 2 1 125
## [6] 28 3 2 1 125
## -------
## seqinfo: 1 sequence from an unspecified genome
## Bin-Sets GRanges
head(this_sliding_chrom$binsets)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | chid bid winid
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <integer>
## [1] chr8 124750001-125000000 * | 17 17 125
## [2] chr8 124850001-125000000 * | 21 21 125
## [3] chr8 124750001-124800000 * | 42 42 125
## [4] chr8 125100001-125250000 * | 42 42 125
## [5] chr8 125300001-125350000 * | 42 42 125
## [6] chr8 125400001-125450000 * | 42 42 125
## -------
## seqinfo: 1 sequence from an unspecified genome
Once you have communities called, next step is to test them for synergy. We define synergy as significant enrichment of high order interactions as compared to pairwise background. This involves a few steps. If a set of covariates are desired to be used, it is necessary to create a covariate object. In this demonstration we will use GC content and number of fragments as covariates. We load these data in as GRanges objects using functions in data.table
, rtracklayer
, and gUtils
packages (however you are free to use any GRanges import utility of your choice). See out fishHook
tutorial to gain more intuition about this implementation
## Frag counts
## frags = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/Nlaiii.frags.hg38.rds") %>% dt2gr()
frags = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/Nlaiii.frags.hg38.rds"))) %>% dt2gr()
## GC content
## gc5b = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/gc.38.rds")
gc5b = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/gc.38.rds")))
##
cov_list = list(gc5b, frags)
## Specify what kind of covariate it is. Score will be aggregated over bins while the number of intervals will be calculated otherwise.
names(cov_list) <- c("score:gc.cov", "interval:frag.cov")
gc_frag_cov = covariate(name = c("gc", "frag"), type = c("numeric", "interval"), field = c("score", NA), data = cov_list)
gc_frag_cov
## 2 Covariates with features:
## name type class field signature na.rm pad log count
## 1: gc numeric GRanges score <NA> NA 0 TRUE TRUE
## 2: frag interval GRanges <NA> <NA> NA 0 TRUE TRUE
This covariate object will be used to assign the respective values to each bin in bin-sets.
Next step is to aggregate the count of concatemers interacting with a given bin along with all the other features of the genome to be used as covariates. This is done using the annotate
function. In short, given n bin-sets annotates their k-power set (i.e. all sets in the power set with cardinality <= k) with basic (min median max distance, cardinality, width) and optional numerical and interval covariates provided as input. Following are the details of the parameters of this function.
Parameters | Description |
---|---|
bin-sets | GRanges of bins with fields bid specifying bin-set id |
concatemers | GRanges of monomers with fields seqnames, start, end, and cid specifying concatemer id, which will be counted across each bin-set |
covariates | Covariate object specifying numeric or interval covariates to merge with this bin-set k the max cardinality of sets to annotate from the power set of each binmer |
gg | optional gGraph input specifying alternate distance function |
interchromosomal.dist | numeric scalar of "effective" distance for inter chromosomal bins [1e8] |
## concatemers used for testing
this.chrom = gr2dt(this_sliding_chrom$concatemers)
## Concatemers for testing
this_gr_testing = dt2gr(gr2dt(this_gr)[!read_idx %in% unique(this.chrom$read_idx)])
annotated_chrom = annotate(binsets = this_sliding_chrom$binsets,
k = 5,
concatemers = this_gr_testing,
covariates = gc_frag_cov, resolution = 5e4,
mc.cores = 3)
## Synergy 2022-08-17 20:22:56: Overlapping 104 bins with 163837 monomers
## Synergy 2022-08-17 20:22:57: Computing bin by bin pairwise distance
## Synergy 2022-08-17 20:22:57: Making sub binsets
## Synergy 2022-08-17 20:22:57: Made 3525 sub-binsets
## Synergy 2022-08-17 20:22:57: Counting concatemers across sub-binsets across 3 cores
## Synergy 2022-08-17 20:23:23: Computing min median max distances per setid
## Synergy 2022-08-17 20:23:25: Computing marginal sum per bid
## Synergy 2022-08-17 20:23:25: Computing total width and cardinality per setid
## Synergy 2022-08-17 20:23:25: Merging counts, distance, and width
## Synergy 2022-08-17 20:23:26: Adding covariates
## Synergy 2022-08-17 20:23:26: Overlapping 15441361 ranges from covariate gc
## Synergy 2022-08-17 20:23:26: Tallying values from numeric covariate gc
## Synergy 2022-08-17 20:23:26: Merging data from numeric covariate gc
## Synergy 2022-08-17 20:23:26: Overlapping 13836523 ranges from covariate frag
## Synergy 2022-08-17 20:23:27: Tallying values from interval covariate frag
## Synergy 2022-08-17 20:23:27: Merging data from interval covariate frag
head(annotated_chrom)
## bid setid count max.dist min.dist width cardinality sum.counts gc
## 1: 10216 3 24 50001 50001 100001 2 2526 0.4455106
## 2: 10291 4 29 150001 150001 200001 2 6942 0.4274106
## 3: 10291 5 8 350001 350001 100001 2 6942 0.4474706
## 4: 10291 6 28 50001 50001 200001 2 6942 0.4241306
## 5: 10291 7 0 350001 50001 250001 3 6942 0.4301106
## 6: 10524 5 5 200001 200001 100001 2 8724 0.3781506
## frag
## 1: 510.259
## 2: 1020.259
## 3: 544.259
## 4: 1012.259
## 5: 1288.259
## 6: 450.259
Next, we construct a data matrix to train the Synergy background model. To provide examples for model training, we generate a large collection of random bin-sets based on bin-sets called from chromunity
function, such that the empirical distribution of the core covariates of subsets matches the empirical distribution of core covariates associated with called bin-sets. This is achieved by the SampleRandombin-sets
procedure (described in methods), which randomly samples sets of intervals that match the input collection with respect to their cardinality, interval widths, bin-to-bin distances, and interchromosomal jumps. The output of this algorithm is the bin-set collection, which is used for model training.
Following are the parameters for the sliding_window_background
function:
Parameters | Description |
---|---|
chromosome | chrtomosome the chromosome to work on |
binsets | GRanges of bins with fields seqnames, start, end, and $bid specifying bin-set id |
resolution | The resolution to use for distance simulation |
n | Integer scalar specifying number of sets to sample (default = nrow(bin-sets)) |
genome.to.use | genome build to use for the simulation [hg38] |
set.seed(198)
back_gr = sliding_window_background(chromosome= "chr8", binsets = this_sliding_chrom$binsets, n = 1000,
resolution = 5e4)
## Generating distributions
## Generating GRanges
back_gr[, V1 := NULL]
back_gr = na.omit(back_gr)
## Adding few filters to remove outlier simulation, can be customized
## Removing bins less than resolution and lying out of bounds
## Getting seqlengths of each chromosome
upper.bound = as.data.table(hg_seqlengths(genome = "BSgenome.Hsapiens.UCSC.hg38::Hsapiens"), keep.rownames = T)
setkeyv(back_gr, c("seqnames", "start"))
back_gr = back_gr[!bid %in% back_gr[width < (5e4-1)]$bid]
back_gr = gr2dt(gr.reduce(dt2gr(back_gr), by = "bid"))
back_gr$bid <- as.factor(back_gr$bid)
back_gr = merge(back_gr, upper.bound, by.x = "seqnames", by.y = "V1", all.x = T, allow.cartesian = T)
back_gr = back_gr[end < V2][start < V2]
back_gr[, overall.cardinality := .N, by = bid]
back_gr = back_gr[overall.cardinality > 1]
back_gr = dt2gr(back_gr)
head(back_gr)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | bid V2
## <Rle> <IRanges> <Rle> | <factor> <integer>
## [1] chr8 54856203-55056202 * | rand_1 145138636
## [2] chr8 55106203-55256202 * | rand_1 145138636
## [3] chr8 55606203-55806202 * | rand_1 145138636
## [4] chr8 56706203-56756202 * | rand_1 145138636
## [5] chr8 33893925-33943924 * | rand_10 145138636
## [6] chr8 34243925-34393924 * | rand_10 145138636
## overall.cardinality
## <integer>
## [1] 4
## [2] 4
## [3] 4
## [4] 4
## [5] 2
## [6] 2
## -------
## seqinfo: 1 sequence from an unspecified genome
Similar to the chromunity calls, we now annotate the background bins with concatemer counts and covariates.
set.seed(198)
annotated_back = annotate(binsets = back_gr,
k = 5,
concatemers = this_gr_testing,
covariates = gc_frag_cov, resolution = 5e4,
mc.cores = 3)
## Synergy 2022-08-17 20:23:47: Overlapping 3396 bins with 163837 monomers
## Synergy 2022-08-17 20:23:47: Computing bin by bin pairwise distance
## Synergy 2022-08-17 20:23:48: Making sub binsets
## Synergy 2022-08-17 20:23:50: Made 105998 sub-binsets
## Synergy 2022-08-17 20:23:50: Counting concatemers across sub-binsets across 3 cores
## Synergy 2022-08-17 20:24:55: Computing min median max distances per setid
## Synergy 2022-08-17 20:25:20: Computing marginal sum per bid
## Synergy 2022-08-17 20:25:20: Computing total width and cardinality per setid
## Synergy 2022-08-17 20:25:24: Merging counts, distance, and width
## Synergy 2022-08-17 20:25:24: Adding covariates
## Synergy 2022-08-17 20:25:24: Overlapping 15441361 ranges from covariate gc
## Synergy 2022-08-17 20:25:25: Tallying values from numeric covariate gc
## Synergy 2022-08-17 20:25:34: Merging data from numeric covariate gc
## Synergy 2022-08-17 20:25:34: Overlapping 13836523 ranges from covariate frag
## Synergy 2022-08-17 20:25:36: Tallying values from interval covariate frag
## Synergy 2022-08-17 20:25:44: Merging data from interval covariate frag
## Removing edge cases with no counts
annotated_back = annotated_back[!bid %in% annotated_back[, .(sum(count)), by = bid][V1 == 0]$bid]
head(annotated_back)
## bid setid count max.dist min.dist width cardinality sum.counts
## 1: rand_105 7 25 250001 250001 250001 2 10718
## 2: rand_105 8 20 650001 650001 200001 2 10718
## 3: rand_105 9 0 950001 950001 100001 2 10718
## 4: rand_105 10 0 1700001 1700001 100001 2 10718
## 5: rand_105 11 0 2150001 2150001 200001 2 10718
## 6: rand_105 12 137 200001 200001 350001 2 10718
## gc frag
## 1: 0.4095764 1168.3814
## 2: 0.4213357 1030.3814
## 3: 0.4145672 500.3814
## 4: 0.3825482 469.3814
## 5: 0.3960412 918.3814
## 6: 0.4217752 1764.3814
Background bin-sets is then used to populate a data matrix whose rows represents each size 2 to k
subset of each background bin-set. The count of concatemers overlapping each subset is predicted from the covariates by fitting a gamma-Poisson generalized linear model with parameters (including the dispersion parameter theta
). We use the fit
function within the package to do so.
Parameters | Description |
---|---|
annotated.binsets | data.table of annotated binsets with at least the fields \(bid (binset id) \)setid (subset id) count min.dist med.dist max.dist width cardinality +/- additional covariates as well, note: every column in this data.table must be a covariate |
nb | logical flag specifying whether to run negative binomial vs. Poisson model |
return.model | logical flag whether to return the model or the scored input |
set.seed(198)
back_model = fit(annotated_back)
annotated_chrom = sscore(annotated_chrom, model = back_model)
head(annotated_chrom)
## bid setid count max.dist min.dist width cardinality sum.counts gc
## 1: 10216 3 24 50001 50001 100001 2 2526 0.4455106
## 2: 10291 4 29 150001 150001 200001 2 6942 0.4274106
## 3: 10291 5 8 350001 350001 100001 2 6942 0.4474706
## 4: 10291 6 28 50001 50001 200001 2 6942 0.4241306
## 5: 10291 7 0 350001 50001 250001 3 6942 0.4301106
## 6: 10524 5 5 200001 200001 100001 2 8724 0.3781506
## frag count.predicted enrichment pval
## 1: 510.259 18.5746262 1.2920852 0.2872729
## 2: 1020.259 28.0294544 1.0346259 0.3674657
## 3: 544.259 4.3629933 1.8336035 0.1664706
## 4: 1012.259 66.6446031 0.4201390 0.7119675
## 5: 1288.259 0.3417598 0.0000000 0.3839606
## 6: 450.259 8.7721394 0.5699864 0.6268825
In the second stage of the Synergy model, we apply the background model to the covariates associated with bin-subsets to obtain expected counts. We use these results to test each individual set for enrichment of counts in high-order vs. pairwise contacts. To do so, we populate a second data matrix comprising observed counts, expected counts, and an indicator variable encoding whether a given set is pairwise (0) or high-order (1). Given this matrix, a second gamma-Poisson generalized linear model is then fit to assess whether the ratio of observed to expected counts is significantly higher among high-order relative to pairwise interactions.
set.seed(198)
this_synergy = na.omit(synergy(binsets = this_sliding_chrom$binsets,
annotated.binsets = annotated_chrom, model = back_model))
## Synergy 2022-08-17 20:25:44: Scoring binsets
this_synergy$fdr = signif(p.adjust(this_synergy$p, "BH"), 2)
head(this_synergy[order(p)])
## bid method p estimate ci.lower ci.upper
## 1: 4911 Negative Binomial(1.351) 0.00875 3.661974 1.3876249 9.664031
## 2: 7511 Negative Binomial(1.351) 0.16700 7.128654 0.4393602 115.662972
## 3: 5301 Negative Binomial(1.351) 0.19500 1.357184 0.8553877 2.153349
## 4: 9832 Negative Binomial(1.351) 0.21800 1.981554 0.6678601 5.879308
## 5: 170 Negative Binomial(1.351) 0.25100 5.106705 0.3155017 82.657039
## 6: 4960 Negative Binomial(1.351) 0.26000 1.939704 0.6129328 6.138443
## effect fdr
## 1: 3.66 [1.39-9.66] 0.16
## 2: 7.13 [0.439-116] 0.70
## 3: 1.36 [0.855-2.15] 0.70
## 4: 1.98 [0.668-5.88] 0.70
## 5: 5.11 [0.316-82.7] 0.70
## 6: 1.94 [0.613-6.14] 0.70
"Regulatory element" (RE) Chromunity where the genomic region to be queried is defined around regulatory elements such as promoters and enhancers. RE Chromunity is useful for the study reference-distal and interchromosomal cooperativity but requires a hypothesis set of regulatory elements.
With this approach, we are trying to narrow down bins of interests. In this case we are going to use promoters and enhancers, annotations of which are obtained from chromHMM with 15 states. We add a pad around the annotations and merge the annotations together if they overlap. Once we have a footprint of the targets, we subsample the concatemers overlapping it. Using this reduced concatemer set, we run the steps as above. Here we demonstrate chromosome wide implementation.
The shave algorithm is another layer of optimization. It removes bins that overlap fewer than \(b\) concatemers and remove concatemers that hit \(c\). Repeat until a fixed point is reached.
Following are details of various parameters for re_chromunity
:
Parameters | Description |
---|---|
concatemers | GRanges with $cid |
resolution | bin size for community detection [5e4] |
region | region to run on [si2gr(concatemers)] |
windows | GRanges or GRangesList of windows to test, more appropriate for testing specific windows. To be used with RE chromunity, e.g. targets such as promoters/enhancers windows. If sliding window approach is desired, keep this parameter as NULL and use piecewise argument |
shave | logical flag specifying whether to iteratively "shave" concatemers and bins to a subset C and B where every bin in B has at least bthresh concatemer support across C and every concatemer in C has order / cardinality of at least cthres across B, this is done by iteratively removing bins and concatemers until you reach a fixed point. Appropriate to be used with RE chromunity |
window.size | window size to do community detection within |
max.size | max size of problem (concatemers x bins) to consider (default is 2^31-1). If we hit this problem size, will subsample concatemers so that concatemers * bins is < max.dim, this step is done downstream of shaving |
k.knn | KNN parameter specifying how many nearest neighbors to sample when building KNN graph |
peak.thresh | peak threshold with which to call a peak |
k.min | minimal number of nearest neighbors an edge in KNN graph needs to have before community detection |
pad | integer pad to use when computing the footprint of each chromunity and finding peak regions which become bin-sets |
set.seed(198)
## Chr2 single run
## chr2_parq_gr = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/chr2.single.run.sub.rds")
chr2_parq_gr = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/chr2.single.run.sub.rds")))
## getting E-P annotations
chain19to38 = rtracklayer::import.chain("~/DB/UCSC/hg19ToHg38.over.chain")
chmm_19 = import("https://mskilab.s3.amazonaws.com/chromunity/wgEncodeBroadHmmGm12878HMM.bed")
chmm = chmm_19 %>% rtracklayer::liftOver(chain19to38)%>% grl.unlist %Q% (!duplicated(grl.ix))
chmm_ep = dt2gr(gr2dt(chmm)[grepl("Promoter", name) | grepl("Enhancer", name)])
## resolution of the pad
resolution = 2.5e4
tiles = gr.tile(hg_seqlengths(genome = "BSgenome.Hsapiens.UCSC.hg38::Hsapiens"), resolution)
## Targets
this_EP = (chmm_ep %Q% (grepl("Active_Promoter|Strong_Enhancer", name)))
targets_EP = gr.reduce(this_EP+resolution)
## Subsample
chr2_parq_gr = chr2_parq_gr %&% targets_EP
## Subsampling concatemers
set.seed(125)
all_concatemers = unique(chr2_parq_gr$read_idx)
training_concatemers = sample(all_concatemers, length(all_concatemers)/2)
testing_concatemers = setdiff(all_concatemers, training_concatemers)
this_gr_training = chr2_parq_gr %Q% (read_idx %in% training_concatemers)
this_gr_testing = chr2_parq_gr %Q% (read_idx %in% testing_concatemers)
##
this_gr_training$cid = this_gr_training$read_idx
this_gr_testing$cid = this_gr_testing$read_idx
## Running Chromunity
this_re_chrom = re_chromunity(concatemers = this_gr_training, windows = targets_EP, piecewise = FALSE, shave = TRUE, resolution = 2.5e4, mc.cores = 5)
## Chromunity 2022-08-17 20:25:58: Generated 40288 bins across 8555 windows
## Chromunity 2022-08-17 20:25:58: Matching concatemers with bins, and bins with windows using gr.match with max.slice 1e+06 and 5 cores
## Chromunity 2022-08-17 20:26:01: Shaving concatemers with bthresh = 3 and cthresh = 3
## Chromunity 2022-08-17 20:26:02: shaving --> Diff: 864155, Old: 1126127, New:261972
## Chromunity 2022-08-17 20:26:02: shaving --> Diff: 97, Old: 261972, New:261875
## Chromunity 2022-08-17 20:26:02: shaving --> Diff: 72, Old: 261875, New:261803
## Chromunity 2022-08-17 20:26:03: Running concatemer communities with 65755 concatemers and 3106 bins
## Chromunity 2022-08-17 20:26:05: Matching reads to tiles
## Chromunity 2022-08-17 20:26:05: Matrices made
## Chromunity 2022-08-17 20:26:05: Making Pairs object
## Chromunity 2022-08-17 20:26:15: Pairs object made
## Chromunity 2022-08-17 20:28:04: Pairs made
## Chromunity 2022-08-17 20:28:07: KNN done
## Chromunity 2022-08-17 20:28:07: Communities made
## Chromunity 2022-08-17 20:28:08: Analyzing gr.sums associated with 207 concatemer communities to generate binsets
## Frag counts
## frags = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/Nlaiii.frags.hg38.rds") %>% dt2gr()
frags = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/Nlaiii.frags.hg38.rds"))) %>% dt2gr()
## GC content
## gc5b = readRDS("/gpfs/commons/groups/imielinski_lab/projects/PoreC/db/tutorial_inputs/gc.38.rds")
gc5b = readRDS(gzcon(file("https://mskilab.s3.amazonaws.com/chromunity/gc.38.rds")))
##
cov_list = list(gc5b, frags)
## Specify what kind of covariate it is. Score will be aggregated over bins while the number of intervals will be calculated otherwise.
names(cov_list) <- c("score:gc.cov", "interval:frag.cov")
gc_frag_cov = covariate(name = c("gc", "frag"), type = c("numeric", "interval"), field = c("score", NA), data = cov_list)
gc_frag_cov
## 2 Covariates with features:
## name type class field signature na.rm pad log count
## 1: gc numeric GRanges score <NA> NA 0 TRUE TRUE
## 2: frag interval GRanges <NA> <NA> NA 0 TRUE TRUE
set.seed(198)
annotated_re_chrom = annotate(binsets = this_re_chrom$binsets,
k = 3,
concatemers = this_gr_testing,
covariates = gc_frag_cov, resolution = 2.5e4,
mc.cores = 5)
## Synergy 2022-08-17 20:28:33: Overlapping 520 bins with 1126124 monomers
## Synergy 2022-08-17 20:28:33: Computing bin by bin pairwise distance
## Synergy 2022-08-17 20:28:33: Making sub binsets
## Synergy 2022-08-17 20:28:34: Made 50472 sub-binsets
## Synergy 2022-08-17 20:28:34: Counting concatemers across sub-binsets across 5 cores
## Synergy 2022-08-17 20:29:15: Computing min median max distances per setid
## Synergy 2022-08-17 20:29:23: Computing marginal sum per bid
## Synergy 2022-08-17 20:29:23: Computing total width and cardinality per setid
## Synergy 2022-08-17 20:29:25: Merging counts, distance, and width
## Synergy 2022-08-17 20:29:25: Adding covariates
## Synergy 2022-08-17 20:29:25: Overlapping 15441361 ranges from covariate gc
## Synergy 2022-08-17 20:29:26: Tallying values from numeric covariate gc
## Synergy 2022-08-17 20:29:26: Merging data from numeric covariate gc
## Synergy 2022-08-17 20:29:26: Overlapping 13836523 ranges from covariate frag
## Synergy 2022-08-17 20:29:27: Tallying values from interval covariate frag
## Synergy 2022-08-17 20:29:28: Merging data from interval covariate frag
head(annotated_re_chrom)
## bid setid count max.dist min.dist width cardinality sum.counts gc
## 1: 1 27 11 68601 68601 75001 2 18638 0.3964695
## 2: 1 28 8 97001 97001 75001 2 18638 0.3884678
## 3: 1 29 1 172001 172001 63401 2 18638 0.3812581
## 4: 1 30 4 255801 255801 75001 2 18638 0.3770913
## 5: 1 31 7 405801 405801 75001 2 18638 0.4029149
## 6: 1 32 6 553001 553001 75001 2 18638 0.3968287
## frag
## 1: 297.2067
## 2: 304.2067
## 3: 256.2067
## 4: 323.2067
## 5: 320.2067
## 6: 345.2067
Parameters | Description |
---|---|
binsets | GRanges of bins with fields seqnames, start, end, and $bid specifying bin-set id |
resolution | The resolution to use for distance simulation |
n | Integer scalar specifying number of sets to sample (default = nrow(bin-sets)) |
interchromosomal.dist | integer inferred average inter-chr distance to be used in case of whole genome simulation |
interchromosomal.table | data.table optional table that contains specific chr-chr inferred distances |
gg | gGnome object to be used for simulating bin-sets in the presence of structural variation |
set.seed(198)
back_re_gr = gr2dt(dt2gr(re_background(binsets = this_re_chrom$binsets, n = 1000,
resolution = 2.5e4)))
## Synergy 2022-08-17 20:29:28: Making kernels
## Synergy 2022-08-17 20:29:28: Making transition matrices
## Synergy 2022-08-17 20:29:28: Generating random sets
## Adding few filters to remove outlier simulation, can be customized
## Removing bins less than resolution and lying out of bounds
## Getting seqlengths of each chromosome
upper.bound = as.data.table(hg_seqlengths(genome = "BSgenome.Hsapiens.UCSC.hg38::Hsapiens"), keep.rownames = T)
setkeyv(back_re_gr, c("seqnames", "start"))
back_re_gr = back_re_gr[!bid %in% back_re_gr[width < (2.5e4-1)]$bid]
back_re_gr = gr2dt(gr.reduce(dt2gr(back_re_gr), by = "bid"))
back_re_gr$bid <- as.factor(back_re_gr$bid)
back_re_gr = merge(back_re_gr, upper.bound, by.x = "seqnames", by.y = "V1", all.x = T, allow.cartesian = T)
back_re_gr = back_re_gr[end < V2][start < V2]
back_re_gr[, overall.cardinality := .N, by = bid]
back_re_gr = back_re_gr[overall.cardinality > 1]
back_re_gr = dt2gr(back_re_gr)
head(back_re_gr)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | bid V2
## <Rle> <IRanges> <Rle> | <factor> <integer>
## [1] chr2 48060335-48110335 * | 1 242193529
## [2] chr2 48135335-48210335 * | 1 242193529
## [3] chr2 122979392-123004392 * | 2 242193529
## [4] chr2 123129392-123204392 * | 2 242193529
## [5] chr2 123404392-123454392 * | 2 242193529
## [6] chr2 123629392-123704392 * | 2 242193529
## overall.cardinality
## <integer>
## [1] 2
## [2] 2
## [3] 4
## [4] 4
## [5] 4
## [6] 4
## -------
## seqinfo: 1 sequence from an unspecified genome
Similar to the chromunity calls, we now annotate the background bins with concatemer counts and covariates.
set.seed(198)
annotated_re_back = annotate(binsets = back_re_gr,
k = 3,
concatemers = this_gr_testing,
covariates = gc_frag_cov, resolution = 2.5e4,
mc.cores = 5)
## Synergy 2022-08-17 20:29:35: Overlapping 1821 bins with 1126124 monomers
## Synergy 2022-08-17 20:29:36: Computing bin by bin pairwise distance
## Synergy 2022-08-17 20:29:36: Making sub binsets
## Synergy 2022-08-17 20:29:38: Made 65964 sub-binsets
## Synergy 2022-08-17 20:29:38: Counting concatemers across sub-binsets across 5 cores
## Synergy 2022-08-17 20:30:01: Computing min median max distances per setid
## Synergy 2022-08-17 20:30:19: Computing marginal sum per bid
## Synergy 2022-08-17 20:30:19: Computing total width and cardinality per setid
## Synergy 2022-08-17 20:30:22: Merging counts, distance, and width
## Synergy 2022-08-17 20:30:22: Adding covariates
## Synergy 2022-08-17 20:30:22: Overlapping 15441361 ranges from covariate gc
## Synergy 2022-08-17 20:30:23: Tallying values from numeric covariate gc
## Synergy 2022-08-17 20:30:24: Merging data from numeric covariate gc
## Synergy 2022-08-17 20:30:24: Overlapping 13836523 ranges from covariate frag
## Synergy 2022-08-17 20:30:25: Tallying values from interval covariate frag
## Synergy 2022-08-17 20:30:27: Merging data from interval covariate frag
## Removing edge cases with no counts
annotated_re_back = annotated_re_back[!bid %in% annotated_re_back[, .(sum(count)), by = bid][V1 == 0]$bid]
head(annotated_re_back)
## bid setid count max.dist min.dist width cardinality sum.counts gc
## 1: 1 3 5 25000 25000 125003 2 1055 0.4036367
## 2: 103 3 21 50000 50000 75003 2 1207 0.4127890
## 3: 115 4 0 75000 75000 200003 2 2180 0.4003545
## 4: 115 5 0 700000 700000 75003 2 2180 0.4548158
## 5: 115 6 12 450000 450000 225003 2 2180 0.4015623
## 6: 115 7 0 700000 75000 250004 3 2180 0.4090673
## frag
## 1: 617.173
## 2: 333.173
## 3: 873.173
## 4: 418.173
## 5: 1027.173
## 6: 1159.173
set.seed(198)
back_re_model = fit(annotated_re_back)
annotated_re_chrom = sscore(annotated_re_chrom, model = back_re_model)
head(annotated_re_chrom)
## bid setid count max.dist min.dist width cardinality sum.counts gc
## 1: 1 27 11 68601 68601 75001 2 18638 0.3964695
## 2: 1 28 8 97001 97001 75001 2 18638 0.3884678
## 3: 1 29 1 172001 172001 63401 2 18638 0.3812581
## 4: 1 30 4 255801 255801 75001 2 18638 0.3770913
## 5: 1 31 7 405801 405801 75001 2 18638 0.4029149
## 6: 1 32 6 553001 553001 75001 2 18638 0.3968287
## frag count.predicted enrichment pval
## 1: 297.2067 11.6107569 0.9473973 0.22530725
## 2: 304.2067 6.8392673 1.1697159 0.19421392
## 3: 256.2067 2.6380810 0.3790634 0.32268332
## 4: 323.2067 1.7950637 2.2283332 0.12783515
## 5: 320.2067 1.4569959 4.8044061 0.05873120
## 6: 345.2067 0.9119446 6.5793473 0.04661607
set.seed(198)
this_synergy = na.omit(synergy(binsets = this_re_chrom$binsets,
annotated.binsets = annotated_re_chrom, model = back_re_model))
## Synergy 2022-08-17 20:30:29: Scoring binsets
this_synergy$fdr = signif(p.adjust(this_synergy$p, "BH"), 2)
head(this_synergy[order(p)])
## bid method p estimate ci.lower ci.upper
## 1: 3 Negative Binomial(0.1597) 0.000345 2.877102 1.6128597 5.132322
## 2: 18 Negative Binomial(0.1597) 0.057200 4.202671 0.9571404 18.453346
## 3: 26 Negative Binomial(0.1597) 0.059900 4.368017 0.9401296 20.294621
## 4: 10 Negative Binomial(0.1597) 0.063400 4.033579 0.9252857 17.583494
## 5: 7 Negative Binomial(0.1597) 0.069600 4.925413 0.8798582 27.572281
## 6: 14 Negative Binomial(0.1597) 0.074700 8.708385 0.8059100 94.099792
## effect fdr
## 1: 2.88 [1.61-5.13] 0.023
## 2: 4.2 [0.957-18.5] 0.780
## 3: 4.37 [0.94-20.3] 0.780
## 4: 4.03 [0.925-17.6] 0.780
## 5: 4.93 [0.88-27.6] 0.780
## 6: 8.71 [0.806-94.1] 0.780
While using RE chromunity, based on the number of concatemers and bin size, the problem set may be too large to fit in memory. In such cases it is advisable to use a higher value for the shave()
parameters.
It is important to make sure that sliding window chromunity is run on individual chromosomes. A for loop can be used to run each chromosome and save outputs before moving to next chromosome.
If at higher resolution, counts are too sparse, it is advisable to rerun the analyses with larger bin sizes.
An empty output from the Chromunity may be indicative of sparseness in the data. Apart from changing the resolution as pointed above, the KNN parameters can be increased to relax the number of neighbors to be considered.