Introduction

The gGnome package provides a flexible, queriable R interface to graphs and walks of genomic intervals. gGnome is written in the R6 object oriented standard and built around a powerful GenomicRanges, data.table, and igraph backend, and thus supports agile interaction with graphs consisting of hundreds of thousands of nodes and edges. Because gGnome classes are written in R6, their methods, variables, and “active bindings” are referenced using the $ symbol. This includes methods that (similar to other object oriented languages) enable the object to be modified “in place”. Please see here for more details about the R6 standard.

Our key interest in developing gGnome is to develop a framework to analyze the sorts of graphs that arise in the study of whole genome cancer structural variation. However, we believe that the package can be useful in any context where graphs and paths of reference genomic intervals are generated, such as through the analysis of complex germline haplotypes, population genetics of structural variation, partially assembled genome drafts, comparative genomics, and transcriptome assembly.

For installation instructions, please visit the gGnome github page. For background, it may help to have some familiarity with data.table, GenomicRanges, gUtils, igraph packages.

gGnome is currently in an alpha release and a manuscript in preparation. If you use gGnome in your work, please contact us so that we can update you when the citation becomes available.

The Basics

The key classes in gGnome comprise the following: the Junction, the genome graph (gGraph) (and its nodes (gNode) and edges (gEdge)), and the genomic walk (gWalk). The nodes in a genome graph represent reference genomic intervals, also known in BioConductor parlance as GRanges. Each gNode corresponds to a chunk of reference genome, and represents a double-stranded DNA sequence. Each gEdge represents a pair of complementary 3-5’ phosphodiester bonds (+/- some foreign sequence insertion), each termed an adjacency.

gGraph anatomy

Since DNA is double stranded, every gNode in the gGraph is actually a pair of stranded GRanges, one on the “+” and the other on the “-” reference strand. Similarly, every gEdge is a pair of directed adjacencies. This makes the gGraph a skew-symmetric directed graph.
In this directed graph, every interval has an “anti-interval”, every adjacency has an “anti-adjacency”, and every path (i.e. sequence of intervals or adjacencies) has an “anti-path”.

The vertices in a standard directed graph do not formally have left and right “sides” that edges can both enter and exit. This limits the applicability of standard directed graphs to the modeling of DNA, since double-stranded DNA segments can be rearranged with each other in any of four (left/left, right/left, right/right, left/left) possible orientations. Though, in a directed graph, we may arbitrarily label nodes that send edges to our vertex as being to its “left”, and edges that receive edges from our vertex to its “right”, it is impossible to then connect two vertices in our directed graph on their mutual left.

However, since gGraph nodes represent interval / anti-interval pairs, we can talk about their “right-side” (i.e. the side with adjacencies leaving the “+” interval and entering the “-” interval) and their “left side” (i.e. the side with adjacencies leaving the “-” interval and entering the “+” interval). Adjacencies associated with the left side of one node can connect to the left or right side of any other node. Though we retain the underlying structure of a directed graph (convenient for applying standard graph algorithms) we are able to model the “sidedness” of DNA sequence. We contrast this skew-symmetric structure to the traditional “bidirected graph” formulation of DNA assemblies, which uses nodes to represent double-stranded sequence “ends” and two flavors of (intrasegment, intersegment) edges to represent sequences and their adjacencies, respectively.

Orientation is fundamental

Just to review, strands of any (double-stranded) reference genome are arbitrarily labeled “+” and “-”, where the “+” strand goes from 5’ to 3’ in the direction of increasing genomic coordinates. Conversely, the “-” strand of the genome goes from 5’ to 3’ in the direction of decreasing genomic coordinates. As a convention we refer to the direction of increasing coordinates on the reference as “right” and the direction of decreasing coordinates as “left”. In the human genome, the 1q telomere is thus on the “right” side of chromosome 1, and the 1p telomere is on its “left”.

Using this system of orientation, there is an intrinsic direction associated with every node, i.e. there are edges associated with its “left” and “right” side. Some of those edges may be “REF” edges, i.e. connecting to that node’s reference-adjacent neighbor. The REF edges connecting a node to its right reference neighbor consist of two adjacencies: one leaving the “+” GRanges associated with the node and entering the “+” GRanges of its neighor, and a second (complementary) adjacency leaving the “-” GRanges of the neighbor and entering the node. Other (“ALT”) edges may connect our node to a distant locus that is non-adjacent in the reference, i.e. an adjacency created through rearrangement.

Genomes are not graphs

Real genomes are not graphs, but consist of linear (and possibly circular) alleles. Each allele is a string of nodes and edges, which we term a gWalk. When these alleles are superimposed (i.e. added), the resulting structure is a graph. The gGraph is thus only an abstraction, representing a summary of our (best) knowledge about the latent allelic state of the genomic sequence (also called its phase). When the nodes and edges on the gGraph are associated with an allelic dosage, this notion becomes quantitative - i.e. the graph represents an estimate of the total dosage of sequences and adjacencies in the given genome. Such an inference is produced by a handful of rearrangement callers, including Weaver, PREGO, RemiXT, CouGaR, and (our tool) JaBbA

Though a gGraph may be our “best guess” at the variant structure of a genome (i.e. from short-read whole genome sequencing (WGS)), certain biological questions require us to make inferences about linear (or circular) alleles. These require us to query features of gWalks generated from our gGraph, representing possible alleles in the genome. For example, we may want to query the genome for the existence of a possible high-copy allele (i.e. path) joining EML4 and ALK resulting in a protein-coding gene fusion. We may want to determine whether a particular regulatory element (i.e. “super-enhancer”) is connected via rearrangement to the gene MYC, resulting in its overexpression. We would like to identify the frequency with which EGFR amplified through the formation of circular extra-chromosomal (episomal) DNA. We may want to identify rearrangement signatures (e.g. BFBs) creating recurrent patterns in a compendium of gGraphs.

We may also be able to further deconvolve (i.e. phase) alleles in the graph using long-range sequence or mapping data (e.g. from optical mapping, linked-read sequencing, or long-read sequencing). Given the complexity of certain aneuploid cancer genomes this deconvolution will most likely be partial, i.e. local and probabilistic. gWalks and gGraphs can help us represent and analyze partially phased genomes, either as probability distributions over gWalks or complex gGraphs with “branches” and “bubbles” representing alternate paths over the same reference intervals.

Similarly, the intermediate stages of de novo (i.e. bottom up) genome assembly pipelines produce graphs, which are often broken or arbitrarily linearized at some stage of analysis, losing possibly important genomic signal. Alignment, visualization, and analysis of assembly graphs in reference genomic coordinates can yield significant biological insight and technical intuition, without the need to generate a “reference-grade” linear assembly. We believe that gGraph and gWalks can serve as useful abstraction to help annotate and refine these partial views of altered genome structure.

gGoals

Our goals in building gGnome are to enable agile analysis of structural variation (rearrangement and copy number alterations) in complex and highly rearranged genomes. Though our work relates to the recent movement in the sequence mapping community to build reference-grade “graph genomes” and “variation graphs”, our goal is to serve a different (though related) purpose. Firstly, these frameworks have been designed primarily with the goal of improving alignment and (single or multi-nucleotide) variant calling. Secondly, these frameworks aim to replace the linear reference genome with a novel graph based data strucutre.

In contrast, the vision for gGnome is to sit downstream of read mapping, assembly, and basic variant calling. We have designed gGnome to enable interactive exploration, annotation, and visualization of complex variants for biological discovery by the genomic data scientist. In addition, gGraphs are still firmly tethered to “linear reference” coordinates, where many / most datasets (epigenetic, transcriptome) live. The goal is to enable vetting of inferred patterns against raw alignment data, both to verify the quality of variant calls / assemblies / phases and interrogate their impact on biology. gGnome achieves this by leveraging several excellent packages in the R / Bioconductor universe for its back-end (GenomicRanges, data.table, igraph) as well as several useful mskilab (gTrack, gUtils) packages for data manipulation and visualization. We demonstrate examples below.

The Junction

The atomic unit of genomic structural variation

Before we get to graph, we have to define the Junction. A Junction is a signed pair of reference genomic locations that are brought together through rearrangement. Each junction connecting locations a and b has one of four possible orientations (a-|b+, a+|b+, a-|b-, a+|b-) that represent the “sides” of the breakpoint that are being joined. By convention we refer to - orientation as connecting the left side of the given breakpoint, while the + orientation connects the right side of the breakpoint.

Junctions calls are usually stored and shared in .bedpe or .vcf files. Though most callers usually provide additional “event” annotations to each junction (e.g. DEL, DUP, INV, TRA), such annotations lose their meaning in complex genomes where many junctions cluster near each other on the reference genome. Many complex (and even some simple variants e.g. inversions, balanced translocations) arise through multiple junctions, therefore the mapping of Junctions and “events” is likely not 1-1. Since “event calling” is an open question in structural variant analysis, the most fundamental unit of structural variant calling is the Junction - a signed genomic location pair.

Importing Junctions

We can load Junctions into gGnome from a variety of input formats from common junction / SV callers, including SvaBa, DELLY, Novobreak using the function jJ.

## SvAbA
svaba = jJ(system.file('extdata', "HCC1143.svaba.somatic.sv.vcf", package = "gGnome"))

## DELLY
delly = jJ(system.file('extdata', "delly.final.vcf.gz", package = "gGnome"))

## novobreak
novobreak = jJ(system.file('extdata', "novoBreak.pass.flt.vcf", package = "gGnome"))

## BEDPE
bedpe = jJ(system.file('extdata', "junctions.bedpe", package = "gGnome"))

In general we support BND style .vcf files and .bedpe files, though many callers vary in their details of implementing these formats.

Subsetting junctions

In gGnome, a Junction is a vectorized object that can be subsetted and queries on the basis of its metadata using convenient data.table style (with = TRUE) syntax. Junction metadata is caller dependent.

## can use both row and column subsetting on Junction metadata
head(novobreak[1:2, 1:10])
##                                               junc CHROM      POS paramRangeID
## 1:     10:118984-118984- <-> 12:68903528-68903528-    10   118984         <NA>
## 2: 10:24254039-24254039- <-> 10:28046995-28046995+    10 24254039         <NA>
##    REF   ALT QUAL FILTER   CIPOS IMPRECISE PRECISE
## 1:     <TRA>    1   PASS -10, 10     FALSE    TRUE
## 2:     <DUP>   60   PASS -10, 10     FALSE    TRUE
## can use data.table style expressions on metadata to subset Junctions
## here, filter novobreak translocations with quality greater than 50
novobreak[ALT == "<TRA>" & QUAL>50, 1:10][1:2, 1:5]
##                                                junc CHROM      POS paramRangeID
## 1: 10:29529472-29529472- <-> 1:200903945-200903945-    10 29529472         <NA>
## 2: 10:38816809-38816809+ <-> 2:101736362-101736362-    10 38816809         <NA>
##    REF   ALT
## 1:     <TRA>
## 2:     <TRA>
## subsetting SvAbA junctions with >10 bases of homologous sequence (short templated-sequence insertions or STSI as defined in the SvABA paper)
svaba[nchar(INSERTION)>10, ][1:2, 1:5]
##                                             junc CHROM      POS paramRangeID
## 1:   1:6525379-6525379- <-> 3:85232670-85232670+     1  6525379         <NA>
## 2: 1:51848768-51848768- <-> 1:70758650-70758650+     1 51848768         <NA>
##    REF           ALT
## 1:   C C[3:85232671[
## 2:   A A[1:70758651[
## subsetting SVabA junctions with same sign and nearby breakpoints (i.e. small $span)
svaba[svaba$sign>0 & svaba$span<1e5][1:2, 1:5]
##                                         junc CHROM     POS paramRangeID REF
## 1: 1:6569292-6569292+ <-> 1:6570681-6570681+     1 6569292         <NA>   T
## 2: 1:8397767-8397767- <-> 1:8401373-8401373-     1 8397767         <NA>   C
##             ALT
## 1: [1:6570682[T
## 2: C]1:8401373]
## subsetting junctions with infinite span (ie different chromosome) and homology length >5
delly[is.infinite(delly$span) & HOMLEN>5, ][1:2,1:5]
##                                                junc CHROM       POS
## 1:  6:29997619-29997619+ <-> 1:212475064-212475064+     6  29997619
## 2: 13:111070932-111070932- <-> 6:58483946-58483946+    13 111070932
##    paramRangeID REF            ALT
## 1:         <NA>   C [1:212475064[C
## 2:         <NA>   T  T[6:58483946[

Note that we are showing off the $span and sign active bindings of the Junction class, which return the “span” (i.e. distance between breakpoints) and “sign” (i.e. product of the breakpoint signs, where “+” = 1, “-” = 1) as a vector.

Yes, a Junction is just a glorified GRangesList, which we can extract using the $grl accessor

svaba$grl[1:2]
## GRangesList object of length 2:
## $`1`
## GRanges object with 2 ranges and 0 metadata columns:
##          seqnames    ranges strand
##             <Rle> <IRanges>  <Rle>
##   6536:1        1   6525379      -
##   6536:2        3  85232670      +
##   -------
##   seqinfo: 84 sequences from hg19 genome
## 
## $`2`
## GRanges object with 2 ranges and 0 metadata columns:
##          seqnames    ranges strand
##             <Rle> <IRanges>  <Rle>
##   6537:1        1   6525379      -
##   6537:2        3  85236506      -
##   -------
##   seqinfo: 84 sequences from hg19 genome

Junction overlaps and merges

However, unlike GRangesList we can perform junction specific queries which only allow for overlap between junctions if both breakpoint locations and orientations match.

## subset svaba by those intersect with DELLY using gUtils subset %&% operator
length(svaba %&% delly)
## [1] 7
## increase the overlap substanntially by padding delly calls with 100bp (using + operator)
length(svaba %&% (delly+100))
## [1] 103
## basic set operations also work
length(setdiff(svaba, delly+100))
## [1] 397
## length(union(svaba, delly+100))

We can do more comprehensive coordinate-based merges / joins of junctions using the merge function. The output is a “consensus” junction set, i.e. a Junction object representing the union of the input junctions with logical metadata columns prefixed by $seen.by that keep track of which junction was “seen” by which caller.

## any names work in the arguments to merge, these will be reflected in metadata as a $seen.by column
## (using padding of 1kb and c() to remove existing metadata)
res = merge(svaba = svaba[, c()], delly = delly[, c()], 
            novo = novobreak[, c()], anynameworks = bedpe[,c()], pad = 1e3)
head(res)
##                                             junc tmp.ix merged.ix anynameworks
## 1:     1:6569292-6569292+ <-> 1:6570681-6570681+     NA         1         <NA>
## 2:  1:6569298-6569298+ <-> 21:36220060-36220060+     NA         2         <NA>
## 3: 1:6683447-6683447+ <-> 1:228054031-228054031+     NA         3         <NA>
## 4:     1:8175806-8175806+ <-> 1:8401106-8401106+     NA         4         <NA>
##    delly novo svaba seen.by.anynameworks seen.by.delly seen.by.novo
## 1:  <NA> <NA>     3                FALSE         FALSE        FALSE
## 2:   191 <NA>  <NA>                FALSE          TRUE        FALSE
## 3:     2 <NA>  <NA>                FALSE          TRUE        FALSE
## 4:  <NA>  177     6                FALSE         FALSE         TRUE
##    seen.by.svaba
## 1:          TRUE
## 2:         FALSE
## 3:         FALSE
## 4:          TRUE
## here we can use the $dt (data.table) accessor quickly make an UpSetR plot
library(UpSetR)

## munge res$dt into upset friendly format
df = as.data.frame(sign(as.matrix(res$dt[,.(seen.by.anynameworks, seen.by.delly, seen.by.novo, seen.by.svaba)])))
upset(df)

The gGraph

Instantiation

We can create a gGraph from several input sources, including junctions, node and edges, and a a host of “graph SV callers” (Weaver, PREGO, RemiXT, CouGaR, and JaBbA) using the gG constructor.

From junctions and breaks

This constructor will “break” the genome (defined using seqinfo of the Junction object) at each connection end and create an ALT gEdge from 2 of the 4 resulting interval pairs. It will also leave a REF gEdge connecting the reference adjacent nodes.

### gGraph from svaba input
gg = gG(juncs = svaba)

The gGraph contains 1059 nodes and 1475 edges. The majority of edges are REF edges connecting reference adjacent nodes, while the remaining are ALT edges that inherit metadata from the input SvAbA junctions. The graph also contains 168 “loose ends”, of which all are terminal. Loose ends represent sequences “ends” that lack a left or right neighbor. These occur at the ends of the contigs that make up the Seqinfo or seqlengths of this object. Since this hg19 derived seqinfo(gg) has 84 “chromosomes”, this gGnome object has 168 terminal loose ends ends. In theory, loose ends can represent terminal 3’ and 5’ phosphates at real DNA “ends” (i.e. teleomeres). More generally, loose ends represent possible sites of attachment of missing or annotated sequence, e.g. unknown or repetitive sequence. Technically, even the annotated telomere represents an “end”.

We can plot the resulting graph as well as the junctions using gTrack, a useful Marcin Imielinski Lab package for visualizing GRanges-based genomic tracks. Each gGraph object has an accessor $gt that generates a gTrack object, which can be concatenated with other gTrack objects, e.g. those representing genome-wide coverage, ChIP-seq, gene annotations, etc.

## we use gTrack to plot the gTrack associated with this gGraph
## the second argument to gTrack plot is a string or GRanges representing the
## window to plot, the links argument enables drawing of junctions from GRangesList
plot(gg$gt, '1', links = svaba$grl)

Each sequence on chromosome 1 is drawn as a gray rectangle. Rectangles are connected by gray REF and red ALT edges. Most red ALT connect different rectangles on chromosome 1 (i.e. are intrachromosomal), though some edges go off the graph and only one end is visible. (are interchromosomal). The topmost track demonstrates the input SvAbA junctions, which distributed across the gGraph. If you look close you can see two “hooked” blue loose ends, one at the left terminus and the other at the right terminus of chromosome 1.

In this plot, the y axis position of each rectangle has no meaning - the vertical stacking is purely for visual purposes, i.e. to minimize collisions.

We can create a graph with additional REF edges by supplying a “breaks” GRanges to the gG constructor. This will break the genome and add new gray REF edges (but no ALT edges) at each location.

### generate breaks using gUtils function to tile genome at evenly spaced 1MB intervals
breaks = gr.tile(seqinfo(svaba), 1e6)

### gGraph from svaba input
gg2 = gG(breaks = breaks, juncs = svaba)

### set gGraph metadata
gg2$set(name = 'with breaks')

### compare graphs towards the beginning of chromosome 1
### (gTracks can be concatenated to plot multiple tracks)
plot(c(gg$gt, gg2$gt), '1:1-5e7', links = svaba$grl)

From nodes and edges

To create custom graphs we provide a constructor that takes a GRanges of interval nodesand a data.table of edges specifying their connectivity. The edges data.table contains 4 essential columns: n1 and n2 specifies the nodes indices that are being joined and n1.side and n2.side specify which side of each node is being joined. The value of side can be integer (0 for left, 1 for right) or character (left, right).

## tiling of chromosome 1 
nodes = gr.tile(seqlengths(svaba)["1"], 1e7)

## generate 20 random edges (n1, n2, n1.side, n2.side)
edges = data.table(
           n1 = sample(length(nodes), 20, replace = TRUE),
           n2 = sample(length(nodes), 20, replace = TRUE))
edges[, n1.side := ifelse(runif(.N)>0.5, 'right', 'left')]
edges[, n2.side := ifelse(runif(.N)>0.5, 'right', 'left')]

gg3 = gG(nodes = nodes, edges = edges)

plot(gg3$gt, '1')

Note that all edges in the graph by default are ALT (red) and loose ends (blue) are automatically placed at any interval end that lacks an edge. In this case all the nodes are non-overlapping and equal size but nothing in gGraph requires this.

pad = runif(length(nodes))*width(nodes)

gg3 = gG(nodes = nodes + pad , edges = edges)

plot(gg3$gt, '1')

From external tools

We support several “graph SV caller” formats for gGraph import.

## PREGO is the original cancer SV graph caller from Oesper et al 2012
gg.prego = gG(prego = system.file('extdata/hcc1954', 'prego', package='gGnome'))

## Weaver is from Li et al 2016
gg.weaver = gG(weaver = system.file('extdata/hcc1954', 'weaver', package='gGnome'))

## RemiXt is from McPherson et al 2018
gg.remixt = gG(remixt = system.file('extdata/hcc1954', 'remixt', package='gGnome'))

## JaBbA is from Imielinski Lab, Yao et al (in preparation)
gg.jabba = gG(jabba = system.file('extdata/hcc1954', 'jabba.rds', package="gGnome"))

plot(c(gg.prego$gt, gg.weaver$gt, gg.remixt$gt, gg.jabba$gt), '4')

These are graphs built from the analysis of the Her2+ breast cancer cell line HCC1954. These plots, unlike the previous, have a y-axis, which in this case represents copy number. The objects / tracks named by the caller that generated them.

ThesegGraph metadata features were set during object instantiation and can be accessed using the $metaactive binding and set using using the $set method. These meta fields are used as defaults for certain gGraph methods e.g. $gt, $simplify, $reduce, and others (some of which are described below).

The y.field meta data field, in particular, tells the $gt active binding to plot the gNode metadata "cn" on the y.axis. As a result the y axis position of each node tells us how many copies of that interval is predicted by that algorithm to exist in HCC1954.

If we set y.field to NULL then $gt will not create a y axis and stack the intervals, similar to the previous examples where created a gGraph from breaks or directly from nodes and edges.

## accessing the meta data features of a gGraph
gg.remixt$meta
## $name
## [1] "ReMiXt"
## 
## $y.field
## [1] "cn"
## 
## $y0
## [1] 0
## 
## $by
## [1] "cn"
## not that the y.field points to a column of the node metadata, accessed
## by $nodes$dt
gg.remixt$nodes$dt[1:2, ]
##    seqnames  start    end  width strand    length major_is_allele_a
## 1:        1      1 177417 177417      + 10575.481                 1
## 2:        1 177418 267719  90302      +  2076.497                 1
##    major_readcount minor_readcount readcount allele_ratio major_depth
## 1:             304             124     10672    0.2897196   0.7167629
## 2:              88              10      7041    0.1020408   3.0448065
##    minor_depth total_depth major_0 minor_0 major_1 minor_1 major_2 minor_2
## 1:   0.2923638    1.009127       1       1       3       0       3       1
## 2:   0.3460007    3.390807       1       1       3       0       3       1
##    major_raw minor_raw major_depth_e minor_depth_e total_depth_e   major_e
## 1:  8.829244  3.591415      0.244444    0.04813139     0.2925754 2585.1126
## 2: 37.561385  4.253389      0.244444    0.04813139     0.2925754  507.5871
##      minor_e   total_e major_raw_e minor_raw_e major_diff minor_diff
## 1: 509.01261 3094.1252           3   0.5771584          0          1
## 2:  99.94468  607.5318           3   0.5771584          0          1
##    prob_is_outlier_total prob_is_outlier_allele total_likelihood_mask
## 1:                  0.01                   0.01                     0
## 2:                  0.01                   0.01                     0
##    allele_likelihood_mask cn loose.left loose.right node.id snode.id index
## 1:                      0  3       TRUE       FALSE       1        1     1
## 2:                      0  3      FALSE       FALSE       2        2     2
## setting the y.field to NULL (previously "cn")
gg.remixt$set(y.field = NULL)

## now the gTrack when plotted on chromosome 4 will no longer plot 
## "cn", instead the nodes / segments will stack to stay out of each other's 
## way 
plot(gg.remixt$gt, '4')

Manipulating and browsing gGraphs

One useful feature of the gGraph, enabled by the R6 system, is the ability to easily navigate nodes and edges of the graph by subsetting, querying on metadata, and quickly selecting neighboring nodes and edges of a given query.

Marking edges and node metadata

Since gNode and gEdge objects are pointers to part of a gGraph, their methods can be used to modify the metadata of the gGRaph in place. This is done using the $mark() method.

It may be useful for example for us to highlight where the junctions harboring templated insertions in the JaBbA graph sit in the ReMiXT graph.

## set the metadata column "col" of remixt edges corresponding to long insertion jabba nodes to the value "blue"
eli.remixt$mark(col = 'blue')

## set the metadata column "col" of remixt nodes overlapping long insert jabba edges to the value "green"
nli.remixt$mark(col = 'green')

## we can also mark the analogous regions in the JaBbA model
biginsert$mark(col = 'blue') ## marking gEdge
biginsert$nodes$mark(col = 'green') ## marking gNode associated with these gEdges

## these metadata values will be interpreted by gTrack as segment and connection colors
## we plot both jabba and remixt objects (which have been changed by the above commands)
## near the vicinity of 5 of the biginsert junctions
plot(c(gg.remixt$gtrack(y.field = 'cn'), gg.jabba$gt), unlist(biginsert$grl[1:2])[, c()]+1e6)

We can see this plot shows two blue edges in three (reference) discontiguous windows. One of the blue edges (on the left) is a biginsert edge that is found only in the JaBbA model, while the the second blue edge (in the right window on chromosome 4 is found by both JaBbA and ReMiXT). The nodes attached to these edges have been highlighted, which in the bottom track mark the nodes where we would expect to find the first blue biginsert edge.

Trimming and subsetting gGraphs

In certain cases we may want to isolated pieces of gGraphs for further analysis. For example, we may want to pull the subgraph associated with a particular subset of nodes, trim a graph around a set of GRanges, or pull a subgraph that is within some (base pair) distance of a seed region on the gGraph.

## gGraph uses the bracket syntax for subsetting, where the indices before the comma
## corresponds to nodes and after the comma correspond to edges
## as with gNode and gEdge, both integers and metadata expressions will work 
ggs1 = gg.jabba[cn>100, ]
ggs1$set(name = 'gGraph\nsubset')

## this syntax is equivalent to the above, but uses the `gNode` subgraph command
ggs2 = gg.jabba$nodes[cn>100]$subgraph
ggs2$set(name = 'gNode\nsubgraph')

## here instead of subsetting on nodes, we trim the graph around a set of `GRanges`
## note that trim works in place, so if we want another copy we use $copy to clone the
## gGraph object
ggs3 = gg.jabba$copy$trim(highcopy$gr+1e5)
ggs3$set(name = 'trimmed\nJaBbA')

## since this function uses GRanges as input, we can apply it to the ReMiXT graph
## as well. 
ggs4 = gg.remixt$copy$trim(highcopy$gr+1e5)
ggs4$set(name = 'trimmed\nRemiXT', y.field = 'cn')

plot(c(gencode, ggs1$gt, ggs2$gt, ggs3$gt, ggs4$gt), highcopy$gr+2e5)

The first two (bottom) tracks generate identical graphs, subsetted to neighbors of highcopy nodes. The third track provides a window that includes 100kbp on each side of high copy nodes. The trim() function trims the boundaries of the nodes in the graph to the provided window. The fourth track does the same to the RemiXT graph.

Another useful task in gGraph browsing is to define a subgraph in a given “neighborhood” around a given seed. Though this can be sort of done using the combinations of $nodes and $edges accessors on gNodes, this limits us to seed that are themselves nodes (or edges) in the graph only allows us to expand a certain order (i.e. ego from the given seed).

Though we provide an $ego query on nodes (shown below), a more useful way to define local subgraph is around a specific GRanges seed and some base pair distance away from that seed. Let’s go back to the highcopy region around ERBB2 and try to zoom out a certain distance in the graph.

## define subgraph 100kbp around our "high copy region"
## we copy so we keep our gg.jabba intact
ggs = gg.jabba$copy$subgraph(highcopy$gr, d = 1e5)

## adjust GENCODE track to keep things pretty
gencode = track.gencode(stack.gap = 2e5, cex.label = 0.5, height = 10, name = 'GENCODE')

## we could plot ggs or just plot gg.jabba and use the ggs$footprint to guide us
## we set the upper y axis limit y1 to 20 to visualize low level copy number changes 
plot(c(gencode, gg.jabba$gt), ggs$footprint+1e5, y1 = 20)

Here, we see several low copy junctions to regions of chromosome 1, 11, and 12 internal to locations internal to the ERBB2 amplicon (rightmost window). This suggests late incorporation of distant genetic material into this (likely episomal) amplicion, or (alternatively) possible chromosomal integration sites of 1 or more copies of this amplicon e.g. into chromosome 12.

Advanced gGraph manipulation

We can edit gGraph objects by adding edges or nodes, replicating nodes or paths (i.e. introducing “bubbles”), and merging graphs with each other. We may want to “chop up” (or “disjoin”) the nodes of a graph across implicit internal reference connections, or simplifying the graph by merging reference adjacent nodes (+/- if they share a particular metadata feature). We may also want to combine graphs that represent different (partially phased) haplotypes into a union graph, and reduce or disjoin this graph into one that contains a single set of non-overlapping nodes.

Let’s begin with a simple graph

## define a simple window on chromosome 1
win = GRanges('1:1-1e7')

## create a simpel graph with 3MB bins
tiles1 = gr.tile(win, 3e6);
gg1 = gG(breaks = tiles1, meta = data.table(name = 'gg1'))

## create a second graph tiling the window with 2 MB bins
tiles2 = gr.tile(win, 2e6);
gg2 = gG(breaks = tiles2, meta = data.table(name = 'gg2'))

## this gGraph metadata will tell gTrack to plot the node.id with each node
gg1$set(gr.labelfield = 'node.id')
gg2$set(gr.labelfield = 'node.id')

## plot these two simple graphs
plot(c(gg1$gt, gg2$gt), win)

Concatenate, disjoin, and simplify graphs

We can concatenate any graphs, which takes the union of their nodes and junctions. We can also apply “disjoin”, “simplify”, and “reduce” to split, collapse, and merge graphs

## concatenate gg1 and gg2 
gg3 = c(gg1, gg2)
gg3$set(name = 'c(gg1, gg2)', height = 20)

## disjoin gg3 collapses the graphs into each other
## by taking the disjoin bins of any overlapping nodes
gg3d = gg3$copy$disjoin()
gg3d$set(name = 'disjoined')

## simplify collapses reference adjacent nodes that lack
## an intervening ALT junction or loose end.
gg3ds = gg3d$copy$simplify()
gg3ds$set(name = 'simplified')

## reduce is equivalent to a disjoin followed by a simplify
gg3r = gg3$copy$reduce()
gg3r$set(name = 'reduced')

## plot
plot(c(gg3$gt, gg3d$gt, gg3ds$gt, gg3r$gt), win)

Add nodes and junctions

Disjoining by GRanges is a quick way to add intervals to a graph without changing any connectivity (since we are just “instantiating” the reference edges that are internal to a node and implicitly exist).

## randomly sample 4 width 1 GRanges representing SNV
snv = gr.sample(win, 4, wid = 1)

## disjoin with gr= argument breaks the graph at these SNV
gg1d = gg1$copy$disjoin(gr = snv)

## plot results
plot(gg1d$gt, win)

We can use the $add method to add edges to the graph

## new edges are specified as data.table with n1, n2, n1.side and n2.side
gg1d$add(edges = data.table(n1 = 3, n1.side = 'left', n2 = 7, n2.side = 'right'))

## plot 
plot(gg1d$gt, win)

A simpler syntax is to use $connect.

## connect syntax specifies edges as pairs of "signed" node ids
## this means that the edge leaves the <right> side of the first signed node
## and enters the <left> side of the second signed node

## thus here we create an edge leaving the right side of 5 and entering the right side of 8
## (i.e. the left side of -8)
gg1d$connect(5, -8)

## this connects the right side of 3 and the left side of 9
gg1d$connect(3, 9)

## plot 
plot(gg1d$gt, win)

We can also $add junctions, however since a Junction is a pair of coordinates, they they create ALT edges between every intersecting node pair in the graph. If we want to add these same locations to just one “haplotype” of the input graph, then we will need to use $connect or $add with a specific node pair.

We illustrate using gg3.

## adding a junction to a copy of gg3
gg3j = gg3$copy$add(junctions = svaba[7])
gg3j$set(name = 'add junction')

## note that we have instantiated 4 separate edges, connecting all
## eligible breakpoint pairs on our input graph
gg3j$edges[type == 'ALT', ]
##    sedge.id type    class parent.graph og.edge.id grl.ix CHROM     POS
## 1:       21  ALT INV-like         <NA>         NA      1     1 8397767
## 2:       22  ALT INV-like         <NA>         NA      1     1 8397767
## 3:       23  ALT INV-like         <NA>         NA      1     1 8397767
## 4:       24  ALT INV-like         <NA>         NA      1     1 8397767
##    paramRangeID REF          ALT QUAL FILTER DISC_MAPQ NUMPARTS MATEMAPQ MATEID
## 1:         <NA>   C C]1:8401373]   33   PASS        28        2       NA 8376:2
## 2:         <NA>   C C]1:8401373]   33   PASS        28        2       NA 8376:2
## 3:         <NA>   C C]1:8401373]   33   PASS        28        2       NA 8376:2
## 4:         <NA>   C C]1:8401373]   33   PASS        28        2       NA 8376:2
##    MAPQ REPSEQ READNAMES                     SCTG SVTYPE BX NM MATENM SPAN
## 1:   60   <NA>           c_1_8379001_8404001_163C    BND     1      1 3606
## 2:   60   <NA>           c_1_8379001_8404001_163C    BND     1      1 3606
## 3:   60   <NA>           c_1_8379001_8404001_163C    BND     1      1 3606
## 4:   60   <NA>           c_1_8379001_8404001_163C    BND     1      1 3606
##    INSERTION SUBN HOMLEN                                              HOMSEQ
## 1:      <NA>    3     NA ATCCGCCCGCCTCAGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACC
## 2:      <NA>    3     NA ATCCGCCCGCCTCAGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACC
## 3:      <NA>    3     NA ATCCGCCCGCCTCAGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACC
## 4:      <NA>    3     NA ATCCGCCCGCCTCAGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACC
##    IMPRECISE EVDNC SECONDARY mateid svtype first right     coord    mcoord mix
## 1:     FALSE ASDIS     FALSE 8376:2    BND  TRUE FALSE 1:8397767 1:8401373  10
## 2:     FALSE ASDIS     FALSE 8376:2    BND  TRUE FALSE 1:8397767 1:8401373  10
## 3:     FALSE ASDIS     FALSE 8376:2    BND  TRUE FALSE 1:8397767 1:8401373  10
## 4:     FALSE ASDIS     FALSE 8376:2    BND  TRUE FALSE 1:8397767 1:8401373  10
##    MATECHROM MATEPOS tier edge.id n1.side n2.side n1 n2
## 1:         1 8401373    2      21   right   right 13 17
## 2:         1 8401373    2      22   right   right 13 18
## 3:         1 8401373    2      23   right   right 14 17
## 4:         1 8401373    2      24   right   right 14 18
## alternatively let's create a disjoint graph containing the breakpoints of this junction
bp = unlist(svaba[7]$grl)

## this uses an alternative syntax of disjoin with collapse = FALSE flag (ie where we only
## do a partial disjoin by chopping up reference nodes without collapsing overlapping nodes)
gg3d = gg3$copy$disjoin(gr = bp, collapse = FALSE)
gg3d$set(name = 'disjoin w bp')

## now we can add an ALT edge to just one of the 4 pairs of breakpoints
gg3de = gg3d$copy$connect(18,-14)
gg3de$set(name = 'connect')

## plot results
plot(c(gg3j$gt, gg3d$gt, gg3de$gt), win)

Replicate nodes and paths

An important concept in representing individual or population variation in genome graphs is the concept of a “bubble”, or a variant / ALT path through the graph that contains one or more sequence differences like SNV or indels. We can instantiate such paths in an existing graph using the $rep operation, which “replicates” a given node or path in the graph.

## copy gg2
gg2$set(name = 'original')
gg2c = gg2$copy

## replaces current copy of the first SNV with three separate copies
## i.e. representing different variants
gg2c$rep(2, 3)

## rep adds a metadata field "parent.node.id" to the graph
## which allows us to track the original node.id prior to replication
## we set gr.labelfield here to plot the parent.node.id instead of the node.id
## to see this correspondence
gg2c$set(name = 'replicate', gr.labelfield = 'parent.node.id')

##
plot(c(gg2$gt, gg2c$gt), win)

In addition to replicating individual nodes we can create bubbles from paths in the graph.

## n1 is the third copy of the node previously known as 2
n1 = gg2c$nodes[parent.node.id == 2][1]

## N2 is the node previously known as 4
n2 = gg2c$nodes[parent.node.id == 4]

## retrieve the path from these two nodes and replicate it in the graph
p = gg2c$paths(n1,n2)
gg2c$rep(p, 2)

## replot with current node.id
gg2c$set(gr.labelfield = 'node.id')
plot(c(gg2c$gt), win)

Now that we have a complex haplotype structure with alternate paths, we may want to add an allele specific rearrangement edges, in this case a shortcut from the right side of node 5 to the left side of node 10.

## we can now use $connect with gNode arguments to connect these two alleles
gg2c$connect(5, 10)

## now let's mark the (new) shortest path between nodes 1 and 2
p = gg2c$paths(1, 2)
p$mark(col = 'pink')

plot(c(gg2c$gt), win)

Components and communities

Weakly and strongly connected components of nodes

We can label weakly and strongly connected components in a gGraph with the $clusters method. These allow us to group nodes and define genomic subgraphs based on the connectivity patterns of nodes. This method modifies the $cluster node metadata data field, which can then be used to query and subset the gGraph.

## label weakly connected components in the jabba graph
## the $cluster node metadata field stores the output of running this method
gg.jabba$clusters(mode = 'weak')

## inspecting these shows that most nodes are part of a large weakly-connected component
## and the remaining are part of 1-node clusters
sort(table(gg.jabba$nodes$dt$cluster))
## 
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##   66   67   68    1 
##    1    1    1 1252
## analysis of strongly connected components reveals some more structure
## we peek at one of these clusters, marking its nodes blue
gg.jabba$clusters(mode = 'strong')

gg.jabba$nodes[cluster == 240]$mark(col = 'blue')

## then plotting shows an interesting amplicon
plot(gg.jabba$gt, gg.jabba$nodes[cluster == 240]$gr+1e5)

Strongly connected components require that every node is reachable from every other node via a directed path. As seen above, multi-node strongly connected components result in cyclic structures. In genome graphs, these represent complex amplicons.

Weakly connected components may not reveal interesting structures for a full genome graph representing a highly rearranged cancer genome, but may reveal interesting structure once we remove very wide nodes. The gNode version of the $clusters method allows us to conveniently mark up clusters in our original graph after clustering the subgraph only involving the selected set of nodes

## we first select nodes that are 1 Mbp in width, then compute clusters
gg.jabba$nodes[width<1e6]$clusters('weak')

## note that this syntax still sets the $clusters metadata field of the original 
## graph, giving any >1 Mbp nodes a cluster ID of NA
table(is.na(gg.jabba$nodes$dt$cluster), gg.jabba$nodes$dt$width>1e6)
##        
##         FALSE TRUE
##   FALSE   995    0
##   TRUE      0  324
## we peek at one of these interesting clusters, marking it with a blue color
gg.jabba$nodes[cluster == 111]$mark(col = 'green')

## interestingly, we have re-discovered the ERBB2 BFB-driven amplification highlighted above
gg.jabba$set(height = 30)
plot(c(gencode, gg.jabba$gt), gg.jabba$nodes[cluster == 111]$gr[, c()]+1e5)

Edge clusters

Clusters of quasi-reciprocal ALT edge can reveal “cycles” or “chains” of rearrangements in genome graphs. Such patterns have been dubbed “chromoplexy” (multi-way balanced rearrangements) and also linked to TIC (templated insertion chains). We enable detection of these edge clusters with the gGraph method $eclusters. This takes an argument thresh which is a distance threshold with which to group nearby quasi-reciprocal junctions - i.e. if thresh=0 then we only consider clusters of exactly reciprocal junctions. The default value of thresh is 1 Kbp.

## this will populate the ALT edges of the gGraph with metadata fields $ecluster, $ecycle, and $epath
## where $ecluster is the concatenation of $ecycle and $epath labels
gg.jabba$eclusters()

## paths are labeled by a "p" prefix, and cycles labeled by a "c" prefix
## here we see a multi-junction cluster p52 with 6 edges
sort(table(gg.jabba$edges$dt$ecluster))
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
## 36 38 27 28 29 30 31 32 33 34 35 37 39 40 41 
##  2  2  3  3  3  3  3  3  3  3  3  4  5  7  9
## we can mark these edges (and their associated nodes) with a special color
gg.jabba$edges[ecluster == 41]$mark(col = 'purple')
gg.jabba$edges[ecluster == 41]$nodes$mark(col = 'purple')

## here, the edges and nodes of the cluster that we have discovered
## are highlighted in purple 
plot(c(gg.jabba$gt), gg.jabba$edges[ecluster == 41]$nodes$gr %>% streduce(1e5))

The gWalk

The gWalk allows us to represent and analyze paths of genomic intervals in gGraphs, and annotate these paths with important metadata. Most fundamentally, a gWalk represents a possible linear (or circular) allele that can be inferred from our current graph model. More generally, the analysis of gWalks can help us study and annotate the connectivity of the graph, with the goal of identifying protein-coding gene fusions and non-coding regulatory element “proximities” (see Applications section below). Once we generate a gWalk object, it is very easy to query the gGraph that it belongs to and extract information about nodes and edges belonging to that gWalk or more loosely associated with it.

(Shortest) paths between node pairs

A simple way to generate gWalk objects is to search for paths connecting distant nodes in our graphs. These paths represent possible alleles or haplotypes that may exist in the genome that our graph is modeling. The $paths method of gGraph takes as input vectors of signed node ids or gNode objects corresponding the start (i.e. “source”) and terminus (i.e. “sink”) of paths that we are searching for. If no such paths exist, $paths will return an empty gWalk object. The default behavior of $paths is to operate in a strand-inspecific fashion (demonstrated below), but if we set ignore.strand = FALSE then the query will only return paths that start and end on the specified strand. This is important when examining connections between single-stranded genetic features (e.g. exons).

## path between nodes 1 and 10
p1 = gg.jabba$paths(1, 1000)
p1
##    walk.id name length       wid circular source sink      dist
## 1:       1    1     48 171774043    FALSE      1 1000 112889276
##                                                                                                                                                  gr
## 1: 1:1-6123459+ -> 1:6124594-51906579+ -> 1:51906580-52146887+ -> ... -> 8:106287457-106291908- -> 8:105510526-106287456- -> 14:54588232-107349540+
## $dt accessor gives us walk metadata, which we can set with $set method
## by default it contains the $length which is the number of nodes in the walk, the width which is the
## total base pairs contain in the walk
p1$set(name = 'my first walk')
p1$dt
##    walk.id          name length       wid circular source sink      dist
## 1:       1 my first walk     48 171774043    FALSE      1 1000 112889276
##           snode.id              sedge.id
## 1: 1,3,4,5,6,7,...  2, 4, 6, 7, 8,10,...
## like the gGraph, a gWalk contains $nodes and $edges accessors which correspond to all the nodes that
## and edges that contribute to that walk
p1$nodes
## GRanges object with 4 ranges and 20 metadata columns:
##     seqnames            ranges strand |        cn  start.ix    end.ix eslack.in
##        <Rle>         <IRanges>  <Rle> | <numeric> <integer> <integer> <numeric>
##   1        1         1-6123459      + |         4         1        11         0
##   3        1  6124594-51906579      + |         4        13        68         0
##   4        1 51906580-52146887      + |         5        69        70         0
##   5        1 52146888-89695552      + |         4        71        97         0
##     eslack.out     loose      edges.in     edges.out   tile.id     index
##      <numeric> <logical>   <character>   <character> <integer> <integer>
##   1          0     FALSE          ()-> ->2(3),->3(1)         1         1
##   3          0     FALSE 1(1)->,2(3)->        ->4(4)         3         3
##   4          0     FALSE 3(4)->,4(1)-> ->4(1),->5(4)         4         4
##   5          0     FALSE        4(4)->        ->6(4)         5         5
##      snode.id loose.left loose.right loose.cn.left loose.cn.right   node.id
##     <numeric>  <logical>   <logical>     <numeric>      <numeric> <integer>
##   1         1       TRUE       FALSE             0              0         1
##   3         3      FALSE       FALSE             0              0         3
##   4         4      FALSE       FALSE             0              0         4
##   5         5      FALSE       FALSE             0              0         5
##             col   cluster  pcluster  ncluster
##     <character> <integer> <integer> <integer>
##   1        <NA>      <NA>        99       311
##   3        <NA>      <NA>        97       313
##   4        <NA>        41        96       314
##   5        <NA>      <NA>        95       315
##   -------
##   seqinfo: 84 sequences from an unspecified genome
p1$edges
##    sedge.id cn type CHROM     POS paramRangeID  REF          ALT QUAL FILTER
## 1:        2  1  ALT     1 6123459         <NA>    T T[1:6124594[   40   PASS
## 2:        4  4  REF  <NA>      NA         <NA> <NA>                NA   <NA>
## 3:        6  4  REF  <NA>      NA         <NA> <NA>                NA   <NA>
## 4:        7  4  REF  <NA>      NA         <NA> <NA>                NA   <NA>
##    DISC_MAPQ NUMPARTS MATEMAPQ MATEID MAPQ REPSEQ READNAMES
## 1:        37        2       NA 3293:2   60   <NA>          
## 2:        NA       NA       NA   <NA>   NA   <NA>          
## 3:        NA       NA       NA   <NA>   NA   <NA>          
## 4:        NA       NA       NA   <NA>   NA   <NA>          
##                        SCTG SVTYPE BX NM MATENM SPAN INSERTION SUBN HOMLEN
## 1: c_1_6100501_6125501_136C    BND     0      0 1135      <NA>   NA     NA
## 2:                     <NA>   <NA>    NA     NA   NA      <NA>   NA     NA
## 3:                     <NA>   <NA>    NA     NA   NA      <NA>   NA     NA
## 4:                     <NA>   <NA>    NA     NA   NA      <NA>   NA     NA
##    HOMSEQ IMPRECISE EVDNC SECONDARY mateid svtype first right     coord
## 1:   GCCT     FALSE ASDIS     FALSE 3293:2    BND  TRUE  TRUE 1:6123459
## 2:   <NA>        NA  <NA>        NA   <NA>   <NA>    NA    NA      <NA>
## 3:   <NA>        NA  <NA>        NA   <NA>   <NA>    NA    NA      <NA>
## 4:   <NA>        NA  <NA>        NA   <NA>   <NA>    NA    NA      <NA>
##       mcoord mix MATECHROM MATEPOS tier id edge.id    class  col ecluster hcl.1
## 1: 1:6124594  68         1 6124593    2 47       2 DEL-like <NA>       NA     1
## 2:      <NA>  NA      <NA>      NA   NA NA       4      REF <NA>       NA    NA
## 3:      <NA>  NA      <NA>      NA   NA NA       6      REF <NA>       NA    NA
## 4:      <NA>  NA      <NA>      NA   NA NA       7      REF <NA>       NA    NA
##    hcl.2 ehcl ecycle n1.side n2.side n1 n2
## 1:     2    1   <NA>   right    left  1  3
## 2:    NA   NA   <NA>   right    left  3  4
## 3:    NA   NA   <NA>   right    left  4  5
## 4:    NA   NA   <NA>   right    left  5  6
## these edges and nodes reference the gGraph from which they were derived
## that gGraph can be accessed via the $graph accessor
identical(p1$graph, gg.jabba)
## [1] TRUE
## we can view the nodes of the gWalk as GRangesList
p1$grl
## GRangesList object of length 1:
## $`1`
## GRanges object with 48 ranges and 20 metadata columns:
##        seqnames              ranges strand |        cn  start.ix    end.ix
##           <Rle>           <IRanges>  <Rle> | <numeric> <integer> <integer>
##      1        1           1-6123459      + |         4         1        11
##      3        1    6124594-51906579      + |         4        13        68
##      4        1   51906580-52146887      + |         5        69        70
##      5        1   52146888-89695552      + |         4        71        97
##      6        1   89695553-89796926      + |         5        98        99
##    ...      ...                 ...    ... .       ...       ...       ...
##    603        8 109821451-110027327      + |         7      2559      2560
##   -550        8 106292156-106295295      - |        13      2474      2474
##   -548        8 106287457-106291908      - |        13      2472      2472
##   -547        8 105510526-106287456      - |        11      2469      2471
##   1000       14  54588232-107349540      + |         4      3843      3898
##        eslack.in eslack.out     loose            edges.in          edges.out
##        <numeric>  <numeric> <logical>         <character>        <character>
##      1         0          0     FALSE                ()->      ->2(3),->3(1)
##      3         0          0     FALSE       1(1)->,2(3)->             ->4(4)
##      4         0          0     FALSE       3(4)->,4(1)->      ->4(1),->5(4)
##      5         0          0     FALSE              4(4)->             ->6(4)
##      6         0          0     FALSE    5(4)->,1327(1)->   ->7(4),->1327(1)
##    ...       ...        ...       ...                 ...                ...
##    603         0          0     FALSE            602(7)-> ->604(3),->1869(4)
##   -550         0          0     FALSE   548(4)->,549(9)-> ->551(9),->1922(4)
##   -548         0          0     FALSE  547(11)->,607(2)->  ->549(9),->550(4)
##   -547         0          0     FALSE 546(10)->,2319(1)->          ->548(11)
##   1000         0          0     FALSE  999(3)->,1866(1)->               ->()
##          tile.id     index  snode.id loose.left loose.right loose.cn.left
##        <integer> <integer> <numeric>  <logical>   <logical>     <numeric>
##      1         1         1         1       TRUE       FALSE             0
##      3         3         3         3      FALSE       FALSE             0
##      4         4         4         4      FALSE       FALSE             0
##      5         5         5         5      FALSE       FALSE             0
##      6         6         6         6      FALSE       FALSE             0
##    ...       ...       ...       ...        ...         ...           ...
##    603       603       603       603      FALSE       FALSE             0
##   -550       550      1869      -550      FALSE       FALSE             0
##   -548       548      1867      -548      FALSE       FALSE             0
##   -547       547      1866      -547      FALSE       FALSE             0
##   1000      1000      1000      1000      FALSE        TRUE             0
##        loose.cn.right   node.id         col   cluster  pcluster  ncluster
##             <numeric> <integer> <character> <integer> <integer> <integer>
##      1              0         1        <NA>      <NA>        99       311
##      3              0         3        <NA>      <NA>        97       313
##      4              0         4        <NA>        41        96       314
##      5              0         5        <NA>      <NA>        95       315
##      6              0         6        <NA>        20         1         1
##    ...            ...       ...         ...       ...       ...       ...
##    603              0       603        <NA>         1         1         1
##   -550              0       550        <NA>         1         1         1
##   -548              0       548        <NA>         1         1         1
##   -547              0       547        <NA>         1         1         1
##   1000              0      1000        <NA>      <NA>        93       462
##   -------
##   seqinfo: 84 sequences from an unspecified genome
## sign only matters if we use ignore.strand = FALSE
p2 = gg.jabba$paths(1, -1000)

## so p2 will be virtually identical to p1
identical(p1$grl, p2$grl)
## [1] FALSE
## if we compute path in a strand specific manner, then we may get a different result
gg.jabba$paths(1, -1000, ignore.strand = FALSE)

## this long and winding path involves a completely separate chromosome, and may represent
## an interesting long range allele in this cancer genome.
## like with gNode and gEdge we can "mark" the nodes and edges of the gGraph that comprise
## this gWalk
p1$mark(col = 'pink')
gg.jabba$set(gr.labelfield = 'node.id')

## we can generate a gTrack from this via the $gt method and the GRanges comprising the  genomic "footprint"
## of this gWalk using the $footprint accessor
plot(c(gg.jabba$gt, p1$gt), p1$footprint+1e6)

The gWalk object is vectorized, and can be subsetted much in the same way as a gEdge or gWalk is, either by using indices or data.table style expressions on gWalk metadata. Since every gWalk represents a sequence of double-stranded genomic intervals in a given orientation (+ vs -) it is biologically equivalent to its reverse complement (i.e. the reverse sequence of intervals in the opposite orientation). To reverse complement a gWalk we just index it using a negative integer (this is different from the standard interpretation of R vector indexing, where negative integers means removal of the provided indices).

## the paths method is vectorized, therefore we can provide a vector of sources and sinks
## as well as gNode arguments for either

## all low copy nodes on chromosome 1
n1 = gg.jabba$nodes[seqnames == 2 & cn<=2]

## all low copy nodes on chromosome 21
n2 = gg.jabba$nodes[seqnames == 21 & cn<=2]

## paths between low copy nodes on chromosome 11 and 17
p4 = gg.jabba$paths(n1, n2)

## paths is vectorized so we can subset using integer indices
p4[1:2]
##    walk.id name length      wid circular source  sink    dist
## 1:       1    1     11 19878970    FALSE     61 -1163 4492992
## 2:       2    2      9  6643813    FALSE     61 -1165 4709946
##                                                                                                                                                              gr
## 1:        2:202339385-203161868+ -> 2:203161869-205496567+ -> 3:177097274-177097491+ -> ... -> 21:27229540-27337644+ -> 21:14563496-14779255- -> 21:1-14563495-
## 2: 2:202339385-203161868+ -> 2:203161869-205496567+ -> 3:177097274-177097491+ -> ... -> 21:16266971-16500987- -> 21:15890640-16266970- -> 21:14779256-15890639-
## reverse complement gWalks using negative indices
## note the signs of the intervals in the $gr metadata string
p4[-1]
##    walk.id name length      wid circular source  sink    dist
## 1:       1    1     11 19878970    FALSE     61 -1163 4492992
##                                                                                                                                                       gr
## 1: 21:1-14563495+ -> 21:14563496-14779255+ -> 21:27229540-27337644- -> ... -> 3:177097274-177097491- -> 2:203161869-205496567- -> 2:202339385-203161868-
## can subset gWalks using metadata
## e.g. we canlook for longer walks, eg those shorter than 10MB
p5 = p4[wid<10e6]

## we can mark the nodes and edges of gWalk just as we would for a gNode or gEDge
## this marks the original graph
p5$mark(col = 'purple')

## plot
plot(c(gg.jabba$gt, p5$gt), p5$footprint+1e6)

Walk decompositions of (sub)graphs (TBC)

Genome graphs represent a summary of the all the possible linear or circular alleles that are possible given our model of a rearranged genome. We can generate such walks for a gGraph using the $walks method, though for a large and complicated graph the number of such possible walks can become untenable to compute due to combinatoric issues. However, we can do this reliably for subgraphs, e.g. those defined through the analysis of clusters and commmunities (see previous section).

## this expression subset our graph to the nodes and edges
## that contribute to edge cluster 41 (see previous section on clusters
## and communities)
gg.sub = gg.jabba[, ecluster == 41]

## walk decompositiion of this small subgraph
## generates 28 possible linear alleles
walks = gg.sub$walks()

## we can choose the longest walk (most nodes traversed)
walks = walks[which.max(length)]

## and plot with the walk track up top (with nodes already marked purple from before)
plot(c(gg.jabba$gt, walks$gtrack(name = "Longest walk")), walks$footprint+1e4)

Creating gWalks de novo

Underlying eachgWalk object is a list of signed node ids and a list of signed edge ids in a gGraph.
These can be retrieved using the $snode.id and $sedge.id active binding. Each list item represents an ordered set of nodes and edges comprising the walk. A gWalk can be instantiated directly from signed node id or signed edge id lists and a gGraph input via the gW function. The provided node or edge sequence will have to exist in the provided gGraph, otherwise gW will error out or (if drop = TRUE is set) will ignore invalid inputs and return a gWalk object whose length is smaller than the provided lists.

## retrieve signed node ids associated with a gWalk
nid = p5$snode.id

## retrieve signed edge ids associated with a gWalk
eid = p5$sedge.id

## you can instantiate a gWalk from signed node ids
gW(snode.id = nid, graph = p5$graph)
##    walk.id name length     wid circular
## 1:       1    1      9 6643813    FALSE
##                                                                                                                                                              gr
## 1: 2:202339385-203161868+ -> 2:203161869-205496567+ -> 3:177097274-177097491+ -> ... -> 21:16266971-16500987- -> 21:15890640-16266970- -> 21:14779256-15890639-
## or from signed edge ids
gW(sedge.id = eid, graph = p5$graph)
##    walk.id name length     wid circular
## 1:       1    1      9 6643813    FALSE
##                                                                                                                                                              gr
## 1: 2:202339385-203161868+ -> 2:203161869-205496567+ -> 3:177097274-177097491+ -> ... -> 21:16266971-16500987- -> 21:15890640-16266970- -> 21:14779256-15890639-
## not every node id or edge id sequence will work
## for example the reverse node sequence won't (necessarily) be in the graph
revnid = list(rev(nid[[1]]))

## this will error out
gW(snode.id = revnid, graph = p5$graph)
## Error in private$gWalkFromNodes(snode.id = snode.id, graph = graph, circular = circular, : One or more provided walks refers to a non-existent edge, please check your input or re-instantiate with drop = TRUE to ignore this error and remove those walks
## however the reverse complement (reverse and multiply by -1) of a legal
## sequence will always work
rcnid = list(-rev(nid[[1]]))
gW(snode.id = rcnid, graph = p5$graph)
##    walk.id name length     wid circular
## 1:       1    1      9 6643813    FALSE
##                                                                                                                                                              gr
## 1: 21:14779256-15890639+ -> 21:15890640-16266970+ -> 21:16266971-16500987+ -> ... -> 3:177097274-177097491- -> 2:203161869-205496567- -> 2:202339385-203161868-
## if we use drop = TRUE on a list of node ids we won't error out 
## if some of the walks are illegal, just return a gWalk whose length is shorter
## than the input
nid2 = c(nid, revnid, rcnid)

## the result here is length 2, though the input is length 3
## this is because revnid is "ignored"
p6 = gW(snode.id = nid2, graph = p5$graph, drop = TRUE)

Alternatively, we can use the gW function to instanatiate a gWalk from a GRangesList. This can be done on top of a provided gGraph, however the connectivity of the gGraph must support the input GRangesList. As above, this requires that all reference and non-reference edges implied by the provided GRangesList must exist in the graph.
However this does not require the intervals in the input GRangesList exactly match those in the graph.
For example the GRangesList can contain a hyper-segmented or hypo-segmented set of nodes relative to the gGraph, i.e. one that explicitly instantiates the implicit reference edges that are internal to an interval, or collapses one or more reference adjacent nodes into a larger node.

## we can instantiate from GRangesList
## to demo we extract the grl from the walk above
grl = p6$grl

## this command will thread these provided grl onto the existing graph
p7 = gW(grl = grl, graph = p5$graph)

## reset the colors in our graph
p7$mark(col = 'gray')

## let's say we chop up i.e. hypersegment the p5 graph
## the above instantiation will still work .. the output
## gWalk will however be "chopped up" to be compatible with the
## chopped up graph

## create 500kb tiles on p5's genome
tiles = gr.tile(seqinfo(p5), 5e5);

## disjoin a copy p5$graph by these tiles
ggd = p5$graph$copy$disjoin(gr = tiles)

## the disjoint graph has many more nodes and edges because we have added reference edges
## at every tile breakpoint
dim(p5$graph)
## [1] 1319 1806
dim(ggd)
## [1] 7502 7989
## however instantiation from the grl will still work
p7d = gW(grl = grl, graph = ggd)
p7d$graph$set(name = 'chopt')

## plotting will help visualize the differences
## you can see that the top version of the walk and the top version of
## the graph is more "chopped up"
plot(c(p5$graph$gt, ggd$gt, p7[1]$gt, p7d[1]$gt), p7d$footprint + 1e5)

A gGraph does not have to be provided during gWalk instantiation from GRangesList. However, every gWalk object must be associated with a gGraph. If a gGraph is not provided, a brand new gGraph is created either by treating each interval in each GRangesList as a separate node (if disjoin = FALSE) or by collapsing all input intervals (if disjoin = TRUE) into a single gGraph of non-overlapping nodes.

## let's create a new grl concatenating the original and "chopped" up grl
grl2 = unname(grl.bind(p7$grl, p7d$grl))

## by default, disjoin = FALSE, and thus will not collapse the graphs corresponding to the inputted walks
## each walk will create a separate (linear) subgraph
p8 = gW(grl = grl2)
p8$nodes$mark(col = 'gray') ## reset node color
p8$graph$set(name = 'non-dis')

## disjoin = TRUE will create a single disjoint gGraph that results from "collapsing" the walks represented
## by the input GLR
p9 = gW(grl = grl2, disjoin = TRUE)
p9$graph$set(name = 'disjoint')

## plotting these with the original graph to visualize
## you can see the "induced subgraph" for p8 and p9 only spans
## the footprint of these walks

plot(c(ggd$gt, p8$graph$gt, p8$gt, p9$graph$gt, p9$gt), p9$footprint + 1e5)

Disjoin and simplify gWalks

Just like a gGraph, a gWalk can be simplified and disjoined. These operations are first applied to the underlying gGraph and then the walks are re-threaded on this altered graph. Since simplify, disjoin, and reduce do not restrict the connectivity of the graph (i.e. every path that exists in the original graph will exist in the altered graph), all the walks on the original graph will be “legal” on the altered graph.

## we can use simplify to "unchop" p8
## note that this will not collapse the disjoint paths, only remove reference edges,
## the resulting graph will continue to have four separate components representing each
## "haplotype"
p8s = p8$copy$simplify()
p8s$graph$set(name = 'simp', border = 'black')

## similarly we can use disjoin on the non-disjoint walks instantiated above
## this will collapse the graph to a non-overlapping set of nodes 
p8d = p8$copy$disjoin()
p8d$graph$set(name = 'disj', border = 'black')

## visualizing the results of the graphs
gt = c(p8$graph$gt, p8$gt, p8s$graph$gt, p8s$gt, p8d$graph$gt, p8d$gt)
gt$name = paste(gt$name, c('gG', 'gW'))
plot(gt, p8$footprint + 1e5)

Analyzing gWalks

Each gWalk refers to a string of gNode and gEdge objects, each which contain their own metadata. When analyzing a collection of walks (e.g. those representing rearranged haplotypes connecting regulatory elements and genes), we may want to annotate the walks on the basis of the features of the nodes and edges that contribute to them. This is enabled through the gWalk method $eval, which takes a scalar-valued expression on node or edge metadata (default is to test both) which will return a vector whose length is the number of walks where that expression has been evaluated once per walk. You can think of this as a fast lapply. It useful for doing “feature-engineering” of walks to identify those with interesting or biologically relevant properties.

## revisiting walks traversing ecluster 41
gg.sub = gg.jabba[, ecluster == 41]

## walk decompositiion of this small subgraph
## generates 28 possible linear alleles
walks = gg.sub$walks()

## now we can use eval to annotate walks with how many ALT junctions they contain
## ALT junctions
## the expression evaluates the edge metadata field type and returns a scalar result,
## one for each walk
numalt = walks$eval(sum(type == 'ALT'))

## we can set a new column in the walks metadata to this result
walks$set(nalt = numalt)

## we use eval to identify the number of short intervals contained in this walk
## width is a node metadata 
walks$set(nshort = walks$eval(sum(width<1e4)))

## by default eval tries to evalute the expression on nodes and then on edges
## if nodes and edges share some metadata field then we may want to specify
## exactly which data type we want eval to run on

## first let's use $mark to add a metadata field "type" to the nodes of this walk (edges
## by default already has a metadata field "type")
walks$nodes$mark(type = 'bla')

## now if we rerun the above expression for numalt, it will give us a new result
## this is because the expression is successfully evaluated on the nodes metadata field
## "type"
identical(walks$eval(sum(type == 'ALT')), numalt)
## [1] TRUE
## if we specify edge= argument to $eval then we will get the old result
## i.e. forcing evaluation on the edge metadata
identical(walks$eval(edge = sum(type == 'ALT')), numalt)
## [1] FALSE
## and if we use force nodes evaluation with node=, we will again get a non-identical result
identical(walks$eval(node = sum(type == 'ALT')), numalt)
## [1] TRUE

Applications

OK, now we’ve gotten through all the basics of gGnome, including a survey of the key functionality of gGraph, gWalk, gEdge, gNode, and Junction. Now we’re ready to do some analysis with gGnome. We present two use cases: the first involves searching for protein-coding gene fusions, and the second involves identifying regulatory element-gene proximities induced through genomic rearrangement. Each involve the analysis of a gGraph object and and return a gWalk object for the analyst to inspect, plot, annotate, and/or manipulate.

Protein fusions

A very important reason to be interested in (cancer) genomic rearrangements is that they may bringing together two (or more) coding sequence (CDS) regions into a neomorphic protein fusion. Such fusions may result from simple (i.e. single ALT junction) alleles but may employ more than one junction in cis. We can derive gene fusions as a path search on a gGraph using the fusions() function.

Each output gWalk represents the genomic coordinates of a possible protein-coding gene fusion. The metadata field $gene.pc is a character vector, which describes each protein fusion as a string of protein coordinates delimited by ‘;’, where protein interval is representing using the character notation for a GRanges (seqnames:start-pos). Since fusions() does careful book-keeping of protein frame, certain nodes in the walk may represent “out-of-frame” regions. These are represented by brackets around the gene name, and a fs suffix (e.g. [BRAF]fs). The $gene.pc string can be converted to a bona-fide protein coordinate GRanges via the gUtils::parse.grl function.

Some fusions may duplicate or delete genetic material. The protein coordinates of any deleted or duplicated material is encapsulated in $amp.pc and $del.pc metadata fields of the output. Only a subset of fusions outputted by gGnome are completely in-frame (denoted by metadata field $in.frame == TRUE). Others may start and end at in-frame node, but have one or more out-of-frame regions in the middle - we label these using the metadata field $frame.rescue == TRUE.

Finally, though we usually identify protein fusions by their gene, fusion call is actually made on the coordinates of a specific GENCODE transcript. The “transcript analogue” of the $gene.pc metadata field is tx.cc, which represents the fusion as a string of CDS coordinates, with the same GRanges style character string that can be converted to a cds GRangesList via grl.string. The same naming convention is used to denote in-frame and out-of-frame segments of the neomorphic transcript. This field can be used to reconcile fusion predictions between RNA-seq and WGS.

The output of fusions() also contains out-of-frame fusions (i.e. that have both $in.frame == FALSE and $frame.rescue == FALSE). We include these because searching for in-frame fusions is not perfectly sensitive for complex multi-junction gene fusions. When a fusion gWalk includes a piece of intergenic sequence, the (fast, greedy) graph search algorithm underlying fusions() may not be able to optimize for “frameness”. We thus include these “out-of-frame” fusions to allow users to look harder for more optimal paths that might be in-frame and also to allow for the (perhaps exotic) possibility that out-of-frame fusions may undergo RNA editing or alternate splicing to yield an alternate fusion product. Finally, some users may have interest in bona-fide truncating fusions, though many of these can be discovered through a simpler analysis (e.g. somatic copy number analysis through the search for recurrent deletions).

Though the fusions() function may take many minutes to run in “unbiased mode” on a highly rearranged genome like HCC1954, we can also use it in a targeted fashion using the genes= argument to quickly query for specific gene fusions. The examples below employ this usage for speed, though most users will want to do this analysis transcriptome-wide (i.e. without setting the genes= argument). Use of mc.cores will also speed up computation considerably for those with multicore machines.

## we need a GENCODE (style) object either as a GRanges (cached as an RDS on mskilab.com)
## or directly from GENCODE (https://www.gencodegenes.org/)
gff = readRDS(gzcon(url('http://mskilab.com/gGnome/hg19/gencode.v19.annotation.gtf.gr.rds')))

## we are looking for any fusions connecting the genes CNOT6, ASAP1, and EXT1 using the
## genes= argument
## (we know there are complex fusions here because we've run a previous genome wide analysis,
## i.e. without setting the "genes =" argument, which discovered complex in.frame fusions in these genes)
fus = fusions(gg.jabba, gff, genes = c('CNOT6', 'ASAP1', 'EXT1'))
length(fus)
## [1] 269
## fusions will output many "near duplicates" which just represent various combinations
## of near equivalent transcripts, we can filter these down using gWalk operations
ufus = fus[in.frame == TRUE][!duplicated(genes)]

## there are 5 unique gene in-frame gene combinations, we plot the first
## connecting ASAP1 to CNOT6 with a chunk of intergenic genome in between

## ufus[1] connects the first 20 amino acids of ASAP1 to the downstream 400+
## amino acids of EXT1
ufus[1]$dt$gene.pc
## [1] "ASAP1:1-19;EXT1:321-746"
## this walk has 6 aberrant junctions, as shown by $numab metadata
ufus[1]$dt$numab
## [1] 6
## indeed that is verified by this expression
length(ufus[1]$edges[type == 'ALT'])
## [1] 6
## here we plot the walk on top of the JaBbA-derived  gGraph, which you will notice
## has been "chopped up" to include features of relevant genes. 
plot(c(gencode, ufus$graph$gt, ufus[1]$gt), ufus[1]$footprint+1e4)

We can also find frame-rescues that begin at ASAP1, then incorporate a frameshifted chunk of the gene NSD1, and end up back at ASAP1 in frame. Such variants are a bit strange to interpret, since the frameshifted chunk may likely contain a stop codon (currently not annotated), and may require further investigation to determine whether they produce functioning protein products.

ufus = fus[frame.rescue == TRUE]

## In this fusion model, a frame-shifted chunk of NSD1 spans 35 amino acids 
## and has been essentially inserted into the middle of an unrearranged
## ASAP1 transcript. 
ufus[1]$dt$gene.pc
## [1] "ASAP1:1-1129;[FAM49B]fs:66-100;NSD1:1435-1459"
## there are 4 unique gene in-frame gene combinations, we plot the first
## connecting ASAP1 to CNOT6 with a chunk of intergenic genome in between
## here we plot the walk on top of the JaBbA-derived  gGraph, which you will notice
## has been "chopped up" to include features of relevant genes. 
plot(c(gencode, ufus$graph$gt, ufus[1]$gt), ufus[1]$footprint+1e4)

Proximity analysis

Rearrangements can bring regulatory elements in (1D and 3D) proximity with each other and with genes, perturbing gene expression and chromatin states in cis. To detect / annotate such noncoding consequences of complex rearrangeents, we have developed proximity() analysis, which takes two sets of GRanges (query and subject) and identifies pairs of these that have been brought near each other through rearrangement. For each such “proximity” it emits a gWalk representing the path connecting these genetic elements. We illustrate this analysis by identifying proximities between super-enhancers (also called locus control regions) and certain genes in HCC1954.

The gWalk object outputted by the proximity() function will have several standard metadata fields. These include: $reldist which is the fractional distance of the distance in the ALT graph (i.e. the $altdist) relative to the reference distance (i.e. $refdist). The remaining metadata features are inherited from the GRanges metadata from the pairs of subject and query arguments provided to proximity(). The returned gWalk object is sorted by its $altdist.

## load Hnisz et al 2013 superenhancers mapped to hg19
se = readRDS(gzcon(url('http://mskilab.com/gGnome/hg19/Hnisz2013.se.rds')))

## many of these are redundant / overlapping so we will reduce them with some padding
## to reduce computation
ser = reduce(se)

## read gff (if did not do it above)
## gff = readRDS(gzcon(url('http://mskilab.com/gGnome/hg19/gencode.v19.annotation.gtf.gr.rds')))

genes = gff %Q% (type == 'gene' & gene_name %in% c('TERT', 'BRD9'))

## useful (optional) params to tune for performance include "chunksize" and "mc.cores"
## which will chunk up and parallelize the path search, in this case 1000
## intervals at a time across 5 cores, can also monitor progress with verbose = TRUE
px = proximity(gg.jabba, ser, genes[, 'gene_name'], chunksize = 2000, mc.cores = 1)

## peek at the first proximity, we can see the reldist, altdist, refdist
## and additional metadata features inherited from the genes object
px[1]
##    walk.id name length   wid circular source sink  dist  qid sid      reldist
## 1:       1    1      5 60211    FALSE   1310  910 10704 2990   1 5.995456e-05
##    altdist   refdist gene_name
## 1:   10704 178535209      BRD9
##                                                                                                                    gr
## 1: 5:179442153-179449126+ -> 8:107834047-107836475- -> 8:107833364-107834046- -> 5:842815-850405+ -> 5:850406-892939+
## make a gTrack for the super-enhancers, coloring by tissue
gt.se = gTrack(se, gr.colorfield = 'tissue', name = 'SupEnh')

## plot the first super-enhancer connecting to BRD9
px[1]$mark(col = 'purple')

plot(c(gencode, gt.se, px$graph$gt, px[1]$gt), px[1]$footprint+1e5)

In this case, the superenhancer is on the same chromosome as BRD9 (chromosome 5). We may be specifically interested in proximities involving superenhancers that were previously very distant to their target gene (e.g. Inf base pairs away on the reference). In addition we may want to identify those that use many ALT junctions. We can use the gWalk subsetting and analysis features to quickly find these.

## use $eval to count ALT junctions for each walk
px$set(numalt = px$eval(edge = sum(type == 'ALT')))

## let's look for a superenhancer connecting to TERT
this.px = px[numalt>2 & refdist == Inf & gene_name == 'TERT']

## check out the first proximity
this.px[1]
##    walk.id name length   wid circular source  sink  dist  qid sid reldist
## 1:       1    6      7 98778    FALSE   2084 -1289 33718 4294   2       0
##    altdist refdist gene_name numalt
## 1:   33718     Inf      TERT      4
##                                                                                                                                                     gr
## 1: 8:129178880-129202017+ -> 5:179508545-179509558+ -> 5:176684274-176685261- -> ... -> 5:1297134-1326340- -> 5:1295185-1297133- -> 5:1253262-1295184-
## mark it up
this.px[1]$mark(col = 'purple')

plot(c(gencode, gt.se, this.px$graph$gt, this.px[1]$gt), this.px[1]$footprint+1e5)

Classifying SV events

There are many mutagenesis pathways producing various patterns of SVs in the cancer genomes. By identifying the specific motif that are recurrent in the cancer genoem graphs, we can classify subgraphs that satisfy the set of criteria corresponding to a particular class. These classes range from simple like deletions and tandem duplications, to complex like chromothripsis and breakage-fusion-bridge cycles. As of Hadi et al, Cell 2020, we support the identification of 5 simple event and 8 complex:

For ease of use, all the callers are wrapped into a single function events:

## load the graph for HCC1954
hcc1954 = gG(jabba = system.file("extdata", "hcc1954", "jabba.rds", package = "gGnome"))

## Identify all supported SV event types
hcc1954 = events(hcc1954, verbose = FALSE)

## Summary of identified events
hcc1954$meta$event[, table(type)]
## type
##         bfb chromoplexy         del         dup         inv      invdup 
##           1           1          38          23           1           1 
##       rigma     tyfonas 
##           1           1
## plot the locus of a BFB event
plot(hcc1954$gt, hcc1954[bfb>0]$footprint + 1e6); title("BFB in HCC1954")

## for the following examples load the CCLE models
ccle = dir(system.file("extdata", package = "gGnome"), ".+jabba.simple.rds", full = TRUE)
names(ccle) = gsub(".*gGnome/.*extdata/(.*)\\.jabba\\.simple\\.rds$", "\\1", ccle)

Below we describe each function in detail.

Deletions and Rigma

When simple deletions accumulate in the same locus more than expected from a null distribution (explained below), their cluster is labeled rigma.

hcc1954 = del(hcc1954)
plot(hcc1954$gt, hcc1954$meta$rigma$footprint %>% GRanges %>% streduce(5e5)); title("Rigma in HCC1954")

Tandem duplications and pyrgo

Similar to deletions and rigma, when a particular genomic region attain more than expected tandem duplications, the cluster of duplications is classified as a pyrgo.

mfe280 = gG(jabba = ccle["MFE_280"])
mfe280 = dup(mfe280)
plot(mfe280$gt, mfe280$meta$pyrgo$footprint %>% head(3) %>% GRanges %>% streduce(5e5)); title("Pyrgos in MFE-280")

Chromothripsis

Chromothripsis has been a stereotypical example of complex SV event, since its inception in Stephens et al., Cell 2011, in which it is proposed to originate from a single catastrophic shattering of a relatively focused chromosome region (e.g. arm), then randomly ligated by DNA repair pathway, resulting in high numbers of junctions clustered, and oscillating copy numbers due to the random loss/retention of DNA shards.

h2081 = gG(jabba = ccle["NCI_H2081"])
h2081 = chromothripsis(h2081)
plot(h2081$gt, streduce(h2081$gr %Q% which(chromothripsis>0), 1e6)); title("Chromothripsis in NCI-H2081")

Complex amplicons (bfb, dm, tyfonas)

Long has been known that genomic amplifications can drive carcinogenesis, yet the characterization of the exact structures beyond the increase in dosage are limited. Two mechanisms are well known, double minute (DM, also known as extrachromosomal circular DNA, eccDNA) and breakage-fusion-bridge cycles (BFBC). Genome graphs provides an oppurtunity to systematically differentiate and discover the spectrum of amplicon structures. In Hadi et al., Cell 2020, We describe a genome graph-derived feature space for these subgraphs that contain amplified junctions (JCN>7) and stably clustered into at least 3 classes, two of which correspond to DM and BFBC, but a third class with high numbers of junctions and high numbers of fold-back inversions, which is named tyfonas based on its wide reaching, extensively rearranged appearance.

The function amp identifies

h526 = gG(jabba = ccle["NCI_H526"])
h526 = amp(h526)
plot(h526$gt, streduce(h526$gr %Q% which(tyfonas>0), 1e6))

hara = gG(jabba = ccle["HARA"])
hara = amp(hara)
plot(hara$gt, streduce(hara$gr %Q% which(bfb>0), 1e6))

hcc827 = gG(jabba = ccle["HCC827"])
hcc827 = amp(hcc827)
plot(hcc827$gt, streduce(hcc827$gr %Q% which(dm>0), 1e6))

Chromoplexy and TICs

A series of long-range junctions (jumping over >=10Mbp on reference genome) can form a chain-like topology, when two breakends from two adjacent junctions in the chain are close enough to each other (by default <= 10 kbp).

h2228 = gG(jabba = ccle["NCI_H2228"])
h2228 = chromoplexy(h2228)
plot(h2228$gt, h2228$edges[which(chromoplexy>0)]$shadow %>% streduce(5e6)); title("Chromoplexy in NCI-H2228")

jhos2 = gG(jabba = ccle["JHOS_2"])
jhos2 = tic(jhos2)
plot(jhos2$gt, jhos2$gr %Q% which(tic %in% head(sort(unique(tic)), 3)) %>% streduce(5e4)); title("TICs in JHOS-2")

Simple

We also support identification of 3 other simple event types, namely inversion, inverted duplication, and ranslocation. Here are some examples:

hcc1954 = simple(hcc1954)
plot(hcc1954$gt, hcc1954$edges[grepl("^INV[0-9]+$", simple)]$shadow %>% streduce(1e5)); title("Inversion in HCC1954")

plot(hcc1954$gt, hcc1954$edges[grepl("^INVDUP[0-9]+$", simple)]$shadow %>% streduce(1e5)); title("Inverted duplication in HCC1954")

h526 = simple(h526)
plot(h526$gt, h526$edges[grepl("TRA", simple)]$shadow %>% streduce(1e5))

Make your own caller

Besides running our current implemented event callers, h2228nome serves as a generalized framework for users to write their own event callers. A general template procedure is: - Filter the nodes/edges of interest by their marginal features, e.g. copy number, width, junction span (reference distance between the breakends), orientations - Cluster by the node or edge based on graph topologies to get candidate subgraphs - Or walk the graph and look for paths or cycles as candidates - Summarize a set of features of the candidate subgraphs or walks

Balance

balance is a function for inferring copy number of nodes and edges of genome graphs by integrating read depth together with graph topology and applying the junction balance constraint. balance is the core function used in our package JaBbA. To learn more about junction balance analysis, please refer to the Hadi et al. publication.

balance takes as input a gGraph with segment copy number estimates stored in the node metadata and solves a Mixed Integer Program (MIP) to infer integer node and edge copy numbers. By default, a mixed-integer linear program (MILP) optimization is performed. JaBbA used a mixed integer quadratic program (MIQP) at the time of publication, but has since been modified to use linear programming due to improved speed and performance. To run balance, users will need to install CPLEX.

What’s the difference between JaBbA and balance?

JaBbA is basically a wrapper for the balance function and accepts as input coverage depth, segmentation and junction calls. JaBbA performs various steps prior to generating and balancing a gGraph (for more details, refer to the publication). JaBbA is intended for estimating segment and junction copy numbers for a whole genome.

The balance function could be used for fine-tuning a gGraph or for estimating values for a subgraph, as well as for analysis of more general cases of graphs such as long reads, segmentation data without coverage depth, or any case in which preliminary copy number estimates are available and the junction balance analysis is applied to find the closest balanced graph with integer cn values.

Notice that the usage of balance outside of JaBbA is an experimental functionality, please feel free to reach out to us if you are interested in exploring this functionality and need help.

balancing an example subgraph

Here we demonstrate how to run balance on a small region of the CCLE HCC1954 genome. balance can also be applied genome-wide to fit whole genome gGraphs.

First, we will read the example subgraph and coverage file.

## read subgraph file
sg = readRDS(system.file("extdata", "hcc1954.example.sg.rds", package = "gGnome"))

## read coverage file
sg.cov = readRDS(system.file("extdata", "hcc1954.rigma.sg.cov.rds", package = "gGnome"))

## create a gTrack of the coverage 
sg.cov.gt = gTrack(sg.cov, y.field = 'ratio', max.ranges = 1e4, lwd.border = 0.3, circles = TRUE, name = 'cov')

## define the plotting window
rg = '12:97000000-102000000'
plot(c(sg$gt, sg.cov.gt), rg)

Note that in the plotted image, the “y” location of the subgraph graph nodes is not reflective of copy number.

adding node copy number estimates via binstats

binstats is a function for using read coverage from your coverage file and sample purity/ploidy to produce node copy number estimates. It returns a new gGraph with the fields cn and weight added to the node metadata. cn represents a node copy number estimate. weight is calculated by dividing the number of coverage bins within a node by the coverage variance and reflects our certainty about the node CN estimate (e.g. we are more confident about the copy number estimate of nodes with higher weight).

A few important parameters in binstats: - bins is a GRanges with coverage data in the metadata column specified in field. In this case, the coverage information is stored in a column named “ratio” - field is the name of the column containing coverage data. - purity is the sample purity. It should be between 0 and 1, specifying the proportion of tumor cells. - ploidy is the sample ploidy (width-weighted average copy number).

binstats.sg = binstats(sg,
                       bins = sg.cov,
                       field = "ratio",
                       purity = 1,
                       ploidy = 4.58)

binstats.sg$nodes
## GRanges object with 4 ranges and 9 metadata columns:
##     seqnames            ranges strand | udnode.id loose.left loose.right
##        <Rle>         <IRanges>  <Rle> | <integer>  <logical>   <logical>
##   1       12 89781003-99041012      + |         5       TRUE       FALSE
##   2       12 99041013-99285988      + |         6      FALSE       FALSE
##   3       12 99285989-99291050      + |         7      FALSE       FALSE
##   4       12 99291051-99374971      + |         8      FALSE       FALSE
##           nid   node.id  snode.id     index        cn    weight
##     <numeric> <integer> <numeric> <integer> <numeric> <numeric>
##   1         4         1         1         1   4.67980 7674.0005
##   2         4         2         2         2   3.31639  501.2436
##   3         4         3         3         3   1.92122   20.9395
##   4         4         4         4         4   3.41781  169.2956
##   -------
##   seqinfo: 23 sequences from an unspecified genome

The cn and weight fields of binstats.sg are now populated:

balancing the graph

Now that the subgraph has CN estimates, we are ready to run balance.

There are a few parameters in balance to be aware of: - lp controls whether MILP (TRUE) or MIQP (FALSE) is used and is TRUE by default. We recommend this default because it improves runtime and convergence. - lambda controls the loose end slack penalty. Higher values of lambda make it less likely for balance to incorporate a copy number change without an associated junction. Suggested values are between 10 and 1000. - epgap controls the MIP optimization relative gap tolerance. Optimization will stop when this number is reached. We suggest values of 1e-4 and below. - tilim is the alotted time for MIP optimization in seconds. Optimization will stop when this number is reached. - verbose controls how much output is printed. Setting to 0 will print nothing, setting to 1 will print helpful debugging messages, and setting to 2 will also print the CPLEX output.

lp.bal.sg = balance(binstats.sg,
        lp = TRUE,
        lambda = 100,
        epgap = 1e-6,
        tilim = 60,
        verbose = 2)
## Rcplex: num variables=296 num constraints=284
plot(c(lp.bal.sg$gt, sg.cov.gt), rg)

Note that the “y” location of the subgraph nodes now represent their integer copy numbers as inferred by balance:

Interactive visualization with gGnome.js

We built gGnome.js, an interactive web application that allows browsing the genome graph at arbitrary windows along the reference genome, rich graph annotation options, cohort metadata filtering and more. In addition, the interface allows you to visualize coverage and RPKM data alongside your graphs/walks. We have made public the 2778 genome graphs in our paper here so you can start exploring the interface right away.

The gGnome package includes some methods for preparing the data to be visualized with gGnome.js. We will describe these methods below as a practical guide of how to get your data visualized.

The best way to get familiar with the features and configuration of gGnome.js is to read the gGnome.js documentation.

In order to visualize your own gGraph and gWalk in the browser, first clone the repository from github and follow the instructions in its README. Once you have gGnome.js installed on your machine, you can add your files to the gGnome.js folder in order to visualize your own data.

Generating gGnome.js instance

The easiest way to visualize your own data is to use the gGnome.js() function in the gGnome package. This function accepts a table with paths to RDS files containing the genome graphs (gGraph objects) and coverage files. Notice that the coverage files can be provided either as an RDS containing a GRanges object or as any file that can be parsed into a GRanges (e.g. txt, tsv, bed, bw, or wig).

As an example, we can use this function to generate a gGnome.js project for the hcc1954 and h526 graphs from above. First we need to save these graphs into RDS files:

saveRDS(hcc1954, 'hcc1954.rds')

saveRDS(h526, 'h526.rds')

Let’s load the path to a sample coverage file for HCC1954 (this file only includes coverage for chr8 just for demonstration).

hcc1954.cov.file = system.file('extdata/hcc1954', 'hcc1954.chr8.cov.rds', package="gGnome")

We then generate the table required by gGnome.js:

jsdt = data.table(sample = c('hcc1954', 'h526'), graph = c('hcc1954.rds', 'h526.rds'), coverage = c(hcc1954.cov.file, ''))

We are now ready to generate the gGnome.js project. We would need to provide a path to the location where we want the project to be created. Notice that this must be a new directory. If you wish to add more files to an existing directory you must use append = TRUE.

gGnome.js(data = jsdt,
                      outdir = './demo_gGnome.js',
                      cov.field = 'reads')

That’s it, you are now ready to launch the gGnome.js interface. You can do so by running the BASH script in your gGnome.js project dir (in our example: ./demo_gGnome.js/start.sh). This should open the following adress in your browser: http://localhost:8080/index.html.

And this is what you should see:

Here is an outline of the steps done by gGnome.js():

  1. Clone the github repository of gGnome.js (unless append is set to TRUE).
  2. Generate a JSON format version of the genome graph using the json method of gGraph and store it in the json subdirectory of the gGnome.js directory.
  3. Save a CSV formatted version of the coverage file of each sample to the scatterPlot subdirectory of the gGnome.js directory.
  4. Generate the sample description file: datafiles.csv.

How are files generated

To generate the genome graph JSON files and the coverage CSV files gGnome.js() uses gGnome functions/methods. You can use these functions directly to add files to your gGnome.js folder.

JSON files are generated using the json methods of gGraphs:

gg$json(filename = "gGnome.js/json/[sample_name].json")

gWalks also have a json method that generates a JSON file that could be visualized by gGnome.js. We plan to soon add an option to gGnome.js() to automatically output certain walks along with the genome graph.

Coverage CSV files are generated using the cov2csv function. For example:

cov2csv(hcc1954.cov.file, field = 'reads', output_file = 'demo_gGnome.js/scatterPlot/hcc1954.csv')