Modern biological experiments are increasingly producing interesting binary matrices. These may represent the presence or absence of specific gene mutations, copy number variants, microRNAs, or other molecular or clinical phenomena. We recently developed a tool, CytoGPS [^Abrams and colleagues], that converts conventional karyotypes from the standard text-based notation (the International Standard for Human Cytogenetic/Cytogenomic Nomenclature; ISCN) into a binary vector with three bits (loss, gain, or fusion) per cytoband, which we call the “LGF model”.
The CytoGPS tool is available at the web site http://cytogps.org, where the LGF results of processing karyotype data are returned in JSON format. To complement the web site, we have developed RCytoGPS, an R package to extract, format, and visualize genetic data at the resolution of cytobands. RCytoGPS can parse any JSON file (or set of files) produced by CytoGPS.org.
We have included a pair of JSON files produced at CytoGPS.org as examples in the package. These are found in the following directory:
## [1] "CytoGPS_Result1.json" "CytoGPS_Result2.json" "input1.txt"
## [4] "input2.txt"
The two text files contain the inputs that were uploaded to the web site; the two JSON files contain the outputs. You can specify the files and the folder that you want to read. The simplest application is to omit the files variable and read all filed in teh specified folder (which defaults to the current working directory).
## Reading 2 file(s) from 'C:/Users/KRC/AppData/Local/Temp/RtmpMVgKH1/Rinst4e702e002639/RCytoGPS/Examples/JSONfiles'.
The return value is a list of five elements.
## [1] "list"
## [1] "source" "raw" "frequency" "size" "CL"
The source element documents which JSON file(s) were read.
## [1] "CytoGPS_Result1.json" "CytoGPS_Result2.json"
The size element lists the number of rows returned from each file; each row represents a distinct clone.
## CytoGPS_Result1 CytoGPS_Result2
## 6 4
The CL element is a data frame describing the chromosomal locations of each cytoband.
## Chromosome loc.start loc.end Band
## chr1 : 65 Min. : 0 Min. : 300000 p11.1 : 20
## chr2 : 64 1st Qu.: 29275000 1st Qu.: 33350000 q11.1 : 17
## chr3 : 62 Median : 62950000 Median : 66400000 q21.2 : 17
## chr6 : 50 Mean : 72922120 Mean : 76480034 q22.2 : 17
## chr4 : 47 3rd Qu.:106700000 3rd Qu.:110725000 q22.1 : 16
## chr5 : 47 Max. :243500000 Max. :248956422 p11.2 : 15
## (Other):533 (Other):766
## Stain
## gneg :417
## gpos50 :122
## gpos25 : 89
## gpos75 : 89
## gpos100: 81
## acen : 48
## (Other): 22
The raw element is itself a list, containing the binary LGF data for each JSON file processed. Each file produces a “Status” output along with the LGF data. The Status includes both the input karyotype (in ISCN format) and an indicator of whether CytoGPS could successfully process it. In this example, the first karyotype contained an error. As a result, the LGF component does not contain any rows derived from that karyotype. It does, however. contain three rows derived from the second karyotype, since the “forward slashes” separate the decriptions of three different clones that were detected in that sample.
## [1] "CytoGPS_Result1" "CytoGPS_Result2"
## [1] "Status" "LGF"
## Status
## RN01 Validation error
## RN02 Success
## RN03 Success
## Karyotype
## RN01 46,XY,-8,+12,der(14)
## RN02 48,XY,t(10;13)(q26;q14),+12,+19/47,XY,t(9;13)(p24;q13),-10,+12,+19/45,XY,-13
## RN03 47,XY,+12,dup(14)(q32q32)
## [1] 4 2750
## [1] "2.1.1" "2.2.1" "2.3.1" "3.1.1"
Finally, the frequency element contains summary data from each file read. These summaries consist of the frequencies of loss, gain, and fusion events. Each row of this data frame represents a cytoband. There are three columns from each JSON file, one each for loss, gain, and fusion
## [1] "data.frame"
## [1] 868 6
## [1] "CytoGPS_Result1.Loss" "CytoGPS_Result1.Gain" "CytoGPS_Result1.Fusion"
## [4] "CytoGPS_Result2.Loss" "CytoGPS_Result2.Gain" "CytoGPS_Result2.Fusion"
In order to be able to work with the cytoband-level frequency data, we must combine it with the cytoband location data. Here we assemble them into a single data frame.
Next, we transfrom the CytoData data frame into an S4 object using the function <tt.cytobandData. The newly acquired object will then be used to generatie plots and will be available for further analyses.
The first graphs (using barplot ]) summarizes the frequency data from one data column along the genome. This provides a broad overview of the changes, and can be used to visually contrast the locations of changes in different data sets. Here we use barplot twice, showing losses and gains from the first file.
opar <- par(mfrow=c(2,1))
barplot(bandData, what = "CytoGPS_Result1.Loss", col = "forestgreen")
barplot(bandData, what = "CytoGPS_Result1.Gain", col = "orange")
The next graph allows you to simultaneously compare multiple cytogenetic events one chromosome at a time.
## [1] "CytoGPS_Result1.Loss" "CytoGPS_Result1.Gain" "CytoGPS_Result1.Fusion"
## [4] "CytoGPS_Result2.Loss" "CytoGPS_Result2.Gain" "CytoGPS_Result2.Fusion"
By adding the parameter horix=TRUE, you can rotate this graph 90 degrees. For more details about the parameters of the image method, see the manual pages and the “gallery” vignette.
We can assemble all of the single-chromosome plots into a single “idiogram” graph that shows all chromosomes at once.
The purpose of this graph is to visualize the chromosomes as well as a barplot of the cytogenetic abnormalities in orderto observe and possibly identify patterns.
This graph allows the user to compare and contrast two or more cytogenetic events simultaneously. Here we show loss (orange), gain (green), and fusion (purple) events from the Type 1 samples.
image(bandData, what = datacolumns[1:3], chr = "all",
pal=c("orange", "forestgreen", "purple"), horiz=TRUE)
To see all possible visuals please go to our gallery for images.
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RCytoGPS_1.2.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.8.0 magrittr_2.0.3
## [5] evaluate_0.15 highr_0.9 stringi_1.7.6 rlang_1.0.3
## [9] cli_3.3.0 jquerylib_0.1.4 bslib_0.3.1 rmarkdown_2.14
## [13] rjson_0.2.21 tools_4.2.1 stringr_1.4.0 xfun_0.31
## [17] yaml_2.3.5 fastmap_1.1.0 compiler_4.2.1 htmltools_0.5.2
## [21] knitr_1.39 sass_0.4.1