Rsamtools基础知识问答 4

在次世代测序结果中,我们经常需要对bam/sam文件进行操作。在命令行中,我们很习惯使用samtools来操作,比如bam至sam,读取文件头,读取前一百行之类的。但是在Rsamtools中,我们应该如何对bam文件进行操作呢?

对于普通的操作,很多都可以一句话完成,比如:bam sam互换, index, idxstats, sort等等。

library(Rsamtools)
## Loading required package: GenomeInfoDb
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, cbind, colnames, do.call, duplicated, eval,
##     evalq, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, mapply, match, mget, order, paste, pmax, pmax.int,
##     pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
#bam to sam
dest <- tempfile()
sam_file_name <- asSam(file = ex1_file, destination = dest)
#sam to bam
bam_file_name <- asBam(file = sam_file_name, destination = dest)
#delete the files just generated.
dir(dirname(dest))
## [1] "filec6055b4e822.bam"     "filec6055b4e822.bam.bai"
## [3] "filec6055b4e822.sam"
unlink(c(sam_file_name, bam_file_name, paste0(bam_file_name, ".bai")))
#sort the bam by position
srt_bam_file_1 <- sortBam(file = ex1_file, destination = paste0(dest, "1"))
#sort the bam by name
srt_bam_file_2 <- sortBam(file = ex1_file, destination = paste0(dest, "2"), byQname=TRUE)
#index the bam
indexBam(srt_bam_file_1)
##       /tmp/RtmpAy9HOH/filec6055b4e8221.bam 
## "/tmp/RtmpAy9HOH/filec6055b4e8221.bam.bai"
tryCatch(indexBam(srt_bam_file_2), error=function(.e) print(.e)) ## will output an error
## <simpleError in FUN(X[[i]], ...): failed to build index
##   file: /tmp/RtmpAy9HOH/filec6055b4e8222.bam>
#merge bams
merged_bam_file <- mergeBam(c(srt_bam_file_1, srt_bam_file_2), destination = paste0(dest, ".merged.bam"))
#idxstats
idxstatsBam(srt_bam_file_1)
##   seqnames seqlength mapped unmapped
## 1     seq1      1575   1482       19
## 2     seq2      1584   1789       17
#count
countBam(srt_bam_file_1)
##   space start end width                 file records nucleotides
## 1    NA    NA  NA    NA filec6055b4e8221.bam    3307      116551
countBam(merged_bam_file)
##   space start end width                       file records nucleotides
## 1    NA    NA  NA    NA filec6055b4e822.merged.bam    6614      233102
#read in bam file
res <- scanBam(ex1_file)[[1]] #always list of lists
names(res)
##  [1] "qname"  "flag"   "rname"  "strand" "pos"    "qwidth" "mapq"  
##  [8] "cigar"  "mrnm"   "mpos"   "isize"  "seq"    "qual"
sapply(res, head)
## $qname
## [1] "B7_591:4:96:693:509"    "EAS54_65:7:152:368:113"
## [3] "EAS51_64:8:5:734:57"    "B7_591:1:289:587:906"  
## [5] "EAS56_59:8:38:671:758"  "EAS56_61:6:18:467:281" 
## 
## $flag
## [1]  73  73 137 137 137  73
## 
## $rname
## [1] seq1 seq1 seq1 seq1 seq1 seq1
## Levels: seq1 seq2
## 
## $strand
## [1] + + + + + +
## Levels: + - *
## 
## $pos
## [1]  1  3  5  6  9 13
## 
## $qwidth
## [1] 36 35 35 36 35 35
## 
## $mapq
## [1] 99 99 99 63 99 99
## 
## $cigar
## [1] "36M" "35M" "35M" "36M" "35M" "35M"
## 
## $mrnm
## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: seq1 seq2
## 
## $mpos
## [1] NA NA NA NA NA NA
## 
## $isize
## [1] NA NA NA NA NA NA
## 
## $seq
##   A DNAStringSet instance of length 6
##     width seq
## [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
## [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
## [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
## [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
## [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
## [6]    35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
## 
## $qual
##   A PhredQuality instance of length 6
##     width seq
## [1]    36 <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
## [2]    35 <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):
## [3]    35 <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5
## [4]    36 (-&----,----)-)-),'--)---',+-,),''*,
## [5]    35 <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%
## [6]    35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666

因为bam文件通常都比较大,而R的编程又习惯于将整个文件一下都读入内存,这就对我们操作比较大的bam文件带来了困难。于是在Rsamtools中示例了一些有用的代码,以便将大的bam文件化整为零的操作。

#fileter bam
filter <- FilterRules(list(MinWidth = function(x) width(x$seq) > 35))
res <- scanBam(ex1_file, filter=filter)[[1]]
sapply(res, head)
## $qname
## [1] "B7_591:4:96:693:509"    "EAS54_65:7:152:368:113"
## [3] "EAS51_64:8:5:734:57"    "B7_591:1:289:587:906"  
## [5] "EAS56_59:8:38:671:758"  "EAS56_61:6:18:467:281" 
## 
## $flag
## [1]  73  73 137 137 137  73
## 
## $rname
## [1] seq1 seq1 seq1 seq1 seq1 seq1
## Levels: seq1 seq2
## 
## $strand
## [1] + + + + + +
## Levels: + - *
## 
## $pos
## [1]  1  3  5  6  9 13
## 
## $qwidth
## [1] 36 35 35 36 35 35
## 
## $mapq
## [1] 99 99 99 63 99 99
## 
## $cigar
## [1] "36M" "35M" "35M" "36M" "35M" "35M"
## 
## $mrnm
## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: seq1 seq2
## 
## $mpos
## [1] NA NA NA NA NA NA
## 
## $isize
## [1] NA NA NA NA NA NA
## 
## $seq
##   A DNAStringSet instance of length 6
##     width seq
## [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
## [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
## [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
## [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
## [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
## [6]    35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
## 
## $qual
##   A PhredQuality instance of length 6
##     width seq
## [1]    36 <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
## [2]    35 <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):
## [3]    35 <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5
## [4]    36 (-&----,----)-)-),'--)---',+-,),''*,
## [5]    35 <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%
## [6]    35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666
#read in a given range
param <- ScanBamParam(what=scanBamWhat(), which=GRanges("seq1", IRanges(1, 1000)))
res <- scanBam(ex1_file, param=param)[[1]]
sapply(res, head)
## $qname
## [1] "B7_591:4:96:693:509"    "EAS54_65:7:152:368:113"
## [3] "EAS51_64:8:5:734:57"    "B7_591:1:289:587:906"  
## [5] "EAS56_59:8:38:671:758"  "EAS56_61:6:18:467:281" 
## 
## $flag
## [1]  73  73 137 137 137  73
## 
## $rname
## [1] seq1 seq1 seq1 seq1 seq1 seq1
## Levels: seq1 seq2
## 
## $strand
## [1] + + + + + +
## Levels: + - *
## 
## $pos
## [1]  1  3  5  6  9 13
## 
## $qwidth
## [1] 36 35 35 36 35 35
## 
## $mapq
## [1] 99 99 99 63 99 99
## 
## $cigar
## [1] "36M" "35M" "35M" "36M" "35M" "35M"
## 
## $mrnm
## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: seq1 seq2
## 
## $mpos
## [1] NA NA NA NA NA NA
## 
## $isize
## [1] NA NA NA NA NA NA
## 
## $seq
##   A DNAStringSet instance of length 6
##     width seq
## [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
## [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
## [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
## [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
## [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
## [6]    35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
## 
## $qual
##   A PhredQuality instance of length 6
##     width seq
## [1]    36 <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
## [2]    35 <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):
## [3]    35 <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5
## [4]    36 (-&----,----)-)-),'--)---',+-,),''*,
## [5]    35 <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%
## [6]    35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666
#read in by cigar filter
param <- ScanBamParam(what=scanBamWhat(),
                      flag=scanBamFlag(isSecondaryAlignment = FALSE,
                                         isUnmappedQuery=FALSE,
                                         isNotPassingQualityControls = FALSE))
## 可以使用更多的flag来过滤,可以通过?scanBamFlag来获得帮助。如果对flag不是很明白,可以通过https://samtools.github.io/hts-specs/SAMv1.pdf和https://broadinstitute.github.io/picard/explain-flags.html来了解更多的内容。
res <- scanBam(ex1_file, param=param)[[1]]
sapply(res, head)
## $qname
## [1] "B7_591:4:96:693:509"    "EAS54_65:7:152:368:113"
## [3] "EAS51_64:8:5:734:57"    "B7_591:1:289:587:906"  
## [5] "EAS56_59:8:38:671:758"  "EAS56_61:6:18:467:281" 
## 
## $flag
## [1]  73  73 137 137 137  73
## 
## $rname
## [1] seq1 seq1 seq1 seq1 seq1 seq1
## Levels: seq1 seq2
## 
## $strand
## [1] + + + + + +
## Levels: + - *
## 
## $pos
## [1]  1  3  5  6  9 13
## 
## $qwidth
## [1] 36 35 35 36 35 35
## 
## $mapq
## [1] 99 99 99 63 99 99
## 
## $cigar
## [1] "36M" "35M" "35M" "36M" "35M" "35M"
## 
## $mrnm
## [1] <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: seq1 seq2
## 
## $mpos
## [1] NA NA NA NA NA NA
## 
## $isize
## [1] NA NA NA NA NA NA
## 
## $seq
##   A DNAStringSet instance of length 6
##     width seq
## [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
## [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
## [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
## [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
## [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
## [6]    35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
## 
## $qual
##   A PhredQuality instance of length 6
##     width seq
## [1]    36 <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
## [2]    35 <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):
## [3]    35 <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5
## [4]    36 (-&----,----)-)-),'--)---',+-,),''*,
## [5]    35 <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%
## [6]    35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666

这个时候读入的bam文件是将其视为单端测序来读取的,如果它是双端测序的结果,我们应该怎么办呢?

res <- scanBam(BamFile(ex1_file, asMates = TRUE))[[1]]
sapply(res, head)
## $qname
## [1] "EAS54_61:4:143:69:578"         "EAS54_61:4:143:69:578"        
## [3] "EAS219_FC30151:7:51:1429:1043" "EAS219_FC30151:7:51:1429:1043"
## [5] "EAS1_108:1:65:787:74"          "EAS1_108:1:65:787:74"         
## 
## $flag
## [1]  99 147  83 163  83 163
## 
## $rname
## [1] seq1 seq1 seq1 seq1 seq1 seq1
## Levels: seq1 seq2
## 
## $strand
## [1] + - - + - +
## Levels: + - *
## 
## $pos
## [1]  36 185 209  59 213  61
## 
## $qwidth
## [1] 35 35 35 35 35 35
## 
## $mapq
## [1] 98 98 99 99 88 88
## 
## $cigar
## [1] "35M" "35M" "35M" "35M" "35M" "35M"
## 
## $mrnm
## [1] seq1 seq1 seq1 seq1 seq1 seq1
## Levels: seq1 seq2
## 
## $mpos
## [1] 185  36  59 209  61 213
## 
## $isize
## [1]  184 -184 -185  185 -187  187
## 
## $seq
##   A DNAStringSet instance of length 6
##     width seq
## [1]    35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG
## [2]    35 ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG
## [3]    35 TATTTGTAATGAAAACTATATTTATGCTATTCAGT
## [4]    35 CTGTGGACCCTGCAGCCTGGCTGTGGGGGGCGCCG
## [5]    35 TGTAATGAAAACTATATTTATGCTATTCAGTTCTA
## [6]    35 GTGGACCCTGCAGCCTGGCTGGGGGGGGCACGGGG
## 
## $qual
##   A PhredQuality instance of length 6
##     width seq
## [1]    35 ===;=====48=844;=;+=5==*57,2+5&,5+5
## [2]    35 222&<21<<<<12<7<01<<<<<0<<<<<<<20<<
## [3]    35 9<5<<<<<<<<<<<<<9<<<9<<<<<<<<<<<<<<
## [4]    35 <<<<<<<<<<<<<:<<<;<<<<:):;<;;-15)+1
## [5]    35 44848=:1661/66==?:<=:?6><<<<1>><<<<
## [6]    35 <<<<<8-82<2823;-<;822222888,*(2%2-2
## 
## $groupid
## [1] 1 1 2 2 3 3
## 
## $mate_status
## [1] mated mated mated mated mated mated
## Levels: mated ambiguous unmated

如果我们只想读取最开始的几行,怎么做呢?

# read first 10 lines
res <- scanBam(BamFile(ex1_file, yieldSize = 10))[[1]]
sapply(res, head, n=20)
## $qname
##  [1] "B7_591:4:96:693:509"     "EAS54_65:7:152:368:113" 
##  [3] "EAS51_64:8:5:734:57"     "B7_591:1:289:587:906"   
##  [5] "EAS56_59:8:38:671:758"   "EAS56_61:6:18:467:281"  
##  [7] "EAS114_28:5:296:340:699" "B7_597:6:194:894:408"   
##  [9] "EAS188_4:8:12:628:973"   "EAS51_66:7:68:402:50"   
## 
## $flag
##  [1]  73  73 137 137 137  73 137  73  89 137
## 
## $rname
##  [1] seq1 seq1 seq1 seq1 seq1 seq1 seq1 seq1 seq1 seq1
## Levels: seq1 seq2
## 
## $strand
##  [1] + + + + + + + + - +
## Levels: + - *
## 
## $pos
##  [1]  1  3  5  6  9 13 13 15 18 22
## 
## $qwidth
##  [1] 36 35 35 36 35 35 36 35 35 35
## 
## $mapq
##  [1] 99 99 99 63 99 99 99 99 75 99
## 
## $cigar
##  [1] "36M" "35M" "35M" "36M" "35M" "35M" "36M" "35M" "35M" "35M"
## 
## $mrnm
##  [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: seq1 seq2
## 
## $mpos
##  [1] NA NA NA NA NA NA NA NA NA NA
## 
## $isize
##  [1] NA NA NA NA NA NA NA NA NA NA
## 
## $seq
##   A DNAStringSet instance of length 10
##      width seq
##  [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
##  [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
##  [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
##  [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
##  [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
##  [6]    35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
##  [7]    36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG
##  [8]    35 TGTAAATGTGTGGTTTAACTCGTCCATTGCCCAGC
##  [9]    35 AAATGTGTGGTTTAACTCGTCCATGGCCCAGCATT
## [10]    35 GTGTGGTTTAACTCGTCCATGGCCCAGCATTTGGG
## 
## $qual
##   A PhredQuality instance of length 10
##      width seq
##  [1]    36 <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
##  [2]    35 <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):
##  [3]    35 <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5
##  [4]    36 (-&----,----)-)-),'--)---',+-,),''*,
##  [5]    35 <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%
##  [6]    35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666
##  [7]    36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0;
##  [8]    35 <<<<<<<<<7<<;<<<<;<<<7;;<<<*,;;572<
##  [9]    35 ==;=:;:;;:====;=;===:=======;==;===
## [10]    35 <<<<<<<<<<<<<<:<<<9<6;9;;&697;7&<55

如果我们想一行一行读入,怎么做?

con <- BamFile(ex1_file, yieldSize=1)
open(con)
i <- 0
while(i<=5 && isIncomplete(con)){
   print(do.call(cbind, scanBam(con)[[1]])) ##seq and qual are special objects
   i <- i + 1
}
##      qname                 flag rname strand pos qwidth mapq cigar mrnm
## [1,] "B7_591:4:96:693:509" 73   1     1      1   36     99   "36M" NA  
##      mpos isize seq qual
## [1,] NA   NA    ?   ?   
##      qname                    flag rname strand pos qwidth mapq cigar mrnm
## [1,] "EAS54_65:7:152:368:113" 73   1     1      3   35     99   "35M" NA  
##      mpos isize seq qual
## [1,] NA   NA    ?   ?   
##      qname                 flag rname strand pos qwidth mapq cigar mrnm
## [1,] "EAS51_64:8:5:734:57" 137  1     1      5   35     99   "35M" NA  
##      mpos isize seq qual
## [1,] NA   NA    ?   ?   
##      qname                  flag rname strand pos qwidth mapq cigar mrnm
## [1,] "B7_591:1:289:587:906" 137  1     1      6   36     63   "36M" NA  
##      mpos isize seq qual
## [1,] NA   NA    ?   ?   
##      qname                   flag rname strand pos qwidth mapq cigar mrnm
## [1,] "EAS56_59:8:38:671:758" 137  1     1      9   35     99   "35M" NA  
##      mpos isize seq qual
## [1,] NA   NA    ?   ?   
##      qname                   flag rname strand pos qwidth mapq cigar mrnm
## [1,] "EAS56_61:6:18:467:281" 73   1     1      13  35     99   "35M" NA  
##      mpos isize seq qual
## [1,] NA   NA    ?   ?
close(con)

如果我们想读入tags怎么办?

param <- ScanBamParam(what=scanBamWhat(), 
                      tag=c("MF", "H0", "H1", "NM", "UQ"))
res <- scanBam(ex1_file, param=param)[[1]]
names(res)
##  [1] "qname"  "flag"   "rname"  "strand" "pos"    "qwidth" "mapq"  
##  [8] "cigar"  "mrnm"   "mpos"   "isize"  "seq"    "qual"   "tag"
sapply(res$tag, head)
##       MF H0 H1 NM UQ
## [1,]  18  1  0  0  0
## [2,]  18  1  0  0  0
## [3,]  18  1  0  0  0
## [4,] 130  0  0  5 38
## [5,]  18  1  0  0  0
## [6,]  18  0  1  1  5

如果我们觉得list的格式不好看,怎么做?

library(GenomicAlignments)
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## SE
readGAlignments(ex1_file)
## GAlignments object with 3271 alignments and 0 metadata columns:
##          seqnames strand       cigar    qwidth     start       end
##             <Rle>  <Rle> <character> <integer> <integer> <integer>
##      [1]     seq1      +         36M        36         1        36
##      [2]     seq1      +         35M        35         3        37
##      [3]     seq1      +         35M        35         5        39
##      [4]     seq1      +         36M        36         6        41
##      [5]     seq1      +         35M        35         9        43
##      ...      ...    ...         ...       ...       ...       ...
##   [3267]     seq2      +         35M        35      1524      1558
##   [3268]     seq2      +         35M        35      1524      1558
##   [3269]     seq2      -         35M        35      1528      1562
##   [3270]     seq2      -         35M        35      1532      1566
##   [3271]     seq2      -         35M        35      1533      1567
##              width     njunc
##          <integer> <integer>
##      [1]        36         0
##      [2]        35         0
##      [3]        35         0
##      [4]        36         0
##      [5]        35         0
##      ...       ...       ...
##   [3267]        35         0
##   [3268]        35         0
##   [3269]        35         0
##   [3270]        35         0
##   [3271]        35         0
##   -------
##   seqinfo: 2 sequences from an unspecified genome
## PE
readGAlignmentPairs(ex1_file)
## GAlignmentPairs object with 1572 pairs, strandMode=1, and 0 metadata columns:
##          seqnames strand   :       ranges  --       ranges
##             <Rle>  <Rle>   :    <IRanges>  --    <IRanges>
##      [1]     seq1      +   :     [36, 70]  --   [185, 219]
##      [2]     seq1      +   :     [49, 84]  --   [224, 259]
##      [3]     seq1      +   :     [51, 85]  --   [228, 262]
##      [4]     seq1      +   :     [60, 95]  --   [224, 259]
##      [5]     seq1      +   :     [60, 94]  --   [235, 269]
##      ...      ...    ... ...          ... ...          ...
##   [1568]     seq2      -   : [1513, 1547]  -- [1322, 1356]
##   [1569]     seq2      -   : [1515, 1549]  -- [1332, 1366]
##   [1570]     seq2      -   : [1523, 1555]  -- [1354, 1388]
##   [1571]     seq2      -   : [1528, 1562]  -- [1376, 1410]
##   [1572]     seq2      -   : [1533, 1567]  -- [1349, 1383]
##   -------
##   seqinfo: 2 sequences from an unspecified genome
## with tags
## SE
param <- ScanBamParam(what=c("qname", "flag", "mapq", "isize", "seq", "qual"))
readGAlignments(ex1_file, param=param)
## GAlignments object with 3271 alignments and 6 metadata columns:
##          seqnames strand       cigar    qwidth     start       end
##             <Rle>  <Rle> <character> <integer> <integer> <integer>
##      [1]     seq1      +         36M        36         1        36
##      [2]     seq1      +         35M        35         3        37
##      [3]     seq1      +         35M        35         5        39
##      [4]     seq1      +         36M        36         6        41
##      [5]     seq1      +         35M        35         9        43
##      ...      ...    ...         ...       ...       ...       ...
##   [3267]     seq2      +         35M        35      1524      1558
##   [3268]     seq2      +         35M        35      1524      1558
##   [3269]     seq2      -         35M        35      1528      1562
##   [3270]     seq2      -         35M        35      1532      1566
##   [3271]     seq2      -         35M        35      1533      1567
##              width     njunc |                    qname      flag
##          <integer> <integer> |              <character> <integer>
##      [1]        36         0 |      B7_591:4:96:693:509        73
##      [2]        35         0 |   EAS54_65:7:152:368:113        73
##      [3]        35         0 |      EAS51_64:8:5:734:57       137
##      [4]        36         0 |     B7_591:1:289:587:906       137
##      [5]        35         0 |    EAS56_59:8:38:671:758       137
##      ...       ...       ... .                      ...       ...
##   [3267]        35         0 |     EAS56_63:4:141:9:811       137
##   [3268]        35         0 |  EAS114_30:6:277:397:932        73
##   [3269]        35         0 | EAS139_11:7:50:1229:1313        83
##   [3270]        35         0 |    EAS54_65:3:320:20:250       147
##   [3271]        35         0 |    EAS114_26:7:37:79:581        83
##               mapq     isize                     seq
##          <integer> <integer>          <DNAStringSet>
##      [1]        99      <NA> CACTAGTGGC...GTTTAACTCG
##      [2]        99      <NA> CTAGTGGCTC...TTTAACTCGT
##      [3]        99      <NA> AGTGGCTCAT...TAACTCGTCC
##      [4]        63      <NA> GTGGCTCATT...ACTCTTCTCT
##      [5]        99      <NA> GCTCATTGTA...TCGTCCATGG
##      ...       ...       ...                     ...
##   [3267]        10      <NA> TTTCTTTTCT...TTTTCTACAT
##   [3268]         0      <NA> TTTCTTTTCA...TTTTTTACTT
##   [3269]        77      -187 TTTTTTCTTT...TTGCATGCCA
##   [3270]        77      -200 TTTTTTTTTT...ATGCCAGAAA
##   [3271]        68      -219 TTTTTTTTTT...TGCCAGAAAA
##                             qual
##                   <PhredQuality>
##      [1] <<<<<<<<<<...<<<<<;:<;7
##      [2] <<<<<<<<<<...9<<3/:<6):
##      [3] <<<<<<<<<<...<3;);3*8/5
##      [4] (-&----,--...,+-,),''*,
##      [5] <<<<<<<<<<...<<7<<;:<5%
##      ...                     ...
##   [3267] <<<;<<<<<<...<<<..));;.
##   [3268] <<<<<<<<<<...<<<:8(,0%(
##   [3269] <<<<,<&<7<...<<<<<<<<<<
##   [3270] +'''/<<<<7...<<<<<<<<<<
##   [3271] 3,,,===6==...==========
##   -------
##   seqinfo: 2 sequences from an unspecified genome
## PE
readGAlignmentsList(ex1_file, param=param)
## GAlignmentsList object of length 1699:
## [[1]] 
## GAlignments object with 2 alignments and 6 metadata columns:
##       seqnames strand cigar qwidth start end width njunc |
##   [1]     seq1      +   35M     35    36  70    35     0 |
##   [2]     seq1      -   35M     35   185 219    35     0 |
##                       qname flag mapq isize                     seq
##   [1] EAS54_61:4:143:69:578   99   98   184 GTACATGGCC...GTGGACCCCG
##   [2] EAS54_61:4:143:69:578  147   98  -184 ATTGGGAGCC...ATTTGTAATG
##                          qual
##   [1] ===;=====4...7,2+5&,5+5
##   [2] 222&<21<<<...<<<<<<20<<
## 
## [[2]] 
## GAlignments object with 2 alignments and 6 metadata columns:
##       seqnames strand cigar qwidth start end width njunc |
##   [1]     seq1      -   35M     35   209 243    35     0 |
##   [2]     seq1      +   35M     35    59  93    35     0 |
##                               qname flag mapq isize
##   [1] EAS219_FC30151:7:51:1429:1043   83   99  -185
##   [2] EAS219_FC30151:7:51:1429:1043  163   99   185
##                           seq                    qual
##   [1] TATTTGTAAT...GCTATTCAGT 9<5<<<<<<<...<<<<<<<<<<
##   [2] CTGTGGACCC...GGGGGCGCCG <<<<<<<<<<...;<;;-15)+1
## 
## [[3]] 
## GAlignments object with 2 alignments and 6 metadata columns:
##       seqnames strand cigar qwidth start end width njunc |
##   [1]     seq1      -   35M     35   213 247    35     0 |
##   [2]     seq1      +   35M     35    61  95    35     0 |
##                      qname flag mapq isize                     seq
##   [1] EAS1_108:1:65:787:74   83   88  -187 TGTAATGAAA...TTCAGTTCTA
##   [2] EAS1_108:1:65:787:74  163   88   187 GTGGACCCTG...GGGCACGGGG
##                          qual
##   [1] 44848=:166...<<<1>><<<<
##   [2] <<<<<8-82<...88,*(2%2-2
## 
## ...
## <1696 more elements>
## -------
## seqinfo: 2 sequences from an unspecified genome

如果我们想输出bam文件怎么做?

library(rtracklayer)
res <- readGAlignments(ex1_file)
export(res, paste0(dest, ".res.bam"))
#delete all the temp files
(tobedelete <- dir(dirname(dest), full.names = TRUE))
## [1] "/tmp/RtmpAy9HOH/filec6055b4e822.merged.bam" 
## [2] "/tmp/RtmpAy9HOH/filec6055b4e822.res.bam"    
## [3] "/tmp/RtmpAy9HOH/filec6055b4e822.res.bam.bai"
## [4] "/tmp/RtmpAy9HOH/filec6055b4e8221.bam"       
## [5] "/tmp/RtmpAy9HOH/filec6055b4e8221.bam.bai"   
## [6] "/tmp/RtmpAy9HOH/filec6055b4e8222.bam"
unlink(tobedelete)

#sessionInfo
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS release 6.7 (Final)
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rtracklayer_1.34.0         GenomicAlignments_1.10.0  
##  [3] SummarizedExperiment_1.4.0 Biobase_2.34.0            
##  [5] Rsamtools_1.26.1           Biostrings_2.42.0         
##  [7] XVector_0.14.0             GenomicRanges_1.26.1      
##  [9] GenomeInfoDb_1.10.0        IRanges_2.8.0             
## [11] S4Vectors_0.12.0           BiocGenerics_0.20.0       
## [13] markdown_0.7.7             knitr_1.14                
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5       zlibbioc_1.20.0    BiocParallel_1.8.0
##  [4] lattice_0.20-34    stringr_1.1.0      tools_3.3.0       
##  [7] grid_3.3.0         Matrix_1.2-7.1     bitops_1.0-6      
## [10] RCurl_1.95-4.8     evaluate_0.10      stringi_1.1.2     
## [13] XML_3.98-1.4

4 thoughts on “Rsamtools基础知识问答

  1. Reply Liangcai 2月 24,2017 12:36 下午

    请问老师

    我用Rsamtools 对BAM文件进行sort后建立索引,之后怎么输出这些sort后的结果和index结果呢?谢谢
    >sortBam(“***.BAM”,”./”)
    [bam_sort_core] merging from 14 files…
    [1] “./.bam”

    > indexBam(“./.bam”,”***.BAM.index”)
    ./.bam
    “./.bam.bai”

    • Reply admin 2月 27,2017 9:42 上午

      在sortBam中,第二个参数是destination。你可以先试试下面的代码,假设你的文件名字叫”abc.bam”
      sorted <- sortBam("abc.bam", "abc.srt") ##这里的abc.srt是没有扩展名的 indexBam(sorted) 你之所以看不到输出文件,是因为你给出的文件名刚好是以.起始的文件名,是隐藏文件。

  2. Reply 生信菜鸟团 2月 26,2017 5:35 上午

    看起来是用rmarkdown在写教程,希望可以把一些打印关闭掉,这样不会显得全文无关信息那么多。
    比如我写的http://www.bio-info-trainee.com/bioconductor_China/software/limma.html

Leave a Reply

  

  

  

%d 博主赞过: