Helix, Bits and Codes: 2013

Sequences from the Illumina use a systematic identifier. The sequence header tells x,y-coordinates of the cluster within the tile. (Roche/454 read headers also have some information about the run. As far as I remember, 454 read ids encode the time of a run + region + several random character.)

Plotting the coordinates of HiSeq (2500) and MiSeq (v2) reads:

Surface area of HiSeq tile is greater than MiSeq tile (assuming both platforms have the same image resolution). More interestingly, the tiles in HiSeq are rectangular, while the tiles in MiSeq are round. MiSeq reads are distributed evenly in the circle, and HiSeq reads are also spread evenly. (Are read quality the same in different areas? maybe I can talk about this in another post).

Take a peek at the thumbnail of a MiSeq tile, you can see the tile is actually square. But the edges of the image are darker than the center area. Very likely that only spots (clusters) in the center are selected and used for base-calling.

If the surface of each tile on the flow cell is full off oligos which allows library binding, clusters should be generated evenly on the flow cell. Each tile should have the same cluster density across the whole area, and sequencing by synthesis reactions should take place at the same time with same performance, no matter a cluster is located at the corner or at the center. However, for some reason, imaging is not equally good across the tile. There is a reduction of brightness at the periphery compared to the image center (This effect seems like vignetting in photography). The circle distribution of reads indicates that clusters at the edges/corners (21% of all clusters) are discarded.

MiSeq v3 kit generates more dense clusters than v2 kit, and gives higher yield. But the shape and size of the 'selected' clusters are pretty much the same in both reagents. If it's possible to improve the imaging device, or cluster identification algorithm, all clusters in a tile could be identified and used for base-calling. That would increase the yield of MiSeq runs by ~27%. It's a little pity that those clusters go to waste.

Codes:

# extract tile numbers and x,y coordinates
gunzip -c XXX_S1_L001_R1_001.fastq.gz | sed -n '1~4s/:/ /gp' | cut -d ' ' -f 5-7 > pos.XXX.txt

# read the file (specifying colClasses option instead of using the default can make 'read.table' run MUCH faster)
pos.miseq.all = read.table("pos.miseq.txt", colClasses=c("numeric","numeric","numeric"))
pos.hiseq.all = read.table("pos.hiseq.txt", colClasses=c("numeric","numeric","numeric"))
# set column names
names(pos.miseq.all) = c("t", "x", "y")
names(pos.hiseq.all) = c("t", "x", "y")
# sampling a subset
pos.miseq = pos.miseq.all[sample(nrow(pos.miseq.all), 10000), ]
pos.hiseq = pos.hiseq.all[sample(nrow(pos.hiseq.all), 50000), ]
# plot for each tile
for (i in sort(unique(pos.miseq$t))){
  cat(i, "\n")
  # plot hiseq reads first
  plot(pos.hiseq[pos.hiseq$t==i, 2:3], pch=21, col="blue", xlim=c(0,30000), ylim=c(0,100000), main=paste("tile:",i))
  # add miseq reads
  points(pos.miseq[pos.miseq$t==i, 2:3], pch=23, col="orange")
  legend(x="right", c("HiSeq","MiSeq"), bty="n", col=c("blue","orange"), pch=c(21,23))
  # pause for 2 seconds
  Sys.sleep(time=5)
}

Recently, I constructed an Illumina RNA-Seq library which had non-standar fragment length. I used TruSeq RNA Sample Prep Kit v2 and followed the protocol except one modification: reducing fragmentation time so that the fragments will be longer. The investigator who submit the RNA sample preferred to have longer library size, because they wanted to do de novo assembly of this transcriptome, hoping longer read length of MiSeq platform (2x250 paired end) would help.

In standard TruSeq protocol mRNA is fragmented by incubation at 94 degC for 8 min. Now I used one min and got a Bioanalyzer profile like the figure below. Library length ranged from 250 up to 1500 bp.

The TruSeq manual actually explained what the profiles would look like. And my library profile matched. But I was still curious how this library performed. I was not sure what size to use as mean library length - which was required for dilution when loading the library on the sequencing machine. Overloading makes the clusters too dense, and base-call qualities will get low. But we need to load adequate amount of library to generate enough data. Fortunately, the run was great, although a little overloaded: cluster density more than 1100K/mm2, 2X20M reads, totally 20Gb data. After the sequencing was done, I mapped the reads to the transcriptome reference using BWA-MEM. Without any idea what this sample was from, I just did several blast search and chose one species which had very high similarity. Library insert size (TLEN field in SAM ouput) was then calculated and plotted. Note that adapters ligation adds extra ~120 bp to the insert fragment.

Codes:

What these results tell:

(1) Mapping results shows the majority of library inserts are less than 500 bp. Libraries with shorter length (insert + adapter) have higher chance to generate clusters and get sequenced. This has been reported already.

(2) For a library with such long and broad fragment length, 2x250 PE sequencing seem not necessary, because 2X150 PE is enough to cover the whole insert fragment of most of library DNA.

(3) Does it still make sense to prepare long RNA-Seq library? What's the pros and cons of it? ...hmm Need comparison with short library to get a conclusion.

In the end of this post, let me make this blog longer by dumping more more ugly plots here ...

Codes:

Helix, Bits and Codes

Saturday, December 28, 2013

Tile shapes in HiSeq and MiSeq

Thursday, December 26, 2013

TruSeq library with long and wide fragment length