Sequences from the Illumina use a
systematic identifier. The sequence header tells x,y-coordinates of the cluster within the tile. (Roche/454 read headers also have some information about the run. As far as I remember, 454 read ids encode the time of a run + region + several random character.)
Plotting the coordinates of HiSeq (2500) and MiSeq (v2) reads:
Surface area of HiSeq tile is greater than MiSeq tile (assuming both platforms have the same image resolution). More interestingly, the tiles in HiSeq are rectangular, while the tiles in MiSeq are round. MiSeq reads are distributed evenly in the circle, and HiSeq reads are also spread evenly. (Are read quality the same in different areas? maybe I can talk about this in another post).
Take a peek at the thumbnail of a MiSeq tile, you can see the tile is actually square. But the edges of the image are darker than the center area. Very likely that only spots (clusters) in the center are selected and used for base-calling.
If the surface of each tile on the flow cell is full off oligos which allows library binding, clusters should be generated evenly on the flow cell. Each tile should have the same cluster density across the whole area, and sequencing by synthesis reactions should take place at the same time with same performance, no matter a cluster is located at the corner or at the center. However, for some reason, imaging is not equally good across the tile. There is a reduction of brightness at the periphery compared to the image center (This effect seems like vignetting in photography). The circle distribution of reads indicates that clusters at the edges/corners (21% of all clusters) are discarded.
MiSeq v3 kit generates more dense clusters than v2 kit, and gives higher yield. But the shape and size of the 'selected' clusters are pretty much the same in both reagents. If it's possible to improve the imaging device, or cluster identification algorithm, all clusters in a tile could be identified and used for base-calling. That would increase the yield of MiSeq runs by ~27%. It's a little pity that those clusters go to waste.
Codes:
# extract tile numbers and x,y coordinates
gunzip -c XXX_S1_L001_R1_001.fastq.gz | sed -n '1~4s/:/ /gp' | cut -d ' ' -f 5-7 > pos.XXX.txt
# read the file (specifying colClasses option instead of using the default can make 'read.table' run MUCH faster)
pos.miseq.all = read.table("pos.miseq.txt", colClasses=c("numeric","numeric","numeric"))
pos.hiseq.all = read.table("pos.hiseq.txt", colClasses=c("numeric","numeric","numeric"))
# set column names
names(pos.miseq.all) = c("t", "x", "y")
names(pos.hiseq.all) = c("t", "x", "y")
# sampling a subset
pos.miseq = pos.miseq.all[sample(nrow(pos.miseq.all), 10000), ]
pos.hiseq = pos.hiseq.all[sample(nrow(pos.hiseq.all), 50000), ]
# plot for each tile
for (i in sort(unique(pos.miseq$t))){
cat(i, "\n")
# plot hiseq reads first
plot(pos.hiseq[pos.hiseq$t==i, 2:3], pch=21, col="blue", xlim=c(0,30000), ylim=c(0,100000), main=paste("tile:",i))
# add miseq reads
points(pos.miseq[pos.miseq$t==i, 2:3], pch=23, col="orange")
legend(x="right", c("HiSeq","MiSeq"), bty="n", col=c("blue","orange"), pch=c(21,23))
# pause for 2 seconds
Sys.sleep(time=5)
}