Analysis Challenges of Next-Generation Sequencing Next-generation sequencing has significantly reduced the time and cost of performing sequencing experiments. However, translating the massive next-generation sequencing data into meaningful results faces new challenges. CAMDA2009 will mainly focus on the ChIP-seq assay. The challenges of ChIP-seq assay include: - Experiment design: number of replicates; sequencing depth
Quality assessment: sequencing bias; sequencing errors; batch effects - Data management and visualization
- Sequence alignment: alignment speed; error handling; multiple matches
- Preprocessing: normalization; filtering, peak finding
- Differential analysis: statistical models; multiple testing;
- Annotation: associate ChIP loci with genes; functional analysis (biological pathways)
- Motif identification
To face these analysis challenges, CAMDA2009 adoptes the approach of a community-wide experiment, letting the scientific community analyze the same public ChIP-seq dataset. The details of the dataset will be illustrated in the subsequent sections. Data description (Summary) Data type: ChIP-seq data (Chromatin ImmunoPrecipitation following by high-thoughput tag sequencing) Platform: GA II platform from Illumina / Solexa Samples: 1. The human transcription factor Pol II in HeLa S3 cells (the same protocol as used for the STAT1 samples except that the cells were not stimulated using gamma-interferon prior to formaldehyde fixation) 2. Input DNA dataset for HeLa S3 cells 3. The human transcription factor STAT1 in HeLa S3 cells (The STAT1 ChIP was performed using HeLa S3 cells that are stimulated using gamma-interferon). 4. Input DNA dataset for gamma-interferon stimulated HeLa S3 cells Publication: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls, Nature Biotechnology: (Vol 27-1, Jan 2009) Download the dataset: The dataset can be accessed at GEO (GSE12738) . But it only includes the mapped sequence reads, and does not include the raw sequence read. The paper companion website provides more detailed data information, and includes both mapped and raw sequence data. Summary of the samples Download detailed sample information table (Tab separated txt file). Data formats Illumina Sequence File Format Illumina sequence files are in FASTQ format. It combines the sequence reads and quality score into one file. The quality score is represented as an ascii character. The ascii character is associated with the quality score. The encoding implicit in Solexa-derived fastq files is that each character code corresponds to a score equal to the ASCII character value minus 64 (e.g., ASCII @ is decimal 64, and corresponds to a Solexa quality score of 0). It is different from BioPerl, for instance, which recovers quality scores by subtracting 33 from the ASCII character value (so that, for instance, !, with decimal value 33, encodes value 0). Following is the first four lines of an example sequence file. The first line starts with @ and followed by the sequence identifier. The second line is the sequence and the fourth line is the ASCII encoded quality score of the sequence. First 4 lines of an example FASTQ sequence file: @FC201WVA_20080307:6:1:85:638 GATAATACCTATTAAATAGTCCAAATTA +FC201WVA_20080307:6:1:85:638 [[[[[[[[[[[[[[[[[[[[[[[[[YYY Eland format ELAND stands for Efficient Large-Scale Alignment of Nucleotide Databases. ELAND searches a set of large DNA files for a large number of short DNA reads (≤ 32 bases, upcoming new release of Eland will support longer read lengths (>32 bp)) allowing up to 2 errors per match. The columns from left to right are: 1. Sequence name 2. Sequence 3. Type of match. Codes are (from the Eland manual): NM (no match); QC (no match due to quality control failure); RM (no match due to repeat masking); U0 (best match was unique and exact); U1 (best match was unique, with 1 mismatch); U2 (best match was unique, with 2 mismatches); R0 (multiple exact matches found); R1 (multiple 1 mismatch matches found, no exact matches); R2 (multiple 2 mismatch matches found, no exact or 1-mismatch matches). 4. Number of exact matches found 5. Number of 1-error matches found 6. Number of 2-error matches found (For unique best match, U, only:) 7. Genome file in which match was found 8. Position of match 9. Direction of match (strand of match) (F, forward; R, reverse) 10. How N characters in read were interpreted (., not applicable; D, deletion; I, insertion) 11. Position and type of the first substitution error (For U1 and U2 matches only:) 12. Position and type of the second substitution error (For U2 matches only:) Suggested Tools Sequence alignment tools Maq, Eland, SOAP, RMAP, SHRiMP and others See some comments of analysis tools at: http://massgenomics.wordpress.com/2008/05/14/short-read-aligners-maq-eland-and-others/ Analysis tools ShortRead (R/Bioconductor) PeakSeq CisGenome Basic Tutorial Data input using functions in ShortRead Bioconductor package Input Illumina sequence data in FASTQ format # suppose sequenceDir is the directory keeping the sequence data. Function readFastq can input multiple sequence files together. It uses parameter pattern to select the files in the “sequenceDir” directory. In default, it will input all files in the directory. rfq <- readFastq(sequenceDir, pattern="HeLa*") # retrieve sequence data sequenceRead <- sread(rfq) # retrieve sequence data sequenceID <- id(rfq) # retrieve sequence quality sequenceQuality <- quality(rfq) sequenceQuality[1] # Coerce the quality score as numeric as(sequenceQuality[1], 'numeric') Input mapped sequence data by Eland We can use the “readAligned” function to input the mapped sequence files. Suppose mappedSeqDir is the directory keeping the mapped sequence data. Similar with function readFastq, function readAligned can also input multiple mapped sequence files together. To input the mapped sequence in our dataset, we can set the type parameter as “SolexaResult”. Please read the help of readAligned for other supported file formats. alignedSeq <- readAligned(mappedSeqDir, type="SolexaResult" ) # show the slot names in the data slotNames(alignedSeq) # "chromosome" "position" "strand" "alignQuality" "alignData" "quality" "sread" "id" |