CAMDA 2008 2007 2006 2004 2003 2002 2001 2000 | What is CAMDA | Book | Discussion | Organizers | Contact Us

 

 

Home

Registration

Keynote Speakers

Call for Papers

Contest Datasets

Analysis Tools

Important Dates

Agenda

Sponsors

Scientific Committee

Accomodations

 

Analysis Challenges of Next-Generation Sequencing

Next-generation sequencing has significantly reduced the time and cost of performing sequencing experiments. However, translating the massive next-generation sequencing data into meaningful results faces new challenges. CAMDA2009 will mainly focus on the ChIP-seq assay. The challenges of ChIP-seq assay include:

  • Experiment design: number of replicates; sequencing depth
    Quality assessment: sequencing bias; sequencing errors; batch effects
  • Data management and visualization
  • Sequence alignment: alignment speed; error handling; multiple matches
  • Preprocessing: normalization; filtering, peak finding
  • Differential analysis: statistical models; multiple testing;
  • Annotation: associate ChIP loci with genes; functional analysis (biological pathways)
  • Motif identification

To face these analysis challenges, CAMDA2009 adoptes the approach of a community-wide experiment, letting the scientific community analyze the same public ChIP-seq dataset. The details of the dataset will be illustrated in the subsequent sections.

 

Data description (Summary)

Data type: ChIP-seq data (Chromatin ImmunoPrecipitation following by high-thoughput tag sequencing)

Platform: GA II platform from Illumina / Solexa

Samples:

1.     The human transcription factor Pol II in HeLa S3 cells (the same protocol as used for the STAT1 samples except that the cells were not stimulated using gamma-interferon prior to formaldehyde fixation)

2.     Input DNA dataset for HeLa S3 cells

3.     The human transcription factor STAT1 in HeLa S3 cells (The STAT1 ChIP was performed using HeLa S3 cells that are stimulated using gamma-interferon).

4.     Input DNA dataset for gamma-interferon stimulated HeLa S3 cells

Publication: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls, Nature Biotechnology: (Vol 27-1, Jan 2009)

Download the dataset:

The dataset can be accessed at GEO (GSE12738) . But it only includes the mapped sequence reads, and does not include the raw sequence read.

The paper companion website provides more detailed data information, and includes both mapped and raw sequence data.

 

Summary of the samples

 

Number of biological replicates

Number of lanes for each replicate

Raw sequence reads

Mapped sequence reads

Pol II ChIP-seq

3

2, 5, 4

Download

Download

Input DNA

1

13

Download

Download

STAT1 ChIP-seq (IFNg stimulated)

2

4, 2

Download

Download

Input DNA (IFNg stimulated)

1

6

Download

Download

Download detailed sample information table (Tab separated txt file).

 

Data formats

Illumina Sequence File Format

Illumina sequence files are in FASTQ format. It combines the sequence reads and quality score into one file. The quality score is represented as an ascii character. The ascii character is associated with the quality score. The encoding implicit in Solexa-derived fastq files is that each character code corresponds to a score equal to the ASCII character value minus 64 (e.g., ASCII @ is decimal 64, and corresponds to a Solexa quality score of 0). It is different from BioPerl, for instance, which recovers quality scores by subtracting 33 from the ASCII character value (so that, for instance, !, with decimal value 33, encodes value 0).

Following is the first four lines of an example sequence file. The first line starts with @ and followed by the sequence identifier. The second line is the sequence and the fourth line is the ASCII encoded quality score of the sequence.

First 4 lines of an example FASTQ sequence file:

@FC201WVA_20080307:6:1:85:638

GATAATACCTATTAAATAGTCCAAATTA

+FC201WVA_20080307:6:1:85:638

[[[[[[[[[[[[[[[[[[[[[[[[[YYY

Eland format

ELAND stands for Efficient Large-Scale Alignment of Nucleotide Databases.

ELAND searches a set of large DNA files for a large number of short DNA reads (≤ 32 bases, upcoming new release of Eland will support longer read lengths (>32 bp)) allowing up to 2 errors per match.

The columns from left to right are:

1.     Sequence name

2.     Sequence

3.     Type of match. Codes are (from the Eland manual): NM (no match); QC (no match due to quality control failure); RM (no match due to repeat masking); U0 (best match was unique and exact); U1 (best match was unique, with 1 mismatch); U2 (best match was unique, with 2 mismatches); R0 (multiple exact matches found); R1 (multiple 1 mismatch matches found, no exact matches); R2 (multiple 2 mismatch matches found, no exact or 1-mismatch matches).

4.     Number of exact matches found

5.     Number of 1-error matches found

6.     Number of 2-error matches found (For unique best match, U, only:)

7.     Genome file in which match was found

8.     Position of match

9.     Direction of match (strand of match) (F, forward; R, reverse)

10.  How N characters in read were interpreted (., not applicable; D, deletion; I, insertion)

11.  Position and type of the first substitution error (For U1 and U2 matches only:)

12.  Position and type of the second substitution error (For U2 matches only:)

 

Suggested Tools

Sequence alignment tools

Maq, Eland, SOAP, RMAP, SHRiMP and others

See some comments of analysis tools at:

http://massgenomics.wordpress.com/2008/05/14/short-read-aligners-maq-eland-and-others/

Analysis tools

ShortRead (R/Bioconductor)

PeakSeq

CisGenome

 

Basic Tutorial

Data input using functions in ShortRead Bioconductor package

Input Illumina sequence data in FASTQ format

# suppose sequenceDir is the directory keeping the sequence data. Function readFastq can input multiple sequence files together. It uses parameter pattern to select the files in the “sequenceDir” directory. In default, it will input all files in the directory.

rfq <- readFastq(sequenceDir, pattern="HeLa*")

# retrieve sequence data

sequenceRead <- sread(rfq)

# retrieve sequence data

sequenceID <- id(rfq)

# retrieve sequence quality

sequenceQuality <- quality(rfq)

sequenceQuality[1]

# Coerce the quality score as numeric

as(sequenceQuality[1], 'numeric')

Input mapped sequence data by Eland

We can use the “readAligned” function to input the mapped sequence files. Suppose mappedSeqDir is the directory keeping the mapped sequence data. Similar with function readFastq, function readAligned can also input multiple mapped sequence files together. To input the mapped sequence in our dataset, we can set the type parameter as “SolexaResult”. Please read the help of readAligned for other supported file formats.

alignedSeq <- readAligned(mappedSeqDir, type="SolexaResult" )

# show the slot names in the data

slotNames(alignedSeq)

# "chromosome" "position" "strand" "alignQuality" "alignData" "quality" "sread" "id"

 

Last modified on 05/28/2009

 

 

© Northwestern University Biomedical Informatics Center