Greengenes Release 12_8
--

Greengenes is a well-characterized and curated database of small subunit
ribosomal near-full length sequences. With every release, a de novo 
phylogenetic tree is constructed that incorporates the novel branching 
patterns, branch length and candidate divisions of the sequences that have 
been deposited into public databases since the prior release.

Greengenes is operated and maintained by an international consortium 
representing academic and biotech interests. The consortium is composed of:

Phil Hugenholtz, Australian Center for Ecogenomics University of Queensland,
    School of Chemistry and Molecular Biosciences University of Queensland,
    Institute for Molecular Bioscience University of Queensland
Daniel McDonald, Dept. Computer Science University of Colorado, BioFrontiers 
    Institute University of Colorado
Todd DeSantis, Second Genome
Rob Knight, Dept. Chemistry and Biochemisty University of Colorado, Dept. 
    Computer Science University of Colorado, BioFrontiers Institute University
    of Colorado, Howard Hughes Medical Institute

For general inquiries, please contact Daniel McDonald 
(daniel.mcdonald@colorado.edu).

Many improvements and changes have been made to the Greengenes backend in an 
effort to streamline the release cycle. Notably:

- Chimera detection now relies on Chimera Slayer, Bellerophon and 
    curation by Phil Hugenholtz. These chimeras are described under the 
    gg.12.8.chimeras.txt file including the Genbank accession, Greengenes ID
    and a brief description for the justification. Furthermore, the reference
    database used is now based off of consensus sequences at the 94% OTU level
    from the previous release.
     
- The percent invariant calculation is now performed on a per-domain basis. We
    discovered a sizable set of sequences were falsely being dropped on the 
    when ignoring domain specific positions.

- The 2-study heuristic has been dropped.

- Many of the taxonomic differences between Silva and Greengenes have been
    resolved. These groups are now communicating on a regular basis to help 
    maintain consistency between the databases.

Our goal for now and indefinitely into the future is for biannual releases.
This release structure will allow for Greengenes to rapidly absorb novel 
diversity as our knowledge of the full tree of life expands.

Methods. Sequences were obtained from Genbank over during June and early July 
2012. These records were parsed for viable sequence. All obtained sequences 
were run through SSU-Align 0.1 (Nawrocki 2009). Sequences that aligned were 
then filtered by length, dropping all reads less then 1200nt. These bag of reads 
were run through Chimera Slayer and Bellerophon. The reference database used
was composed of consensus sequences from the 94% 4feb2011 Greengenes OTUs 
requiring a minimum of 10 sequences per cluster. Any sequence flagged by both
programs were marked as chimeric. Additionally, any sequence that introduced a
conflict at the class level as identified by Chimera Slayer was dropped and any
sequence with a divergence ratio > 1.2 as well as a class level divergence as
identified by Bellerophon were dropped. Any sequence with >= 1% of non-ACGT 
bases was dropped. Any sequence with <= 90% of positions that varied on
invariant 16S positions were dropped. 

OTU picking. Sequences were sorted as to increase stability in the picked 
representative sequences. The following preferences were used:

1) in the previous release
2) correctly named isolate
3) length

OTUs were picked using QIIME (Caporaso, et al. 2011) and UCLUST (Edger, 2010).

Initial tree construction was performed using FastTree (Price, et al 2010) with
the representative sequences from the 99% OTUs. A donor taxonomy based off the 
previous Greengenes release was decorated on to the tree. Visual inspection 
of this tree resulted in the identification of additional chimeric sequences. 
These sequences were flagged and dropped from the cleansed sequence set. OTUs
were repicked with the same criteria as before. Trees were constructed with the
full set and 99% OTU set.

Taxonomy curation was performed on the 99% tree and expanded out to the full 
set of sequences.

File listing and descriptions. All files are prefixed with gg_<release> where
the <release> is composed of <year>.<month>. For instance, gg_12_8 refers to 
the Greengenes release for August 2012.

00README
    - This file
00CHANGELOG
    - Important changes since the last release
00ROADMAP
    - Planned changes and additions including release dates
00STATS
    - Quick stats on the number of sequences included at various similarity 
        levels
gg_12_8_otus_99_annotated.tree.gz
    - 99% OTU rooted tree with taxonomy decorated
    - Phylogenetic reconstruction performed with FastTree (Price, et al 2010)
    - Base taxonomy decorated with tax2tree (McDonald, et al 2011)
gg_12_8_taxonomy.txt.gz
    - Full taxonomy for every tip
    - All taxonomy strings are strictly 7 level and prefixed
gg_12_8.fasta.gz
    - Unaligned sequences corresponding to the tips comprising the full tree
gg_12_8_aligned.fasta.gz
    - Aligned sequences corresponding to the tips comprising the full tree
    - Sequences are in NAST 7682 width
    - Sequences aligned with SSU-Align (Nawrocki 2009). As a consequence
        of the alignment, small structural differences are potentially cut out.
        It is not recommended to use this alignment for probe design.
gg_12_8_genbank.map.gz
    - A mapping from Greengenes IDs to Genbank accessions
gg_12_8_otus.tgz
    - QIIME (Caporaso, et al 2011) compatible OTUs
    - Sequences are sorted place seed preference on:
        - in the previous release
        - named isolates
        - longer sequences
    - Clusters are determined using QIIME-wrapped UCLUST (Edgar 2010)
    - Code can be obtained from https://github.com/qiime-dev/nested_reference_otus
        - commit 821a98df6773ea4e4d209af20b9a8cf34d00324
gg_12_8.sql.gz
    - Full Greengenes records
    - This is a mysqldump. The user is 'greengenes' without a password. The 
        database is named 'greengenes'
    - NOTE: This database currently only contains sequence information for those
        records included in the release, however all examined Genbank records
        that contained alignable 16S are described. This is a work in progress
        with additional record data and functionality to be added in subsequent
        releases
gg_12_8_chimeras.txt
    - The current chimera blacklist
    
References:

Caporaso JG, Kuczynski J, Stombaugh S, Bittinger K, Bushman FD, Costello EK, Fierer N, Gonzalez A, Goodrich JK, Gordon GI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J and Knight R (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 2010; doi:10.1038/nmeth.f.303

Price, M.N., Dehal, P.S., and Arkin, A.P. (2010) FastTree 2 -- Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3):e9490. doi:10.1371/journal.pone.0009490.

McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R and Hugenholtz P. (2011). An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME Journal 6: 610-618. doi:10.1038/ismej.2011.139. 

Nawrocki, EP. (2009). Structural RNA Homology Search and Alignment Using Covariance Models. PhD Thesis: Washington University School of Medicine

Edgar, R.C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics.doi: 10.1093/bioinformatics/btq461