Refined data release for the WoL project
======

Last updated: April 20, 2020

This is a refined version of the original data release of the "Web of Life"
(WoL) project, which is available from the Globus endpoint "WebOfLife". This
directory hosts necessary data files for utilizing WoL in microbiome data
analysis. They include reference genome sequences, coordinates of protein-
coding genes, phylogenetic trees, taxonomic and functional annotations.


## Files and directories

 - trees/tree.nwk: Reference phylogeny of 10,575 microbial genomes.
 - genomes/concat.fna.xz: Concatenated DNA sequences of the reference genomes.
 - proteins/coords.txt.xz: Coordinates of protein-coding genes on the genomes.
 - taxonomy: Taxonomic classification of the genomes, including NCBI and GTDB,
   original and curated based on the tree.
 - function: Functional annotation of the proteins, including UniRef, MetaCyc,
   GO and others.


## Instructions

Using the provided genome sequences, one may build reference databases for
specific metagenomics tools, which can then be used to analyze microbiome
data. For example, the following commands build a Bowtie2 index of the
genomes, which should resemble the pre-built Bowtie2 index provided in the
full data release:

```
mkdir -p databases/bowtie2
xzcat genomes/concat.fna.xz > /tmp/input.fna
bowtie2-build --seed 42 --threads 16 /tmp/input.fna databases/bowtie2/WoLr1
rm /tmp/input.fna
```

Once the database is built, one can run Bowtie2 to align input sequencing data
to the reference genomes. For example, the following commands run Bowtie2 with
parameters optimized for shotgun metagenomic data, as suggested in the SHOGUN
pipeline:

```
bowtie2 -p 16 -x databases/bowtie2/WoLr1 -f input.fa -S output.sam \
  --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" \
  --score-min "L,0,-0.05" --no-head --no-unal --seed 42
```

The resulting alignment file can be further analyzed with the reference tree,
the taxonomic and functional annotations, using the program Woltka. A tutorial
is included in the Woltka website:

 - https://github.com/qiyunzhu/woltka


## Contact

 - Project leader: Dr. Qiyun Zhu (qiyun.zhu@asu.edu)
 - Senior PI: Dr. Rob Knight (robknight@ucsd.edu)

 - Knight Lab  
   Departments of Pediatrics  
   University of California San Diego  
   9500 Gilman Drive, MC 0763  
   La Jolla, CA 92093-0763 USA  
   Tel: (858) 822-2379  
   Fax: (858) 246-1981