# Author: Daniel McDonald
# Contact: danielmcdonald@ucsd.edu
# Date: 20-Nov-2020

American Gut 16S sequencing data
================================

The files here contain all of the per-sample 16S sequence data from the American Gut Project. All samples were sequenced using the Earth Microbiome Project 16S protocol (https://www.protocols.io/view/emp-16s-illumina-amplicon-protocol-nuudeww). Additional preparation details can be found in the AGP manuscript (https://msystems.asm.org/content/3/3/e00031-18), and in the preparation files in the "preparation-templates" directory. At least one preparation is single-end rather than paired end.

For the most part, the samples were sequenced on the Illumina MiSeq, however some samples were run on the HiSeq. Almost all runs used a 150 cycle kit, but at least one run used a 125 cycle kit. 

The files themselves are in the QIIME 2 QZA format, which are zip files that with contextual information as well as the per-sample gzip'd FASTQ data. The contents can be extracted either using QIIME 2's export (https://docs.qiime2.org/2020.8/tutorials/exporting/), or by using "unzip". More detail on QIIME 2 can be found in the associated manuscript (https://www.nature.com/articles/s41587-019-0209-9) or at the website (https://qiime2.org/).

Demultiplexing of the sequence data was performed using q2-demux (https://github.com/qiime2/q2-demux). A development branch from 2020.8 was used, which contained two minor bug fixes discovered during demultiplexing.

Filename structure
==================

The sample metadata are contained in (10317_20201030-084631.tsv).

The sequence data are in the .qza files. Each of the file describes the Qiita preparation ID, and the run prefix, for the sequence data: <number>-<name>.qza.

The "preparation-templates" directory contains files describing preparation specific information for a given prep and run prefix pair.

More information about Qiita can be found here (https://qiita.ucsd.edu/static/doc/html/gettingstartedguide/index.html). In brief, the preparation ID denotes a specific set of raw sequence data and contextual information about those data. The run prefix is used to map sequence filenames to the set of samples they contain. These items are important as:

- sample IDs are assured to be unique within a preparation, but are not assured to be unique among preparations
- a preparation can be composed of multiple sequencing runs
- a preparation can be composed of a partial sequencing run

Generally, a single preparation is a single sequencing run, however that is not always the case as samples are sometimes multiplexed with other studies.

The most common situation, in the American Gut, for why a sample may exist in multiple 16S preparations is if a sample failed to yield sufficient sequence data to produce results for a participant. Practically, for analysis, this presents two different strategies for samples represented multiple times: either merge the sequence data, or select the preparation specific data with the most reads. 

Important considerations
========================

The sample set here includes fecal, oral, skin samples, as well as some pet, environmental, food and control samples. 

An individual participant might contribute multiple samples. There are some individuals with a large number of samples in the dataset. The "host_subject_id" variable in the sample metadata is a stable, anonymized, identifier for a participant.

Most of the collection devices used for these specimens are dry swabs, where the fecal samples have known issues with blooms during shipping (https://msystems.asm.org/content/2/2/e00199-16). Prior to analysis, we advise removing the known blooms from the dataset. The amplicon sequence variants we advise removing can be found here (https://github.com/knightlab-analyses/bloom-analyses/blob/master/data/newbloom.all.fna). For meta-analyses, we advise removal of these blooms from all datasets to avoid introducing a technical difference.

The sample data is noisy for some fields. Anything that is described as "LabControl test" can be treated as a null. A data dictionary for the metadata can be found here (https://msystems.asm.org/content/msys/3/3/e00031-18/DC3/embed/inline-supplementary-material-3.xlsx?download=true).