Tracing the origin of SARS-CoV-2 Omicron-like Spike sequences detected in wastewater

Introduction

Files linked in this portal represent the data generated while investigating the "Wisconsin wastewater lineage" and tracking it down to a single facility. This work is outlined in a Medrxiv pre-print [Link] and has been accepted by Lancet Microbe.

Sequences

Throughout the course of the project, sequencing data was generated in four different ways. Files can be found below this brief overview of the four sequencing techniques:

1) Illumina - Whole Genome Sequencing (Illumina_WGS)

These sequence reads were generated by the Wisconsin State Laboratory of Hygiene [WSLH] using an Illumina MiSeq instrument [learn more] from 425 amplicons covering the whole SARS-CoV-2 genome amplified from wastewater samples with the QIAseq DIRECT SARS-CoV-2 Kit A [learn more]. Illumina_WGS fastq files contain these unaligned sequence reads. Each sample should have both an R1 and and R2 fastq file, which result from Illumina's paired-end sequencing technique [learn more]. Illumina reads are highly accurate and short.

2) Illumina - Alternate RBD Amplicon (Illumina_altRBD)

These sequence reads were generated by the Marc Johnson Laboratory at the University of Missouri using an Illumina Miseq instrument (see above). Unlike the whole-genome sequences from WSLH, these sequences are of just one amplicon that covers the receptor binding domain (RBD) region of the SARS-CoV-2 spike gene. The custom "alternate" primer sets used to generate this RBD amplicon for sequencing have been designed to exclude sequences bearing omicron lineage-specific mutations. Find these sequences in the primer manifest table here [link].

3) Oxford Nanopore Technology - Whole Genome Sequencing (ONT_WGS)

These sequence reads were generated by the David O'Connor Laboratory at the University of Wisconsin Madison [learn more] using Oxford Nanopore Technologies' (ONT) SARS-CoV-2 Midnight protocol [learn more] and their Gridion and Minion instruments [learn more]. The Midnight protocol amplifies 29 1.2kb amplicons that cover the whole SARS-CoV-2 genome. ONT_WGS fastq files contain these unaligned read sequences. ONT sequencing instruments can sequence much longer amplicons than Illumina instruments, and do not sequence paired-end reads. ONT sequence reads, however, have a higher error rate than Illumina sequence reads.

4) PacBio - Extended Spike Amplicon (PacBio_extSPIKE)

These sequence reads were generated by the David O'Connor Laboratory and the UW Madison Biotech Center using Pacific Biosciences' (PacBio) Sequel 2 instrument [learn more]. Custom 1.6kb and 2.5kb amplicons covering the SARS-CoV-2 spike gene were amplified for sequencing via the Sequel 2 instrument's hifi circular consensus sequencing method [learn more] that can sequence larger amplicons than Illumina instruments, but maintain a high level of fidelity. PacBio_extSPIKE fastq files contain these unaligned read sequences. The primers used to generate these amplicons can be found in our primer manifest [link].

Files

Linked below are folders containing all files generated by the associated sequencing technique for each collection site. The Wisconsin wastewater lineage was first detected at "Main Plant (POTW)" in January 2022, and eventually traced (using Sub-District lines, City manholes, and Village manholes) to the point source "Facility Line B" in June 2022. See Figure 2 in the manuscript for more information about how individual sites fit into the overall sampling strategy.

Illumina_WGS

Illumina_altRBD

ONT_WGS

PacBio_extSPIKE

Analyses

Facility Line B Illumina whole-genome sequences were run through nf-core/viralrecon to call variants and assess sequence quality. The workflow was initiated with the following code:

# create conda environment
conda create --name nextflow -c bioconda nextflow

# activate environment
conda activate nextflow

# process data
nextflow run nf-core/viralrecon \
--input resources/27660-samples.csv \
--outdir results \
--platform illumina \
--protocol amplicon \
--genome 'MN908947.3' \
--primer_bed resources/QIAseqDIRECTSARSCoV2primersfinal.bed \
--primer_left_suffix '_LEFT' \
--primer_right_suffix '_RIGHT' \
--ivar_trim_offset 5 \
--skip_assembly \
-profile docker

All technical sequencing replicates from each of the three time points (June 2022, August 2022, and September 2022) were processed this way most recently on February 14th 2023. The variant call format files (.vcf) and binary alignment map files (.bam) are linked below:

Full logs and file structure for each analysis are linked below (note that these files include all work and results from viralrecon, making these files quite large):

  • Original June and August: [link]
  • June and August replicates: [link]
  • All September files: [link]

README.md - README file containing instructions for running viralrecon: [link]

12S & 16S sequencing was performed to determine species contribution to Facility Lines A & B. These data are [linked here]

Scripts

CondenserWI.cur.py - Script used to generate Figure 1 [link to script]

Link to the project’s GitHub page: [link to GitHub]

Figures

fig_1_rbd_evolution.png - .png file for Figure 1: [link]
fig_2_sampling_strategy.ai - .ai file for Figure 2: [link]
fig_3_timepoint_diversity.ai - .ai file for Figure 3: [link].

Supplemental Tables

The folder supplemental_tables contains files that list primer sequences used in this project [link] and SRA Accession IDs for each sequence generated throughout the project [link].