High-Density Amplicon Sequencing Identifies Community Spread and Ongoing Evolution of SARS-CoV-2 in the Southern United States

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is constantly evolving. Prior studies focused on high-case-density locations, such as the northern and western metropolitan areas of the United States. This study demonstrates continued SARS-CoV-2 evolution in a suburban southern region of the United States by high-density amplicon sequencing of symptomatic cases. 57% of strains carry the spike D614G variant, which is associated with higher genome copy numbers, and its prevalence expands with time. Four strains carry a deletion in a predicted stem loop of the 3′ UTR. The data are consistent with community spread within local populations and the larger continental United States. The data instill confidence in current testing sensitivity and validate “testing by sequencing” as an option to uncover cases, particularly nonstandard coronavirus disease 2019 (COVID-19) clinical presentations. This study contributes to the understanding of COVID-19 through an extensive set of genomes from a non-urban setting and informs vaccine design by defining D614G as a dominant and emergent SARS-CoV-2 isolate in the United States.


In Brief
McNamara et al. use next-generation sequencing (NGS) with a high-density tiling array across SARS-CoV-2 to find a deletion and document how the D614G spike protein mutation rapidly swept through a rural/suburban population. D614G is associated with slightly higher viral loads.

INTRODUCTION
The coronavirus disease 2019  pandemic is an urgent public health emergency, with over 200,000 deaths in the United States alone. COVID-19 is caused by infection with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Typical symptoms for COVID-19 include fever, cough, shortness of breath, fatigue, myalgias, headache, sore throat, abdominal pain, and diarrhea Zhou et al., 2020aZhou et al., , 2020b. Patients admitted to the hospital generally have pneumonia and abnormal chest imaging (Bhatraju et al., 2020;Chen et al., 2020). COVID-19 is also associated with other complications, including acute respiratory failure and acute respiratory distress syndrome, which appear to be significant predictors of mortality. Severe COVID-19 is disproportionately observed in the elderly and individuals with underlying comorbidities. COVID-19 has not similarly impacted children CDC COVID-19 Response Team, 2020;Verdoni et al., 2020;Xu et al., 2020b); however, other SARS-CoV-2 disease manifestations, such as Kawasaki disease, are emerging in this group.
The first reported SARS-CoV-2 clusters appeared in the Wuhan province in China and have since rapidly spread across the world Wu et al., 2020;Zhu et al., 2020). The primary means of transmission is by oral secretions, though viral RNA has also been detected in blood, stool, and semen Zou et al., 2020). Social distancing, rapid case ascertainment, physical barriers, and quarantine of infected persons have proven successful in limiting the impact of COVID-19. For these public health measures to remain effective and sustainable, it is important to understand the pathways of transmission through aggressive contact tracing and virus testing. Of high concern with regards to SARS-Cov-2 is that the virus may be shed prior to the onset of clinical symptoms, at late times after the cessation of clinical symptoms, and by asymptomatically infected persons (Arons et al., 2020;He et al., 2020;Hijnen et al., 2020;van Doremalen et al., 2020;Wö lfel et al., 2020;Xu et al., 2020a). While antibody testing identifies patients with prior exposure (Long et al., 2020), only targeted nucleic acid amplification testing (NAT) or SARS-CoV-2 antigen detection can identify actively transmitting individuals.
The SARS-CoV-2 genome shares 79.6% sequence identity with SARS-CoV, the causative agent of SARS in 2002. It shares 96% sequence identity with a bat coronavirus (BatCoV), RaTG13 (GenBank: MN996532) (Ceraolo and Giorgi, 2020;Lu et al., 2020b;Zhou et al., 2020b). SARS-CoV entry is determined by the spike protein ORF S . ORF S has many interaction surfaces and is the target of neutralizing antibodies. The S protein uses human ACE2 (hACE2) as a receptor and is proteolytically activated by human proteases (Hoffmann et al., 2020;Shang et al., 2020). Comparative analysis shows that between SARS-CoV-2 and either SARS-CoV or bat-derived SARS-like coronavirus (bat SARS-CoV) (Andersen et al., 2020;Wu et al., 2020), the sequence identities are the least alike for spike protein gene (S). SARS-CoV-2 has a longer spike protein as compared to bat SARS-CoV, human SARS-CoV, and middle east respiratory syndrome coronavirus (MERS-CoV) (Lu et al., 2020b). Although SARS-CoV-2 and SARS-CoV only share 79% identity at the whole-genome scale, their spike protein receptor binding site sequences are more similar compared to bat SARS-CoV and MERS-CoV (Lu et al., 2020b). Residues at the receptor-binding site have evolved for better association with ACE2 compared to SARS-CoV Wrapp et al., 2020) and can be attributed to these molecular features; five of the residues critical for binding to ACE2 are different in SARS-CoV-2 as compared to SARS-CoV Wrapp et al., 2020), and a functional polybasic cleavage site (RRAR) is present at the S1/S2 boundary of the SARS-CoV-2 spike protein (Andersen et al., 2020;Walls et al., 2020). The polybasic cleavage site allows for effective cleavage by furin and other proteases, which is important for viral infectivity (Letko et al., 2020). The additional proline may also result in O-linked glycans to S673, T678, and S686 that can be important in shielding key epitopes or residues (Andersen et al., 2020). Ascertaining whether these key residues remain invariable as the pandemic progresses or evolve over time is crucial to ensure testing accuracy and rational vaccine design.
Phylogenetic analysis translates viral genome sequences into a hierarchical classification based on sequence similarity. Early analyses established SARS-CoV-2 as a Sarbecovirus, in the same clade as BatCoVs, substantiating its use as an outgroup here (Jaimes et al., 2020). Initial analyses of human SARS-CoV-2 genomes established three major variant types worldwide (Forster et al., 2020). Clade B was derived from clade A by a synonymous T8782C mutation in ORF1ab and a nonsynonymous C28144T mutation that changes a leucine to serine in ORF8 (Ceraolo and Giorgi, 2020;Forster et al., 2020). Clade C was derived from clade B by a nonsynonymous G26144T mutation that changes a glycine to valine in ORF3a. A and C types are mainly found in Europe and the United States. B type is mainly found in East Asia. Other analyses arrived at different clades and, unfortunately, different naming conventions Zhang et al., 2020). Additional clades have since been recognized, including clade G, which is defined by a nonsynonymous single-nucleotide variant (SNV) in spike protein at amino acid position 614. Multiple studies continue to study SARS-CoV-2 sequence evolution based on an ever-increasing set of sequences collected at GISAID (GISAID, 2020;Shu and McCauley, 2017), GenBank, and Nextstrain (Hadfield et al., 2018). The phylogenetic analysis of SARS-CoV-2 is very much in flux. Analyses represent a snapshot of the time of prepublication. The clade designations used here were derived from GISAID at the time of data analysis.
To provide finer granularity about biological changes during SARS-CoV-2 transmission, we employed next-generation sequencing (NGS) as an independent screening modality. This allowed us to reconstruct the mutational landscape of cases seen at a tertiary clinical care center in the southeastern United States from the start of the North Carolina (NC) epidemic on March 3, 2020, until past the peak of the first major wave of infections. The samples cover the period when community spread in NC was established and when the state-wide stay-at-home order was issued (March 30 to May 8, 2020).
SARS-CoV-2 testing remains limited in many countries due to a shortage of personal protective equipment, testing kits, and diagnostic capacity. The Centers for Disease Control (CDC) guidelines during the time of sampling prioritized patients with specific clinical symptoms (fever, cough, and shortness of breath) and curtailed testing to only a subset of all probable cases. Individuals not fitting the clinical criteria for testing, as well as asymptomatic individuals, were excluded. To evaluate if any cases were missed because of this triage algorithm, nasopharyngeal (NP) swabs for three groups of patients were evaluated (n = 175 known SARS-CoV-2-positive NP samples, n = 41 known SARS-CoV-2-negative NP samples, and n = 12 NP samples of unknown status [i.e., the patient had symptoms justifying sample collection but was not prioritized for clinical SARS-CoV-2 testing]). ''Testing by sequencing'' was negative for all negative samples, less sensitive for weakly positive samples, and uncovered five new cases among previously not tested cases. The index case in NC was linked to the US outbreak in the state of Washington. Phylogenetic analyses established the dominance of the S protein D614G SNV among this population, which has been increasing over time through community spread and was introduced initially by a person returning from Europe.

Whole-Genome SARS-CoV-2 Sequencing through High-Density Amplicons
The University of North Carolina at Chapel Hill Medical Center (UNCMC) used one of two NATs to test for the presence of SARS-CoV-2 RNA, one laboratory-developed test based on the protocol by Corman et al. (2020) and the commercially available Abbott real-time SARS-CoV-2 assay, both under the EUA provision of the US Food and Drug Administration. Both tests report the presence or absence of SARS-CoV-2 RNA. Remnant NP samples were subjected to targeted sequencing using the Thermo Fisher AmpliSeq SARS-CoV-2 assay and S5 Ion Torrent sequencing platform. A subset of isolates was subjected to 3 0 and 5 0 rapid amplification of cDNA ends (RACE) followed by Sanger sequencing to verify the sequences in the highly structured untranslated regions of the genome. Individual sequence reads were mapped to the SARS-CoV-2 reference sequence (NC_045512) and a strain-specific consensus sequence was generated and SNV recorded. The finished genomes are submitted to GenBank and GISAID and were named according to convention (Coronaviridae Study Group of the International Committee on Taxonomy of, 2020).
A total of n = 175 known positive samples and positive control (full-length genomic RNA from strain SARS-CoV-2/human/USA-WA1/2020; GenBank: MN985325) were subjected to NGS. The number of mapped reads varied substantially across samples, reflecting the differences in the amount of virus per sample. The distribution of 103 coverage for all samples is presented in Figure 1A  (see also Table S1). As expected, more mapped reads yielded higher coverage. Of the 33 negative controls, none had >10 2 total reads aligned. Of the positive samples, >5 3 10 3 total mapped reads were needed to obtain 13 coverage of the whole genome, and a minimum of 3.1 3 10 4 reads were needed to obtain >90% coverage at 103. The number of reads aligned varied depending on the viral load, as determined by real-time qPCR using CDC primer N1, but not total RNA, as determined using RNase P, of the samples ( Figure 1B). In this assay, any crossing point (CP) <35 for SARS-CoV-2 qPCR yielded reliable coverage, which increased linearly with viral load. At a CP R35, most positive samples still yielded reads that mapped to the target genome and thus allowed detection of SARS-CoV-2 sequences; however, the results were less consistent, and coverage was more variable. As expected, total RNA (measured by RNase P) was not associated with sequencing coverage and varied considerably across samples, even though each sample used the same amount of virus transport medium (VTM).
The coverage level distribution is shown in Figures 1C and 1D. Figure 1C represents the subset of samples for which high-quality genomes were submitted to GenBank and GISAID. Figure 1D represents samples, with more variable complete coverage. These samples were nevertheless included in SNV calling, as the SNV algorithm relies on local coverage rather than overall coverage. As a result, the variant calls represent a conservative estimate of SNV distribution in this sample set. Figure 1E shows the per nucleotide coverage for all genomes with median coverage R5,0003. Median coverage of >5,0003 was required to ensure >99% genome coverage without a single amplicon Report ll OPEN ACCESS dropout. The nucleotide composition of SARS-CoV-2 was largely balanced and did not contain repeats larger than sequencing length. Hence, coverage was continuous across the genome, except for the 5 0 and 3 0 UTRs. Targeted amplification using this primer set missed the first 42 nt at the 5 0 end and 29 nt, starting at 29,843, at the 3 0 end of the viral genome. These regions are conserved across most SARS-CoV-2 sequences in GenBank, many of which are themselves incomplete or known to suffer amplification bias (van Dorp et al., 2020). The limiting factor was not sequencing depth per se; rather, samples of low viral load failed in the targeted amplification step for individual amplicons. Samples with low viral load were re-sequenced.
A subset of positive samples (n = 33) were independently resequenced and yielded 251 high-confidence SNVs. No new SNVs were uncovered upon resequencing; 180 SNVs were confirmed and 71 SNVs were lost upon pooling multiple sequencing runs for the same sample due to the frequency dropping below 90%. Of the 71 SNVs, 50 possessed a majority vote matching the reference and 21 possessed a majority vote matching the prior SNV call. Target capture efficiency was verified using multiple dilutions and compared to unbiased RNA sequencing (RNA-seq) of the reference strain SARS-CoV-2/human/USA-WA1/2020 ( Figure S1). Targeted sequencing coverage was uniform over a 50-fold range of input RNA; it was higher than RNA seq, except in the terminal regions that were not covered by PCR amplicons. In some cases, as little as 5 mL VTM from a single swab had sufficient virus to obtain a full-length viral genome sequence at 1,0003. These data are consistent with the astonishingly high reported genome copy numbers of SARS-CoV-2 in some cases  and demonstrate the principal suitability of testing by sequencing as a diagnostic option for SARS-CoV-2 and other rapidly evolving viruses.
The average quality score per read is set to a minimal average phred score of 20 corresponding sequencing error rate of 1% and to a false-positive probability of any individual base of 0.1% and a true-positive probability of 99.9%. Using a theoretical model (Petrackova et al., 2019) based on the binomial distribution, a minimal coverage of 103 was expected sufficient to call SNVs with an allele frequency of R90%.
Twelve samples were collected during the same calendar period from individuals presenting with respiratory complaints but no indication for SARS-CoV-2 testing according to CDC guidelines. 5 of 12 yielded >5% genome coverage ( Figures  S2A-S2L). The remainder had reads aligned only to regions of the genome that have low complexity; 2 out of 12 had a sequence coverage distribution, at 57% and 34%, respectively, consistent with the presence of the target virus. Three other samples had coverage of 20%, 13%, and 10%. At the time of study, SARS-CoV-2 testing guidelines were extremely restrictive due to a lack of supplies. Patients with clear clinical symptoms of COVID-19 were not tested but treated on the basis of clinical diagnosis alone, and patients with respiratory symptoms not exactly matching CDC/COVID-19 criteria were not tested either. None of the samples in this study originated from asymptomatic patients. Though the number of unknowns tested was small, the results suggest that limiting testing to narrowly defined case criteria misses a significant number of cases and thus transmission events.

Sequence Analysis Reveals the Presence of Two Clades of SARS-CoV-2
Putting individual sequences into context is key to understanding SARS-CoV-2 transmission. Sequencing identified n = 139 samples with at least one high-confidence SNV as compared to the reference sequence. Of these, n = 79 (57%) carried the S protein D614G SNV, a mutation implicated in higher pathogenicity of the virus (Becerra-Flores and Cardozo, 2020). Samples carrying the D614G SNV had higher SARS-CoV-2 genome loads as measured by CDC N3-primer directed real-time qRT-PCR for SARS-CoV-2 (p % 0.002 by Wilcoxon signed rank test). A similar, but not significant trend emerged using CDC N1-primer directed real-time RT-qPCR for SARS-CoV-2, but not for total RNA levels as measured by CDC RNase P-directed real-time RT-qPCR ( Figures S2M-S2O). Figure 2A shows the SNV distribution of the data, color-coded by the week of collection. These data include high-confidence SNVs of genomes with <99% coverage, whereas the phylogenetic reconstructions are only based on complete genomes (R99% coverage) that were submitted to GenBank (and also present in GISAID). This SNV distribution was dominated by isolates representing clade A and some of clade B, the dominant clades in North America and Europe (Forster et al., 2020). The NC stay-at-home order was enacted on March 30, 2020, and the sample collection concluded on April 11, 2020 (i.e., covering a period of unrestrained local spread). The SNV pattern is consistent with the idea that SARS-CoV-2 was introduced into NC by travelers from the continental United States and that this population was in equilibrium with the general population of the United States Unlike retroviruses, such as human immunodeficiency virus (HIV) or hepatitis C virus, CoVs do not exist as co-existing sequence swarms within a person, since CoVs employ a proofreading RNA-dependent RNA polymerase (Agostini et al., 2018;Graham et al., 2012). Rather, a single variant seems to dominate Cell Reports 33, 108352, November 3, 2020 5 Report ll OPEN ACCESS the transmission events. Consistent with the biology of CoV, this study did not find widespread evidence of minor SNVs. Figure S1B shows the analysis of lower frequency variants (down to 70% frequency). The majority of high-quality (phred score R20) SNVs called were present at >90% frequency (n = 1,100). Including SNVs with a frequency of 80%-90% added n = 61 additional variants (5.0%). Including SNVs with a frequency of 70%-80% added an additional 40 variants (3.3%).
One limitation of all targeted sequencing efforts is the large number of PCR amplifications that are conducted to enrich for virus sequences prior to building the library. To explore the effect of amplicon-PCR-induced duplications on sequencing accuracy, we repeated our analysis using only unique reads and obtained the same high-prevalence SNV. Amplicon duplications became prominent at read counts >10 4.5 ( Figure S1B). As the SARS-CoV-2 genome is $3 3 10 4 and the median read length was 204 ± 29 (mean ± SD), this threshold corresponds to $200-fold median coverage. This suggests that only deduplicate reads should be used in amplicon sequencing and that requiring extraordinary levels of sequence coverage may introduce a bias of oversampling, which is well recognized in the bacterial 18S sequencing field.
Independently derived consensus genomes from the SARS-CoV-2/human/USA-WA1/2020 isolates showed evidence of divergence between the original isolate, the seed stock, and commercially distributed standard ( Figure 2B). Similar cultureassociated changes were recently reported for a second, culture-amplified reference isolate, Hong Kong/VM20001061/ 2020 (GenBank: MT547814). This is not surprising, given that any large-scale virus amplification in culture is accompanied by virus evolution, but it raises concerns about the utility of using a natural isolate, rather than a molecular clone Thao et al., 2020), as the standard for sequencing.
The phylogeny based on whole-genome nucleotide sequences revealed several interesting facets. Predictably, all UNC isolates of SARS-CoV-2 were significantly different from SARS-CoV and RaTG13 ( Figure 2B, purple). RaTG13 was used as an outgroup for clustering. The first NC case (NC_6999; Figure 2B, arrow labeled ''WA'') was a person returning from Washington, and sequence confirmed at the CDC (NC-CDC-6999). It initiated a branch of cases related to the initial isolate SARS-CoV-2/human/CHN/Wuhan-01/2019 (NCBI Accession: NC_045512). The branch of cases ( Figure 2B, arrow labeled ''cruise'') contains the majority of NC cases, several cases isolated in neighboring Virginia ( Figure 2B, black cases), and a cluster of cases reported in Germany (DEU, orange). It also contains several early cases, representing the individual who participated in a cruise.
SARS-CoV entry is determined by the spike protein ORF S, and S is the target of neutralizing antibodies. Figure 2C shows the phylogenetic analysis of the S protein across all samples, the index cases for NC deposited by the NC Department of Health and Human Services, and representative examples from the United States, China, and Germany. Two branches emerged, one containing isolates from China, Washington, and Germany and a second containing United States and German sequences only. Since the S protein is shorter and more conserved across SARS-CoV-2, the limited numbers of SNVs did not support as detailed a lineage mapping as the wholegenome nucleotide sequences.
One large deletion was identified in four independent samples: 14 nt were deleted beginning at position 29745 (indicated in Figure 2C by a delta symbol). This region is within the previously recognized ''coronavirus 3 0 stem-loop II-like motif (s2m).'' This was confirmed in multiple isolates, supported by multiple, independent junction-spanning reads (Figures 3A and 3B). Junctions were mapped to single-nucleotide resolution directly from individual reads. To confirm our deep-sequencing results, we performed 3 0 UTR site-specific amplification and Sanger-based sequencing ( Figures 3E-3G). The variant 3 0 end does not destroy overall folding but introduces a shorter stable hairpin ( Figures 3C  and 3D). How this mutation affects viral fitness remains to be established.
In sum, this study generated exhaustive SNV information representing the introduction and spread of SARS-CoV-2 across a suburban low-density area in the southern United States All samples were from symptomatic cases, and the majority of genomes clustered with variants that predominate the outbreak in the United States, rather than Europe or China. This supports the notion that the majority of United States cases were generated by domestic transmission.

DISCUSSION
This study demonstrates extensive shedding of SARS-CoV-2 in symptomatic patients among a low-density population in the southeastern United States. It is among the largest sequencing studies that focuses on a suburban and rural community, rather than a crowded city, like New York City. The SNV distribution was consistent with continuous evolution or genetic drift of this new virus through an immunologically naive host population (Consortium, 2004;Fauver et al., 2020;Lu et al., 2020a).
The first reported SARS-CoV-2 case in NC was a person who previously traveled to the state of Washington (03-03-2020, NC State Health Department; GenBank: MT325591). Additional early cases included persons who became infected while onboard a cruise ship (03-12-2020, NC State Health Department). Each of these introduction events was associated with a distinct clade. More recent cases, and cases in neighboring Virginia, were associated with the cruise case. These data support the hypothesis that the majority of cases in NC originate from persons traveling within the United States rather than internationally, reflecting predominant spread by community transmission within the United States (Fauver et al., 2020).
SNV analysis documents the presence of a presumed highpathogenicity variant D614G in 57% of the cases (Becerra-Flores and Cardozo, 2020;Ceraolo and Giorgi, 2020;Eaaswarkhanth et al., 2020). It is clear that this variant signifies spread within Europe and the continental United States. Within the limitations presented by measuring viral loads within samples collected at unknown times past infection and with presumably differing clinical sampling efficiency, patients with the D614G SNV presented with higher SARS-CoV-2 genome loads. The association of the D614G SNV with specific clinical presentations, distinct biological properties, and high peak titers seems increasingly likely . While this article was under review, a large number of studies cemented the importance of the D614G SNV and its biological and clinical properties.
Multiple studies demonstrated superior infectivity of D614Gcontaining viruses or pseudotyped particles .
Cao et al. reported on a clustering of genomes that harbor a D614G mutation in the S gene (Cao et al., 2020). Their analysis of 489 genomes derived from 32 countries reveals that genomes in clades A2 and A2a harboring the D614G mutation originate mainly from European and several South American countries, different from clade B, which contains genomes from mainland China. This observation is mirrored in the extensive analysis by Korber et al. (2020), which was published while this article was under review. The D614G mutation dominates over the initial human strain defined by the SARS-CoV-2/human/CHN/Wuhan-01/ 2019 isolate. They observed on average higher genome copy numbers for the D614G isolate, similar to this study, but could not make a conclusive association with clinical outcomes. Several other mutations reportedly accompany the D614G mutation on the S gene and include C214T, C3037T, and the C14408T mutations, and together, these form the globally domi- (B) The same reads mapped to an artificial target sequence with the 29745delta14. Blue indicates forward and red reverse reads (all reads are single reads). Red boxes and black bars indicate mismatches at below 20% of reads (red) or above 20% of reads (black). In this alignment, duplicate mapping reads were removed to guard against PCR amplification bias. Genome positions are shown on top (note that after nucleotide 29,745, genome positions are out of sync due to the deletion). Note that this region is within the CoV 3 0 stem-loop II-like motif (s2m), annotated in NC_045512 as a prediction based on profile:Rfam-release-14.1:RF00164,Infernal:1.1.2. (C) Predominant Mfold prediction of the 3 0 end of NC_045512 with deletion bases indicated in yellow. (D) Predominant Mfold prediction of the 3 0 end of NC_045512 delta14. (E) Sequence alignment of 3 0 UTR deletion mutants with other representative SARS-CoV-2 isolates. (F) Sanger sequencing confirmation of the 3 0 UTR deletion mutant UNC_200313_2020/2020. (G) Sanger sequencing confirmation wild-type sequence for isolate UNC_200399_2020/2020. nant strain of SARS-CoV-2 (Isabel et al., 2020;Korber et al., 2020). The clade G strain of SARS-CoV-2 was reported in Italy as early as February 2020 (Bartolini et al., 2020;Stefanelli et al., 2020;Zehender et al., 2020). Studies of Russian isolates also have identified D614G, as well as additional mutations (Kozlovskaya et al., 2020). These findings are consistent with ours, as most of the genomes containing the D614G mutation also carry additional mutations defining the G clade. Of the 87 sequences that have the D614G mutation, 69 have the C214T mutation, 15 have the C3037T mutation, and 48 have the C14408T mutation. Given the increasing abundance of D614G SNVs, further research into its role in pathogenicity and clinical outcomes is warranted.
Four samples had the same 14-nt deletion in the 3 0 UTR, and no samples had deletions within the coding region. This deletion is 71 nt away from the stop codon of ORF10 (N protein) and eliminates a predicted stem-loop structure. An analogous bulged stem loop at approximately the same location (right after the stop codon) is important for the replication of mouse hepatitis virus. In bovine CoVs, an analogous RNA structure attenuates viral replication (Williams et al., 1999;Z€ ust et al., 2008). There seems to be partial overlap between the bulged stem loop and the pseudoknot, suggesting that these two structures are mutually exclusive and may serve as a switch to regulate the ratio of full-length RNA and defective RNA (Goebel et al., 2004). These two structures are also present in SARS-CoV. These isolates represent Cell Reports 33, 108352, November 3, 2020 7 Report ll OPEN ACCESS full-length genomes from symptomatic patients rather than disjointed RNA fragments recovered after clinical disease had subsided; thus, we speculate that these deletion mutants are replication competent yet have an altered ratio of full-length genomic and defective interfering RNAs. The biological phenotypes of these and other recent SNVs remain to be established in future studies.
There are limitations to our approach. These are similar to other NGS-based phylogeny reconstructions. Sampling was neither randomized nor exhaustive. At this point, we cannot exclude the presence of a founder effect and a disproportional impact of particular populations and situations on this dataset. The unknown group of samples included individuals who were not asymptomatic in a broader sense of being negative for any respiratory symptoms. In the current time of limited personal protective equipment, limited sample kits, and limited testing capacity, it would not have been ethical to divert these resources for random population-wide sequencing. As properly randomized cohort studies become available in the future, the SARS-CoV-2 phylogeny will become more representative of SARS biology and less influenced by sample bias.
Some SNVs may be the result of technical bias. For instance, the 5 0 end awaits individual confirmation by RACE; the 3 0 end likewise requires RACE for genome finishing. The Nextstrain database (Hadfield et al., 2018) suggests that positions 18,529, 29,849, 29,851, and 29,853 may be subject to PCR or sequencing bias. Lastly, targeted sequencing relies on amplification or hybridization capture. Unless the amplicon PCR primers or capture baits are completely removed, a portion of reads will reflect the sequence that these primers/baits were derived from rather than the sample. Most protocols rely on bioinformatic primer pruning alone. AmpliSeq, in addition to bioinformatic removal, enzymatically digests the targeting primers before library construction. Therefore, the sequences and SNVs reported here could exclusively be attributed to the particular clinical sequence.
This particular sequencing experiment was not designed to identify minority variants, as current whole-genome amplification primer sets do not include unique molecular identifiers (UMIs). UMIs, sometimes called ''primer IDs,'' have been pioneered for sequencing small regions of the HIV genome (Jabara et al., 2011), and could likewise be applied to SARS-CoV-2.
This study confirmed the sensitivity of current NATs concerning the specific SARS-CoV-2 strains circulating in the region (and the United States). None of the UNC isolates had mutations in the CDC primer binding sites (Lu et al., 2020c). Three European isolates (MT358642, MT358639, and MT318827) had a GGG>AAC polymorphism in the 5 0 terminal end of the forward CDC N3 (5 0 -GGGGAACTTCTCCTGCTAGAAT), which is a CoV consensus primer. Another European isolate (MT35638) had a G>T at 12,725, which is within the nCoV_IP2 forward primer. One European and one Chinese isolate (MT358638 and MT226610) each had a SNV in nCoV_IP2 reverse primer at positions 12,818 and 12,814. As more and more viral genome sequences are generated, more and more SNVs will be recorded, including SNV in qPCR primer and probe binding sites. Currently (May 9, 2020), 2.7% and 0.68% of sequences in GI-SAID contain SNVs in the CDC primer pairs N1 and N2, respec-tively. These data should be interpreted with caution, since at this point, little standardization exists as to the quality of SNVs reported, and it is unclear how much a given SNV in one of the primer binding sites affects assay performance. Not all mutations in a primer binding site result in catastrophic failure or significant loss of sensitivity (Hilscher et al., 2005), which is defined as the sum of all steps in the assay pipeline, including, e.g., proper sample collection of the patient. Periodic retesting of positive and negative samples by whole-genome NGS represents an option to increase sensitivity and specificity and detect any variants emerging in the populations, which may escape detection by NAT.
Testing by sequencing represents an interesting alternative to NAT in the case of CoVs, which are present at very high genome copy numbers during days of active shedding (Wö lfel et al., 2020;Yu et al., 2020). Testing by SARS-CoV-2 targeted sequencing had perfect specificity but lower sensitivity than qPCR (Sellers et al., 2020). Sequence coverage correlated with viral load. The lower sensitivity was expected, as real-time qPCR amplicons can be placed anywhere on the target genome and optimized for sensitivity ; shorter amplicons (<100 bp) maximize sensitivity as compared to larger amplicons (>200 bp) (Hilscher et al., 2005;Lock et al., 2010). By contrast, NGS represents a compromise, as the entire viral genome has to be covered with primers that are part of a common pool. Primer design is governed by compatibility under a single set of conditions (annealing temperature) as much as by individual efficiency. The ARTIC network protocol uses n = 96 larger amplicons (https://artic.network/ncov-2019). By comparison, the AmpliSeq protocol deployed here uses n = 237 amplicons of size 204 ± 29 (mean ± SD) (i.e., twice as many and substantially shorter amplicons with expected higher sensitivity). In sum, testing by sequencing represents a suitable, albeit expensive, tool for COVID-19 diagnosis.
Approximately half of the specimens not clinically tested for SARS-CoV-2 had a positive result by sequencing. This was not surprising, as to this day, testing capabilities are limited and probable cases are triaged based on clinical and public health indications. These unknown cases were not asymptomatic but represent patients with a clinically indicated need for upper respiratory sampling. Finding additional SARS-CoV-2 cases in this population suggests that case counts based on NAT represent a lower estimate of SARS-CoV-2 prevalence. It may also suggest that the current triage criteria for SARS-CoV-2 testing are too limited to understand spread of this virus. In sum, this study underscores the sensitivity and accuracy of current NAT assays and demonstrates the utility of testing by sequencing. It contributes to the worldwide effort to understand and combat the COVID-19 pandemic by providing an extensive set of full-length SARS-CoV-2 genomes from a non-urban setting.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
This work was funded by NIH public health service grants CA016086, CA019014, and CA239583 to D.P.D. Funding was also provided by the University Cancer Research Fund and the UNC School of Medicine. This project was supported by the North Carolina Policy Collaboratory at UNC with funding from the North Carolina Coronavirus Relief Fund established and appropriated by the North Carolina General Assembly. The authors would like to thank all the members of the Damania and Dittmer labs, Corbin Jones, and Nicole Fischer for critical reading, comments, and suggestions. We also thank the participants and the nurses and physicians at the UNC Pulmonary Intensive Care Unit and Department of Infectious Diseases who, in addition to their heroic patient care, ensure that de-identified excess samples are available for discovery research and rapidly validating novel diagnostic approaches.