Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA

Division of Infectious Diseases, University of Colorado Anschutz Medical Campus, Aurora, CO, USA

Department of Medicine, Stony Brook University, Stony Brook, NY, USA

Department of Medicine, Washington University, St. Louis, MO, USA

Department of Pediatrics, University of North Carolina, Chapel Hill, NC, USA

Abstract

Background

Culture-independent phylogenetic analysis of 16S ribosomal RNA (rRNA) gene sequences has emerged as an incisive method of profiling bacteria present in a specimen. Currently, multiple techniques are available to enumerate the abundance of bacterial taxa in specimens, including the Sanger sequencing, the ‘next generation’ pyrosequencing, microarrays, quantitative PCR, and the rapidly emerging, third generation sequencing, and fourth generation sequencing methods. An efficient statistical tool is in urgent need for the followings tasks: (1) to compare the agreement between these measurement platforms, (2) to select the most reliable platform(s), and (3) to combine different platforms of complementary strengths, for a unified analysis.

Results

We present the latent variable structural equation modeling (SEM) as a novel statistical application for the comparative analysis of measurement platforms. The latent variable SEM model treats the true (unknown) relative frequency of a given bacterial taxon in a specimen as the latent (unobserved) variable and estimates the reliabilities of, and similarities between, different measurement platforms, and subsequently weighs those measurements optimally for a unified analysis of the microbiome composition. The latent variable SEM contains the repeated measures ANOVA (both the univariate and the multivariate models) as special cases and, as a more general and realistic modeling approach, yields superior goodness-of-fit and more reliable analysis results, as demonstrated by a microbiome study of the human inflammatory bowel diseases.

Conclusions

Given the rapid evolution of modern biotechnologies, the measurement platform comparison, selection and combination tasks are here to stay and to grow – and the latent variable SEM method is readily applicable to any other biological settings, aside from the microbiome study presented here.

Background

Complex microbial communities, like those of the human gastrointestinal (GI) tract and other environmental specimens, have gained increased attention in recent years, thanks to technological advances in culture-independent methods based on the amplification of 16S rRNA genes

Next-generation sequencing (NGS) technology provides a promising alternative to quantifying the microbiome without the limitations of cloning/Sanger sequencing. For instance, a single run of the 454 Life Sciences pyrosequencing platform can produce 1.2 million sequences in 8 hours

To date, few attempts have been made to systematically compare and combine different measurement modalities for microbiome analysis. Nossa

Here we propose an alternative analytical approach using the latent variable structural equation modeling (SEM) to compare and integrate microbiome measurements from different measurement platforms. The latent variable SEM treats the true bacterial composition of a sample as the latent (unobserved) variable and estimates the relations between, and the reliabilities of, different measurement platforms, and if necessary, subsequently combines them for a joint analysis with each platform weighed by its reliability

In this paper, we demonstrate the latent variable SEM approach through a study of the microbiome in inflammatory bowel diseases (IBD). Our primary goal is to identify the most reliable microbiome measurement platform. A secondary goal is to examine the impact of IBD disease phenotypes (Crohn’s Disease [CD] and ulcerative colitis [UC]) on the enteric microbiota. The measurement platforms compared in this study are: 1) ABI 3730 (Sanger) sequencing of the entire 16S rRNA gene; 2) 454 sequencing of the V1-V3 hypervariable regions; 3) 454 sequencing of the V3-V5 hypervariable region. In the case of a single bacterial taxon,

Methods

In this section, we illustrate the general methodology for platform comparison and combination using latent variable SEM. We start with the simpler latent variable SEM measurement model in which covariates are not involved to better elucidate how latent variable SEM gauges platform reliability and consistency. Subsequently, we introduce latent variable SEM with covariates and describe its two special cases -- repeated measures ANOVA in the univariate and multivariate approaches. To better assist readers with a less mathematical background in this section, each general model is accompanied by the corresponding example from the microbiome study on IBD.

Measurement model of latent variable SEM

In latent variable SEM, a latent variable refers to the unknown real value such as the true frequencies of bacteria in the microbiome. The latent variable is linked to its various measurements or indicators through a measurement model. Figure _{
i
}(**Y** = (_{1}, _{2}, ⋯ , _{
m
})^{'}, the latent variable SEM model is a system of linear equations: **Y** = **Λ**ξ + **ε**, where **Λ** = (_{1}, _{2}, ⋯ , _{
m
})^{'} is the vector of path coefficients showing the expected number of unit changes in the observed variables/measurements for a one-unit change in the true level of ξ. Random errors for the measurements and the latent variable itself are denoted by **ε** = (_{1}, _{2}, ⋯ , _{
m
})^{'} and ζ respectively. We further assume that all errors are normally distributed and independent with _{
i
}, ξ) = 0, _{
i
}, _{
j
}) = 0, and **Y** is usually centered about its mean and thus the intercept terms are eliminated.

Path diagram for a latent variable SEM measurement model. (A) The general model with m measurements (observed variables) for one latent variable; (B) The measurement model with four measurements (Sanger, 454_V1V3, 454_V3V5 and qPCR) for the true (logit-transformed) relative frequency of Faecalibacterium

**Path diagram for a latent variable SEM measurement model.** (**A**) The general model with **B**) The measurement model with four measurements (Sanger, 454_V1V3, 454_V3V5 and qPCR) for the true (logit-transformed) relative frequency of

Let **θ** be the vector of the model parameters including the path coefficients and the error variances and covariances. For the latent SEM model illustrated in Figure **θ**) of **Y** implied by the SEM model is:

Given the multivariate normally distribution of **Y**, one can estimate the model parameters via the traditional maximum likelihood (ML) method that will eventually result in the minimization of the following ML fit function:

where S is the sample covariance matrix. This in turn reduces to minimizing the difference between S and **θ**).

To fix ideas, we now illustrate the modeling and estimation of the latent variable SEM in details by setting m = 3 in Figure _{1} = _{1}ξ + _{1}, _{2} = _{2}ξ + _{2} and _{3} = _{3}ξ + _{3}, where _{
i
}) = 0, _{
i
}) = 0, _{
i
}) = 0 and _{
i
}, _{
j
}) = 0.

The implied covariance matrix of the model (*its upper triangular portion is omitted in the matrix form due to symmetry) is:

Following convention for latent variable SEM estimation, we set one of the path coefficients to 1 to assign a scale to the latent variable _{1} ≡ 1 in **θ**), and subsequently, by equating **θ**) and S = [Sij], the sample variance covariance matrix, the maximum likelihood estimators of the model parameters soon emerge as:

Platform reliability measure

In order to evaluate the consistency of the measurement platforms, we adopt the concept of reliability originated from the classical test theory by assuming a true score underlies a measure _{
i
}, is a good reliability measure representing the percentage of variance in a measure that is explained by the latent variable (true score). It is appropriate under very general conditions and, in simple cases, is equal to some of the traditional techniques such as Cronbach’s alpha **
the reliability measure for the i
**

The last term in the equation can be interpreted as the proportion of variance in the measure Y_{i} that is explained by the latent variable _{2} for the simple case of one latent variable with three measurements (Figure

Pearson correlations and Text S1 Reliability in the measurement model.

Click here for file

Here _{
ij
} is the sample Pearson product moment correlation coefficient between the observed variables _{
i
} and _{
j
}. Similarly, we have _{12} = 1) while the third measure is poorly correlated to the first two with _{13} = _{23} = 0.5. Then we have

The standardized path coefficients are defined as _{
i
} and the latent variable ζ. The estimated reliability of the **
i
**

Comparison to repeated measures ANOVA

The traditional approach to incorporate multiple repeated measures for the same underlying latent variable is the repeated measures ANOVA. Here we show that the latent variable SEM is a more general model – with the repeated measures ANOVA, both the univariate and the multivariate analysis approaches, as its special cases (Figure

Path diagram for repeated measures ANOVA. In comparison to the latent variable SEM model (Figure 1A), repeated measures ANOVA assumes equal path coefficients for both the multivariate and univariate analysis approaches. In addition, for the univariate approach the measurement error variances, Var(εi), are assumed to be equal

**Path diagram for repeated measures ANOVA.** In comparison to the latent variable SEM model (Figure _{i}), are assumed to be equal.

The univariate repeated measures ANOVA model is: **Y** = **ε**, where w assume **Y =** (_{1}, _{2}, ⋯ , _{
m
})^{'} is centered, in analogous to SEM, thus the intercept term is eliminated; Z is the (random) effect of subject; and **ε** = (_{1}, _{2}, ⋯ , _{
m
})^{'} are independent and identically distributed random errors independent of Z. Therefore **Y** ~ _{
m
}(**0**, **Σ**), where omitting the upper triangle of the matrix by symmetry, we have

This particular structure of the variance covariance matrix is called “compound symmetry”. The univariate repeated measures ANOVA can be obtained from the more general latent variable SEM shown in Figure _{
i
} ≡ 1 and

The multivariate approach for repeated measures ANOVA allows different measurement error variances but still imposes equal weights to path coefficients from the measurements to the latent variable, that is, _{
i
} ≡ 1, (**Y** is:

In summary, the repeated measures ANOVA models, both the univariate and the multivariate approaches, are special cases of latent variable SEM with constraints on the error variances and path coefficients. The general latent variable SEM is a more realistic, flexible and better-fitting model to evaluate the latent variable with several measurements, especially when the reliability of each measurement is unclear and the assumption of equal error variances is questionable. This general principle is fully illustrated in the ensuing example of a microbiome study where we compared the latent SEM measurement model with both repeated measures ANOVA models.

Latent variable SEM with covariates

While one advantage of the latent variable SEM is the ability to simultaneously incorporate multiple measures for the same underlying latent variable in a measurement model as shown in the previous section, SEM also can integrate multiple covariates for a latent variable in the same model. In the ensuing example of IBD, we simultaneously examine the influence of disease phenotypes and genotypes on the underlying bacterial ensemble while incorporating measures from multiple platforms (e.g., Sanger sequencing, 454 pyrosequencing, and qPCR). As illustrated in Figure

Path diagram for a latent variable SEM with covariates. (A) A general model with m measurements and k covariates for one latent variable ξ. (B) The model with four measurements (Sanger, 454_V1V3, 454_V3V5 and qPCR) and two covariates -- two binary disease indicators: CD (= 1 for subjects with Crohn’s disease, and 0 otherwise), and UC (= 1 for subjects with ulcerative colitis, and 0 otherwise) for the true/latent (logit-transformed) relative frequency of Faecalibacterium

**Path diagram for a latent variable SEM with covariates. **(**A**) A general model with **B**) The model with four measurements (Sanger, 454_V1V3, 454_V3V5 and qPCR) and two covariates -- two binary disease indicators: CD (= 1 for subjects with Crohn’s disease, and 0 otherwise), and UC (= 1 for subjects with ulcerative colitis, and 0 otherwise) for the true/latent (logit-transformed) relative frequency of

The SEM model for Figure

Here, **Y** is a vector of measurement variables for the latent variable **X** is a vector of independent variables (covariates) affecting the latent variable **Y** and **X** have been centered about their means per SEM convention. In addition to the notation in the measurement model, we have **Γ** = (_{1}, _{2}, ⋯ , _{
k
})^{'} representing the vector of path coefficients from the covariates to the latent variable. The estimation procedure is very similar to the measurement model as well. We can break the covariance matrix **θ**) into a block matrix as follows:

Thus the parameters can be estimated through minimizing the ML fitting function, or equivalently, by equating **Σ**(**θ**) and S, the sample covariance matrix for both **X** and **Y**.

Nonparametric analysis of latent variable SEM

In the above, we presented the analysis of latent variable SEM based on the most widely used maximum likelihood estimation (MLE) framework, which depends on normality assumptions. In practice, SEM with continuous variable, including ordinal variables of five categories or more will not have severe problems with non-normality. When the normality assumption is not attainable, one can not directly employ the hypothesis test or confidence interval results. One can employ bootstrap resampling procedures to perform nonparametric significance tests and to construct nonparametric confidence intervals

In order to fully analyze the following application example on IBD and microbiome, we developed a modified boot.sem function by adapting the boot.sem function from the R package SEM (version 0.9-21) to estimate platform reliability and the standardized latent variable SEM path coefficients and other parameters whenever the normality assumption is not attainable. Our modified boot.sem function is available for free download at ^{th} and the 97.5^{th} percentiles of the resampled data are shown in the following section.

Results and discussion

Data and model descriptions

Inflammatory bowel diseases (IBD), including Crohn’s disease (CD) and ulcerative colitis (UC), are chronic inflammatory conditions of the small intestine and/or the colon. The IBD study reported here includes 39 ileal CD patients, 50 UC patients, and 53 non-IBD control subjects, specimens from which were subjected to microbiome analysis. The abundance of the bacterial genus ^{dCT}) so that all four measurements were subjected to the same transformation. The IBD phenotypes (CD and UC) are incorporated as two covariates into the SEM model for an association analysis as well. Path diagrams for the latent variable SEM measurement, and covariate models for

Consistency and reliability of different measurement modalities

Table

**Sanger**

**454_V1V3 (****
p
**

**454_V3V5 (****
p
**

**qPCR (****
p
**

Sanger

1

0.828 (<.001)

0.866 (<.001)

0.642 (<.001)

454_V1V3

1

0.887 (<.001)

0.624 (<.001)

454_V3V5

1

0.610 (<.001)

qPCR

1

The reliabilities of these measurement modalities, as estimated by the squared correlation coefficients between measurements and the latent variable, are shown in the Table

**Four- modality measurement model**

**Sanger**

**454_V1V3**

**454_V3V5**

**qPCR**

The 95 % confidence intervals are obtained using bootstrap resampling with 100 replications

Reliability

0.819

0.857

0.912

0.441

(95% CI)

(0.689, 0.907)

(0.774, 0.917)

(0.865, 0.963)

(0.303, 0.553)

Correlation to the latent variable

0.905

0.926

0.955

0.664

(95% CI)

(0.830, 0.952)

(0.880, 0.958)

(0.930, 0.981)

(0.550, 0.744)

Because the reliability measure calculated in this model is closely related to the correlations among measurement modalities, and because the two 454 pyrosequencing windows feature the highest correlation (r = 0.887), we also evaluated a three-modality measurement model that dropped the 454 V1V3 data (the less reliable pyrosequencing window). In this independent platform comparison, Sanger sequencing emerged as the most reliable platform among the three modalities with an estimated reliability of 0.911 and an estimated correlation of 0.955 with the underlying

**Three- modality measurement model**

**Sanger**

**454_V3V5**

**qPCR**

The 95% confidence intervals are obtained using bootstrap resampling with 100 replications. Two 3-modality models are shown with Sanger, qPCR, and 454_V3V5 in the first model, and 454_V1V3 in the second model.

Reliability

0.911

0.822

0.452

(95% CI)

(0.775, 1.000)

(0.720, 0.912)

(0.323, 0.610)

Correlation to the latent variable

0.955

0.907

0.672

(95% CI)

(0.880, 1.000)

(0.849, 0.955)

(0.568, 0.781)

Sanger

454_**V1V3**

qPCR

Reliability

0.851

0.806

0.483

(95% CI)

(0.671, 1.000)

(0.645, 0.905)

(0.350, 0.648)

Correlation to the latent variable

0.922

0.898

0.696

(95% CI)

(0.819, 1.000)

(0.803, 0.951)

(0.592, 0.805)

Path diagrams for the measurement models with the estimated standardized path coefficients are shown in Figure

The estimated (A) four-modality (B) three-modality (Sanger, 454_V3V5, qPCR) and (C) three-modality (Sanger, 454_V1V3, qPCR) latent variable SEM measurement models for a study of the inflammatory bowel diseases

**The estimated (A) four-modality (B) three-modality (Sanger, 454_V3V5, qPCR) and (C) three-modality (Sanger, 454_V1V3, qPCR) latent variable SEM measurement models for a study of the inflammatory bowel diseases.**

In addition to

**Three-measurement modality model**

**Sanger**

**454_V1V3**

**454_V3V5**

The 95% confidence intervals are obtained using bootstrap resampling with 100 replications.

(A)

Reliability

0.657

0.641

**0.974**

(95% CI)

(0.524, 0.793)

(0.529, 0.724)

**(0.878, 1.000)**

Correlation to the latent variable

0.811

0.801

**0.987**

(95% CI)

(0.724, 0.891)

(0.727, 0.851)

**(0.937, 1.000)**

(B) Firmicutes/Clostridia/Clostridiales/LachnoIV

Reliability

0.685

**0.923**

0.793

(95% CI)

(0.582, 0.804)

**(0.837, 1.000)**

(0.688, 0.903)

Correlation to the latent variable

0.827

**0.961**

0.890

(95% CI)

(0.763, 0.897)

**(0.915, 1.000)**

(0.829, 0.950)

(C)

Reliability

0.582

0.854

**0.882**

(95% CI)

(0.424, 0.700)

(0.743, 0.942)

**(0.765, 0.976)**

Correlation to the latent variable

0.763

0.924

**0.939**

(95% CI)

(0.652, 0.837)

(0.862, 0.970)

**(0.875, 0.988)**

(D)

Reliability

0.684

0.828

**0.980**

(95% CI)

(0.323, 0.922)

(0.652, 1.000)

**(0.941, 1.000)**

Correlation to the latent variable

0.827

0.910

**0.990**

(95% CI)

(0.569, 0.960)

(0.808, 1.000)

**(0.970, 1.000)**

(E)

Reliability

0.698

0.953

**0.959**

(95% CI)

(0.553, 0.797)

(0.888, 1.000)

**(0.913, 0.995)**

Correlation to the latent variable

0.835

0.976

**0.979**

(95% CI)

(0.744, 0.893)

(0.942, 1.000)

**(0.956, 0.998)**

Comparison to repeated measures ANOVA

The model goodness-of-fit indices for the four-modality latent variable SEM measurement models for _{(proposed model)}/d_{(null model)}, where d is equal to the corresponding chi-square minus the degrees of freedom of the model. The CFI ranges from 0 to 1 with a larger value indicating better model fit. Acceptable model fit is indicated by a CFI value of 0.90 or greater ^{2} = 5.089, **
In summary, the (general) latent variable SEM is the only model that fits the data well as neither of the repeated measures ANOVA models is satisfactory.
**

**MODEL**

**MODEL CONSTRAINT**

**GOODNESS-OF-FIT**

**A: Latent variable SEM**

set only λ_{1} = 1

Chi-square

5.089 (df = 2) Pr > χ^{2}: 0.079

RMSEA

0.105

CFI

0.994

**B: Equivalent to repeated measures ANOVA (multivariate approach)**

set all indicator path coefficient λ_{i} ≡ 1 (i = 1, 2, 3, 4)

Chi-square

129.955 (df = 5) Pr > χ^{2}: < .001

RMSEA

0.421

CFI

0.750

**C: Equivalent to repeated measures ANOVA (univariate approach)**

set all indicator path coefficient λ_{i} ≡ 1; set all indicator error variances to be equal, var (ε_{i}) ≡ ^{2} (i = 1, 2, 3, 4)

Chi-square

172.068 (df = 8) Pr > χ^{2}: < .001

RMSEA

0.381

CFI

0.671

Estimation of the latent variable SEM model with IBD phenotypes

In this section, we examine the impact of two IBD phenotypes, Crohn’s Disease (CD) and ulcerative colitis (UC), on the relative frequency of

The estimated (A) four- and (B) three-modality latent variable SEM models examining the effect of two covariates: CD and UC phenotypes with their path coefficients and the corresponding p-values (in parentheses)

**The estimated (A) four- and (B) three-modality latent variable SEM models examining the effect of two covariates: CD and UC phenotypes with their path coefficients and the corresponding **
**
p
**

The estimated values of path coefficients in the association study with IBD phenotype are interpreted as follows. Take the three- modality covariate latent variable SEM for example (Figure

This translates to:

**
Therefore in comparison to the control subjects, CD patients are found have an average 14.4% less (p < .001) Faecalibacterium
** as the following simple calculation shows:

**
Similarly, UC patients are found to have 4.1 % less Faecalibacterium than the control subjects (p = 0.048)
** because

The mean differences of the logit-transformed relative frequency of

Comparison of logit-transformed relative frequency of Faecalibacterium among CD, UC and control by four measurements (qPCR, 454_V1V3, 454_V3V5 and Sanger sequencing) respectively. Mean and standard error are shown on each bar. Pairwise comparisons between UC, CD and control within each measurement platform are performed using Tukey’s studentized range test and significantly different pairs at the familywise error rate of 0.05 are labeled with the asterisk (*) representing significantly different pairs

**Comparison of logit-transformed relative frequency of Faecalibacterium among CD, UC and control by four measurements (qPCR, 454_V1V3, 454_V3V5 and Sanger sequencing) respectively.** Mean and standard error are shown on each bar. Pairwise comparisons between UC, CD and control within each measurement platform are performed using Tukey’s studentized range test and significantly different pairs at the familywise error rate of 0.05 are labeled with the asterisk (*) representing significantly different pairs.

Conclusions

In this work, we introduced the latent variable SEM as a versatile and effective analytical tool for measurement platform comparison and combination. While traditional SEM relied on the normality assumption for its parametric based inference, thanks to contemporary nonparametric techniques such as the bootstrap resampling method

In the study of the gastrointestinal microbiome, we demonstrated that latent variable SEM can provide a robust means of integrating datasets derived from different experimental platforms. Moreover, it can gauge effectively the relative merits of different measurement platforms, in this example, Sanger sequencing, 454 pyrosequencing with two different target regions/windows, and qPCR. Joint panel studies

The joint study panel has also recommended sequencing microbiome with two 454 pyrosequencing windows such as V1V3 and V3V5 – which we can readily combine using the latent variable SEM for a unified joint analysis. Nevertheless, more works need to be done for a thorough treatment of the platform comparison problem. For example, we have yet to examine the rare taxa issue. Given that data from rare taxa will feature near zero counts and artificially low or suspiciously high variances, a robust version of the current latent SEM method needs to be developed for the occasion. We definitely expect to submit a follow-up paper on this issue.

To our knowledge, this is the first application of latent variable SEM to the study of human microbiome, and for modern sequencing platform comparison and combination. Since human gastrointestinal microbial communities are typically complex and difficult to study

Competing interests

The authors declare that they have no competing interests.

Authors’ contribution

WZ, KB and XW proposed the statistical methodology. XW carried out data analyses and drafted manuscript. DF and EL provided data interpretation; EL and AG provided experimental data. WZ, KB, DF, EL and AG provided critical revision and suggestive comments of manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the Crohns and Colitis Foundation of America (EL, WZ), the Simons Foundation (EL) and National Institutes of Health (HG005964, DNF), UH2 DK083994 (EL), EB007530 (WZ), HL091939 (WZ), and MH090134 (WZ), and the Children's Digestive Health and Nutrition Foundation and the CCFA (ASG). We acknowledge use of the Washington University Digestive Diseases Research Core Center Tissue Procurement Facility (P30 DK52574). We thank Drs. George Weinstock and Erica Sodergren at the Genome Institute of Washington University for generating the sequence data. We also thank Dr. R. Balfour Sartor at School of Medicine of University of North Carolina for helpful discussions. Our thanks also go to the BMC Bioinformatics Review Panel for their insightful comments that have improved this work substantially.