If you are familiar with human research, you may have seen some examples of group comparisons. For example, if we want to test the effectiveness of an antidepressant, we may want to compare gender differences (since depression can differ based on gender). However, are you sure that the instrument you are using (e.g., Beck Depression Inventory) has the same structure for men or women?
Another example is in cross-cultural research. We could measure differences in subjective well-being between countries and see if some countries are happier than others. But how can you infer that your results are accurate if you don’t know whether you can compare the latent variable scale scores?
One criterion for comparing scale scores is that the latent variable is understood and measured equivalently across groups/countries. This property is known as measurement invariance or lack of bias. (Svetina et al., 2020).
So, whenever you want to compare test scores between groups in some way, it would be interesting to do an invariance analysis on the scale first. You may have already come across results about IQ, saying that some countries or cultures have higher or lower IQ than others, without even testing measurement invariance.
A test is invariant if participants belonging to different groups, who have the same score on the factor underlying the test, have on average the same score on an observed item (Lubke & Muthén, 2004).
A common statistical method for establishing evidence of measurement invariance is through Multigroup Confirmatory Factor Analysis (CFA-MG). In CFA-MG, we use a hierarchical test set to impose constraints on the parameters of interest across groups.
10.1 Defining Measurement Invariance/Equivalence
A general factor model is defined as
\[
\Sigma = \Lambda \Phi \Lambda' + \Theta
\]
Where \(\Sigma\) represents the covariance matrix of the observed variables (or items); \(\Lambda\) is the matrix of factor loadings that expresses the degree of association between the vector of latent variables (with associated covariance matrix \(\Phi\)) to the observed variables; \(\Theta\) represents the covariance matrix of measurement errors for the observed variables.
Consider that \(v\) is the mean structure. Then, the observed variables’ means can be represented by \(E(Y)=E(v+\Lambda \eta + \epsilon)\), where \(\eta\) is the vector of latent variables. This model has the assumption that \(E(\epsilon)=0\) and \(E(\eta)=k=0\), then \(E(X)=v\), where \(X\) is the vector of observed variables. This model generalizes to multiple population by permitting separate covariance matrix for each population or group, i.e., \(\Sigma^{(g)}\) with mean structure \(v^{(g)},g=1,...,G\).
To establish measurement invariance is that if the null hypothesis is \(H_0: \Sigma^{(1)}=\Sigma^{(2)}=...=\Sigma^{(G)}\) is rejected, a series of tests follows. I list them below.
Configural invariance: tests whether the factorial structure is the same across groups (i.e. number of factors, and whether items percent on the same factor). Thus, the number and pattern of parameters are assumed to be equal across groups. Nevertheless, the values of the parameters are assumed to differ withing identification constraints. If configural invariance is not found, this means that the items load in different factors for different groups.
Metric invariance: tests whether the factor loadings of items are equal across groups. The null hypothesis states that the pattern and value of factor loadings are equivalent across groups, i.e., \(H_0: \Lambda^{(1)}=\Lambda^{(2)}=...=\Lambda^{(G)}\). If metric invariance is not found, it means that one or more items of the instrument are being answered with bias by one or more groups. Therefore, inferences of differences between groups may be biased. Although this is an indicator of response bias in the items, the next step is necessary to assume equal comparison between groups.
Scalar invariance: in addition to equal factor loadings, it tests whether the intercepts are equal between groups. Thus, the null hypothesis is \(H_0: \Lambda^{(1)}=\Lambda^{(2)}=...=\Lambda^{(G)}\), and \(v^{(1)}=v^{(2)}=...=v^{(G)}\). If scalar invariance is not found, any differences found between groups are not related to the latent trait, but to the non-equivalence of measurement of the instrument parameters.
Strict Invariance: implies that, in addition to loadings and intercepts, residual variances are similar between groups. Residual variance is simply item variance that is not associated with the latent variable you are measuring.
Some authors advocate the need for strict invariance as a condition for comparing group means (Lubke & Dolan, 2003). In practice, this level of invariance is rarely achieved, given that scalar invariance supports comparisons between groups of manifest (or latent) variable means on the latent variable of interest (Hancock, 1997; Svetina et al., 2020, Thompson & Green, 2006). Thus, most authors suggest achieving scalar invariance instead, in order to conclude invariance (Svetina et al., 2020).
10.2 Measurement Invariance for Categorical Variables
Previously, I described the concept of invariance where the distribution of observed variables is assumed to be multivariate normal. However, in Psychology, most of the surveys are binary of ordinal. If we use the multivariate distribution to categorical variables, we might have consequences on parameters, model fit, and cross-group comparisons (Beauducel & Herzberg, 2006; Lubke & Muthén, 2004; Muthén & Kaplan, 1985). To surpass this, we can use other estimators, like the diagonally weighted least squares (DWLS) family of estimators. The categorical measurement invariance goes as follows (Svetina et al., 2020).
Imagine we have a p X 1 vector of observed variables \(Y\), which take ordered values 0, 1, 2, …, \(C\). For each observed variable \(Y_j\), with \(j=1,2,...,p\), it is assumed that there is an underlying continuous latent response variable \(Y^{*}_j\), which has the value of that determines the observed category of the observed variable \(Y_j\). \(Y^{*}_j\) is related to \(Y_j\) by a set of \(C+1\) threshold parameters \(\tau_j=(\tau_{j0},\tau_{j1},...,\tau_{jC+1})\), where \(\tau_{j0}=-\infty\) and \(\tau_{jC+1}=\infty\). Thus, the probability that \(Y_j = c\) is:
\[
P(Y_j=c)=P(\tau_{jc}≤Y^{*}_j≤\tau_{jc+1})
\]
For \(c=0,1,...,C\). The model for the vector of latent response variables is: \[
Y^*=v+\Lambda\eta+\epsilon
\] where factor loadings and residuals are defined the same as before, \(v\) is a vector of latent intercept parameters. In addition, mean and covariance structure of this model is the same: \(E(Y^*)=v\), \(Cov(Y^*)=\Sigma^*=\Lambda\Phi\Lambda'+\Theta\), where \(E(Y^*)=v\) is assumed to be zero for identification purposes.
In the typical scenario, the categorical factor model can be expanded to encompass multiple groups by accommodating distinct thresholds and covariance matrices for the latent response variables within each population. These are denoted as \(\tau^{(k)})\) and \(\Sigma^{*(k)}\), where \(k\) ranges from 1 to \(K\) (with \(v^{(k)}=0\) for all \(k\)). In a similar vein, for ordinal data, assessments are made for “baseline” invariance, equivalent slopes, and equal slopes and thresholds, which mirror configural, metric, and scale invariance, respectively. To ascertain the viability of these invariance assumptions, both overall and difference chi-square tests are employed.
10.3 Measurement Invariance in R
To run a Multigroup Confirmatory Factor Analysis, we must first install the lavaan (Rosseel, 2012) and semTools (Jorgensen et al., 2022) packages for analyses, and psych (Revelle, 2023) for the database.
Os seguintes objetos são mascarados por 'package:semTools':
reliability, skew
O seguinte objeto é mascarado por 'package:lavaan':
cor2cov
To run the analyses, we will use the BFI database (Big Five Personality Factors Questionnaire) that already exists in the psych package. We will differentiate between genders (1=Male and 2=Female).
dat<- psych::bfi
We will store the results of our models in an empty matrix called results, where we will extract chi-square, degrees of freedom, RMSEA, CFI and TLI.
results<-matrix(NA, nrow =3, ncol =6)
Let’s do the analysis with the 5 BFI factors. First we place the model.
The function measEq.syntax from the semTools package automatically generates the lavaan syntax to perform a confirmatory factor analysis. As can be seen from the baseline model specification, items are treated as ordinals, delta parameterization and Wu and Estabrook’s 2016 model identification are used.
Let’s then fit the base model.
model.baseline <-as.character(baseline)fit.baseline <-cfa(model.baseline, data = dat,group ="gender",ordered =TRUE)
Now let’s save the results in the matrix we created initially.
To examine the relative fit of the model and compare the chi-square statistics between the baseline model and the model where threshold constraints are employed, we use function lavTestLRT.
we noticed that in general, the model fit improved as the models became more restricted by imposing equality of thresholds (prop4) and equality of factor loadings (prop7).
We can perform the chi-square difference test between the threshold invariance (prop4) and threshold + factor loadings (pro7) models to assess the feasibility of measurement invariance.
lavTestLRT(fit.prop4,fit.prop7)
Df
AIC
BIC
Chisq
Chisq diff
Df diff
Pr(>Chisq)
fit.prop4
344
NA
NA
3946.388
NA
NA
NA
fit.prop7
404
NA
NA
3952.985
20.73053
60
0.9999995
10.4 Data Interpretation
For interpretation, you can see Table 1 of the study by Svetina et al., (2020). In it, there is a summary of several simulation studies and which cutoff points are recommended. For example, if you have an ordinal distribution, you are comparing two groups, with each group having 150/150 or 150/500 or 500/500 participants, having 2 to 4 factors, you should test the levels of measurement invariance through of the chi-square difference with p < 0.05 between each level of invariance. If you consider that you have data with normal distribution, you are comparing 2 groups, you have an N of 150, 250 or 500 per group, and you are comparing only 1 factor, the difference between each CFI invariance level should not be greater than or equal to 0.005, while the RMSEA difference must not be less than or equal to 0.010.
10.5 Comparing Confidence Intervals as Means to Measurement Invariance
A study by Cheung & Lau (2012) showed that, even if only few items are non-invariant in a scale, it can lead to poor fit for the whole scale. Thus, his idea is to compare Confidence Intervals (CIs) between item parameters, which allows us to exclude or treat specific items that shows non-invariance. In this approach, we should conclude that the scale structure is the same if we achieve configural invariance (by using Hu & Bentler’s (1999) recommendations, for example [i.e., CFI and TLI > .95, and RMSEA < .06]). Then, we will conclude other types of invariances if there’s no overlap between the 95% Confidence Intervals (CIs) between the loadings and intercepts parameters between groups (Cheung & Lau, 2012). The confidence intervals (CIs) can should be estimated using bias-corrected bootstrap samples to assess metric and scalar invariance. For practical purposes, we will exclude non-invariant items.
10.6 R Code for Comparing CIs
First, create your dataframe using a single factor.
dat<- psych::bfi[,c(1:5,26)]
We will need to run again the first step showed in the “Baseline Model” part.
This lavaan model syntax specifies a CFA with 5 manifest indicators (5 of which are ordinal) of 1 common factor(s).
To identify the location and scale of each common factor, the factor means and variances were fixed to 0 and 1, respectively, unless equality constraints on measurement parameters allow them to be freed.
The location and scale of each latent item-response underlying 5 ordinal indicators were identified using the "delta" parameterization, and the identification constraints recommended by Wu & Estabrook (2016). For details, read:
https://doi.org/10.1007/s11336-016-9506-0
Pattern matrix indicating num(eric), ord(ered), and lat(ent) indicators per factor:
Agre
A1 ord
A2 ord
A3 ord
A4 ord
A5 ord
The following types of parameter were constrained to equality across groups:
configural
# Have to specify as.character to submit to lavaanmodel.baseline <-as.character(baseline)# Fitting baseline model in lavaan via cfa functionfit.baseline <-cfa(model.baseline,data = dat,group ="gender",estimator ="dwls")# Obtaining results from baseline modelsummary(fit.baseline)
# Extracting fit indices into the first row of all.results matrixall.results[1,]<-round(data.matrix(fitmeasures( fit.baseline,fit.measures =c("chisq","df","pvalue","cfi","tli","rmsea"))),digits=3)colnames(all.results) <-c("chisq","df","pvalue","cfi","tli","rmsea")all.results
Now, let’s to the analysis of CIs for a single factor in order to facilitate interpretation.
############### METRIC FOR Agreeableness# Define the vector of a Factorrhs_AGRE <-colnames(dat[,1:5]) PER_ol <-as.data.frame(do.call(rbind, lapply(rhs_AGRE, subset_and_print)))# Initialize an empty list to store resultsresults <-list()# Iterate over each pair of variables and check for overlapfor (i in1:(nrow(PER_ol)/2)) { rng <-cbind(pmin(PER_ol[(2*i-1), 11], PER_ol[(2*i-1), 12]),pmax(PER_ol[(2*i-1), 11], PER_ol[(2*i-1), 12]),pmin(PER_ol[(2*i), 11], PER_ol[(2*i), 12]),pmax(PER_ol[(2*i), 11], PER_ol[(2*i), 12]))# Check for overlap and store the result results[[i]] <-overlap_TRUE_FALSE(rng)print(paste("Overlap for Item", i, ":", results[[i]]))}
[1] "Overlap for Item 1 : TRUE"
[1] "Overlap for Item 2 : TRUE"
[1] "Overlap for Item 3 : TRUE"
[1] "Overlap for Item 4 : TRUE"
[1] "Overlap for Item 5 : TRUE"
# Iterate over each pair of variables and show confidence intervalsfor (i in1:(nrow(PER_ol)/2)) { rng <-cbind(pmin(PER_ol[(2*i-1), 11], PER_ol[(2*i-1), 12]),pmax(PER_ol[(2*i-1), 11], PER_ol[(2*i-1), 12]),pmin(PER_ol[(2*i), 11], PER_ol[(2*i), 12]),pmax(PER_ol[(2*i), 11], PER_ol[(2*i), 12]))# Show the confidence intervalsshow_confidence_intervals(round(rng, 3), i)}
The results show that the CIs overlap, which means we have metric invariance.
Now let’s run a scalar invariance analysis. Save the following functions for faster interpretation.
# Function to subset the parameter estimates for a specific item and thresholdsubset_and_print <-function(lhs_value, rhs_value) { inter <-subset(par, lhs == lhs_value & op =="|"& rhs == rhs_value)return(inter)}# Function to show confidence intervals between two groupsshow_confidence_intervals <-function(rng, item_name, threshold) { interval1 <-c(rng[1], rng[2]) # Group 1 interval2 <-c(rng[3], rng[4]) # Group 2print(paste("Confidence intervals for", item_name, "threshold", threshold, ":"))print(paste("Group 1: [", interval1[1], ", ", interval1[2], "]", sep =""))print(paste("Group 2: [", interval2[1], ", ", interval2[2], "]", sep =""))}# Main automation function to compare thresholds between the same item and threshold but across two groupscompare_all_thresholds <-function(param_estimates) {# Extract all unique items (lhs) and thresholds (rhs) in the parameter estimates item_names <-unique(param_estimates$lhs[param_estimates$op =="|"]) threshold_values <-unique(param_estimates$rhs[param_estimates$op =="|"])# Iterate over each item (lhs) and threshold (rhs)for (item_name in item_names) {for (rhs_value in threshold_values) {# Subset parameter estimates for the current item and threshold across both groups inter <-subset_and_print(item_name, rhs_value)# Check if there are exactly two rows (representing two groups)if (nrow(inter) ==2) {# Extract the confidence intervals for both groups rng <-cbind(pmin(inter[1, "ci.lower"], inter[1, "ci.upper"]), # Group 1pmax(inter[1, "ci.lower"], inter[1, "ci.upper"]), # Group 1pmin(inter[2, "ci.lower"], inter[2, "ci.upper"]), # Group 2pmax(inter[2, "ci.lower"], inter[2, "ci.upper"]) # Group 2 )# Show the confidence intervals for the current item and thresholdshow_confidence_intervals(round(rng, 3), item_name, rhs_value)if((rng[1] <= rng[4]) & (rng[2] >= rng[3])){print(paste("TRUE overlap for item",item_name,"in theshold",rhs_value)) }else{print(paste("FALSE overlap for item",item_name,"in theshold",rhs_value)) } } else {print(paste("Not enough data to compare thresholds for item:", item_name, "and rhs:", rhs_value)) } } }}
Run the analysis for the parameters table within the faster interpretation function.
compare_all_thresholds(par)
[1] "Confidence intervals for A1 threshold t1 :"
[1] "Group 1: [-0.883, -0.696]"
[1] "Group 2: [-0.356, -0.234]"
[1] "FALSE overlap for item A1 in theshold t1"
[1] "Confidence intervals for A1 threshold t2 :"
[1] "Group 1: [-0.017, 0.146]"
[1] "Group 2: [0.389, 0.505]"
[1] "FALSE overlap for item A1 in theshold t2"
[1] "Confidence intervals for A1 threshold t3 :"
[1] "Group 1: [0.443, 0.613]"
[1] "Group 2: [0.785, 0.916]"
[1] "FALSE overlap for item A1 in theshold t3"
[1] "Confidence intervals for A1 threshold t4 :"
[1] "Group 1: [0.964, 1.167]"
[1] "Group 2: [1.258, 1.416]"
[1] "FALSE overlap for item A1 in theshold t4"
[1] "Confidence intervals for A1 threshold t5 :"
[1] "Group 1: [1.643, 1.951]"
[1] "Group 2: [1.83, 2.08]"
[1] "TRUE overlap for item A1 in theshold t5"
[1] "Confidence intervals for A2 threshold t1 :"
[1] "Group 1: [-2.1, -1.749]"
[1] "Group 2: [-2.477, -2.118]"
[1] "FALSE overlap for item A2 in theshold t1"
[1] "Confidence intervals for A2 threshold t2 :"
[1] "Group 1: [-1.41, -1.184]"
[1] "Group 2: [-1.794, -1.597]"
[1] "FALSE overlap for item A2 in theshold t2"
[1] "Confidence intervals for A2 threshold t3 :"
[1] "Group 1: [-1.038, -0.842]"
[1] "Group 2: [-1.428, -1.27]"
[1] "FALSE overlap for item A2 in theshold t3"
[1] "Confidence intervals for A2 threshold t4 :"
[1] "Group 1: [-0.284, -0.112]"
[1] "Group 2: [-0.693, -0.577]"
[1] "FALSE overlap for item A2 in theshold t4"
[1] "Confidence intervals for A2 threshold t5 :"
[1] "Group 1: [0.667, 0.867]"
[1] "Group 2: [0.294, 0.412]"
[1] "FALSE overlap for item A2 in theshold t5"
[1] "Confidence intervals for A3 threshold t1 :"
[1] "Group 1: [-2.008, -1.665]"
[1] "Group 2: [-1.982, -1.76]"
[1] "TRUE overlap for item A3 in theshold t1"
[1] "Confidence intervals for A3 threshold t2 :"
[1] "Group 1: [-1.273, -1.073]"
[1] "Group 2: [-1.48, -1.313]"
[1] "FALSE overlap for item A3 in theshold t2"
[1] "Confidence intervals for A3 threshold t3 :"
[1] "Group 1: [-0.867, -0.685]"
[1] "Group 2: [-1.128, -0.986]"
[1] "FALSE overlap for item A3 in theshold t3"
[1] "Confidence intervals for A3 threshold t4 :"
[1] "Group 1: [-0.171, 0]"
[1] "Group 2: [-0.508, -0.388]"
[1] "FALSE overlap for item A3 in theshold t4"
[1] "Confidence intervals for A3 threshold t5 :"
[1] "Group 1: [0.784, 0.964]"
[1] "Group 2: [0.436, 0.56]"
[1] "FALSE overlap for item A3 in theshold t5"
[1] "Confidence intervals for A4 threshold t1 :"
[1] "Group 1: [-1.762, -1.499]"
[1] "Group 2: [-1.787, -1.597]"
[1] "TRUE overlap for item A4 in theshold t1"
[1] "Confidence intervals for A4 threshold t2 :"
[1] "Group 1: [-1.161, -0.959]"
[1] "Group 2: [-1.267, -1.113]"
[1] "TRUE overlap for item A4 in theshold t2"
[1] "Confidence intervals for A4 threshold t3 :"
[1] "Group 1: [-0.811, -0.623]"
[1] "Group 2: [-1.022, -0.887]"
[1] "FALSE overlap for item A4 in theshold t3"
[1] "Confidence intervals for A4 threshold t4 :"
[1] "Group 1: [-0.24, -0.067]"
[1] "Group 2: [-0.543, -0.427]"
[1] "FALSE overlap for item A4 in theshold t4"
[1] "Confidence intervals for A4 threshold t5 :"
[1] "Group 1: [0.446, 0.626]"
[1] "Group 2: [0.034, 0.151]"
[1] "FALSE overlap for item A4 in theshold t5"
[1] "Confidence intervals for A5 threshold t1 :"
[1] "Group 1: [-2.075, -1.736]"
[1] "Group 2: [-2.289, -2.003]"
[1] "TRUE overlap for item A5 in theshold t1"
[1] "Confidence intervals for A5 threshold t2 :"
[1] "Group 1: [-1.298, -1.088]"
[1] "Group 2: [-1.531, -1.355]"
[1] "FALSE overlap for item A5 in theshold t2"
[1] "Confidence intervals for A5 threshold t3 :"
[1] "Group 1: [-0.838, -0.65]"
[1] "Group 2: [-1.08, -0.948]"
[1] "FALSE overlap for item A5 in theshold t3"
[1] "Confidence intervals for A5 threshold t4 :"
[1] "Group 1: [-0.206, -0.028]"
[1] "Group 2: [-0.368, -0.257]"
[1] "FALSE overlap for item A5 in theshold t4"
[1] "Confidence intervals for A5 threshold t5 :"
[1] "Group 1: [0.732, 0.925]"
[1] "Group 2: [0.556, 0.678]"
[1] "FALSE overlap for item A5 in theshold t5"
The function prints TRUE when there’s an overlap, or FALSE when there’s not. We see that most of the thresholds are not overlapping, which would indicate that the Agreeableness scale for this data has no scalar invariance for gender.
10.7 References
Beauducel, A., & Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling: A Multidisciplinary Journal, 13(2), 186–203. https://doi.org/10.1207/s15328007sem1302_2
Cheung, G. W., & Lau, R. S. (2012). A direct comparison approach for testing measurement invariance. Organizational Research Methods, 15(2), 167-198. https://doi.org/10.1177/1094428111421987
Hancock, G. R. (1997). Structural equation modeling methods of hypothoesis testing of latent variable means. Measurement & Evaluation in Counseling & Development (American Counseling Association), 30(2), 91–105. https://doi.org/10.1080/07481756.1997.12068926
Jorgensen, T. D., Pornprasertmanit, S., Schoemann, A. M., & Rosseel, Y. (2022). semTools: Useful tools for structural equation modeling. R package. Retrieved from https://CRAN.R-project.org/package=semTools
Lubke, G. H., & Muthén, B. O. (2004). Applying multigroup confirmatory factor models for continuous outcomes to Likert scale data complicates meaningful group comparisons. Structural equation modeling, 11(4), 514-534. https://doi.org/10.1207/s15328007sem1104_2
Muthén, B., & Kaplan, D. (1985). A Comparison of Some Methodologies for the Factor Analysis of Non-Normal Likert Variables. British Journal of Mathematical and Statistical Psychology, 38, 171–189. https://doi.org/10.1111/j.2044-8317.1985.tb00832.x
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Revelle, W. (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package. https://CRAN.R-project.org/package=psych
Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1-36. [https://doi.org/10.18637/jss.v048.i02\](https://doi.org/10.18637/jss.v048.i02){.uri}
Svetina, D., Rutkowski, L., & Rutkowski, D. (2020). Multiple-group invariance with categorical outcomes using updated guidelines: an illustration using M plus and the lavaan/semtools packages. Structural Equation Modeling: A Multidisciplinary Journal, 27(1), 111-130. https://doi.org/10.1080/10705511.2019.1602776