Objective: Data about Arab-Americans, a growing ethnic minority, are not routinely collected in vital statistics, registry, or administrative data in the USA. The difficulty in identifying Arab-Americans using publicly available data sources is a barrier to health research about this group. Here, we validate an empirically based probabilistic Arab name algorithm (ANA) for identifying Arab-Americans in health research.
Design: We used data from all Michigan birth certificates between 2000 and 2005. Fathers' surnames and mothers' maiden names were coded as Arab or non-Arab according to the ANA. We calculated sensitivity, specificity, and positive (PPV) and negative predictive values (NPV) of Arab ethnicity inferred using the ANA as compared to self-reported Arab ancestry.
Results: Statewide, the ANA had a specificity of 98.9%, a sensitivity of 50.3%, a PPV of 57.0%, and an NPV of 98.6%. Both the false-positive and false-negative rates were higher among men than among women. As the concentration of Arab-Americans in a study locality increased, the ANA false-positive rate increased and false-negative rate decreased.
Conclusion: The ANA is highly specific but only moderately sensitive as a means of detecting Arab ancestry. Future research should compare health characteristics among Arab-American populations defined by Arab ancestry and those defined by the ANA.