Title | Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study. |
Publication Type | Journal Article |
Year of Publication | 2025 |
Authors | Yang, D, Zhou, D, Cai, S, Gan, Z, Pencina, M, Avillach, P, Cai, T, Hong, C |
Journal | JMIR Med Inform |
Volume | 13 |
Pagination | e54133 |
Date Published | 2025 Jan 22 |
ISSN | 2291-9694 |
Keywords | Algorithms, Cohort Studies, Female, Humans, Machine Learning, Semantics |
Abstract | <p><b>BACKGROUND: </b>Cohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multicohort studies. However, given differences in variable encoding, accurate variable harmonization is difficult.</p><p><b>OBJECTIVE: </b>We propose SONAR (Semantic and Distribution-Based Harmonization) as a method for harmonizing variables across cohort studies to facilitate multicohort studies.</p><p><b>METHODS: </b>SONAR used semantic learning from variable descriptions and distribution learning from study participant data. Our method learned an embedding vector for each variable and used pairwise cosine similarity to score the similarity between variables. This approach was built off 3 National Institutes of Health cohorts, including the Cardiovascular Health Study, the Multi-Ethnic Study of Atherosclerosis, and the Women's Health Initiative. We also used gold standard labels to further refine the embeddings in a supervised manner.</p><p><b>RESULTS: </b>The method was evaluated using manually curated gold standard labels from the 3 National Institutes of Health cohorts. We evaluated both the intracohort and intercohort variable harmonization performance. The supervised SONAR method outperformed existing benchmark methods for almost all intracohort and intercohort comparisons using area under the curve and top-k accuracy metrics. Notably, SONAR was able to significantly improve harmonization of concepts that were difficult for existing semantic methods to harmonize.</p><p><b>CONCLUSIONS: </b>SONAR achieves accurate variable harmonization within and between cohort studies by harnessing the complementary strengths of semantic learning and variable distribution learning.</p> |
DOI | 10.2196/54133 |
Alternate Journal | JMIR Med Inform |
PubMed ID | 39844378 |
PubMed Central ID | PMC11778729 |
Grant List | R61 NS120246 / NS / NINDS NIH HHS / United States |