Title | Automatic identification of variables in epidemiological datasets using logic regression. |
Publication Type | Journal Article |
Year of Publication | 2017 |
Authors | Lorenz, MW, Abdi, NAshtiani, Scheckenbach, F, Pflug, A, Bülbül, A, Catapano, AL, Agewall, S, Ezhov, M, Bots, ML, Kiechl, S, Orth, A |
Corporate/Institutional Authors | PROG-IMT Study Group, |
Journal | BMC Med Inform Decis Mak |
Volume | 17 |
Issue | 1 |
Pagination | 40 |
Date Published | 2017 Apr 13 |
ISSN | 1472-6947 |
Abstract | <p><b>BACKGROUND: </b>For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.</p><p><b>METHODS: </b>For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.</p><p><b>RESULTS: </b>In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.</p><p><b>CONCLUSIONS: </b>We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.</p> |
DOI | 10.1186/s12911-017-0429-1 |
Alternate Journal | BMC Med Inform Decis Mak |
PubMed ID | 28407816 |
PubMed Central ID | PMC5390441 |
Grant List | U01 HL080295 / HL / NHLBI NIH HHS / United States R01 DE013094 / DE / NIDCR NIH HHS / United States N01 HC015103 / HC / NHLBI NIH HHS / United States N01HC55222 / HL / NHLBI NIH HHS / United States N01HC85086 / HL / NHLBI NIH HHS / United States R37 NS029993 / NS / NINDS NIH HHS / United States N01HC85079 / HL / NHLBI NIH HHS / United States N01 HC035129 / HC / NHLBI NIH HHS / United States |