TY - JOUR

T1 - Sharpening the toolbox of computational chemistry: a new approximation of critical f-values for multiple linear regression

AU - Kramer, C.

AU - Tautermann, C.

AU - Livingstone, D.

AU - Salt, D.

AU - Whitley, David

AU - Beck, B.

AU - Clark, Tim

PY - 2009

Y1 - 2009

N2 - Multiple linear regression is a major tool in computational chemistry. Although it has been used for more than 30 years, it has only recently been noted within the cheminformatics community that the standard F-values used to assess the significance of the resulting models are inappropriate in situations where the variables included in a model are chosen from a large pool of descriptors, due to an effect known in the statistical literature as selection bias. We have used Monte Carlo simulations to estimate the critical F-values for many combinations of sample size (n), model size (p), and descriptor pool size (k), using stepwise regression, one of the methods most commonly used to derive linear models from large sets of molecular descriptors. The values of n, p, and k represent cases appropriate to contemporary cheminformatics data sets. A formula for general n, p, and k values has been developed from the numerical estimates that approximates the critical stepwise F-values at 90%, 95%, and 99% significance levels. This approximation reproduces both the original simulated values and an interpolation test set (within the range of the training values) with an R2 value greater than 0.995. For an extrapolation test set of cases outside the range of the training set, the approximation produced an R2 above 0.93.

AB - Multiple linear regression is a major tool in computational chemistry. Although it has been used for more than 30 years, it has only recently been noted within the cheminformatics community that the standard F-values used to assess the significance of the resulting models are inappropriate in situations where the variables included in a model are chosen from a large pool of descriptors, due to an effect known in the statistical literature as selection bias. We have used Monte Carlo simulations to estimate the critical F-values for many combinations of sample size (n), model size (p), and descriptor pool size (k), using stepwise regression, one of the methods most commonly used to derive linear models from large sets of molecular descriptors. The values of n, p, and k represent cases appropriate to contemporary cheminformatics data sets. A formula for general n, p, and k values has been developed from the numerical estimates that approximates the critical stepwise F-values at 90%, 95%, and 99% significance levels. This approximation reproduces both the original simulated values and an interpolation test set (within the range of the training values) with an R2 value greater than 0.995. For an extrapolation test set of cases outside the range of the training set, the approximation produced an R2 above 0.93.

U2 - 10.1021/ci800318q

DO - 10.1021/ci800318q

M3 - Article

VL - 49

SP - 28

EP - 34

JO - Journal of Chemical Information and Modeling

JF - Journal of Chemical Information and Modeling

SN - 1549-9596

IS - 1

ER -