# Mastering the Basics: An Introduction to Multivariable Science Modeling ## Basic Concepts

Many scientific models involve more than one statistical outcome variable. These models are often used to predict or optimize a real-world situation or gain insights into the studied system. Multivariable techniques are designed to analyze the outcome variables of interest simultaneously. The basic concepts of multivariable science modeling include the mathematical frameworks, computational techniques, and statistical reasoning necessary to perform these analyses. They are an extension of univariate and bivariate analysis (techniques used to study a single outcome variable) and are usually combined with regression methods.

The mathematical foundations for multivariable science modeling are the concepts of calculus – particularly derivatives and integrals. Derivatives study the rate of change in a function’s input, while integrals study how to add infinitesimal quantities that make up a function’s output. In addition, a basic understanding of vector calculus is helpful.

### Variable Selection

Virtually all statistical software packages provide variable selection procedures. They are popular because they can reduce the size of a model, make predictions more accurate and simplify the interpretation of regression coefficients. But they are not foolproof, and the quality of an explanatory model is only as good as the data that goes into it.

A common procedure for determining the best predictors is forward or backward stepwise selection. Each variable is tested for its ability to explain the dependent variable by evaluating its partial correlation. Variables with weak associations with the dependent variable are removed from the model. This process can be subject to bias. For example, variables highly correlated in the data set may contribute equally to the dependent variable. A solution to this problem is bootstrap resampling with replacement or a similar technique. This method yields confidence intervals for postselection variance that are more precise than those based on maximum likelihood models.

### Regression Models

Regression models estimate the degree to which predictor variables influence a single dependent variable. A common example is an agriculture scientist predicting total crop yield based on the expected rainfall, temperature, and amount of fertilizer used. The regression model results provide point estimates and confidence intervals that describe the strength of the association between variables. This information is useful in determining which factors are worthy of further study. Adjusting for confounding variables is a frequent reason to deploy multivariable regression modeling techniques.

### Model Validation

Model validation is a process of confirming that a model achieves its intended purpose. It typically involves comparing model simulations with independent experimental data sets containing real and predicted values at specific times. This can be challenging when the simulated scenario extends beyond observed conditions (e.g., predicting responses to climate change) or when you’re using probabilistic forecasts that incorporate uncertainty in system processes. Students can gain greater proficiency in multivariable modeling through visualizations that allow them to compare and assess models. For example, stratified boxplots for assessing relationships between quantitative and categorical variables or scatterplots that compare the relationship between two quantitative variables can help students better understand important concepts such as confounding.

A thorough model validation program is critical before a model goes into production. It should also be repeated on a routine basis to ensure that a model continues to meet the needs of its users. 