Effex app documentation – Modeling

Docs: Upload a dataset

Mon, 01 Jan 0001 00:00:00 +0000

Uploading the data

Before the dataset is uploaded, choose one of the two following options for headers:

This is to indicate if there is a header in the dataset that you will upload. After specifying this, copy the dataset from an excel file, click on the plus sign as presented below

and paste the data. On windows, the data can be pasted by pressing CTRL + V. On mac, this can be done by pressing command⌘ + V.

Steps to follow after pasting the data

By default, the data in each column of the dataset is assigned to be quantitative. If the user wishes to change some of these default assignments to categorical, then clicking on the dropdown menu

will allow the user to make this change.
The last column of the dataset is assumed to be a response column, and the remaining columns are assumed to be factor columns. To change this original assignment toggle between the options provided below

It is possible to have more than one response column. However, all response columns must strictly have data of the quantitative type for modeling. If a column is assigned to be categorical, then this column will not be allowed to be set as a response column.

Here is a screenshot of a dataset uploaded successfully with the appropriate selections for the options described above.

In our example, we take a dataset from the paper of Derringer & Suich (1980). This dataset comes from an experiment where researchers study the effect of three input factors, namely, SILICA, SILANE and SULFUR, on the quality of rubber tires. The quality of the rubber tires is measured using multiple response variables, of which we consider two in our dataset, namely, ABRASION and ELONG (i.e Elongation at break).

After the dataset has been defined appropriately using the options above, click on Save dataset.

References:

Derringer, G., & Suich, R. (1980). Simultaneous Optimization of Several Response Variables. Journal of Quality Technology, 12(4), 214–219.

Docs: Launch a modeling calculation

Mon, 01 Jan 0001 00:00:00 +0000

In this page, the user can specify which response columns should be used for modeling by using the sliding toggle buttons (enclosed in a black box in the image below) presented for each row corresponding to each of the response columns.

After a response has been selected, the following options emerge.

Model definition:

In the model definition dropdown menu, the following options appear

This option can be used to specify all model terms that need to be considered for modeling. A more detailed explanation of these three types of models are as follows:
1. Main effects (this includes only first-order effects)
2. Main and interaction effects (this includes first-order effects as well as two-factor interaction effects)
3. Main and second-order effects (this includes first-order effects, two-factor interaction effects and quadratic effects)
Intercept:

The user can choose to include or exclude the intercept from modeling using the following drop down menu
Model heredity:

For model heredity, there are three choices: Strong, Weak and No heredity.

Strong heredity:
• If an interaction effect is included, then both the linear effects of the involved factors are also included in the model.
• If a quadratic effect is included, then the linear effect of the corresponding factor is also included in the model.

Weak heredity:
• If an interaction effect is active, then one of the linear effects of the involved factors is also included in the model.
• If a quadratic effect is active, then the linear effect of the corresponding factor is also included in the model.

No heredity:
No strong, nor weak heredity.

Note: Ockuly et. al (2017) observed that in real experiments, strong heredity occurs more frequently than weak heredity, which in turn occurs more frequently than no heredity.
Transformations

The software allows the reponse variable to be transformed if the user wishes to do so. To transform a specific response before modeling, the following options are offered: Original, Sqrt, Log. The original option leaves the response as is, the sqrt option takes the square root of the response, and the log option takes the log of each value in the response column to the base 2. To use the sqrt transformation, all values in the original response column must be greater than or equal to 0, and for the log transformation, all values must be strictly greater than 0. For a given response, all three options can be selected together, in which case three separate analyses are done for each transformation type.

After all above options are correctly selected, click on the Launch modeling button to begin the calculations for model selection. The model selection is performed using the method proposed by Vazquez et. al (2021).

The user will be notified via the Notifications tab when the calculations are complete. When the modeling calculations are completed, the user can navigate to the My DoE items tab, locate their data set under the option Data sets, and select the specific dataset to review the modeling results as shown below. For documentation on modeling results, refer to this page.

References:

Ockuly, R. A., Weese, M. L., Smucker, B. J., Edwards, D. J., & Chang, L. (2017). Response surface experiments: A meta-analysis. Chemometrics and Intelligent Laboratory Systems, 164, 64-75.
Vazquez, A. R., Schoen, E. D., & Goos, P. (2021). A mixed integer optimization approach for model selection in screening experiments. Journal of Quality Technology, 53 (3), 243-266.

Docs: Modeling results

Mon, 01 Jan 0001 00:00:00 +0000

Modeling options and filters

In the Modeling results page, the results that are displayed are based on two specifications: The selected response variable and the transformation type selected.

At the top of this page, you will find two boxes that indicate this. An example is given below:

In the Response box, the name of the corresponding response variable is displayed and in the Transformation box, the specific transformation of the response variable used in the modeling is displayed.

If the modeling calculations for this dataset were completed for multiple response variables, then the dropdown menu for the Response box (refer image above) will allow the user to switch between modeling results for the different reponse variables.

If a response variable was modeled using more than one type of transformation, then the dropdown menu for the Transformation box (refer image above) will allow the user to switch between modeling results for the different transformations for the specific response variable specified in the Response box.

Other options for filtering

Forcing and removing effects

If the user has some prior knowledge that certain effects must be included in all models that they wish to consider, the user can specify this using the Force effects in the model dropdown option as shown below.

The dropdown list will display all effects that were originally considered for the analysis. This option allows the user to specify more than one effect to include in all models. Similarly, if the user has some prior knowledge that certain effects must be excluded in all models that they wish to consider, the user can specify this using the Force effects NOT in the model dropdown option.
Heredity

If the user wishes to review models that obey a certain type of heredity assumption, this can be specified using the checkboxes provided under the heading Heredity (as given below).

A description of the concept of heredity can be found here.

To include all models that satisfy the strong heredity assumption, tick the box with the label ‘Strong’, to include all models that satisfy the weak heredity assumption, tick the box with the label ‘Weak’, and finally, to include all models that do not satify neither strong nor weak heredity, tick the box with the label ‘No-heredity’.

Plots with modeling results

For the specified combination of response variable and type of transformation, three graphs/plots are provided:

Effects in the generated models rasterplot

Here is an example of such a plot.

This is called a raster plot which can be viewed as a table with certain number of rows and columns. Each row in this table corresponds to one specific model. For a given row, the number of highlighted cells indicates the number of effects that are present in that model, while the different columns corresponding to the highlighted cells, indicate which specific effects are present in that model.

For example, in the image given above, there are 26 rows, which means that there are 26 models presented in the raster plot. The last row at the bottom indicates that the model corresponding to this particular row, has only one effect which is the main effect of ‘SILANE’ corresponding to column number two. By hovering with the mouse pointer, you can review other rows (models) and columns (effects) too.

All models in the raster plot are ranked by the size of the models, with models that have a larger number of terms appearing at the top and the ones with a smaller number of terms appearing at the bottom.

On hovering over a particular cell, the ‘x’ value indicates the effect or column that cell corresponds to, the ‘y’ value indicates the model or row that the cell corresponds to, and the ‘z’ value indicates a value which matches the intensity of the color highlighted in the cell. The color code of the cells is determined by the following option

If the Effects option is selected, for a given row (model), a red colored cell indicates that the corresponding effect has a positive value for its coefficient in that model, while a blue colored cell indicates a negative value. The darker the color, the more signficant is the value (positive or negative) for the coefficient.

If the P-values option is selected, for a given row (model), the colors of the cells are divided (according to the legend in the image to the right) by the following ranges for the p-values:
• Between 1 and 0.75
• Between 0.75 and 0.50
• Between 0.50 and 0.20
• Between 0.20 and 0.10
• Between 0.10 and 0.05
• Between 0.05 and 0.01
• Less than 0.01

Note that for any mixed model analysis involving one or more random effects (i.e variance components), all p-values that are reported are based on the corrected degrees of freedom calculated according to the method of Kenward and Roger (1997)¹.

All models that are displayed in the raster plot are charcterized based on metrics that quanitify the statistical quality of each of the models. To see this, click on

This will open up a parallel coordinate plot which allows you to visualize all models that appear in the raster plot based on the total number of effects, root mean squared error (RMSE), R² adjusted, corrected AIC (Hurvich and Tsai 1989²) and PRESS. For mixed model analysis, the conditional R² adjusted is reported which also takes into account the variance in the response explained by the random effects (i.e variance components).

An example of such a plot is given below

This plot is interactive and therefore allows the user to specify vertical constraints on each of the five provided criteria. For example, to display all models with 5 and 6 terms, define a constraint as follows:

Hover close to the vertical axis corresponding to the number of effects. The mouse pointer will change to a plus sign. When this happens, click on the vertical line in a region just below the value 5 and drag the mouse pointer to an area just above 6 and leave. A pink line will appear confirming you specification. Additional constraints can be specified on other vertical axes corresponding to other model ranking criteria. To remove a constraint, hover over the pink line which you would like to remove and when the cursor turns to a sign that points in the up and down direction, click once and the pink line will be removed. Note that only one constraint can be specified per axis.

After all constraints are specified, click the OK button, upon which the user will observe that only the models that satisfy the specified constraints on the model ranking criteria will remain in the raster plot. The plots discussed below will also be updated due to this filtering.

To practice using such a plot, refer here.
Box plots

Here is an example of such a plot.

The different effects are marked on the x-axis, where for each effect a boxplot is presented. For a given effect, its corresponding boxplot shows the distribution of values based on the option set for the option color code. If the color code is set to ‘Coefficient’, then the boxplot for a given effect shows the distribution of coefficient values across all models displayed in the raster plot which include that effect. The example plot above displays the coefficient sizes for each of the effects in the example dataset. Similarly, if the color code is set to ‘P-values’, then the boxplot gives the distribution of p-values across all models displayed in the raster plot which include that effect

For a given effect, if there is a wider distribution of values in the boxplot, this suggests that the values for the coefficient/p-values is highly variable across all models that are displayed in the raster plot.
Effect frequencies

Here is an example of such a plot.

The different effects are marked on the x-axis, where for each effect a barplot is presented. For a given effect, its corresponding barplot gives the total number of times it appears across all models that are displayed in the raster plot. For a given effect, the bigger the barplot, the greater number of times this effect appears across all models presented in the raster plot. This plot helps identify which effects consistently appear across all models.

Additional options

Show Intercept

This option is presented as follows

where this option can be toggled on or off. When on, the rasterplot, boxplot and the effect frequency plot, will display additional columns to indicate which effects did not appear in all models that presented in the raster plot.

Choosing a final model

There are two ways to select a final model: recommended and manual.

If the user prefers the software to recommend a single best model, click on the following button

When this button is clicked, the software will recommend the best model of all models that appear in the raster plot using the utopia method considering the following criteria: corrected AIC (Hurvich and Tsai 1989) and PRESS. More information on the utopia method can be found here.

To select a final model manually, use the graphical filtering and the other options discussed above to filter out the single model of interest, and then click on the Get recommended model button.

On clicking the Get recommended model button, all model diagnostic results will appear for the selected model.

References:

Kenward, M. G., & Roger, J. H. (1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 983-997. ↩︎
Hurvich, C. M., and C.-L. Tsai. (1989). Regression and time series model selection in small samples. Biometrika 76(2):297–307. ↩︎

Docs: Model details (for mixed models)

Mon, 01 Jan 0001 00:00:00 +0000

On clicking the Get recommended model button, all model diagnostic results will appear for the selected model. This screen has six tabs:

Summary
Effect details
Q-Q plot
Diagnostic graphs
Actual vs Predicted
Variance components

1. Summary

In this summary tab, the model is described using many quantitive and qualitative measures, out of which some measures describe the statistical quality of the selected model. These are:

Root mean squared error (RMSE): The RMSE is an estimate for the standard deviation. More generally, it is an estimate that quantifies the prediction accuracy of the selected model. The lower the value for RMSE, the better is the prediction accuracy. This is calculated as follows: $$RMSE = \sqrt{\frac{\sum_{i=1}^{N}{(y_i - \widehat{y}_i)}^2}{N-p}}$$ where
$N$ – total number of observations
$i=1,\dots,N$
$y_i$ – observed value for the observation number $i$
$\widehat{y}_i$ – predicted value for the observation number $i$ conditioned on the random effect (i.e predicted value based on the conditional model)
$p$ – number of effects in the selected model including the intercept.
Conditional $R^2$ and Marginal $R^2$: The conditional $R^2$ is the value for the coefficient of determination including the variance in the response explained by the variance components, while the marginal $R^2$ excludes the variance associated with the variance components (see Piepho 2023¹ for more details).
Akaike information criterion (AIC) and corrected AIC (See here)
Bayesian information criterion (BIC).
Predicted residual error sum of squares (PRESS).
Condition number: Ratio between the maximum and minimum eigen value of the information matrix of the model matrix.

2. Effect details

In this tab, the details for the statistical tests performed on each effect is displayed in a table. For each effect in the selected model, the following information is provided:

The effect’s coefficient value (2nd column). This quantifies the size of the effect’s contribution in changing the average value of the response.
The standard error for the effect’s coefficient (3rd column). Since the true coefficient values are unknown and are estimated from the data, there is some uncertainty around the estimated coefficient value. The standard error quantifies the uncertainty associated with the estimated coefficient value.
The 4th and 5th columns '[0.025' and ‘0.975]' give the lower and upper bounds for the coefficient values calculated at 0.025 and 0.975 percentiles with a confidence of 95%.
Degrees of Freedom (6th column): For linear mixed models, the denominator degrees of freedom used in the t-tests is calculated based on the method of Kenward and Roger (1997)² which takes into account the random components and provides a more accurate test for statistical significance for all the effects.
T-statistic (7th column) is a measure that is calculated to assess the statisfical significance of an effect. For a given effect, if the T-statistic value if much larger than 0, it is likely that the effect is statistically significant.
P-value (8th column) is a measure that is based on the T-statistic, which quantifies the probability to observe the estimated coefficient value just by chance or when the true coefficient value is zero. The lower the p-value, the more likely that that the effect is statistically signicant. For a given effect, if the p-value is lower than 0.05, we say that the effect is statistically significant.

3. Q-Q plot

A Q-Q plot is used to check if the residuals for all observations are normally distributed. Here is an example of such a chart.

The x-axis represents the magnitude for the residuals, and the y-axis represents the z-score corresponding to each individual residual after sorting all residuals. If all points on the plot lie close to the straight black line, then this is an indication that the normality assumption for the residuals is satisfied.

4. Diagnostic graphs

In this tab several model diagnostic plots are presented.

Marginal residuals vs Actual response

In this plot, for the marginal model which excludes the variance explained by the random effects, the actual response values (x-axis) are plotted against the residuals. Here is an example of such a plot:

If all points in this plot are randomly scattered around the central line without any visible trend, this indicates that the residuals obtained from the selected marginal model satisfy the assumption of homoscedasticity and independence for all residuals across all the actual observed values for the response. However, if there is a trend, then this may be a cause for concern as this indicates that the residuals vary in a systematic manner with respect to the actual value and that the assumptions of independence and homoscedasticity may be violated.

Conditional standardized residuals vs Predicted response

In this plot, for the conditional model which includes the variance explained by the random effects, the predicted values (x-axis) are plotted against the standardized residuals (y-axis). Here is an example of such a plot:

If all points in this plot are randomly scattered around the central line without any visible trend, this indicates that the standardized residuals obtained from the selected conditional model satisfy the assumption of homoscedasticity and independence for all standardized residuals across all the predicted values for the response. However, if there is a trend, then this may be a cause for concern as this indicates that the standardized residuals vary in a systematic manner with respect to the predicted value and that the assumptions of independence and homoscedasticity may be violated.

Conditional standardized residuals vs Row number

In this plot, for the conditional model which includes the variance explained by the random effects, the predicted values (x-axis) are plotted against the row number of each data point (y-axis). Here is an example of such a plot:

If all points in this plot are randomly scattered around the central line without any visible trend, this indicates that the residuals obtained from the selected model satisfy the assumption of homoscedasticity and independence for all residuals across the run order. However, if there is a trend, then this may be a cause for concern as this indicates that the run order is important to consider in the model and that the assumptions of independence and homoscedasticity may be violated.

Cook’s distance vs Row number

In this plot, the Cook’s distances (y-axis) are plotted against the row number of each data point (x-axis). The Cook’s distance (Cook, 1977³) reflects how influential a data point is in determining the estimates for all the fixed effects. Here is an example of such a plot:

Ideally, all data points have similar values for the Cook’s distance. Data points with larger values for the cook’s distance are more influential in determining the estimates for all the fixed effects. Therefore, deleting the data points with large values for the Cook’s distance will produce big differences in the estimates for all the fixed effects.

MDFFITS vs Row number

In this plot, the MDFFITS values (y-axis) are plotted against the row number of each data point (x-axis). Similar to the Cook’s distance, MDFFITS or multivariate DFFITS (Belsley, D.A., Kuh, E., and Welsch, R.E., 1980⁴) value also reflects how influential a data point is in determining the estimates for the marginal model. The difference between the two criteria is that the latter uses the variance covariance matrix for the fixed effects after deleting the specific data point, while the former uses the same variance covariance matrix for all Cook’s distance calculations. Therefore, this chart will be very similar to the one discussed above. Here is an example of such a plot:

Ideally, all data points have similar values for the MDFFITS. Data points with larger values for the MDFFITS are more influential in determining the estimates for all the fixed effects. Therefore, deleting the data points with large values for the MDFFITS will produce big differences in the estimates for all the fixed effects.

5. Actual vs Predicted In this tab, a scatter plot is displayed where for each observation, the actual value of the response variable (on the y-axis) are plotted against its predicted value (on the x-axis). Such a plot is a diagnostic tool to make sure that all predicted values are similar to the actual values in the dataset. Here is an example of such a plot.

If all points on the plot lie close to the straight line, then this is an indication that all predicted values obtained using the selected model are close to the true observed values in the dataset, and hence the model performs well in terms of predicting values for the response variable.

6. Variance components

(This section will soon be updated.)

If the selected model passes all visual diagnostic checks, click on

to continue to use this selected model to proceed to the optimization step to find the optimal settings of the input factors.

Note: If the selected model does not pass all visual diagnostic checks, click on the following button

to return to the Modeling results page to choose another model.

References:

Piepho, Hans‐Peter (2023). “An adjusted coefficient of determination (R2) for generalized linear mixed models in one go.” Biometrical Journal 65.7. ↩︎
Kenward, M. G., & Roger, J. H. (1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 983-997. ↩︎
Cook, R.D. (1977), “Detection of Influential Observations in Linear Regression,” Technometrics, 19, 15–18. ↩︎
Belsley, D.A., Kuh, E., and Welsch, R.E. (1980), Regression Diagnostics; Identifying Influential Data and Sources of Collinearity, New York: John Wiley & Sons ↩︎