## Missing Value Estimation in a Nested-Factorial Design with Three Factors

*Trends Journal of Sciences Research*, Volume 3, Issue 1, 2018, Pages 10–17. https://doi.org/10.31586/Statistics.0301.02

Received June 01, 2018; Revised July 20, 2018; Accepted July 24, 2018; Published July 25, 2018

### Abstract

When faced with unbalanced data, it is often necessary to estimate the necessary missing values before the application of the analysis of variance technique. Previous studies have shown that different designs require different missing value estimators. With the introduction of some relatively new statistical designs, it has become expedient to derive missing value estimators for such designs. In this study, least squares estimators of missing values in a three-factor nested-factorial design are derived. Properties of the estimators are equally determined. A numerical example is given to show the application of the theoretical results obtained in this paper. Our empirical results establish the appropriateness of the missing value estimation method presented in this study.

### Introduction

Comparative experiments are often inevitable in many scientific studies. They serve as the means of generating data. Therefore, care is usually taken to ensure that such experiments are properly conducted. Before carrying out a comparative experiment, an experimenter may have to adopt a suitable experimental(statistical) design. Several statistical designs have been proposed for use under certain experimental conditions ^{ 1, 2}.

Data collected in the course of a well design experiment need to be analysed in order to provide answers to research questions under consideration. If quantitative data are classified according to three or more treatments or levels of at least two factors, an analysis of variance (ANOVA) technique may be applied. Different statistical designs require different analysis of variance techniques. For instance, one-way ANOVA is applicable to data collected using the completely randomised design.

No matter how carefully planned and conducted an experiment is, there might be a case of unbalanced data. ANOVA models were originally developed for balanced data. The problem of performing analysis of variance on unbalanced data can be handled by first estimating the missing values and using the estimates in place of the missing observations. The resulting data, comprising the actual observations and the estimates of the missing values are then analysed. Following the novel works of ^{ 3, 4}, least squares estimators of missing values in a number of statistical designs, namely, Randomised Block Design ^{ 5}; General Incomplete Block Design ^{ 6}, Latin Square Design ^{ 7, 8}, Graeco-Latin Square Design ^{ 9}, F-Square Design ^{ 10}, Cross-Over Design ^{ 11} and Split-Plot Design ^{ 12} have been derived. The purpose of this paper is to derive the least squares estimators of missing values in a nested factorial design. Statistical properties of the estimators are equally investigated.

### Review of two-stage nested design and three-factor nested factorial design

Nested designs among other statistical designs are frequently used in agricultural, ecological, med- ical and industrial experimental processes ^{ 13}. There are generally classified in accordance with the number of factors used in the experiment. For instance, in a experimental situation where two factors (say A and B) are being considered such that each level of B is combined with only one level of A, we say B is nested in A. The resulting design is called a two-stage nested design. The linear statistical model for a balanced two-stage nested design, may be written as

where ${Y}_{ijk}$ is the kth observation at the jth level of B nested in the ith level of A, $\mu $ is the grand mean, ${\alpha}_{i}$ is the effect of ith level of factor A, ${\beta}_{j(i)}$ is the effect of jth level of factor B nested within ith level of factor A and ${\u03f5}_{k(ij)}$ is the random error term such that ${\u03f5}_{k(ij)}~N(0,{\sigma}_{e}^{2})$. The nature of this design makes it impossible for one to examine the main effect of factor B and the interaction between the two factors (Shai and Ageel, 2000). In a two-stage nested design, the hypotheses to be tested, depend on whether the two factors are fixed or random or we have a combination of fixed and random factors. In these three cases, the partitioning of the total variation into recognised sources of variation remains the same. Let $S{S}_{A}$, $S{S}_{B(A)}$ and $S{S}_{{E}_{1}}$ denote the sum of squares due to factor A, sum of squares due to factor B within the levels of factor A and sum of squares due to error respectively. The total sum of squares ($S{S}_{T}$) is partitioned as follows:

where $S{S}_{T}={\displaystyle \sum _{i=1}^{a}{\displaystyle \sum _{j=1}^{b}{\displaystyle \sum _{k=1}^{r}{X}_{ijk}^{2}}}}-\frac{1}{abr}{X}_{\mathrm{...}}^{2}$, $S{S}_{A}=\frac{1}{br}{\displaystyle \sum _{i=1}^{a}{X}_{i\mathrm{..}}^{2}}-\frac{1}{abr}{X}_{\mathrm{...}}^{2}$,

$S{S}_{B(A)}=\frac{1}{r}{\displaystyle \sum _{i=1}^{a}{\displaystyle \sum _{j=1}^{b}{X}_{ij.}^{2}}}-\frac{1}{br}{X}_{\mathrm{...}}^{2}$, $S{S}_{{E}_{1}}={\displaystyle \sum _{i=1}^{a}{\displaystyle \sum _{j=1}^{b}{\displaystyle \sum _{k=1}^{r}{X}_{ijk}^{2}}}}-\frac{1}{r}{\displaystyle \sum _{i=1}^{a}{\displaystyle \sum _{j=1}^{b}{X}_{ij.}^{2}}}$ and ${X}_{\mathrm{...}}={\displaystyle \sum _{i=1}^{a}{\displaystyle \sum _{j=1}^{b}{\displaystyle \sum _{k=1}^{r}{X}_{ijk}}}}$.

In Table 1, $\alpha $ is the level of significance, $M{S}_{A}=\frac{S{S}_{A}}{a-1}$, $M{S}_{B(A)}=\frac{S{S}_{B(A)}}{a(b-1)}$ and $M{S}_{E}=\frac{S{S}_{{E}_{1}}}{ab(n-1)}$.

A nested-factorial design is a statistical design that involves both crossed and nested factors. Suppose that in a three-factor nested-factorial design, factors A, B and C have a levels, b levels and c levels respectively. If the b levels of factor B are nested within a levels of factor A and c levels of factor C are crossed with a levels of factor A and b levels of factor B, we may consider the linear model:

In (3), $\mu $ is the grand mean, ${\alpha}_{i}$ is the effect of the ith level of factor A, ${\beta}_{j(i)}$ is the effect of jth level of factor B nested within ith level of factor A, ${\gamma}_{k}$ is the effect attributable to kth level of factor C, ${(\alpha \gamma )}_{ik}$ is the effect of the interaction of ith level of factor A and kth level of factor C, ${(\beta \gamma )}_{jk(i)}$ represents the interaction effect of the kth level of factor C and jth level of factor B within the ith level of factor A and ${e}_{l(ijk)}$ is the error term.

*The total sum of squares (*$S{S}_{T}$*) corresponding to (3), is partitioned as follows:*

where

and

where ${F}_{4}=\frac{M{S}_{A}}{M{S}_{E}}$, ${F}_{5}=\frac{M{S}_{B(A)}}{M{S}_{E}}$, ${F}_{6}=\frac{M{S}_{C}}{M{S}_{E}}$, ${F}_{7}=\frac{M{S}_{AC}}{M{S}_{E}}$, ${F}_{8}=\frac{M{S}_{BC(A)}}{M{S}_{E}}$, ${F}_{9}=\frac{M{S}_{A}}{M{S}_{BC(A)}}$, ${F}_{10}=\frac{M{S}_{C}}{M{S}_{BC(A)}}$ ${F}_{11}=\frac{M{S}_{AC}}{M{S}_{BC(A)}}$ and ${F}_{12}=\frac{M{S}_{BC(A)}}{M{S}_{E}}$.

### Main Results

In this section, we derive least squares estimators of missing values in a three-factor nested-factorial design under several conditions. Theorem 1 provides the estimators of s missing values within the same cell in nested-factorial design.

**Theorem 3.****1** *Suppose there are *$n$* numbers of observations per each combination of a level of each of factors A, B, and C in a nested-factorial design. Assume s of the r observations are missing. Let the least squares estimators of the missing values be *${M}_{1},{M}_{2},{M}_{3},\cdots ,{M}_{s}$*. The estimators are all equal to the arithmetic mean of the *$(n-s)$* observations remaining in the cell that contains the missing values.*

*Proof *From (11), we have

where $R$ is the sum of all the terms independent of ${M}_{1},{M}_{2},{M}_{3},\cdots ,{M}_{s}$. The partial derivatives of $S{S}_{E}$ with respect to ${M}_{1},{M}_{2},{M}_{3},\cdots ,{M}_{s}$ satisfy the equations $\frac{\partial S{S}_{E}}{\partial {M}_{y}}=2{M}_{y}-\frac{2({X}_{\mathrm{....}}^{\prime}+{\displaystyle \sum _{y=1}^{s}{M}_{y}})}{n},1,2,3,\cdots ,s$. Equating to zero the partial derivative of $S{S}_{E}$ with respect to each of ${M}_{1},{M}_{2},{M}_{3},\cdots ,{M}_{s}$ leads to the following system of linear equations:

where ${C}_{s\times s}=\left(\begin{array}{c}n-1\text{\hspace{1em}}-1\text{\hspace{1em}}-1\text{\hspace{1em}}\cdots \text{\hspace{1em}}-1\\ -1\text{\hspace{1em}}n-1\text{\hspace{1em}}-1\text{\hspace{1em}}\cdots \text{\hspace{1em}}-1\\ -1\text{\hspace{1em}}-1\text{\hspace{1em}}n-1\text{\hspace{1em}}\cdots \text{\hspace{1em}}-1\\ .\text{\hspace{1em}}\text{\hspace{1em}}.\text{\hspace{1em}}\text{\hspace{1em}}.\text{\hspace{1em}}\text{\hspace{1em}}\cdots \text{\hspace{1em}}\text{\hspace{1em}}.\\ .\text{\hspace{1em}}\text{\hspace{1em}}.\text{\hspace{1em}}\text{\hspace{1em}}.\text{\hspace{1em}}\text{\hspace{1em}}\cdots \text{\hspace{1em}}\text{\hspace{1em}}.\\ .\text{\hspace{1em}}\text{\hspace{1em}}.\text{\hspace{1em}}\text{\hspace{1em}}.\text{\hspace{1em}}\text{\hspace{1em}}\cdots \text{\hspace{1em}}\text{\hspace{1em}}.\\ -1\text{\hspace{1em}}-1\text{\hspace{1em}}-1\text{\hspace{1em}}\cdots \text{\hspace{1em}}n-1\end{array}\right)$, ${M}_{s\times 1}=\left(\begin{array}{c}{M}_{1}\\ {M}_{2}\\ {M}_{3}\\ .\\ .\\ .\\ {M}_{s}\end{array}\right)$, ${X}_{s\times 1}=\left(\begin{array}{c}{X}_{ijk.}^{\prime}\\ {X}_{ijk.}^{\prime}\\ {X}_{ijk.}^{\prime}\\ .\\ .\\ .\\ {X}_{ijk.}^{\prime}\end{array}\right)$and ${X}_{ijk.}^{\prime}$ is the sum of the $(n-s)$ observations that are originally available in the cell.

Next, we solve for ${M}_{s\times 1}$ in (13) using the principle of mathematical induction.

Before obtaining the general solution of (13), we shall solve (13) when $s$=1, 2 and 3. If $s=1$, we have

For $s=2$,

Solving (13) for ${M}_{1}$ and ${M}_{2}$ leads to

With $s=3$, the following equation is satisfied:

Consequently, the solution of (13) is

It may happen that the missing values we wish to estimate belong to different cells.

**Theorem 3.2** *Let* ${V}_{1},{V}_{2},\cdots ,{V}_{q}$ *denote least squares estimators of missing observations in *$q$* different cells in a nested-factorial design with three factors, such that in each of the cells only one value is missing. Let the number of observations originally available in each of the q cells be *$n-1$*. Denote the totals of observations originally available in the cells by *${X}_{{i}^{(e)}{j}^{(e)}{k}^{(e)}{l}^{(e)}}^{\prime},e=1,2,3,\cdots ,q$*. Then *${V}_{e}=\frac{{X}_{{i}^{(e)}{j}^{(e)}{k}^{(e)}{l}^{(e)}}^{\prime}}{n-1},e=1,2,3,\cdots ,q$

*Proof* Using (11), we obtain

where ${R}^{\prime}$ is the sum of all the terms independent of ${V}_{1},{V}_{2},{V}_{3},\cdots ,{V}_{q}$. The partial derivatives of $S{S}_{E}$ with respect to ${V}_{1},{V}_{2},{V}_{3},\cdots ,{V}_{q}$ satisfy the equations $\frac{\partial S{S}_{E}}{\partial {V}_{e}}=2{V}_{e}-\frac{2({V}_{e}+{X}_{{i}^{(e)}{j}^{(e)}{k}^{(e)}{l}^{(e)}})}{n}$, ${V}_{1},{V}_{2},{v}_{3},\cdots ,{V}_{q}$. On equating $\frac{\partial S{S}_{E}}{\partial {V}_{e}}$ to zero and solving the resulting equation, we have

Other cases of missing values in a nested-factorial design with three factors may be frequently encountered. For instance, two or more of the $q$ missing values may belong to the same cell. The fact remains that least squares etimators of such missing values can be easily derived using similar procedures to those in Theorem 3.1 and 3.2.

It has been argued by many authors that when a missing value is estimated , as it is the case in this study, the treatment sum of squares is biased. The bias in sum of squares due to factor C, which may be encountered when a missing value in the design under consideration, is estimated using (13), is given in Theorem 3.

**Theorem 3.3**** ***Let *${V}_{1},{V}_{2},\cdots ,{V}_{q}$* denote least squares estimators of missing observations in *$q$* different cells in a nested-factorial design with three factors, such that in each of the cells only one value is missing. Let the number of observations originally available in each of the q cells be *$n-1$*. Denote the totals of observations originally available in the cells by *${X}_{{i}^{(e)}{j}^{(e)}{k}^{(e)}{l}^{(e)}}^{\prime},e=1,2,3,\cdots ,q$*. Then *${V}_{e}=\frac{{X}_{{i}^{(e)}{j}^{(e)}{k}^{(e)}{l}^{(e)}}^{\prime}}{n-1},e=1,2,3,\cdots ,q$*.*

*Proof*** **Using (11), we obtain

where ${R}^{\prime}$ is the sum of all the terms independent of ${V}_{1},{V}_{2},{V}_{3},\cdots ,{V}_{q}$. The partial derivatives of $S{S}_{E}$ with respect to ${V}_{1},{V}_{2},{V}_{3},\cdots ,{V}_{q}$ satisfy the equations $\frac{\partial S{S}_{E}}{\partial {V}_{e}}=2{V}_{e}-\frac{2({V}_{e}+{X}_{{i}^{(e)}{j}^{(e)}{k}^{(e)}{l}^{(e)}})}{n}$, ${V}_{1},{V}_{2},{v}_{3},\cdots ,{V}_{q}$. On equating $\frac{\partial S{S}_{E}}{\partial {V}_{e}}$ to zero and solving the resulting equation, we have

In the case of one missing value in a three-factor nested factorial design, the missing value is estimated using (14) and adjustment for bias in $S{S}_{C}$ is made by subtracting $B$ from $S{S}_{C}$ (Rangaswamy, 2010) In general, if there are two or more missing values, the estimates of the values are found using the appropriate formulae based on the nested and nested-factorial designs. These estimates are then used in place of the corresponding missing values and the analysis of variance for both nested and nested-factorial designs are conducted. The corresponding $S{S}_{E}$ and $S{S}_{{E}_{1}}$ are computed. As a consequence, the corrected sum of squares due to factor C is (Das and Giri, 1986)

### Numerical Example

Numerical illustrations made in this section are based on the assembly time data from Montgomery (2013). The data were collected in an experiment in which three-factor nested factorial design was applied. Of interest in the experiment are the three factors operators, layouts and fixtures, which have four levels, two levels and three levels respectively . Among the three factors considered in the experiment, operators are nested under levels of layouts. It shall be noted that the four operators selected for Layout 1 are different from the four operators selected for Layout 2. Moreover, the operators are randomly selected, justifying the use of the mixed effects analysis of variance model. As shown in Table 3, the third factor fixtures and layouts are subjected to a factorial arrangement.

For easy reference to each observation in Table 3, the observations will be expressed in ${X}_{ijkl}$ notation, where $i=1,2,3,4,j=1,2,k=1,2,3,4,l=1,2$. For instance, ${X}_{1111}=22$ refers to the first observation in the cell corresponding to Operator 1, Layout 1 and Fixture 1. The data in Table 1 have been analysed in Montgomery (2013). However, for reference purposes, we consider the ANOVA results in Table 4.

Though Table 3 contains balanced data, we shall create room for missing observations by deleting some of the observations and then estimate those missing observations and perform the necessary analysis of variance. Using (14) the estimate of ${X}_{1111}=22$ is found to be ${M}_{1}=24$. Replacing ${X}_{1111}$ in Table 3 by its estimate and performing the requisite analysis of variance, the results in Table 5 are obtained.

The values inside the brackets are obtained after the adjustment has been made for the bias in the sum of squares due to factor C. The bias $B$ is calculated using Theorem 3. In this regard, $B=1.633$.

To illustrate the estimation of two missing values using Theorem 2 in a three-factor nested-factorial design, we assume that the values ${X}_{2111}$ and ${X}_{2121}$ are missing in Table 3. Their least squares estimates are 24 and 28 respectively. If we ignore data classification according to factor C, we have a two-stage nested-factorial design and the estimates of ${X}_{2111}$ and ${X}_{2121}$ can be easily found to be 25.4 and 27.4 respectively. In line with (21), the corrected sum of squares due to factor C is calculated to be Corrected $S{S}_{C}=164.233$.

It can be deduced from Tables 4, 5 and 6 that the sum of squares due to factor C, error sum of squares and total sum of squares all vary depending on whether the analysis is based on balanced data without estimates of missing values or data with one or more estimates of missing values. Interestingly, the analysis of variance results in the three tables lead to the same conclusion for each of the four sources of variations fixture (C), layout (B), operator (C), B(A), CA and C$\times $ B(A), indicating the appropriateness of the missing value estimation technique discussed in this paper.

### Conclusion

This study primarily deals with the non-iterative least squares estimation of missing values in a three-factor nested-factorial design. The theoretical results obtained in this paper are predicated on several cases of missing values. In particular, we have paid attention to the cases of one missing value and many missing values in the same cell or different cells.

In the case of one missing value, we have shown that the estimate of the missing value is equal to the arithmetic mean of the remaining values in the cell containing the missing value. Similar results are also obtained in the case of many missing values.

In the three-factor nested-factorial design, the factor C is crossed with the other factors. The bias in the sum of squares due to factor C is derived when a missing observation is estimated using the proposed estimator. The bias is shown to be a positive quantity. On the basis of many missing values, an expression for the corrected sum of squares due to factor C is given.

In order to show the application and suitability of the theoretical results, a numerical example based on the data from ^{ 2} is considered. Analysis of variance tables are obtained based on the original data, data with the estimate of one missing value and the data with the estimate of two missing values. Correction is also made for the upward bias in the sum of squares due to factor C. Interestingly, the analysis of variance results obtained in these cases lead to the same conclusion for each requisite source of variation.