[Home]

[Table of Contents]

Technical Memorandum on Soil Boring Investigation

Appendix D - Statistical Methodology

D.1 - Statistical Approach

The background concentrations were calculated using information presented in two U.S. Environmental Protection Agency documents:

"Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities," Interim Final Guidance, February 1989 (EPA, 1989).

"Statistical Analysis of Ground-water Monitoring Data at RCRA Facilities," Draft Addendum to Interim Final Guidance, July 1992 (EPA, 1992).

Although the Partial Facility Closure Plan indicated that the student-t test would be used for determining background metals concentrations, this test proved to be impractical for this application. Use of the student-t is suggested in the risk reduction rules, but is only suitable for populations with normal distributions. Preliminary use of the student-t test consistently resulted in "cleanup" values below three of ten actual background values. Clearly, this statistical method is not acceptable for this situation since it suggests that background metals concentrations require cleanup, or that the background samples were collected in areas of waste management activities. The background sample locations were based on records of all SWMUs at CSSA (ES, 1993b) and were carefully reviewed by Parsons ES and CSSA personnel in the field prior to collection of any background samples. Therefore, after evaluation of sample locations and a preliminary student-t test analysis, it was concluded that the background sample locations were acceptable for calculation of background metals levels but that the student-t test was not appropriate for the number of samples collected.

The background concentrations were therefore calculated with a Tolerance Interval test, using a one-sided tolerance interval to estimate the upper bound on a large fraction of the concentration distribution. Use of the Tolerance Interval test for this purpose was recently approved by the TNRCC in a similar study of background metals levels at a nearby U.S. Air Force facility (Kelly AFB, 1994). In addition, the UTL is referenced in the EPA documents listed above as an approved method to compare background monitoring data to compliance wells (EPA, 1989 and 1992). For background soil data, the UTL predicts the upper range of background concentrations from a relatively small data set.

The UTL is designed for use on data that consist mainly of positive detections. Since background data sets typically contain many non-detects, several tests and procedures must be conducted on those data. Non-detect data must be evaluated and manipulated in one of three ways depending on the percentage of non-detects within the sample population. After screening for non-detects, the data must be tested for normality. The UTL assumes a a normal or lognormal distribution. Specific procedures used to background levels are described below. Figure 4.5 is a flowchart of the procedures followed to determine the UTL.

D.2 - Procedures for Non-Detects

If an analyte was present at a concentration that is less than the detection limit of the method, the analytical result was reported as not detected with a detection limit rather than as a concentration. All data with "U" or "UJ" qualifiers were considered to be non-detect. The procedures for addressing below detection limit values depended on the percentage of non-detect values in the data set. There were three possibilities:

  1. For less than 15% non-detect results, the non-detect values were replaced with a value equal to half the PQL. The distribution of the data and the appropriate UTL (normal or lognormal, based on the distribution) were then determined.

  2. For between 15% and 50% non-detect results, Cohen's or Aitchison's adjustment was made to the sample mean and the standard deviation to continue with a parametric UTL.

  3. For greater than 50% non-detect results, a non-parametric UTL was used. A non-parametric UTL is not based on a normal or lognormal distribution. The largest value observed in the data set was used as the non-parametric tolerance limit. Detection limits were included (for non-detects values) in determining the largest observed value unless the analysis was performed on a laboratory dilution.

D.3 - Normality Tests

Probability plots, probability plot correlation coefficients and the Shapiro-Wilk test were used to determine if the data were normally or lognormally distributed. Probability plots were also used to visually screen the data for statistical outliers. Some outliers were eliminated from the statistical evaluation for background concentrations. Elimination of outliers is clearly noted on the probability plots.

D.3.1   Probability Plots

Probability plots show the normal cumulative probability plotted against the sample concentrations. If the points plot approximately on a straight line, the underlying distribution is approximately normal (or lognormal). Probability plots are also useful for identifying irregularities within the data and indicating the presence of possible outlier values that do not follow the basic pattern of the data (EPA, 1992). Because computer software was used to evaluate the data, the y-axis of the plot was generated using normal quartile values, instead of the normal cumulative probability. The y-coordinate was computed using the following formula:

yi = D-1 (i/n+1),

where D-1 denotes the inverse of the cumulative normal (or lognormal) distribution, n represents the size of the data set, and i represents the rank position of the ith ordered concentration (EPA, 1992).

D.3.2   Shapiro- Wilk Test

The Shapiro-Wilk Test is based on the premise that if a set of data are normally (or lognormally) distributed, the ordered values should be highly correlated with corresponding quartiles taken from a normal (or lognormal) distribution (Shapiro and Wilk, 1965). The Shapiro-Wilk test can be used for data sets with 50 samples or less. The Shapiro-Wilk test statistic (W) will tend to be large when a probability plot of the data indicates a nearly straight line (normal distribution). When the data show significant bends or curves, W is small. The test statistic W was calculated using the following formula:

W = [b/(S*SQRT(n-1))]-2

where bn-1 = R xkj *aj

The value x(j) represents the jth smallest ordered value in the sample, coefficient aj depends on the sample size n, S is the standard deviation, n is the number of samples, and k is the greatest integer less than or equal to n/2.

Normality (or lognormality) of the data was rejected if the Shapiro-Wilk statistic was lower than the critical value (Shapiro and Wilk, 1965). In cases where both the normal and lognormal distributions tested positive, the distribution was selected based on the value of the test statistic, W. because a higher value of W indicates a more normal distribution of the data examined (whether original data or log-transformed data), the test with the higher value is considered to indicate the distribution which is a closer fit to the data.

D.3.3   Probability Plot Correlation Coefficient

The probability plot correlation coefficient measures the linearity of data in a probability plot, thereby indicating how normal (or lognormal) the data is. We used the probability plot correlation coefficient in cases where it was not visually evident which probability plot was most linear.

These correlations involved only the points actually plotted (i.e. detected concentrations). Since the correlation coefficient is a measure of the linearity of the points on a plot, the correlation coefficient will be high when the plotted points fall along a straight line and low when there are significant bends and curves in the probability plot. The formulas below (Filliben, 1975 and EPA, 1992) were used in calculating probability plot correlation coefficients.

X(i) represents the ith smallest ordered concentration value, Mi is the median of the ith order statistic from a standard normal or lognormal distribution, and X and M represent the average values of X(i) and M(i). The plot with the higher correlation coefficient (r) represented the most linear trend and indicated which method was used to adjust the data.

D.4 - Adjustment of Sample Mean and Standard Deviation

Both Cohen's Adjustment and Aitchinson's Adjustment can be used to adjust the sample mean and sample standard deviation to account for data below the detection limit. To determine if Cohen's or Aitchinson's adjustment was more appropriate for a particular set of data during this statistical evaluation, two separate probability plots were constructed.

D.4.1   Censored Probability Plot

A censored probability plot was constucted to test Cohen's underlying assumption that nondetects have been "censored" at their detection limit. To construct the censored probability plots, the combined data set of detects and nondetects was ordered, and normal quartiles were computed for the data set as in a regular probability plot. However, only the detected values and their associated normal quartiles were actually plotted. If the shape of the censored data probability plot was more linear than the detects-only probability plot, then Cohen's assumption was considered to be acceptable, and Cohen's adjustment was made to estimate the sample mean and standard deviation.

D.4.2   Detects Only Probability Plot

To test the assumptions of the Aitchinson method, a detects-only probability plot was constructed. The assumptions underlying Aitchinson's adjustment are that non-detects represent zero concentrations and that detects and nondetects follow separate probability distributions. Only detected measurements were used to construct the detects-only probability plots. Nondetects were completely ignored. Normal quartiles were computed only for the ordered detected values. The same number of points and concentration values were plotted on both the detects-only and censored probability plots; however, different normal quartiles were associated with each detected concentration. If the detects-only probability plot was more linear than the censored data probability plot, then the underlying assumptions of Aitchinson's method were considered to be reasonable.

D.4.3   Cohen's Adjustment

To determine the adjusted sample mean and standard deviation, a number of calculations were made. First the mean (Xd) and standard deviation (Sd) of the data above the detection limit were calculated. Then, two parameters, h and c, were calculated using the following equations:

h = (n-m)/n

c = Sd2/(Xd-DL)2

where DL is the average detection limit. Based on the values of h and gamma, the value of the parameter k was determined (EPA, 1989).

The adjusted sample mean (Xa) and standard deviation (Sa) were determined using the equations:

Xa = Xd - k (Xd-DL)

Sa = SQRT(Sd2 = k (Xd-DL)2)

The adjusted sample mean and standard deviation were then used to determine the upper tolerance limit.

D.4.4   Aitchinson's Adjustment

Aitchinson's adjustment was calculated on either the normal or lognormal data, depending on the distribution. The adjusted mean was calculated using the equation:

Xa = (1-d/n)Xd

where d is the number of detects, n is the number of samples, and Xd is the sample mean of the detected values. The adjusted standard deviation was calculated using the equation:

Sa2 = [n-(d+1)*Sd2]/(n-1) + d/n((n-d)/(n-1))Xd2

The adjusted mean and adjusted standard deviation were then used to determine the upper tolerance limit.

D.4.5   Upper Tolerance Limit

A tolerance limit was constructed from the background soil data to establish a basis for determining if there is statistically significant evidence of contamination present in B-20 soil samples. The Tolerance Interval test consists of defining an interval, based on background soil data, which is expected to contain (with a statistical level of confidence) a given percentage of the population. A tolerance interval establishes a range that is constructed to contain a specified proportion (P%) of the population with a specified confidence coefficient, Y. The proportion of the population included, P, is referred to as the coverage. The probability with which the tolerance interval includes the proportion P% of the population is referred to as the tolerance coefficient.

A coverage of 95% and a tolerance coefficient of 95% was used in the evaluation of all the background soil data. Therefore, there is a confidence level of 95% that the tolerance limit will contain 95% of the distribution of observations from background data.

The UTL was calculated using the following equation:

UTL = X + KS,

where X is the mean of the data, K is the one-sided normal tolerance factor (EPA, 1989), and S is the standard deviation. The tolerance intervals are constructed assuming that the data or the log-transformed data are normally distributed. The appropriate distribution indicated by the Shapiro-Wilk tests were used for the calculation.