当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第8期 > 正文
编号:11372478
Assessment of reliability of microarray data and estimation of signal
http://www.100md.com 《核酸研究医学期刊》
     Department of Biostatistics, Epidemiology, and Scientific Computing and 1 Department of Biological and Medical Research, King Faisal Specialist Hospital and Research Center, PO Box 3354, MBC-03, Riyadh, 11211, Saudi Arabia

    *To whom correspondence should be addressed. Tel: +966 1 464 7272, ext. 39211; Fax: +966 1 442 7854; Email: asyali@kfshrc.edu.sa

    ABSTRACT

    DNA microarray is an important tool for the study of gene activities but the resultant data consisting of thousands of points are error-prone. A serious limitation in microarray analysis is the unreliability of the data generated from low signal intensities. Such data may produce erroneous gene expression ratios and cause unnecessary validation or post-analysis follow-up tasks. In this study, we describe an approach based on normal mixture modeling for determining optimal signal intensity thresholds to identify reliable measurements of the microarray elements and subsequently eliminate false expression ratios. We used univariate and bivariate mixture modeling to segregate the microarray data into two classes, low signal intensity and reliable signal intensity populations, and applied Bayesian decision theory to find the optimal signal thresholds. The bivariate analysis approach was found to be more accurate than the univariate approach; both approaches were superior to a conventional method when validated against a reference set of biological data that consisted of true and false gene expression data. Elimination of unreliable signal intensities in microarray data should contribute to the quality of microarray data including reproducibility and reliability of gene expression ratios.

    INTRODUCTION

    DNA microarray technology provides a powerful and efficient means for measuring the relative abundance of expressed genes in a variety of applications (reviewed in 1–3). In any microarray experiment, only a small fraction of the genes become expressed as a result of the investigated conditions. Thus, a large portion of the microarray data is comprised of low signal intensities that cause variability or impair reproducibility of the measured ratios between control and experimental samples. There are also other situations that give rise to low signal values such as the deposition of sub-optimal amounts of the probes, quality of probes, or incorrect segmentation of the spots. The aim of the approach described in this paper is to find the optimal signal intensity thresholds to identify and eliminate the unreliable low signal intensities so that only reliable measurements are acquired from microarray data and, consequently, accurate ratios of gene expression are obtained.

    Here, we describe a statistical approach based on normal mixture modeling (4) to segregate the microarray data into two classes, namely classes of reliable and unreliable data. Normal mixture models are used to model data that contain more than one population conforming to normal distribution and find use in many biological applications. We used an expectation maximization (EM) algorithm to estimate the weighted normal mixture components, i.e. class posterior probabilities (4,5). Subsequently, we applied Bayesian decision theory (6,7) to find the optimal signal thresholds that discriminate between the reliable and the unreliable (low) signal intensity populations, based on the estimated posterior probabilities. Since microarray data are often generated from two dye channels, we used both univariate and bivariate (8,9) mixture modeling to deal with two-channel data separately or together, respectively. We applied the proposed techniques on data from three different microarray experiments and compiled a reference set of genes known to be expressed under the same experimental conditions, to validate our methods. A statistically designed approach for estimating signal thresholds, such as the described method, should eliminate unreliable signal intensity observations in microarray data, and yield more informative and accurate gene expression ratios.

    MATERIALS AND METHODS

    Microarray preparation

    We used complementary DNA (cDNA) microarray, which contained 2000 cDNA distinct probes and a total of 4000 elements (10,11). The cDNA probes were generated via PCR amplification of the clone-containing glycerol bacterial culture stocks using M13 primers and spotted (100 μm diameter) at least in duplicate on poly-L-lysine glass slides using an OmniGrid robot (GeneMachines, San Carlos, CA). Total RNA samples (20 μg, 5 μg/μl) from a THP-1 monocytic cell line that was stimulated with 10 μg/ml endotoxin (Sigma, St Louis, MO) were reverse transcribed using SuperScript II (Gibco) and a primer provided by the Genisphere kit (Genisphere, Inc., Hatfield, PA). The cDNAs from the two samples (i.e. control and treatment) were combined and ethanol precipitation was performed. The hybridization buffer supplied by the Genisphere kit was heated to 55°C for 10 min and mixed with two blocking solutions, oligo(dT) and LNA dT for non-specific hybridization of labeled cDNA to elements containing poly(dA) sequences and for poly A containing elements including spotted poly(dA) sequences, respectively. The cDNA pellets were incubated in the hybridization buffer with the blockers as ‘hybridization mixes’ of 40% formamide, 4x SSC, 1% SDS and 2x Denhardt’s solution, and were then applied to the microarray. The microarrays were covered with coverslips for 18 h at 55°C in a sealed, humidified chamber. After hybridization, the slides were washed briefly by a series of washes to remove unbound cDNA. Subsequently, the microarrays were subjected to dendrimer hybridization using Genisphere DNA dendrimer capture reagent labeled with Cy3 (control) or Cy5 (experiment). The hybridization mixes were first incubated at 75–80°C for 10 min, and then at 50°C for 20 min, and then added to pre-warmed microarray slides. The microarrays were covered with coverslips and incubated for 3 h in a dark humidified chamber at 45–50°C. After hybridization the slides were washed briefly several times to remove unbound dendrimer molecules and the microarrays were dried in a centrifuge.

    Image acquisition and intensity extraction

    The cDNA microarrays were scanned at 10 μm resolution using ScanArray 5000 (GSI lumonics, Moorpark, CA) scanner that excites the Cy3 (green) and Cy5 (red) fluorophores at optimal wavelengths of 532 and 635 nm, respectively. The median intensities of Cy3 and Cy5 fluorescent signals from each cDNA spot on the microarray image were extracted using the adaptive circle algorithm (QuantArray software). The spot intensities with large local background were excluded if the median intensity of the spot is <1.4 of the median background intensity of the same spot. The negative values, if present, were also excluded from the data. The mean background intensities were subtracted from the spot intensities. The background corrected intensities were then normalized to total signal intensities to account for different input RNA concentrations or labeling efficiency in the individual Cy3 and Cy5 reverse transcriptase reaction. The intensity ratios (Cy5/Cy3), which represent the relative expression profile of the genes in the two starting RNA samples, were calculated.

    A reference list of 53 or 54 expressed genes was compiled from our two laboratory generated data sets, based on their status of expression in human monocytes, including both endotoxin-induced genes and housekeeping genes. Their expression status was obtained from the literature and from our previous large-scale microarray expression data (10) and other data (K. S. A. Khabar, manuscript in preparation). The signal intensities in the two reference sets were identified as true positives if the microarray gene expression data agreed with the prior knowledge about the expression status in induced human monocytes (10,12), while true negatives were those for which their microarray gene expression data were not consistent with the expected expression or inducibility of these genes.

    Publicly available data set

    A publicly available data set from a recent study (13) was used in addition to our data. The microarray procedures of the study involves 40 000 elements, representing 36 000 different genes. We downloaded the raw Excel data file No. 17 368 which corresponds to the profiling of asynchronous arm fibroblast versus common reference fibroblasts, from the website: http://genome-www5.stanford.edu/cgi-bin/data/viewDetails.pl?fullID=17368GENEPIX17368.

    Mixture modeling

    Finite mixture modeling is a widely used technique for probability density function estimation (4–5,7–9) and has found significant applications in various biological problems (14–17). We assumed that a multivariate probability density f(x) can be expressed as a weighted sum of k densities,

    Here, hi(x; i) is the ith (i = 1, 2, ..., k) probability density component with an m x 1 parameter vector i, i 0 is the ‘weight’ or ‘mixing coefficient’ of hi(x; i), and {x = xj, j = 1, 2, ..., n} is the set of n d x 1 data points or observations. In the case of normal mixtures, the density component hi(x; i) is a multivariate normal probability density, i.e. hi(x; i) = N(x; μi, i) = (2)–d/2|i|–1/2exp{–(x – μi)Ti–1(x – μi) / 2}, where μi is the d x 1 mean vector and i is the d x d covariance matrix, and i consists of the distinct elements of μi and i. The corresponding length of i is m = d + (1/2)d(d + 1), because of the symmetry of the covariance matrix. In addition, ki=1 i = 1 so that f(x) is a valid probability density. Therefore, the total number of parameters estimated in equation 1 was M = km + k – 1. In order to estimate these parameters with acceptable variance/accuracy, one must have M << n.

    Here, we assumed there were two probability density components, i.e. k = 2, embedded in the probability density of our data, corresponding to the classes of unreliable and reliable signal intensity observations. Once the mixture model parameters is were estimated, we calculated the posterior probability of any data point x, belonging to ith class as f(x; x Class i) = ihi(x; i) / f(x) and compared these probabilities to decide which class the data point x belongs to (7).

    We used the EM algorithm for estimating the mixture parameters; EM is suitable for optimizing the likelihood function in situations where there is some missing information such as the class memberships of the data points (18–20). We started the algorithm with an initial estimate of the mixture parameters is obtained by the k-means algorithm (4,7,21) and iterated the expectation and maximization steps until the changes in all of the estimated parameters were smaller than a small preset tolerance or a certain number of iterations were reached. We selected the parameter tolerance and the maximum iteration count as 0.0001 and 300, respectively. We performed all the computations using in-house software developed under MatlabTM (The MathWorks Inc., Natick, MA) environment.

    Bayesian classification

    In a two-class classification problem, if x denoted an observation and i = 1, 2 was the class index, and the class prior probabilities, Pi = P(x Class i), and the class conditional densities, f(x: x v Class i) were known, then according to the Bayes’s theorem, the posterior probabilities, f(x Class i; x), were:

    where f(x) is a normalizing factor that ensures the posterior probabilities add up to 1.0. Bayesian decision rule simply assigns an observation to the class that has the highest posterior probability, i.e. decides that x Class 1 if f(x Class 1; x) is larger than f(x Class 2; x) and vice versa. (This method is also known as maximum a posteriori classification.) Hence, for any given value of x probability of error, P(e; x) was minimized, because

    Detailed aspects of Bayesian classification techniques are described in Duda et al. (7). The optimal decision boundary was found by equating the posterior probabilities. In the case of multivariate normal densities, the optimal decision boundary is a hyper-quadratic, whose form and position are determined by the prior probabilities, the mean vectors and the covariance matrices of the distributions, i.e. mixture components. Assuming that the class conditional densities and prior probabilities were estimated through mixture modeling utilizing the EM algorithm, this optimal classification technique can be applied to minimize the probability of error in the microarray signal intensity classification.

    RESULTS

    We performed two independent experiments of microarray gene expression using the same cell system in order to demonstrate the present approach. In addition, we used a publicly available data set (13). Background-subtracted and normalized data (fluorescence signal intensities) were first natural log-transformed to obtain better approximation to normal distribution. Microarray data generally consist of signal intensities generated from two different channels, Cy3 (green) and Cy5 (red). Table 1 shows summary statistics for the two channel data in the three data sets including mean, median, standard deviation (SD) and the correlation between the two channels. The table also includes the median and the SD of the raw (not log-transformed) background intensities; these values were needed to implement Fielden’s thresholding technique (22) for comparison with our approach. We then applied normal mixture modeling, typically fitted by maximum likelihood using the EM algorithm, to segregate microarray data into two populations of unreliable and reliable components. We employed this approach with either univariate or bivariate analysis, in which Cy3 (green) and Cy5 (red) channel data were either independently or dependently analyzed, respectively. Subsequently, a biologically validated reference set of genes was used to confirm the methods.

    Table 1. Summary statistics for microarray gene expression data

    Univariate normal mixture modeling

    In the univariate approach, we processed the log-transformed data from each channel (Cy3 and Cy5) separately and assumed that their probability density functions consisted of two normal components. That is

    fg(x) = gN(x; μ1g, 21g) + (1 – g)N(x; μ2g, 22g)2a

    fr(x) = rN(x; μ1r, 21r) + (1 – r)N(x; μ2r, 22r)2b

    where f(x) is the probability density, x is the log-transformed signal intensity, is the weight, N(x; μi, i), i =1, 2 are the normal density components, and the subscripts g and r denote Cy3 and Cy5 channels, respectively. The weighted normal density components of each channel, i.e. N(x; μ1, 1) and (1 – ) N(x; μ2, 2), represent the posterior probabilities of potentially unreliable and reliable intensity classes of data, respectively. Figure 1 shows the histogram of each channel from the three data sets representing the probability density functions fg(x) and fr(x) of the entire population of signal intensities. In addition, the figure shows the plots of the two probability density functions representing the components that were estimated for each channel data by normal mixture modeling. Thus, the normal mixture approach resulted in two normal density components, a component with a lower mean (Component 1) and a component with a higher mean (Component 2), corresponding to populations of data with low (potentially unreliable) and higher signal intensity. To verify that the normal mixture model fits the underlying distribution of the data well, we constructed quantile–quantile (qq) plots. Indeed, the qq plots showed that the normal mixture model for each channel data fits well the underlying distribution of the actual data (Fig. 2). This was consistent in all the three microarray data sets that were analyzed (Fig. 2A–C).

    Figure 1. Univariate normal mixture histograms and threshold plots. Histograms in 60 bins from three different data sets (A–C) represent estimated normal density components for Cy3 green channel (top) and Cy5 red channel (bottom) channels. The estimated weighted normal density components for each channel, i.e. components 1 and 2 or N(x, μ1, 1) and (1 – )N(x, μ2, 2) are shown by dashed lines and dotted lines, respectively. The statistical parameters for each component are shown in Table 2. The sum of the weighted densities, fg(x) and fr(x) (solid lines in top and bottom panels) closely mimics the density or histogram for the Cy3 and Cy5 channels, respectively. The vertical dashed line shows the optimal signal intensity threshold for each channel in the log-domain as explained in the text. The optimal thresholds can be converted to the raw-data domain by taking their exponentials.

    Figure 2. Analysis of the goodness of the fit between the Cy3 and Cy5 data channels and their two-component normal mixture model, using qq plots. The qq plots from three different datasets (A–C) show quantiles of the actual Cy3 green channel (left) or Cy5 red channel (right) data against the mixture model of each. The dashed line in each plot corresponds to the case of perfect agreement between the compared samples. If the ‘+’ marks, indicating the location of the matching quantiles, roughly lie on the straight line, then the distributions of the samples have the same shape except for a possible shift and rescaling.

    An important feature observed here is that the majority of the signal intensity data (75%), regardless of the channel, belong to Component 1 ( 0.75; Table 2). This is consistent with the fact that most of the microarray data fall into low signal intensities representing genes that are not expressed in any given experimental condition. In contrast to Component 2, it appears that Component 1 has lesser coefficient of variation (8%). The relatively larger variability (coefficient of variation 16%) of Component 2 (reliable signal population) may reflect the heterogeneity of the expression levels of the different genes. In the larger data set (Table 2, third data set) of approximately 40 000 data points and more than 3000 differentially expressed genes, the variability of Component 2 (reliable signal population) was 12% (coefficient of variation), which was also larger than the variability of the low signal component.

    Table 2. The results of univariate and bivariate mixture modeling

    Bivariate normal mixture modeling

    We applied bivariate analysis in the normal mixture modeling by considering Cy3 and Cy5 channels together. In general, microarray channel readings are highly correlated, as clearly seen in the scatter plot of the two channel data points (Fig. 3). Indeed, the correlation coefficient was >0.9 in all three cases (Table 1). We have exploited this correlation feature and modeled the probability density of the microarray data as follows:

    Figure 3. Bivariate normal mixture scatter plots, classification plots and validation plots. Cy3 (green channel) and Cy5 (red channel) scatter plots for the three data sets (A–C) are shown. The gray ‘+’ marks denote the individual signal intensity data points, whereas the up and down triangles denote true positives and true negatives (as provided in the reference/validation sets), respectively. The large black dots and ellipses indicate the centers (means) and the variances of the two bivariate normal mixture component ellipses 1 and 2 as indicated by dashed and dotted lines, respectively (Table 2). The optimal thresholds for Cy3 and Cy5 channels obtained using univariate analysis are shown by the vertical and horizontal solid lines, respectively. The dashed vertical and horizontal lines show the optimal thresholds obtained using Fielden’s method for Cy3 and Cy5, respectively. The region of unreliable signal observations in univariate analysis is the lower-left rectangular corner of the two-dimensional space, bounded by the thresholds. The thick ellipse shows the optimal decision boundary obtained by using the bivariate mixture modeling approach. The region of the unreliable observations is the interior of the decision ellipse.

    f(x) = N(x; μ1, 1) + (1 – )N(x; μ2, 2)3

    Again, the weighted bivariate normal density components, i.e. N(x; μ1, 1) and (1 – )N(x; μ2, 2) correspond to the posterior probabilities of the low (unreliable) and high potentially reliable intensity classes of data, respectively. The bivariate mixture modeling yielded two bivariate normal components embedded in two-dimensional probability density of the data depicted as ellipses overlaid in the scatter plot (Fig. 3). The larger the variance of the bivariate density in each direction, the larger axes length of the ellipsoids in the corresponding direction. The results of bivariate mixture estimation, i.e. mean vector μ, covariance matrix , and the weight of the two bivariate density components for each data set are shown in Table 2 (bivariate analysis). As in the case of univariate analysis, the bivariate Component 1 representing low signal intensities, i.e. had lower signal intensity means (mean vector μ) of both channels and included most of the signal intensity data ( 0.75) in contrast to Component 2 (Table 2). This is also consistent with the fact that not all the genes are expressed in any given biological condition. Also, the variance of Component 1 (potentially unreliable) is smaller than the variance in Component 2 (potentially reliable signal population) indicating that heterogeneity of the level of gene expression (Table 2, covariance matrix ). The scatter plot shows also the ratios of gene expression; in any typical experiment, a small proportion of genes have significant expression ratios (for example Cy5 signal intensity/Cy3 signal intensity) that are different from unity (Cy3 = Cy5 readings). Thus, ratios that are dependent on the signal intensities located in the ellipse of the potentially reliable Component 2, are more likely to represent true expression ratios.

    Optimal classification for signal intensity reliability

    Because there is an overlap region between the two normal components resulting from both univariate and bivariate normal mixture analyses, one needs to find an optimal decision boundary for classification of signal intensities as reliable or unreliable. In univariate analysis, the optimal classification threshold for each channel was found by equating the class posterior probabilities. From equation 2a, the g-threshold (Cy3 signal intensity threshold) was the solution of:

    gN(x; μ1g, 21g) = (1 – g)N(x; μ2g, 22g)

    and similarly from equation 2b, the r-threshold (Cy5 signal intensity threshold) was the solution of:

    rN(x; μ1r, 21r) = (1 – r)N(x; μ2r, 22r)

    As seen in Figure 1, for each channel, the optimal threshold was exactly at the intersection of the two weighted normal density components. For example, the threshold for data set 1 (Fig. 1A) was 1001 and 761 for the Cy3 and Cy5 channels, respectively. It is clear that each data set, as a result of the different hybridization conditions and the level of global background, have their unique thresholds. For example, the Cy3 and Cy5 thresholds for data set 2, were 963 and 789 (Fig. 1B). Thus, an observation x = T should be flagged as reliable, if either of the channels is larger than its respective threshold, i.e. if xg > g-threshold or xr > r-threshold, then x is identified as belonging to the class of reliable signal intensities.

    In bivariate analysis, an optimal classification decision boundary was generated instead of the optimal fixed signal thresholds for each channel in the univariate analysis. This boundary was determined by equating the two weighted mixture components or posterior probabilities that are given in equation 3:

    N(x; μ1, 1) = (1 – )N(x; μ2, 2)

    The solution of this equation defines a quadratic curve and consequently the decision regions in the two-dimensional observation space (Fig. 3). On one side of this decision curve, observations belong to one class (e.g. the population of low or unreliable signal intensities), whereas on the other side is the class of reliable signal intensities. In other words, the region for the class of unreliable data is the interior of the decision ellipse—denoted by the thick ellipsoid line in Figure 3—while those signal intensities belonging to the reliable class fall outside the decision ellipse.

    Comparison of classification performance against the reference validation sets

    We constructed and used reference validation sets from our experimental data to assess and compare how well different classification methods, including a common thresholding technique (22), are performing. We used human monocytes (monocytic leukemia cell line, THP-1) induced by the endotoxin, LPS. This is a common model in inflammation research (23–25). Thus, prior knowledge about the expression of many genes would facilitate the discrimination between true and false positives in microarray data. For each data set we used to test the proposed methods, a reference validation set of 50 genes, was compiled. The validation sets contained both true signals and false signals based on prior knowledge of their expression in literature, and the results agreed in large-scale expression analysis in both SAGE and microarray data (10,12). The true signals are those in which their microarray elements agree with prior knowledge on their expression, while false signals are those in which their microarray elements (due to different problems such as failure of amplifications, other low probe concentration or under-printing, etc.) are low in signal intensity and are not consistent with the expected expression or inducibility of these elements. An example is interleukin-8 gene, which is highly and consistently inducible in endotoxin-stimulated monocytes (24), the same cellular model from which the microarray data derived from. In the arrays, IL-8 probe was generated from two different sources of cDNA-containing cultures, which resulted in optimal and sub-optimal probe concentrations on the arrays. Another example is the eukaryotic elongation factor 2 gene that is ubiquitously expressed, and thus, acts as a housekeeping gene; their elements were present both as true signals or false signals due to optimal and sub-optimal probe concentrations, respectively.

    To quantitatively assess how well our method classifies the signal intensities in the reference/validation gene set into the two components of reliable and unreliable signal intensities, we computed the kappa coefficient of agreement (26,27), along with the sensitivity, specificity and observed agreement rates (Table 3). Both univariate and bivariate methods resulted in the ability to classify almost all of the true positives (sensitivity is 96–100%) and true negatives (specificity, 93–100%, depending on the data set) in the reference gene set as reliable and unreliable signal intensities, respectively (Table 3). The accuracy of the method can also be statistically evaluated using the kappa () coefficient of agreement in which a value of 1 indicates an excellent (e.g. 100%) agreement. As the kappa values indicated, the best agreement with the reference classification information available in the validation sets was obtained using our classification technique based on bivariate modeling, with 100% observed agreement rate compared to 98 and 94% of univariate modeling for data sets 1 and 2, respectively (Table 3).

    Table 3. Classification results and validation with the reference set

    Using the validation sets, we were also able to compare our approach with Fielden’s method, a widely used method for exclusion of unreliable signal intensities. In this method, the threshold for each channel is obtained as:

    MedianBackG + 3SDBackG

    where MedianBackG and SDBackG refer to the median and SD of the signal intensity estimated from the spot backgrounds. Both the univariate and bivariate normal mixture modeling resulted in superior results when compared to Fielden’s method. Unlike Fielden’s method, which behaved particularly poorly in data set 1 (agreement rate of only 57%), our method performed superiorly and equally well in both data sets (Table 1).

    Table 4 shows examples of utility of our bivariate analysis-based classification method using four expressed genes (of the 50 genes in the reference set) that are well known to be induced in human monocytes by endotoxin (LPS), in addition to data on one housekeeping gene.

    Table 4. Examples of validation of the algorithm using a reference list of genes known to be expressed in endotoxin-stimulated human monocytes

    DISCUSSION

    Microarray gene expression data usually bear a substantial number of false positives due to different possible sources of error, but, mostly due to the large population of low signal intensities resulting from low or lack of expression of the majority of the genes being investigated. This is due to the fact that, though cells harbor the same 50 000 genes, only a fraction of genes become expressed at any time point or any condition. In addition, those low signal intensities that are near to the background levels increase variability or impair reproducibility of the measured ratios between control and experimental samples. In this study, we developed an approach that was derived from normal mixture modeling (4,5) using univariate and bivariate analysis, which segregated the microarray data into two components. With the application of the optimum classification algorithm, two classes, namely, classes of reliable and unreliable (low) signal intensity data, were identified, and the proposed method was validated.

    We employed normal mixture modeling to microarray data because this procedure is well fitted for clustering the observations together into groups for classification and also provides a convenient and flexible class of models for estimating distributions (4,15,28). Both univariate and bivariate normal mixture modeling were used; in univariate analysis, data from the two microarray channels were considered, and optimal signal thresholds were established for each channel separately. Whereas in bivariate analysis, the two-dimensional nature of the data was exploited and data were directly classified using posterior probabilities computed from the bivariate mixture components. The latter approach, which takes into account the correlation of the two channels, leads to further accurate classification of the data to reliable and unreliable components, as demonstrated with the validation reference sets.

    The validity of our approach in identification of reliable signal intensities and in thresholding/classification procedures was asserted through several statistical assessments and observations. We estimated the mixtures components and their weights using the EM algorithm (18–21) which is an iterative method for optimizing the likelihood function in situations where data is incomplete; in our case, the missing information was the class memberships of the data points as being reliable and unreliable. We used the ‘k-means’ algorithm to initialize the EM to ensure that it will find a ‘good’ local maximum (4); this is often sufficient in practical applications. Though there are other ways to fit or estimate mixture models, the EM is relatively easy to implement, does not require any gradient computation, and is numerically stable.

    This statistically designed approach leading to segregation of the data into two components resulted in a situation that mimics two important features of microarray data. First, the majority of the signal intensities microarray data are not applicable for any analysis because the majority of the genes are not expressed at all or at very low basal levels of expression. We found that our approach succeeded in segregating the microarray data into one large component (75%) and a smaller component (25%) containing reliable data but with a larger mean value. This is consistent with the fact that most of the microarray data fall into low signal intensities representing genes that are not expressed in any given experimental condition. Also, low signals can arise from non-biological causes such as low quality of the hybridization probes on oligonucleotide microarrays or sub-optimal amplification/printing of cDNA probes in cDNA microarray procedures. It should be noted that even among the reliable data, not all the genes are differentially expressed, as assessed by ratios of signal intensities. The majority of the microarray elements fall in the ratios of 1.0 (slope of 1.0 at microarray scatter plots of the two channels). Thus, our approach will yield a more accurate assessment of gene expressions ratios by eliminating the false positives ratios due to unreliable signal intensities. Since most of the error occurs when working with ratios based on low (unreliable) signal intensities, it is expected that the ratios are more accurate when based on ‘reliable signals’ that our methods succeed to discriminate. Another feature that validates our procedures is that variation of the class of reliable signal intensity as statistically assessed reflects the nature of the microarray data and the true heterogeneity of the levels of expression of the different genes.

    An important component of this study is the use of a validation reference set of expressed genes to assess the accuracy of the classification of reliable and unreliable data. We observed that both techniques, when applied to validation sets, have produced excellent results in terms of sensitivity, specificity and kappa coefficient of agreement with the reference classification information that was available in the validation sets. In particular, the classification based on bivariate mixture modeling, fully exploiting the bivariate nature of the data—there is always excellent correlation between Cy3 and Cy5 channels of data—yielded even superior results with 100% sensitivity and specificity. We also compared the results of our classification method with those of a currently utilized simple thresholding method (22) and observed that our classification algorithm performed better as assessed with the sensitivity, specificity and agreement statistics (kappa coefficient).

    Our method overcomes many of the limitations of the existing methods. A major advantage of the proposed approach here is that the normal mixture model parameters that determine the optimal thresholds are estimated from the background-corrected signal intensities and not from the background levels alone. Thus, the proposed approach takes into consideration both the distribution of the entire signal intensity data and the global background status. Moreover, our method deals with the spot intensities themselves while several other methods (29,30) deal with gene expression ratios, which may have already been derived from unreliable signal intensities to start with. A clear advantage of our method is that it assigns reliability on the spots themselves, regardless of whether the genes are induced or repressed; unlike those, for example in the Hughes et al. study (31), which only assign confidence levels on ratios that significantly change. In addition, these methods require repeated measurements, which may not be always practical.

    Nevertheless, few studies have addressed the identification of reliable signal intensities, but still with limitations. Many of the commercial microarray processing softwares utilize a fixed signal threshold, which is arbitrary and independent neither on the nature of the distribution of the entire data nor on the different background status of each array experiment. A commonly used method suggested the elimination of spot intensities whose medians were below a threshold that was defined as the median background intensity plus a constant (suggested default value was 3) times the SD of the spot background (22). This method is subjective; it involves an arbitrary constant and does not take into account the distribution of the actual data, and more importantly, this method was not validated with a reference gene data set, as was ours. In another approach, Bilban et al. introduced a ‘receiver operating characteristic’-based analysis of a set of positive and negative controls to establish the signal thresholds (32). However, this method may not be as accurate due to the reliance on the signals generated from these controls and needs to be overly repeated for meaningful significance. Tran et al. (33) proposed the use of correlation between the mean and median spot intensities to eliminate inaccurate signals. Although this method is well suited to discriminate spots with good morphology, this may not be the case for low signals that still have good morphology; besides the aforementioned limitations, this approach suffers from an arbitrarily defined threshold on the correlation.

    The method should prove useful in any microarray experiment with one or two channels and is applicable to cDNA microarrays, as demonstrated by the use of independent in-house and publicly available data sets, or oligonucleotide-spotted microarray data (M. H. Asyali and K. S. A. Khabar, unpublished observations). The assessment of signal intensity reliability and estimation of threshold is adaptable to any data set due to the different hybridization conditions and level of global background. Thus, each set of microarray data will have its own unique mixture model and corresponding optimal classification thresholds. The method proves superior to many of the threshold techniques and has been validated with a true and false reference list of expressed genes. The use of the proposed classification approach will significantly improve the reliability and reproducibility of microarray data, and eliminate erroneous ratios of gene expression.

    ALGORITHM AVAILABILITY

    The approach described in this paper is available as a stand-alone executable WindowsTM program with an easy to use graphical user interface. It is available for academic users upon signing a standard academic use agreement. Commercial users need to obtain a license. For further details please see http://rc.kfshrc.edu.sa/bssc/staff/MusaAsyali/Downloads.asp.

    ACKNOWLEDGEMENTS

    The authors like to thank Mr Mohammad Naemmuddin and Mr Mohammed Dhalla for their excellent technical assistance.

    REFERENCES

    Nguyen,D.V., Arpat,A.B., Wang,N. and Carroll,R.J. (2002) DNA microarray experiments: biological and technological aspects. Biometrics, 58, 701–717.

    Ramaswamy,S. and Golub,T.R. (2002) DNA microarrays in clinical oncology. J. Clin. Oncol., 20, 1932–1941.

    Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Caligiuri,M.A. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.

    McLachlan,G.J. and Basford,K.E. (1988) Mixture Models, Inference and Applications to Clustering. Marcel Dekker, New York.

    McLachlan,G.J. (1982) In Krishnaiah,P.R. and Kanal,L.N. (eds), The classification and mixture maximum likelihood approaches to cluster analysis. Handbook of Statistics. North-Holland, Amsterdam, Vol. 2, pp. 199–208.

    Schwartz,G. (1978) Estimating dimensions of a model. Ann. Stat., 6, 461–464.

    Duda,R., Hart,P. and Stork,D. (2000) Pattern Classification, 2nd Edn. Wiley, New York.

    Symons,M. (1981) Clustering criteria and multivariate normal mixtures. Biometrics, 37, 35–43.

    Wolfe,J. (1970) Pattern clustering by multivariate mixture analysis. Multivariate Behav. Res., 5, 329–350.

    Frevel,M.A., Bakheet,T., Silva,A.M., Hissong,J.G., Khabar,K.S. and Williams,B.R. (2003) p38 Mitogen-activated protein kinase-dependent and -independent signaling of mRNA stability of AU-rich element-containing transcripts. Mol. Cell. Biol., 23, 425–436.

    Bakheet,T., Williams,B.R. and Khabar,K.S. (2003) ARED 2.0: an update of AU-rich element mRNA database. Nucleic Acids Res., 31, 421–423.

    Suzuki,T., Hashimoto,S., Toyoda,N., Nagai,S., Yamazaki,N., Dong,H.Y., Sakai,J., Yamashita,T., Nukiwa,T. and Matsushima,K. (2000) Comprehensive gene expression profile of LPS-stimulated human monocytes by SAGE. Blood, 96, 2584–2591

    Chang,H.Y., Sneddon,J.B., Alizadeh,A.A., Sood,R., West,R.B., Montgomery,K., Chi,J.T., Rijn Mv,M., Botstein,D. and Brown,P.O. (2004) Gene expression signature of fibroblast serum response predicts human cancer progression: similarities between tumors and wounds. PLoS Biol., 2, E7.

    McManus,I.C. (1983) Bimodality of blood pressure levels. Stat. Med., 2, 253–258.

    McLachlan,G.J. and Gordon,R.D. (1989) Mixture models for partially unclassified data: a case study of renal venous renin in hypertension. Stat. Med., 8, 1291–1300.

    McLachlan,G.J., Bean,R.W. and Peel,D. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.

    Shoukri,M.M. and McLachlan,G.J. (1994) Parametric estimation in a genetic mixture model with application to nuclear family data. Biometrics, 50, 128–139.

    Dempster,A., Laird,N. and Rubin,D. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc., B39, 1–38.

    Redner,R. and Walker,H. (1984) Mixture densities, maximum likelihood and the EM algorithm. Siam Rev., 26, 195–202.

    Moon,T.K. (1996) The expectation-maximization algorithm. IEEE Signal Process. Mag., 13, 47–60.

    Martinez,W.L. and Martinez,A.R. (2001) Computational Statistics Handbook with MATLAB. CRC Press, Boca Raton.

    Fielden,M.R., Halgren,R.G., Dere,E. and Zacharewski,T.R. (2002) GP3: GenePix post-processing program for automated analysis of raw microarray data. Bioinformatics, 18, 771–773.

    Murayama,T., Ohara,Y., Obuchi,M., Khabar,K.S., Higashi,H., Mukaida,N. and Matsushima,K. (1997) Human cytomegalovirus induces interleukin-8 production by a human monocytic cell line, THP-1, through acting concurrently on AP-1- and NF-kappaB-binding sites of the interleukin-8 gene. J. Virol., 71, 5692–5695.

    Al-Humidan,A., Edwards,C.K., Al-Sofi,A., Dzimiri,M., Al-Sedairy,S.T. and Khabar,K.S. (1998) A carbocyclic nucleoside analogue is a TNF-alpha inhibitor with immunosuppressive action: role of prostaglandin E2 and protein kinase C and comparison with pentoxifylline. Cell. Immunol., 188, 12–18.

    Kastelic,T., Schnyder,J., Leutwiler,A., Traber,R., Streit,B., Niggli,H., MacKenzie,A. and Cheneval,D. (1996) Induction of rapid IL-1 beta mRNA degradation in THP-1 cells mediated through the AU-rich region in the 3'UTR by a radicicol analogue. Cytokine, 8, 751–761.

    Cohen,J. (1960) A coefficient of agreement for nominal scales. Educ. Psychol. Measurement, 20, 37–46.

    Landis,J. and Koch,G. (1977) The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.

    Ridolfi,A. and Idier,J. (2002) Penalized Maximum Likelihood Estimation for Normal Mixture Distributions. School of Computer and Information Sciences Technical Report 200285. école Polytechnique Fédérale de Lausanne, EPFL, School of Computer and Information Sciences; Lausanne, Switzerland.

    Brody,J.P., Williams,B.A., Wold,B.J. and Quake,S.R. (2002) Significance and statistical errors in the analysis of DNA microarray data. Proc. Natl Acad. Sci. USA, 99, 12975–12978.

    Tusher,V.G., Tibshirani,R. and Chu,G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121.

    Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D. et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126.

    Bilban,M., Buehler,L., Head,S., Desoye,G. and Quaranta,V. (2002) Defining signal thresholds in DNA microarrays: exemplary application for invasive cancer. BMC Genomics, 3, 19.

    Tran,P.H., Peiffer,D.A., Shin,Y., Meek,L.M., Brody,J.P. and Cho,K.W. (2002) Microarray optimizations: increasing spot accuracy and automated identification of true microarray signals. Nucleic Acids Res., 30, e54.(Musa H. Asyali*, Mohamed M. Shoukri, Ome)