Logo.gif (10726 bytes)

Data Mining -- An Optimization Problem

|Home|Technology|Solutions|Product|Literature|


wpe4.jpg (89850 bytes)

 

Major Steps in Data Mining Applications

Collection of History Data
Data are collected from a process in a system, including operating data records, in situ sensor reports, raw materials and composition, design and/or operating process parameters, and so on.

Separability Test
This test is designed to investigate the possibility of separating data points from different populations or clusters in a hyper-space.  If the data are separable, it is possible to build a mathematical model to describe the system. Otherwise, a good model can not be built from the give data and more data or data processing is needed.

Data Preprocessing: data pretreatment or conditioning is adopted to find the local views of a process, or to delete outliers (noise) from original data set using various methods in mathematical statistics and robust estimation theory, such as pattern recognition, least median squares (LMS) method, least trimmed squares (LTS) method, reweighted least square (RLS) method, outlier diagnosis, single-case diagnosis, Hat matrix, genetic algorithms, and robust regression.

Factor  Selection - not a text-book approach:
Analysis is performed in the m-dimensional factor (or feature) space.  Any system model must be built on the selected factors that represent the operating rules of the system. Therefore, selection of an appropriate set of factors is very important in data mining.  Inappropriate selection of factors may lead to unnecessary complexity and introduce noise to the data.  There are two general approaches to feature selection. One approach is to use the first principle method to study the physical, chemical or electrical properties and perform experiment and take measurement to diagnose major factors. The second approach is to use pattern recognition and optimization techniques to find the relationships among various factors from history data. In the latter case, there is no change to system setup, no experiment needed, and no interruption to production.

Strictly controlled factors - many important factors are already under close control and vary very little in production process, and they should not be considered as important factors. Pay more attention to other factors.

Common vs. specific factors - common factors from text book or common sense knowledge may not apply to a specific process in a specific case. Attention should be directed to finding those factors that are specific to the process under study.  These specific factors are more important than the common factors.

Evolutionary factors - even with the same process in the same plant, the priority of the identified factors may change. While one factor may be the deciding factor in solving the bottleneck problem, it may be less important than some other factors in solving the product quality problem of the same plant.

Minimum set of effective factors: this is the min set of factors that can be used to represent the system under study.  An interactive and iterative technique has been developed in MasterMiner to identify this factor set, which has been proved highly effective and efficient in many industrial applications.

Factor Multiplicity - a concept borrowed from molecular chemistry that describes the multiplicity in the phase change of a substance.  In an optimization problem, Y = f(X1, X2, X3, ...,Xi,..., Xn), facators, X1, X2, X3, ...,Xi,..., Xn,and target Y are interchangeable. For instance, Xi = g(X1, X2, X3, ...,Y,..., Xn) describes another optimization problem for the same system.   This means that factors are not fixed in a problem, and the challenge is to identify the best factors for one specific problem using an efficient and effective method in addition to expert's knowledge and experience.

Pattern Recognition
Data patterns can be recognized in the M-dimensional hyperspace spanned by M-factors. Two types of patterns have so far been identified in theory and verified in practice. They are called "two-sided" pattern and "inclusive" pattern, as shown below.

wpe3.jpg (7195 bytes)

Model Building
The mathematical models built by MasterMiner include (1) a set of
inequalities that cover a sub-space in the feature space and (2) a set of two linear equations along the 2 principal directions, as shown in the picture below:

wpeD.jpg (47863 bytes)

Once MaterMiner has calculated the inequalities for the optimal subspace, a back propagation based-neural network module can be used to accurately estimate the model parameters from the clean date of one cluster  in the sub-space, and the resultant parameter estimation is superior to that calculated from the fixed data of two or more clusters.  A genetic algorithm is built into MaterMiner to fine-tune the neurons and weights in the neural network model. 

Prediction and Control
With the mathematical model so obtained for the process under study, it is feasible now to use for different purposes:

  • use the model equations as criteria for optimal control
  • extrapolate in the direction given by the model for maximum return (economic benefit)
  • diagnose a system's fault (state) using the current operating data and process parameters
  • propose for optimal conditions in design new materials or processes
  • display the cross-section maps to show optimal and failure zone for on-line control (below)

wpe4.jpg (88585 bytes)

The above picture shows the cross-sectional maps of the 4-dimensional optimal
zone (green) for the yield management problem in aluminum production plant.

Copyright © 1997 - 2000, ZAPTRON Systems, Inc.