OPTIMUM NUMBERS OF SINGLE NETWORK FOR COMBINATION IN MULTIPLE NEURAL NETWORKS MODELING APPROACH FOR MODELING NONLINEAR SYSTEM

: This paper is focused on finding the optimum number of single networks in multiple neural networks combination to improve neural network model robustness for nonlinear process modeling and control. In order to improve the generalization capability of single neural network based models, combining multiple neural networks is proposed in this paper. By studying the optimum number of network that can be combined in multiple network combination, the researcher can estimate the complexity of the proposed model then obtained the exact number of networks for combination. Simple averaging combination approach is implemented in this paper which is applied to nonlinear process models. It is shown that the optimum number of networks for combination can be obtained hence enhancing the performance of the proposed model.


INTRODUCTION
Artificial neural networks have been shown to be able to approximate any continuous non-linear functions and have been used to build data base empirical models for non-linear processes [1]. Hence what is a neural network? According to Haykin [2]:"A neural network is a massive parallel-distributed processor that has a natural capability for storing experiential knowledge and making it available for use. It resembles the brain in two respects knowledge is acquired by the networks through a learning process. Interneuron connection strengths known as synaptic weights are used to store the knowledge." Furthermore, the main advantage of neural network based process models is that they are easy to build. This feature is particularly useful when modeling complicated processes where detailed mechanistic models are difficult to develop. However, a critical shortcoming of neural networks is that they often lack robustness unless a proper network training and validation procedure is used. Robustness of the model can be defined as one of the baseline to judge the performance of the neural network models and it is really related to the learning or training classes as what Bishop [3] described:"The importance of neural networks in this context is that they offer very powerful and very general framework for representing non-linear mappings from several input variables to several output variables, where the form of the mapping is governed by a number of adjustable parameters." There are a lot of factors that contributed to the successful research on neural networks and among them the two main factors are as follows. The first one is that neural networks are very powerful modeling tool capable of modeling extremely complex functions [1,4,5]. In particular, neural networks are non-linear models, which are very useful in modeling nonlinear systems that cannot be successfully modeled by linear models.
The second main factor is that neural networks are easy to use and develop and they basically learn by examples. The neural network users gather representative data, and then invoke a training algorithm to automatically learn the structure of the data [6,7,8]. Because of the advantages or the tremendous capability of neural networks, currently there are a lot of applications of neural networks in industry and business where neural networks are applied in signal processing, control, pattern recognition, medicine, speech processing and in business.
In order to improve the robustness of neural networks a number of techniques have been developed lately like regularization [9] and the early stopping method [10]. Ohbayashi [11] implemented the universal learning rule and second order derivatives to increase the robustness in neural network models. Robustness is enhanced by minimizing the change in the values of criterion function caused by the small changes around nominal values of system parameters. Lack of the robustness in individual neural networks is basically due to the over fitting of the models [12]. Therefore combination of neural networks has come up and researchers concentrate on how over fitting can be avoided by improving the learning algorithm or by combining the neural networks.
Overfitting basically refers to the poor generalization of the networks due to fitting the noise in the data [13]. Furthermore, the trained network might not minimize the error on the training data set because it has uncontrolled excess dynamics capability or because the training data itself is corrupted with noise. The representation capability of a neural network is determined by its size (number of neurons). If networks are too large they can find many solutions which fit the training set data exactly, but which contain high frequency dynamics is not present in the underlying function. When the data is corrupted with noise a second form of over fitting occurs. Here the data itself contain high frequencies not present in the underlying function, with the result that minimizing the error on the data set will result in the networks fitting the noise.
Neural networks are related to the basic principle of brain [14] and try to mimic how brain works. They have been developed since 1940 after World War 2 when industrialization was growing rapidly. Neural networks are generally structured in layers of which all the neurons are connected between the adjacent layers. As mentioned by Willis et al. [15], more accurate representation of the processes are required to ensure good process control performance especially in Advance Process Control. Therefore neural network models must be robust or stable when they are applied to new (unseen) data. Even though neural network models are very powerful non-linear modeling tools, noises in the input data sometimes cause the model over-fitting. Over-fitting and under fitting is the main problem in developing neural network models.
In over-fitting, the error on the training data set is driven to a very small value, but when applied to unseen data, the network error is large and the generalization capability of the neural network is poor. While under fitting is due to that the neural network itself cannot cope with or fails to capture the relationship within the complex data [16].
Therefore a lot of techniques have been introduced to improve the generalization capability of neural network models like regularization techniques [12,13,17], Bayesian learning [8,18] and also by using the parsimonious networks structure [19]. The most exceptional model for this approach is network pruning techniques and sequential orthogonal training techniques. A sequential orthogonal training technique gradually builds up a neural network model and avoids unnecessarily large networks structure [20]. Among those approaches for improving neural network generalization, the combination of multiple neural networks seems to be very effective. Therefore the multiple neural networks and combination of multiple neural networks is proposed in this paper with the aim of enhancing the single neural network robustness.
The paper is organized as follows. Section 2 presents the concept multiple neural networks and how its can be combined. Section 3 presents the nonlinear process modeling for a nonlinear process. Applications of the proposed technique to two case studies are given in Section 4. Finally, the last section concludes this paper.

MULTIPLE NEURAL NETWORK
The idea of multiple neural networks came up from Wolpert [21] where he described about stacked generalization which is a technique for combining different representations to improve the overall prediction performance. It can also be described as an architecture of network consisting of several sub-models and a mechanism which combines the outputs of these sub-models [22]. There are several types of multiple neural networks but the underlying ideas are basically similar and the main difference is on how to create the submodels as shown in Fig. 1. Two major types of multiple neural networks are described here.
The first category is multiple model neural networks [23,24]. The training data are totally different in building the individual networks which can be built using different inputs in different regions of operation. The idea of this approach is to adapt different information by using different inputs, and by combining this information a better prediction can be obtained [22,25]. The learning algorithm in each network can also be different and can be supervised or unsupervised methods. Other multi model approach are introduced by Jacobs [26] by using the 'mixture of local expert'. Then, Jordan and Jacobs [27] came up with the hierarchical mixture of neural networks. In this case they basically discuss about the supervised learning algorithm and how the divide and conquer method works. Some examples of multi model applications are in the field of pattern recognition where different models represent different image classification [28,29]. Medical application of multi models is presented by Jerebko [30] where different classifications of polyps as single neural network models using different inputs are combined and better prediction rate is obtained. It has also been used in other medical fields like in diagnosis application and in detecting the lung cancer [31,32]. Multiple models have also been applied in time series forecasting [24]. In this case, each model forecasts a difference time series prediction or prediction horizon and this reduces the recursive prediction promoted to reducing the recursive error occurred in the long range prediction. It also shows that the multi network model performs better than single networks.
The second category is to creating multiple models using the same training data but re-sampled or partitioned using particular algorithms [33,34]. There are three main algorithms being used to re-sample or partition the training data which are bagging or bootstrap [34,35] which is being used in this paper to create a multiple neural networks, adaboost [36,37] and randomization [38]. The motivation of creating those different inputs or partitions is to create the effective network ensembles [39]. The bootstrap or bagging basically refers to replication of a training data set where the bootstrap algorithm re-samples the original training data set. Some of the data samples may occur several times, and other may not occur in the sample at all. The individual training sets are independent and the neural networks can be trained in parallel. In this paper the data were resample using bootstrapping method.
The development of computer capability also promoted the development of multiple neural networks. Application of multiple neural networks will grow rapidly and become an important component of future research. This is also due to the various used of the neural networks and combining neural networks is one of the methods that increase the performance of network models. Therefore, the objective of this paper is to study the optimum numbers of network that can be combined. The single network will be added one in a time to be combined until the optimum number of network is obtained based on the sum square error (SSE) which is used the simple averaging method as shown below:

Simples Averaging
This method is the most common method in combining several model outputs with the weights fixed as shown below: where i ŷ is the network prediction from the ith network, n is the number of networks to be combined, Yˆi s the final prediction output, and w i = 1/n is the weight for combining the ith network. The main disadvantage of this approach is that all the networks have the same contribution to the final prediction output even though some of the networks might have better predictions then others; consequently it might deteriorate the model.

NONLINEAR PROCESS MODELING
In this case study, the individual networks were trained by the Levenberg-Marquardt optimization algorithm with regularization and "early stopping". The aggregated of single network are varies from 2 to 50 numbers of networks. If the number of networks is too small we might not get the optimum reduction of the SSE in the combination. To accommodate systems with lag elements, while re-sampling the training and testing data using bootstrap re-sampling techniques, the training and testing were already in discrete time function, therefore, by re-sampling discrete time function, it will not affect the sequence of input-output mapping of the prediction.
All weights and biases were randomly initialized in the range from -0.1 to 0.1. The individual networks are single hidden layer feed forward neural networks. Hidden neurons use the sigmoid activation function whereas output layer neurons use the linear activation function. To cope with different magnitudes in the input and output data, all the data were scaled to zero mean and unit standard deviation. The data for neural network model building were divided into: 1) Training data (for network training); 2) Testing data (for cross-validation based network structure selection and early stopping); and 3) Unseen validation data (for evaluation of the final selected model). In networks with fixed structure, network structures (numbers of hidden neurons) were determined through crossvalidation. Networks with different numbers of hidden neurons were trained on the training data and tested on the testing data. The network with the lowest SSE on the testing data is selected. In assessing the developed models, SSE on the unseen validation data is used as the performance criterion. Figure 2 shows the model of conic water tank level apparatus. There is an inlet stream to the tank and an outlet stream from the tank. Manipulating the inlet water flow rates will regulates the water tank level.

Case Study: Water Tank Level Prediction
where, V is represents as volume of water in the tank (cm 3 ), Q1 is inlet water flows rates (cm 3 /s) and Q o is outlet water flows rates (cm 3 /s). The outlet water flow rate, Q o , is related to the tank level, h, by the following equation: where k is constant for a fixed valve opening.The volume of water in the tank is related to the tank level by the following equation: where r is the tank bottom radius and θ is the angle between the tank boundary and the horizontal plane.
Combining equations above, the following dynamics model for the tank level is obtained: Based on the above model, a simulation file is developed to simulate the process. The parameters used in the simulation are r = 10 cm, k = 34.77 cm 2.5/s and θ = 60. The sampling time is 10 second. The above equation indicates that the relationship between the inlet water flow rate and the water level in the tank is quite non-linear. The outlet valve characteristic showed that the static gain increased with tank level. The time constant of the processes increases with the tank level because the tank is of a conical shape. Thus, both the static and dynamic characteristics of the process vary with the operating condition. All the building data are generated from the simulation program and noise with the distribution N (0, 0.7 cm) are added to simulated tank. The data was divided into three sections, which are training data, testing data and validation data respectively.

RESULTS AND DISCUSSION
In this case study, the input data to the neural network system will be divided into three sections (testing, training and validation). All these data have to be scaled before it can be use for analysis which is to normalize the data to zero mean and unity standard deviation. This is important since different input data have different magnitude or units. The dynamics model for tank level prediction is in the form of: where ∧ h represent the predicted tank level. The optimum number of hidden neuron has to be determined before the networks being used for training, testing and validation purpose. Based on the analysis, the number of hidden neurons that came up with the least SSE for the tank level model is used for other analysis in multiple neural networks. Table 1 shows the results of SSE for training and testing data for each trial while Fig. 3 shows the plot of the results.
From Table 1 and Fig. 3, the optimum number of nodes with minimum SSE is 3 nodes with the value of SSE 3.2668. Hence, 3 nodes will be used for the tank level prediction analysis. After determined the number of nodes, the m-file for tank level prediction will be simulated to obtain the results of sum square error, SSE and also mean sum square error, mean SSE for neural networks system. The method of calculation of SSE and mean SSE in multiple neural networks are different. For SSE in multiple networks, the predicted output value, y p for each network will be combined to obtain the mean predicted output value, p y .
where, n = number of network. This combined output value of p y will be used to compare with the true output value, y to obtained the sum square error, SSE.
2 ) ( For mean SSE in multiple neural networks, the predicted output value, yp for each network will be used to compare with the true output value, y. This means the sum square error is obtained for each network in the multiple networks system. The mean value of SSE for the networks will be determined and describe as mean SSE. The results of SSE and mean SSE for tank level prediction are tabulate in Table 2 and 3. While Fig. 4 and Fig. 5 show the plots of the results.    From Fig. 5, the mean SSE also decrease with the increase in number of networks. The optimum number of networks with minimum mean SSE for validation data achieve at 35 networks. The mean SSE value for validation increase slightly when more than 35 networks are used. The overall results show that multiple neural networks provide good performance with lower SSE compare to single network. The optimum number of network achieves at 35 networks. For further analysis, the plot of residual error of validation data for multiple (35) networks will be conducted. The result of plot is shown in Fig. 6. The residual error is the difference between true output value, y and predicted output value, y p :

Residual error
The true and predicted output values are rescaled to its original value (level, cm) before finding the residual error. From Fig. 6, the residuals appear randomly scattered around zero indicating that the model describes the data well. The range of error is from -2.5 to 2.5 cm. There are total 199 sets of validation data use to compare with the predicted output values. To prove how well the multiple neural networks predict the output data, we can plot a graph to relate between the true and predicted output values. Fig. 7 shows the results of the plot.
The result of plots shows that the predicted output values were close to the true output values. The calculated correlation coefficient value R 2 for true and predicted output for Fig. 7 for multiple networks is 0.9893. Correlation coefficient is a normalized measure of linear relationship strength between two variables. The perfect fit will obtain a correlation coefficient value of 1. This means the results for the tank level prediction case study is close to the perfect fit value and indicate the multiple networks predict a good result.

CONCLUSION
As mention in earlier part that more accurate representation of neural networks is very crucial as it contributes to the improvement of model performance especially when it is applied to unseen data. Therefore, major problem in modeling neural networks such as over fitting and under fitting can be avoided and generalization capability of the model can be enhanced. Driven by such convincing result, a step has been taken into reaching the goal of modeling neural networks. One of the methods that have been utilized is by studying the optimum number of neural networks that eligible to be combined.
By implementing the method, researcher can predict the intricacy of the model then finally obtained the best number of neural networks that should be combined. The overall result of this research has proposed 35 networks where the SSE values reach the almost steady condition. The combination of 35 networks also has been proved be reliable since the result shows that the predicted output values closed to the true output values.