Wine Classification using neural networks

An example of a multivariate data type classification problem using Neuroph framework

by Milica Stojković, Faculty of Organizational Sciences, University of Belgrade

Introduction

Neural networks can solve some really interesting problems once they are trained. They are very good at pattern recognition problems and with enough elements (called neurons) can classify any data with arbitrary accuracy. They are particularly well suited for complex decision boundary problems over many variables. Therefore we have chosen neural networks and Neuroph Studio as a good candidates for solving the classification problem represented below.

Introduction to the problem

In this demo we will try to build a neural network that can classify wines from three wineries by thirteen attributes:

Alcohol
Malic Acid
Ash
Ash Alcalinity
Magnesium
Total Phenols
Flavanoids
Nonflavanoid Phenols
Proanthocyanins
Color Intensity
Hue
OD280/OD315 of dedulted wines
Proline

This is an example of a pattern recognition problem, where inputs are associated with different classes, and we would like to create a neural network that not only classifies the known wines properly, but can generalize to accurately classify wines that were not used to design the solution.The thirteen neighborhood attributes will act as inputs to a neural network, and the respective target for each will be a 3-element row vector with a 1 in the position of the associated winery, #1, #2 or #3.
The data set consists of 178 instances and each one is described with 13 characteristics given above. The dataset can be found in the UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Wine

Procedure of training a neural network

In order to train a neural network, there are six steps to be made:

Normalize the data

Create a Neuroph project

Creating a Training Set

Create a neural network

Train the network

Test the network to make sure that it is trained properly

Step 1. Data Normalization

In order to train neural network this data set have to be normalized. Normalization implies that all values from the data set should take values in the range from 0 to 1.
For that purpose it would be used the following formula:

Where:

X – value that should be normalized
Xn – normalized value
Xmin – minimum value of X
Xmax – maximum value of X

As it was said before, the last attribute is the output (A, B, C). So each one will be replaced with two 0 and a one 1 in the location of the associated winery. A - 1 0 0 B - 0 1 0 C - 0 0 1

Step 2. Creating a new Neuroph project

The next step is to create a new project. So, we should do the following:

Click File - > New Project, then choose Neuroph project and click 'Next' button.

Define project name and location. After that click 'Finish' and a new project is created and will appear in projects window, on the left side of Neuroph Studio.

Step 3. Creating a Training Set

Next, we need to create new training set that is used to teach the network. Click New - > Training set. We name it and set the parameters.

There are two types of training used in neural networks, supervised and unsupervised training, of which supervised is the most common. In supervised learning, the network user assembles a set of training data. The training data contains examples of inputs together with the corresponding outputs, and the network learns to infer the relationship between the two. For an unsupervised learning rule, the training set consist of input training patterns only. Our, normalized data set, that we create above, consists input and output values. That is why we choose supervised learning. Then, we set the number of input, which is 13 because out data set has 13 input attributes, and the number of outputs is 3 because of three different classes - outcomes.

After clicking 'Next' we need to insert data into training set table. We will load all data directly from file. We click on 'Choose File' and select file in which we saved our normalized data set. Values in that file are separated by coma(,).

Then, we click 'Load' and all data will be loaded into table. We can see that this table has 16 columns, first 13 of them represents inputs, and last 3 of them represents outputs from our data set.

After clicking 'Finish' new training set will appear in our project. The next step is to create a neural network that will learn to classify the wines.

Standard training techniques

Standard approaches to validation of neural networks are mostly based on empirical evaluation through simulation and/or experimental testing. There are several methods for supervised training of neural networks. The backpropagation algorithm is the most commonly used training method for artificial neural networks.

Training attempt 1

Step 4.1 Creating a neural network

To create a new neural network do right click on project and then New -> Neural Network. Then we define name and type of neural network. Neuroph supports common neural network architectures such as Adaline, Perceptron, Multi Layer Perceptron, etc. We are going to analyze several architectures, but all of them will be using Multi Layer Perceptron. This is the most widely studied and used neural network classifier. It is capable of modeling complex functions, it is robust (good at ignoring irrelevant inputs and noise) ,and can adapt its weights and/or topology in response to environment changes. Another reason we will use this type of perceptron is simply because it is very easy to use - it implements black-box point of view, and can be used with few knowledge about the relationship of the function to be modeled.

Then we click on Next, and the following window will show (but not the values inside)

Problems that require more than one hidden layers are rarely encountered. For many practical problems, there is no reason to use any more than one hidden layer. One layer can approximate any function that contains a continuous mapping from one finite space to another. Deciding the number of hidden neuron layers is only a small part of the problem. We must also determine how many neurons will be in each of these hidden layers. Both the number of hidden layers and the number of neurons in each of these hidden layers must be carefully considered.

How let's see why did we used these values and choices. The number of input and output parameters in obliviously clear. As for the hidden layers, there are two decisions that have to be made. The first is how many hidden layers the neural network should have. Secondly, we must determine how many ons will be in each of these layers. In common use most neural networks will have only one hidden layer. It is very rare for a neural network to have more than two hidden layers. For this first attempt we will use one hidden layer. Now we have to determine how many neurons will be in this hidden layer. Using too few neurons will result in underfitting. On the other hand, using too many neurons can result in overfitting. Obviously some compromise must be reached between too many and too few neurons in the hidden layers. There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers. Some of them are summarized as follows.

The number of hidden neurons should be in the range between the size of the input layer and the size of the output layer.
The number of hidden neurons should be the input layer size, plus the size of the output layer plus one.
The number of hidden neurons should be less than twice the input layer size.

For first training attempt we are going to use the second rule, that is why the number of hidden neurons is 17. We also checked 'Use Bias Neurons' option and chosen 'Sigmoid' for transfer function (because our data set is normalized, values are between 0 and 1). Bias neuron is very important, and the error-back propagation neural network without Bias neuron for hidden layer does not learn. The Bias weights control shapes, orientation and steepness of all types of Sigmoid functions through data mapping space. A bias input always has the value of 1. Without a bias, if all inputs are 0, the only output ever possible will be a zero.

As learning rule we choose Backpropagation With Momentum. Backpropagation With Momentum algorithm shows a much higher rate of convergence than the Backpropagation algorithm. The momentum is added to speed up the process of learning and to improve the efficiency of the algorithm. This is how are architecture look like

Step 5.1 Train the neural network

After we had created training set and neural network, we can train neural network. We select training set, click 'Train', and then we have to set up the learning parameters. Before we click train, let us see what the following parameters present.

Learning rate is a procedure which assesses the relative contribution of each weight to the total error. The selection of Learning Rate is critical importance in finding the true global minimum of the error distance. If it is too small- LR will make really slow progress. And if it is too large it will proceed much faster, but may produce oscillations between relatively poor solution.

Momentum rate can be helpful in speeding the convergence and avoiding local minimum. The idea is to stabilize the weight change by making non radical revisions using the combination of the gradient decreasing term with a fraction of the previous weight change.

When the Total Net Error value drops below the max error, the training is complete. If the error is smaller we get a better approximation.

In this first attempt a default values will be used. Maximum error will be 0.01, learning rate 0.2 and momentum 0.7.

Then we click on the 'Next' button and the training process starts.

After 13 iterations Total Net Error drop down to a specified level of 0.01 which means that training process was successful and that now we can test this neural network. We can see how fast the training was, only after few iteration the error value has drastically dropped down closely to number 0.01!

Step 6.1 Test the neural network

After the network is trained, we click 'Test', in order to see the total error, and all individual errors.

As we can see, the Total Mean Square Error is 0.0041984822461348685, that is really good result, because our goal was to have error below the 0.01. The individuals errors are pretty good too. Looking at the result we can observe that most of them are at the low level, below 0.08, and there are not some extremely cases where those errors are considerably larger.

At the end of this attempt, it will be useful to randomly choose five instances and to compare them with the same from the test. 0

Num	Alcohol	Malic_Acid	Ash	Ash_Alcalinity	Magnesium	Total_Phenols	Flavanoids	Nonflavanoid_Phenols	Proanthocyanins	Color_Intensity	Hue	OD280/OD315 of dedulted wines	Proline	Class A	Class B	Class C
1.	0.737	0.737	0.674	0.485	0.489	0.679	0.646	0.509	0.413	0.454	0.528	0.476	0.608	1	0	0
2.	0	0.152	0.449	0.562	0.163	0.51	0.386	0.736	0.505	0.053	1	0.586	0.092	0	1	0
3.	0.721	0.229	0.706	0.335	0.489	0.697	0.517	0.491	0.401	0.428	0.528	0.608	0.782	1	0	0
4.	0.666	0.192	0.508	0.289	0.511	0.748	0.622	0.396	0.609	0.414	0.382	0.773	0.369	1	0	0
5.	0.5	0.409	0.717	0.536	0.283	0.193	0.034	0.755	0.107	0.283	0.236	0.381	0.23	0	0	1

The output neural network produced for this input is, respectively:

number	Class A	Class B	Class C
1.	0.9977	0.001	0.0052
2.	0.0003	1	0
3.	0.9985	0.0002	0.0207
4.	0.9731	0.0215	0.0018
5.	0.0087	0.023	0.9117

The network guessed 5 of 5! As we can see, in the first attempt we succeeded to find pretty good solution, so we can conclude that this type of neural network architecture is the good choice . How we want to experiment with the Learning Rule and Momentum. Let's see that on example.

Training attempt 2

Step 5.2 Train the neural network

In this attempt we will leave the same architecture, but we will increase the Learning rate and lower the Momentum. This time, we will see how the network will behave when it is trained with a higher learning rate. We know that the larger the learning rate is, the larger the weight changes on each epoch, and the quicker the network learns. We are going to test this assumption.

The results are:

After 11 iterations the network was trained! It succeeded to find the total error beyond the 0.01, more correctly 0.00771, that is less then we had before. But, let's see what the test results will show.

Step 6.2 Test the neural network

Now we want to see testing results.

And Total Mean Square Error is much better that in the previous testing too – about 50% lower, like we wanted. Our statement was correct.

Training attempt 3

Step 5.3 Train the neural network

But, let's see what happens if we decrease the Learning rate to 0.1 and the Momentum will remain the same.

As it was expected, the network need more iterations to find the solution, and more iterations implies more time. After 40 iteration we have total error 0.009882 that is bigger then in earlier attempts.

Step 6.3 Test the neural network

The test results are not representative, but with the smaller Learning rate it was expected.

Training attempt 4

Even though the previous architecture gave really good results, we will try to lower the number of neurons in hidden layer in order to see what the performances of network will be like then. We are going to use 4 hidden neurons. All other parameters will be the same as before.

Step 4.4 Creating a neural network

This is how are architecture look like

Step 5.4 Train the neural network

The learning parameters will be default values in order to compare these results with the previous one.

Step 6.4 Test the neural network

Now, we are going to test the network

The outcome was more then expected! With lower number of hidden neurons then before, the network gave better results. With the same number of iteration, that is 13, the Total Mean Square Error is smaller, 0.00398! Individual error are pretty good too! The final part of testing this network is testing it with several input values. We will select 5 random input values from our data set. Those are:

Num	Alcohol	Malic_Acid	Ash	Ash_Alcalinity	Magnesium	Total_Phenols	Flavanoids	Nonflavanoid_Phenols	Proanthocyanins	Color_Intensity	Hue	OD280/OD315 of dedulted wines	Proline	Class A	Class B	Class C
1.	0.532	0.204	0.396	0.33	0.402	0.697	0.561	0.283	0.511	0.321	0.325	0.762	0.433	1	0	0
2.	0.508	0.536	0.529	0.407	0.391	0.141	0.076	0.509	0.167	0.341	0.163	0.176	0.283	0	0	1
3.	0.321	0.621	0.449	0.407	0.457	0.138	0.093	0.302	0.230	0.591	0.138	0.267	0.412	0	0	1
4.	0.424	0.123	0.353	0.320	0.326	0.359	0.226	0.755	0.066	0.381	0.407	0.117	0.123	0	1	0
5.	0.111	0.328	0.567	0.485	0.283	0.662	0.517	0.358	0.448	0.168	0.260	0.777	0.248	0	1	0

The output neural network produced for this input is, respectively:

number	Class A	Class B	Class C
1.	0.9101	0.0879	0.0086
2.	0.0294	0.0147	0.9644
3.	0.0519	0.0078	0.9707
4.	0.0026	0.9092	0.1704
5.	0.0287	0.9805	0.0027

The network guesses all! So far, we can conclude that this is the best solution for our dataset. Let's see what will happen if we put Learnig rate on 0.4

Training attempt 5

Step 5.5 Train the neural network

As we noticed before, if we increase the learning rate, the result will be better. How we are going to see if that is the case with this example too.

How let's look at the results

Step 6.5 Test the neural network

We can see that result are better, but not significantly much then in attempt before. It was needed 10 iteration to reach the total mean square error of 0.0036316… that is only 0.00035 lower then previous attempt. We can say that the attempt 5 gave the best results for now.
But, let's experiment with the even smaller number of hidden neurons. We will try with 2 of them.

Training attempt 6

Step 4.6 Creating a neural network

We create a new neural network with 2 hidden neurons on one layer.

The learning parameters will be the same as before in order to compare results. Now train the network and see the results

Step 5.6 Train the neural network

Network was successfully trained, now we will test it.

Step 6.6 Test the neural network

Results are slightly worse then in 4 hidden architecture. It was needed 24 iteration for training the network, but it succeeded in finding the required error. Total Mean Square Error is also inferior, but not so bad and the individual errors are fine too.

But, if we continue to lower down the number of hidden neurons, and put only one hidden neuron and leave default Learning parameters, the network will be not able to do the training successfully. And if the network wasn't successfully trained, therefore is not possible to do testing. So, order to achieve good solution, it is recommended to use more then one and less then five hidden neurons. Until now 4 of them gave the best solution to our problem.

Training attempt 7

Step 4.7 Creating a neural network

In this attempt, we will create a different type of neural network. We want to see what will happens if we create neural network with two hidden layers. First we create a new neural network, type will be Multy Layer Perceptron as it was in the previous attempts. Then we put the values 3(space) 2. This means that in first layer we will have 3 hidden neurons, and in second one 2 hidden neurons.

New neural network has been created, and in the image below is shown the structure of this network.

Step 5.7 Train the neural network

Now, we are going to train the network and see if it going to give better results then architecture with one hidden layer. The default value of learning parameters will not be changed.

Step 6.7 Test the neural network

The two layers gave the Network that was successfully trained, therefore the testing was done. We can see that it needed two iteration more then in our best solution, attempt 5, and the Total Mean Square Error is worse then in mentioned case. We can conclude that two hidden layers architecture is able to find the solution, but is not necessary for our problem and we can get better results using just one layer.

Advanced training techniques

One of the major advantages of neural networks is their ability to generalize. This means that a trained network can classify data from the same class as the learning data that it has never seen before. This is really big benefit due to the fact in real world applications developers only have a small part of all possible patterns for the generation of a neural network.

To do this, we have to divide our training set to two sets- one amount of data will remain training set, and the second one will become the test data. Now, the test set provides a completely independent measure of network accuracy. It is important to emphasize that these two data sets do not have same data. We will try to see how fast and good will network learn if we lower the training set and examine the ability of a network to classify input patterns that are not in the training set.

One of the major advantages of neural networks is their ability to generalize. This means that a trained network could classify data from the same class as the learning data that it has never seen before. In real world applications developers normally have only a small part of all possible patterns for the generation of a neural network. To reach the best generalization, the data set should be split into three parts: validation, training and testing set.

In the advanced training we are going to use the architecture given in the attempt number 4, that is 13 inputs, 3 outputs, 4 hidden neurons and all other parameters used before. First, we will take 40% of data for training the network and 60% for testing it. Then the relation will be changed, so we are going to use 30% to train and 70 % to test the network. The percentage of training data will continue with descending to 20% and 80% for testing. At the end, we will try to train network with only 10% of data and 90% of data to test it. Let us see how that will work.

Training attempt 8

Step 3.8 Create a training set

In this first attempt we will choose random 40% of instances of training set for training and remain 60% for testing. In the initial training set we have 178 instances. This means that now we have 71 instance in training set, and 107 in test set. The Max Error will remain 0.01, Learning Rate 0.2 and Momentum 0.7

Step 5.8 Train the neural network

If we look at the graph above and compare it with the one we had in attempt 4 we will see that the function came close to wanted value after more iteration. We do not have fast decreasing at the beginning like we had before. But still, even with 40% the network succeeded to reach the wanted error value. And what about test results? Here, we test its ability to make judgments for individuals that are not in the training set.

Step 6.8 Test the neural network

The Total Mean Square Error is not below 0.01 as we wanted, but close enough. The individual errors are mainly good, but we have some cases where are pretty high (0.835, 0.754…)

Training attempt 9

Step 3.9 Create a training set

How we are going to use 30% of data set or 53 instances for training, and 70% that is 125 instances to test the network.

Step 5.9 Train the neural network

This will be the brave attempt. We will try with the default parameters, resisting the desire to increase the Learning Rate.

The network succeeded. The function needed some time to start to fall down, it was not immediately like we have already use to it, but after 26 iteration the total error was below 0.01. Now we have to test it.

Step 6.9 Test the neural network

The Total Mean Square Error is expectedly higher, but still the number 0.02272… is still totally perfect. In order to verify the network, we are going to randomly take 5 instances again:

Num	Alcohol	Malic_Acid	Ash	Ash_Alcalinity	Magnesium	Total_Phenols	Flavanoids	Nonflavanoid_Phenols	Proanthocyanins	Color_Intensity	Hue	OD280/OD315 of dedulted wines	Proline	Class A	Class B	Class C
1.	0.537	0.150	0.396	0.253	0.304	0.490	0.485	0.283	0.303	0.206	0.569	0.520	0.529	1	0	0
2.	0.647	0.182	0.471	0.691	0.185	0.310	0.316	0.264	0.196	0.210	0.407	0.553	0.138	0	1	0
3.	0.834	0.202	0.583	0.237	0.457	0.690	0.643	0.396	0.401	0.492	0.523	0.602	0.784	1	0	0
4.	0.666	0.200	0.508	0.446	1	0.748	0.622	0.396	0.609	0.414	0.382	0.773	0.650	1	0	0
5.	0.463	0.409	0.717	0.588	0.457	0.193	0.034	0.755	0.107	0.283	0.236	0.381	0.23	0	0	1

The output neural network produced for this input is, respectively:

number	Class A	Class B	Class C
1.	0.739	0.2962	0.005
2.	0.0121	0.9596	0.0528
3.	0.9742	0.025	0.0033
4.	0.9709	0.0316	0.0032
5.	0.0092	0.0248	0.9749

As we can see, the network guessed correct in all five instances.

Training attempt 10

Step 3.10 Create a training set

In the attempt 11 we are going to use 20% or 53 instances for training, and 80% or 125 instances to test the network. We will be persistent and still not increase neither the Max Error or Learning Rate or Momentum.

Step 5.10 Train the neural network

This time, the network also succeeded in reaching the desirable error level. For that it had to make 50 iteration, that is 100% more then in previous attempt. Also, we can see that around 26 iteration it was close to solution, but the network had problems to go beyond the 0.01.

Step 6.10 Test the neural network

As we can see, if the parameters remain the same, the result are 100% worse that before even though we increase the test data and decrease the training set by 10%. The Total Mean Square Error is pretty high, more then 40% then we want it to be.

Training attempt 11

Step 3.11 Creating a Training Set

The last attempt, but not the least. Now we are going to try to train the network using only 10%, that is 17 instance! Since this is a small number of instances, in order to make the result and network perform better, we will change the Learning Rate to 0.4, but Momentum will stay the same.

Step 5.11 Train the neural network

We are going to train the network, and see what the results will be like.

Even this small number of instances we succeeded to train our network. It took only 32 iteration to have error below the 0.01. Now only what is left to do is to test the network.

Step 6.11 Test the neural network

Even though we were using higher value for the Learning Rate, the test results aren't so good as we wanted. Furthermore, in individuals error we have some that are really high 0.8349, 0.7167… But, knowing that we used only 17 instances to train the network, these results are not shocking at all.

Conclusion

During this experiment we have created several different architectures of neural networks. Our goal was to find out the architecture structure that will give the best results and to discover its capability to generalize.

What proved out to be crucial to the success of the training, is the selection of an appropriate number of hidden neurons during the creating of a new neural network. One hidden layer with more then one neuron in it was perfectly fine to find the really good solutions. We used 17, 4 and 2 hidden neurons, and the best results have been reached with 4 neurons in one hidden layer. Also, through the various tests we demonstrated the sensitivity of neural networks to high and low values of learning parameters. In the end, we divided original data set into two sets- training and testing in order to examine the ability of generation for the architecture with four hidden neurons. We have seen that the network was able to give very good results for cases that have not been presented before.

Final results of our experiment are given in the two tables below. In the first table (Table 1.) are the results obtained using standard training techniques, and in the second table (Table 2.) the results obtained by using advanced training techniques. The best solution is indicated by a green background.

Table 1. Standard training techniques

Training attempt	Number of hidden neurons	Number of hidden layers	Training set	Maximum error	Learning rate	Momentum	Number of iterations	Total mean square error	5 random inputs test/number of correct guesses	Network trained
1	17	1	full	0.01	0.2	0.7	13	0.004198	5/5	yes
2	17	1	full	0.01	0.7	0.4	11	0.00272	/	yes
3	17	1	full	0.01	0.1	0.4	40	0.00550	/	yes
4	4	1	full	0.01	0.2	0.7	13	0.00397	5/5	yes
5	4	1	full	0.01	0.4	0.6	10	0.00362	/	yes
6	2	1	full	0.01	0.2	0.7	24	0.00414	/	yes
7	5	2	full	0.01	0.2	0.7	12	0.00450	/	yes

Table 2. Advanced training techniques

Training attempt	Number of hidden neurons	Number of hidden layers	Training set	Testing set	Maximum error	Learning rate	Momentum	Number of iterations during training	Total mean square error	5 random inputs test	Network trained
8	4	1	40%	60%	0.01	0.2	0.7	19	0.0177	/	yes
9	4	1	30%	70%	0.01	0.2	0.7	26	0.0227	5/5	yes
10	4	1	20%	80%	0.01	0.2	0.7	50	0.0428	/	yes
11	4	1	10%	90%	0.01	0.4	0.7	32	0.0519	/	yes

Download