Product classification – the ideal task for machine learning
While searching for the use case of machine learning in your businesses, you don’t always have to think about huge, ground-breaking projects which are scheduled for years. To enjoy the benefits provided by this technology, it is good to start from more simple works, for example the use of artificial intelligence to handle the part of selected business process.
An example of such approach was developing software dedicated to handling one of the stages in Client’s business process, which so far had been done by hand by dedicated employees. In particular, it was the phase during the process of granting installment credit, where a product branch was to be assigned to sold article, for example „fold-out sofa estella for 2 persons” belongs to “furniture” branch, while „synology RT19 router WLA” to “computer hardware”.
Since the information about the branch of the article is a necessary component at further stages of credit process, an error in assigning might expose company to losses or other unwanted issues. Such risk is very real, because, even though the list of the branches comes predefined and unchangeable, the names of the articles, which has to be assigned to them, comes from systems running at cooperating stores and there is no common format used by them.
Hence, the software, which was being developed, was charged with a task to automatize aforementioned activity and at the same time to eliminate the risk of making an error. To fulfill these requirements we have used machine learning, with the help of which we have developed a mechanism of transformation from names’ space into branches space.
The mechanism needed to fulfill a few important criteria. One of them was the accuracy criterion. In order to assess our function in numerical way, we’ve made use of elementary statistics and defined accuracy indicator as a relation of successful attempts to assign a branch to the overall number of attempts made. We have adopted an assumption that we will be satisfied with the results of our function only when defined accuracy of attribution exceeds the value of 0.90, that is 90%.
Another important issue was a possibility to assess certainty level of a result. Our function won’t work at 100% accuracy rate. The received result are sure to be wrong from time to time, without automatic process of issuing decision knowing. One approach would be to ignore it and deal with the fact that credit process will function on the basis of incorrectly defined branch from time to time, but it would be much better to know when branch assigned is uncertain, and in such circumstances, redirect the process to be handled manually.
We will be interpreting this uncertainty through result accuracy probability, which is a number within the range from 0 to 100%.
We have to deal with two important ranges at this point. The probability of 10% and less means that specific result is almost certainly incorrect. At the same time if probability is bigger than 95% it means that the result is almost definitely correct.
Interpreting values between 10% and 95% depends on specific use case, in which function will be applied, and the appetite for risk that the company is prepared to accept. In our particular case (low appetite for risk), assigned branch with a low level of trust should also be directed to be handled manually.
The accuracy of the result and the level of trust might occur in various combinations, as illustrated below:
Two significant areas should be distinguished here. The first one is going to be the desired area, which will confirm the usability of our function, while the second one, on the contrary, is going to prove that the results of our function are completely useless.
The desired area consists of two ranges of combination: accuracy and trust:
- Accurate result with high trust
- Inaccurate result with low trust
Branch assigned with high trust will be processed automatically, so it is good that it will be accurate.
Branch assigned with low trust will be refused and processed manually, thus, it is also desired result, since it has been assigned incorrectly.
We would like that a significant number of results of our function’s operation to fall into aforementioned areas.
The undesired area also consists of two ranges of combination: accuracy, trust:
- Accurate result with low trust
- Inaccurate result with high trust
Branch assigned with high trust will be proceeded automatically, thus it wouldn’t be good if it was inaccurate.
Branch assigned with low trust will be refused, which is a shame, since it has just been assigned correctly, hence we would lose a good opportunity for business.
We would like for as low as possible number of results to fall into this area.
Our function should successfully deal with the names that were the base for its development, but it should also produce acceptable results for the names a bit deviating from them. An example here can be a new product belonging one of the current branches, that so far have come from one system but in the future it may come from other store systems under a different name. The sofa may be described as „leather sofa with sleep function”, but may also have shortened name „sofa alabama”. The „sleep function” fragment may even be abbreviated to e.g „slp”.
On the other hand, we won’t be expecting our function to properly indicate the branch for completely new products, for example a new kind of furniture which will probably be made in the future.
A need for generalization in this specific project was lesser than usual, since variety of articles which might be sold in installment mode is limited compared to, for example: a set of possible mails, articles or social media post contents.
One of the possible methods to develop our function could be a typical algorithmic approach, where it would be necessary to recruit a good analyst who would take a look at our data, analyze it and figure out how to indicate branch on their basis.
Such person will certainly notice that the occurrence of the word „Sofa” in the name of the article guarantees that it is going to be the furniture branch. The same would go for other words such as: couch, chair, sofa…
Hence, we can build a list (base) of words, the presence of which in the name signifies furniture branch.
The real-world case it is not that simple, what should we do with the word „wheel” for example? Does the occurrence of the word mean that we are dealing with car industry? Not necessarily. There are bicycle wheels as well, and these are from sport and tourism branch.
Therefore, we have to formulate variant logic, which makes our base of key words and algorithm functioning a bit more difficult.
Lastly, there is also a question of handling the trivial issues, such as the letter size normalization, removing special and punctuation marks, considering declination as well as typographical mistakes and abbreviations.
Would it be an easy task to do? Hard to say, still, there is another way – a way across.
The branches assignation, which so far has been done by the company employees, has been recorded and currently we have at our disposal about 10 000 records of correct association between article names and appropriate branches.
Instead of developing the algorithm in the privacy of our own analytic mind, we may make use of machine learning.
Machine Learning – Multiclass Neural Network
We could approximate our function using a few machine learning algorithms, however we achieved the best results using Microsoft Azure ML Studio platform, a neural network of multiclass neural network type.
It consists of certain number of output neurons, to which we are going to attach article names, internal neuron layers and output neuron layers. Each output neuron is responsible for displaying probability of an item to be paired with a particular branch.
Here is where the first problem appears, that needs to be solved. Namely, neuron network has at its output a defined, constant number of neurons. Meanwhile the name of the articles have different length. How then can we attribute a thing with a variable length to something with a set length?
The concept of a bag of words comes to help. To illustrate this, let’s take a few articles, divide them into single words and remove repetitions. Once the operation is completed, we will receive the list of words like: sofa, personal, fold-out, estella, Portland, coffee machine, delonghi, refrigerator… This is the size of our name space.
Article name can include a word from the list above a certain number of times: zero, one or more. Hence, we may assign specific point in name space to article name.
|Word – coordinate in name space||Value|
For instance, a point (1,1,1,1,1,0,0,0) would correspond with the name „ Personal fold-out sofa ESTELLA Portland”.
Thus, we could dedicate specific input neuron to each word from the word space. We would have as much neurons as positions in word space then.
String „Personal fold-out sofa ESTELLA Portland”, used as an input, causes the first neuron to receive value “1” and likewise up to the fifth one, and the rest will receive “0”.
Does this solve our problem?
Though it seems it does, it actually doesn’t.
Basically, names may contain a lot of different words (tens of thousands), hence our set of words, and thus the number of output neurons would have to be that exact number which is still too big to handle considering current technology’s capabilities. Besides, it could be considered a waste of resources, because there is a way to approach the task far more cleverly.
What I’m thinking about is using the technique called hashing. In the simplest terms it is an algorithm changing the character string with a variable length into a number within a set range.
In this project we have used a type of bit hashing that produced a number ranging from 1 to 1024.
For the sake of illustration, let’s assume that the word „sofa” after being hashed becomes the number 345, while the word „personal” – the number 719, and likewise for every other word. It means that any given name may be translated into the numerical space.
The result is similar to the previous one, only that currently instead of a set of possible words (amounting to dozens of thousands), we have a set of number of words in a given name (several).
The operation of dividing given character string into words, creating a glossary of words and transforming into hashes is called feature hashing.
With 1024 bit hashing function transforming our data now we need only 1024 input neurons. Each neuron is dedicated for one of possible hash values.
The illustration below shows how the name „Personal fold-out sofa ESTELLA” will be processed by the input layer.
All 1024 neurons will receive the value 0 except the ones, with the numbers corresponding with the hashes of words included in the name.
Once the name has been put into the input layer, the neurons will perform calculations and computed numerical values will appear in output layer. These numbers tells us what is the chance that the given name, according to the network, is linked to the particular branch.
We are expecting to receive small numbers for all neurons except for one, whose value will be close to “1”.
For example, first neuron from the top claims that with 0.1% probability for given input the result should be the branch with code 1, that is „Consumer electronics articles”. Second neuron conveys the information that the article name should point to „Household goods” branch, only now the chance for that equals 0.0002. Neuron no 3 is very confident of its computed result: it gives us 99.94 % probability that we are dealing with an item from the furniture category. The rest of the neurons offers incorrect branches, however they do so with a very low level of certainty.
In principle, we always take the result of a neuron which has the highest level of trust and on this basis we specify what branch the network picks as an overall result and, as a consequence, also the result of our function.
So much for the theoretical speculations. In real-life scenario, we have used Microsoft Azure Machine Learning Studio environment. With this tool the models are constructed with a graphic editor by adding and manipulating nodes representing specific computations.
The first step on the diagram represents loading the initial data. After that we perform series of operations dedicated to ordering the data. Computations here are hidden in Edit Metadata and Preprocess Text nodes. The Split Data node causes a portion of the data (80% in this case) to be running separately (along the left side), while the second portion along the right side.
We subject the data coming out of Split Data node to operation feature hashing. The data from the left branch finds their way in the most important element in the model, which is Train Model node. This node is connected also with yet untrained neuron network.
Score Model will start once training of the network is complete. The node takes the data which came as a result of the right branch computations (the data which didn’t take part in training of the network), checks which product branch will be predicted by network and compares the result with the branch, that is known to be correct. Based on this information the node can assess, whether the network got it’s answer right or wrong.
Evaluate Model node gathers information about both correct and incorrect calculations. By performing statistical computations it sums up and calculates indicators showing pros and cons of the trained network.
Precision is the chance that randomly chosen result of operating network is correct. It is the ratio of the number of correct designations of particular branch to the number of all designations of that branch. For example, if the network considered 1000 names as related to furniture branch, but 10 of them were a mistake, then precision would be 990/1000.
Recall is the chance that randomly chosen initial input data will be properly qualified. It is the ratio of the number of right designations for particular branch to the number of all appearances of the branch in the initial data. Let’s assume, that there was 1100 names from furniture branch in the file, however, network properly identified only 990. Recall in that scenario would amount to 990/1100.
As we can see all available indicators shows that our network is doing really fine.
An interesting way to check how well network works is analyzing the, so called, confusion matrix.
It is the matrix where rows relates to known correct branches, while columns relates to branches predicted by the network. In cells at the intersection we store the information about the percentage of names qualified for the given branch. Ideally, we would have zeroes on every cell, except for the cells on the diagonal, where we would have “100%” values.
In our case the upper half of confusion matrix looks like this:
It is fine, but not perfect. For example, in the second row we can see that admittedly 99.8% of names related, as a matter of fact, to the branch “2” have been designated properly for the branch number 2, while the remaining 0.2% have been wrongly linked to the branch “3”.
In the lower half of the matrix we can see that we have got a problem with products belonging to the branch “20”.
Most of them have been qualified by network to branch number 3. The reason for that is the fact that we have only 14 such names in our initial file, furthermore, 80% of them have been used for training, that is only 11.
It is a perfect example of what we should instinctively know, namely that neuron network needs a lot of data in order to learn and function reliably.
By artificial multiplication of the data we may improve results, if such need arises, but of course in such a situation generalization (the fact that network recognizes another article from the branch) is out of the question.
Based on trained network, a web service have been created with which the user can attempt to use the network.
Let’s enter the name of the article, click „Test request-response” button, and check the results of network’s operation – a list of all branches with calculated probability that the particular name suits specific branch.
For example, if we enter the name „44 – 55 Philips TV set”, we may see something like in the illustration below:
Fragment of the screen was cut from the picture, leaving only the upper and lower fragments. First of all, it is clear that branch number 1 has been chosen – correctly. It can also be seen that the network is positive about its choice in 99.99%.
The rest of branches have received very small probabilities.
In the article, by referring to the real example, I’ve tried to explain the theoretical basics of using machine learning for classification, the process of building a prototype using Microsoft Azure Machine Learning Studio environment, as well as methods for its verification.
Since this is a genuine example, it is worth to mention the benefits which our Client has achieved. Automating the part of his business process, that is classifying the item to a branch, has accelerated the whole process of considering application for credit. Quick feedback for people interested in credit, has directly changed into much more satisfaction from the service.
In Altkom Software & Consulting, we are fascinated by solutions which allow us to change the world around us. However, we do realize that each change is an investment which has to pay off. The use of neural networks allowed us to come up with a very effective, and as a matter of fact, the most inexpensive solution compared to the case were we would use the standard engine of expert rules of classification.
Author: Mariusz Surma, Senior Analyst at Altkom Software & Consulting