The steps of the trail recommendation algorithm
As the basis of this machine learning experiment, the catalog of trails from eTrilhas, a nature tourism company, was used in which we developed a digital product. The base has about 500 trails and has a series of classifications such as time, distance, difficulty and existing activities.
The objective of the recommendation system in machine learning in this case is to suggest a trail from the input of another trail, in a model “if you did this trail maybe you will like this one” and the method used was classification by clustering, a model of Unsupervised Learning in which we will group tracks in groups in the amount we define – here 5 clusters, or groups, were chosen.
The first step consists of treating the database. This involves a few steps, such as:
- Eliminate columns that are not of interest to the objective. As an example, fields with url of images and managers' e-mails.
- Convert relevant columns but in textual form to numerical or boolean variables (true/false). As an example, if a field has the textual information that “there is hydration”, we can convert this field to the value 1, while “there is no hydration” to the value 0. If the difficulty of the trail can be low, medium and high, we can convert to 0, 1 and 2.
- Removal of general textual content. In the scope of this algorithm we are not dealing with textual variables, although that is possible.
- Removal of blank and “null” fields that affect the ranking as a whole.
From there, it is already possible to carry out a correlational analysis, that is, to identify fields that are correlated with each other. Fields that have a very high correlation tend to be less relevant to the algorithm due to their redundancy.
In this graph, for example, we see clearer fields as those that have more correlation with each other. In addition to the diagonal band with 1 values where the data intersect (a data obviously has a direct correlation with itself), we can draw some conclusions in machine learning.
Above we can see that difficulty and risk exposure are very correlated. But not directly, that is, there may still be a difficult trail but without exposure to risk, and vice versa. However, this is a very high correlation index, 0.71 (the maximum would be 1).
Another direct correlation is the negative and positive difference. This data is about the up or down slope of a trail (not necessarily the maximum altitude). And the correlation makes sense because trails tend to start and end on a basis. In our jargon, “what goes up comes down”.
Of course, the correlation itself can lead to hasty conclusions. There is a site called “Spurious Correlations” that shows examples of very picturesque comparisons but that, from a mathematical point of view, would demonstrate correlation. In the example below, the appearance of Nicolas Cage in films versus drowning deaths in swimming pools.
The algorithm is gaining (or losing) members
The next step is to “normalize” the data so that they have comparable proportions. As an example, if we have an altitude field that goes from 0 to 4,500 and another one of difficulty that goes from 0 to 2, they must be converted to parameters of the same proportion in order not to cause distortion.
Next, we need to reduce the number of columns for treatment and here we use the (Principal Component Analysis), an algorithm used for dimensionality reduction and extraction of relevant information from multidimensional datasets. It seeks to identify the main directions (principal components) along which the data show greater variability. See in the example on the side that the approximately 60 columns have been reduced to two, which greatly facilitates training.
The classification / clustering of data in machine learning
In this step, we will classify the data of each track into specific groups based on their characteristics. Think of a classroom in which we divide the class in the back, the class that sits in the front, the class of girls who talk, the nerds, etc.
Note that here I am purposely using classifications that are absolutely stereotyped. Perhaps, with a large database of the class, we could realize that a group from the back of the class would actually have a great similarity in leadership and another group a characteristic of transgression and, after the algorithm, in fact we would have these two groups separated and not together as “fundão”.
Therefore, much of what we talk about the value judgment of algorithms actually goes through the characteristics of the data we are inputting and the relevance we give them – and then we return to the question of how artificial intelligence is a cruel mirror.
Back to our example, we chose to group the tracks into 5 clusters. See below in which class 10 examples were grouped:
Now, from the same data we can visually look at how the tracks are grouped together, in a 2D view….
… or 3D:
At this point, much of the work is complete. What we are going to do now is select a specific track, identify which cluster it belongs to, call the other tracks and use a Euclidean algorithm to find out which of these have values closest to the selected track. AND voila :
Note that the “distance” field is demarcating the difference between one point and another, and not the physical distance of the trail itself.
Thus, using Machine Learning, we developed an artificial intelligence algorithm for recommending trails in order to increase sustainability.