Dataset description

We use Hitchwiki maps [19] dataset. There are 21,562 rated points (on 12 August 15), and they are not uniformly covered across all locations in the world. For example, whole Africa contains only 167 points, the top-10 countries sorted by the number of points are provided in Table 2.

Table 2 Top-10 countries from Hitchwiki dataset Full size table

These results conform with the difference in the spread of hitchhiking culture among other countries as well as popularity of particular Hitchwiki website. To make an extensive analysis, we do all the experiments described below with top-6 countries: Germany, France, Poland, Netherlands, United Kingdom and United States. In this section, we show a graph for a single country due to space constraints and provide the graphs for remaining countries in the Additional file 1.

Since the dataset is related to Volunteered Geographic Information (VGI) crowdsourced by users, its credibility should be considered. An extensive recent survey about quality assessment of VGI [22] summarizes all recent findings in the field. VGI data has been succesfully applied in many areas like discovering Points of Interest (POI) for estimating urban land use for urban planners [23] or travel reccomendations [24]. Even though there are 2525 points from Germany in the dataset, they have various levels of credence. Some locations may have only one vote, while others may be assessed 20–30 times and thus have higher confidence. For example, among 2525 spots from Germany, only 107 have 10 or more rating votes. This parameter of minimum amount of votes of a location would be defined as N min , and will be assessed later.

Users are able to create new points and edit existing points, assessing their rating and waiting time, and add comments about them. Since there are no regulations about assessing locations, users may vote for both “hitchability” rating of a location (on the scale from 1 to 5, from “Senseless” to “Very good”) and its waiting time (5, 10, 15 etc. minutes). Since a point may be assessed a few times, the integer ratings become continuously distributed in the interval [1..5], and waiting times - in the half-interval [5..∞).

Some of the locations have many detailed comments, and usually they comprise descriptions of some aspects such as: how good it is; which directions from this spot most of cars go; how long it takes to catch a car there etc. Consequently, text analysis of comments would be another future direction with a potential improving of the accuracy of the hitchhiking recommender system, and some initial results are presented below.

Users can vote spots’ rating and/or write waiting time of their experience. Due to users preferences, reason, there are roughly two times more ratings than waiting time in the dataset. Figure 1 a and 1 b show that there is a moderate correlation between them for different N min .

Fig. 1 a Correlation coefficient between waiting time and rating for all countries. b Correlation between waiting time and rating for different N min for Poland Full size image

Figure 1 a shows that there are not enough points with higher N min values for some countries, and generally the correlation between waiting time and rating is moderate.

Figure 1 b shows many points with integer ratings for N min =1 due to the nature of the data, which allow users to rate locations only with integers from 1 to 5. Next figures exhibit much less points along integer lines because average values of longer arrays of integers are less likely to be integers, and another observation is significant decrease of points with ratings lower than 2. This tendency is also illustrated in the Fig. 2: the locations with higher ratings are more likely to attract new people, while bad ratings, especially with an explicit description, may be a red flag for other users. Figure 2 illustrates the tendency of users to use more often locations which have already got high ratings. Interestingly, Fig. 3 shows no significant relationships between waiting time and ratings. The reason may be related to general uncertainty of hitchhiking: even at a good locations sometimes a hitchhiker needs to wait long, so the average waiting time is not decreasing significantly.

Fig. 2 Average rating and N min Full size image

Fig. 3 N min and waiting time and rating Full size image

Road type analysis

Coordinates of a hitchhiking location, do not reveal anything specific about its relative emplacement. Obviously, hitchhiking points are located next to roads, but it is unknown where exactly: either on a motorway or a small village road. Therefore, a question of assigning the point to a road becomes crucial. A natural criteria is to classify roads by their type using road hierarchy [25]. Therefore the road type of a point is determined by the type of its closest road.

To implement our experiments, we use OpenStreetMap and its road hierarchy [26]. For example, the category motorway is the highest among 6 main categories and unclassified is the lowest, while trunk is related to link between roads of different categories, i.e. ramps. The same classification hierarchy of roads may vary between countries. Thus, some different categories of roads may be represented in a different way, so to make our analysis feasible, we need to research roads from one country. In the Table 3, we illustrate the ratio of how many points of each country dataset were assigned to each type of road, how many roads there are in OpenStreetMap in that country. In this case, if a proportion is close to zero, than roads of this type are not popular for hitchhikers. The greater the proportion is, the more popular the road type is.

Table 3 Road types distribution by country Full size table

Link roads are especially popular for hitchhikers, followed by motorway and trunk roads. In addition, tertiary and unclassified roads are not popular, even though they are the most common types of roads in all countries. Note that this data is very specific to countries: for example, Polish hitchhikers tend to use primary roads instead of motorways, while trunk roads are especially popular in Great Britain. Therefore, recommendations for the desired application should consider individual features of each country.

After the popularity of road types, we investigate ratings and waiting times at different road types. In the Fig 4 a and 4 b average rating and waiting time are shown in respect to N min . We did not include roads secondary links and tertiary links, which are closest to 2 and 0 points respectively. Waiting time is different at different road types. For example, on primary roads it is almost half than on a motorway. In addition, motorway link roads have almost 50% less waiting time than motorway itself, which is important for long-distance travellers: they should look for a proper position on a ramp instead of a motorway itself. Potentially, it gives us an opportunity to group points by road types: high-speed roads (motorway, trunk) or low-speed roads(others), while slip roads may constitute the third group. In terms of rating, trunk link roads and secondary roads in average get smaller ratings than other types of roads. For the trunk links, it may be related to higher waiting time, while secondary roads may seem inefficient due to the properties of traffic on them. For example, there could be many cars stopping, most of which do not go far.

Fig. 4 Average values by road type. a Rating and b waiting time Full size image

Feature analysis

Following the verbal descriptions of good hitchhiking locations described in “Methodology” section, we crawled the features that are usually located next to roads: gas station, bus stop, traffic light, restaurant, parking. These features are extracted from OpenStreetMap.

To begin with, we may find what is the distance from Hitchwiki dataset points to these features. For each hitchhiking location, we assign distances to the each closest feature. The histograms of distributions are given in the Fig. 5.

Fig. 5 Histogram of distance from dataset points to closest facilities. X-axis: distance to the closest facility in km, y-axis: number of hitchhiking locations with this distance. a gas station, b bus stop, c traffic light, d restaurant and e parking Full size image

Following, we investigate relationships between features, waiting times and ratings of locations in Fig. 6 a. They depict the difference in waiting times/ratings between points that are located in features neighbourhoods to the points that are located far from them. For example, if the waiting time difference for bus stop feature at distance 0.02 km is −9 min, it means that average waiting time for points which have the closest bus stations less than 20 m away is 9 mins less than the rest of points.

Fig. 6 Difference in waiting time/rating between points which are close to features neighbourhoods vs those which are not. a Rating difference and b waiting time difference Full size image

Since we have many attributes derived from different sources, multicollinearity becomes an important question, i.e. when one or more attributes are highly correlated. For example, in many cases bus stops are located near traffic lights, so distances to these facilities may be somehow correlated. To measure it, we take a subset of points for N min =3 and find variance inflation factors (VIF) for the attributes for each country. VIF is a common value which measures how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related, and it assumes that multicollinearity is high when VIF are larger than 5.

Variance inflation factors are given Table 4. We see that the VIF larger than 5 appears only for USA, the smallest dataset. In this case, the correlation comes from the fact that one of the road types appears only twice in the whole US dataset. After removing this column, VIF becomes 3.94.

Table 4 Variance inflation factors for different countries Full size table

However, our main goal is to have a system that predicts location rating, and in this case multicollinearity is not related to its performance, rather to coefficients and respective standard errors of attributes in a linear model. In addition, Random Forest can handle multicollinearity due to probabilistic sampling of a set of attributes. These results will be presented in the following section.

Another important statistics is p-value for each attribute, which tests the null hypothesis that the coefficient has no effect (equal to zero). Therefore, low p-values favor the alternative hypothesis. In our case, we measure p-values for the same data for the 3 biggest countries, and select attributes with p-values less than 0.05. The results are given in Table 5, and only_hw corresponds to a feature of point being near a highway without links nearby. For the other 3 countries, there is not enough data to have p-values small for the attributes.

Table 5 Attributes with low p-values Full size table

To sum up, the most important features are gas stations, bus stops and traffic lights, and high proportion of points are located next to them. All of those generally improve waiting times and ratings. The more the distant is, the more random the data becomes, therefore all time and rating differences converge to zero. Even though some hitchhiking locations are located next to parkings, usually those points may have higher waiting times and lower rantings. Therefore, this facility is not recommended to hitchhike at, probably due to a low number of cars passing by parkings. However, the usage of these features in hitchhiking recommender system should be adjusted to specific countries.

Classification and regression

Since the main interest lies in distinguishing efficient and non-efficient points for a hitchhiker recommender system, the natural idea is to use classification of good and bad hitchhiking points for this purpose. For the following experiments, data from all countries will be used according to different values of N min . The idea is to train the classification algorithm to distinguish points with high and low ratings based on the given list of attributes: type of road, distances to the closest gas station, bus stop, traffic light, and also if the point is around a highway link, around a highway without link, is it isolated from all facilities. First part of them has rating more than 4, and the rest have rating less than 3, and they correspond to 2 classes: “good” and “bad” points.

Accuracy of classification algorithms based on the given list of features for the classes of points with low and high ratings are given in the Fig. 7, in a setting when there are 66% random points selected for training and the rest 34% for testing. As it was mentioned above, as N min is increasing, the average rating is increasing, so there are less points with low ratings, and the splitting into training and testing data is done after filtering. Therefore, the low rating class is oversampled in the experiments, and therefore we calculate the error bars for each of the classifier based on a sample of 1000 experiments.

Fig. 7 Accuracy of classifiers. a Points in all countries and b points in Germany Full size image

In addition, we present the results of regression model to estimate location rating without division into two classes. In this case, there is no problem of imbalanced classes. The results of regression models are presented in Fig. 8.

Fig. 8 Performance of regression models Full size image

To conclude, the results show high accuracy of given set of attributes for classification of efficient and not-efficient points for hitchhiking, and the accuracy increases when points have been ranked more than once. Even though the classes are skewed due to the reasons mentioned above, average accuracy of KNN and Random forest classifiers achieves 75% for special countries. Classification graphs for countries have similar structure, but results may vary more due to the less points in dataset, and since more uncertainty. The same feature set is also feasible for the regression problem, and the linear regression model achieves the best performance on median absolute error of 0.3 rating points.

Text analysis

Subsequent analysis is related to descriptions and comments of locations in the dataset. Here we try to make the first stage of text analysis, targeting to identify the relationships between keywords and attributes they represent. For this analysis, we also need to use subset from 1 country, because verbal descriptions of roads may be different for different countries, so we discuss the results for the Netherlands. For example, in this country roads starting with A (like A5, A19 etc.) correspond to largest motorways, while N-roads are used for general roads connecting towns. For each road type, we calculate how many points corresponding to this road type (e.g. their closest road is a road of specific road type from hierarchy) include a particular keyword. After that, for each keyword we have feature vector of frequency of its usage in each of these road types. For example, if a keyword “Gas station” is used in 50% of motorways and 5% of slip roads, its feature vector will be [ 0.5,0.05] assuming there are only 2 road types. After that, the correlation matrix is computed to estimate the similarity between each pair of keywords, which finds the relationships between different keywords. The more the correlation coefficient is, the more similar these terms are.

Pearson correlation between keywords is given in Table 6.

Table 6 Correlation between different words according to types of roads they describe Full size table

This analysis proves the fact that that synonyms (as gas/petrol station, ramp/slip) have very high similarity, and also provides some valuable insights. For example, bus stops are highly associated with N-type roads, while gas stations are more popular on motorways. However, as it was mentioned before, usage of keywords is mostly limited to each country, with corresponding keywords (names of roads etc.). Another complication is arising because in some countries comments are given not in English. For the future system, the information extracted from comments may be included into the recommender system.