User segmentation is one of the most important parts of our work. How do we do it? We and our data providers assign all users with a unique cookies ID.

This ID may look like this:

42bcfae8–2ecc-438f-9e0b-841575de7479

These IDs are keys in different tables, but the primary value is the page URL where these cookies were loaded, search queries and sometimes some additional information given by provider — IP-address, timestamp, information about the client etc. This data is quite inhomogeneous. That’s why URL is more valuable for segmentation. When our analysts create a segment, they indicate some address list. If one of these pages contains this cookies ID, it will get to the corresponding segment. Up to 90% of analysts’ working time goes to drawing up a suitable set of URLs. It is painstaking work with different search engines, Yandex, Wordstat and other tools.

As a result when we got more than a thousand segments, we understood that we need to automate and simplify the process. But it was supposed to allow us monitor algorithm quality and provide analysts with usable interface for working with the new tool as well. I will explain now how we solve these problems.

As you can understand from introduction we segment web-pages, not people. After this work is done, our analytic engine allocates the users to the corresponding segments automatically.

It is necessary to say some words about how segments are represented in our DPM. The main feature of the set of segments is their hierarchical, tree-like structure. We don’t impose any restrictions on the depth of hierarchy because every next level provides more precise description of user’s interests portrait. Here are some examples of a hierarchy branches:

If a user visited a web-page explaining how to feed puppies or how to accustom a kitten to a litter box, it is likely that they have a pet and it makes sense to show them appropriate ads — about veterinary clinics or a new line of pet food. If some time before user used to choose premium brand clothes in some online shop, we can advertise more expensive services — cat psychologist or dog groomer.

Generally, we had to create a service which would have an URL as input, and would give a list of matching topics as output, having some handy taxonomy topics of internet pages. We solve the problem of topics identifying as the problem of multi-class classification by “one against all” scheme. It means that the own classifier is trained for every taxonomy unit. Classifiers traverse recursively started on the root of the tree and going downwards to those branches which are classified as suitable on the current level.

Classifier setup

Classifier frontend is a Flask-application, which holds some object in memory. This application generally does the data preparation, deserialization of trained classifiers objects of class sklearn.ensemble.RandomForestClassifier contained in mongoDB using method predict_proba() and processing results in accordance with the existing taxonomy. Taxonomy, requests and test samples are also contained in mongoDB.

Application waits for URI POST-requests that look like this:

• localhost/text/

• localhost/url/

• localhost/tokens/

classifier = RecursiveClassifier()

app = Flask(__name__)

@app.route(“/text/”, methods=[‘POST’])

def get_text_topics(): data = json.loads(request.get_data().decode()) text = data[‘text’] return Response(json.dumps(classifier.get_text_topics(text), indent=4), mimetype=’application/json’)

@app.route(“/url/”, methods=[‘POST’])

def get_url_topics(): data = json.loads(request.get_data().decode()) url = data[‘url’] html = html_get(url) text = clean_html(html) return Response(json.dumps(classifier.get_text_topics(text,url), indent=4), mimetype=’application/json’)

@app.route(“/tokens/”, methods=[‘POST’])

def get_tokens_topics(): data = json.loads(request.get_data().decode()) return Response(json.dumps(classifier.get_tokens_topics(data), indent=4), mimetype=’application/json’)

if __name__ == “__main__”: app.run(host=”0.0.0.0", port=config.server_port)

For example, when some URL is obtained, application downloads its body, gets the text of the page and initializes recursive traverse of taxonomy from root to branches. Recursion is applied only for those tree nodes which on the current stage have the probability of belonging to the page and if it is higher than the threshold specified in configuration.

Data preparation includes single text tokenization, calculation of frequency characteristics and feature conversion for every classifier in accordance with token weights, which was chosen on feature selection stage (we’ll mention that later). It uses “Bag of words” model, that ignores relative position of words in text.

Classifier training

Backend does the training process. When changes to taxonomy or some node search request list occur, text of new pages is downloaded and tokenized. After that, training algorithm is launched for all topics on the same level with the changed classifier’s level. All classifier’s “brothers” retrain with the changed one, because it is the same training set for the whole level. They are WEB-page texts of TOP-50 search results in Bing, which was found on requests from all brother-nodes and its children. The positive examples for every topic are pages matching their requests and requests of its children, all other pages are negative examples. The result is stored in object pandas.DataFrame.

After that, obtained token sets randomly distribute to a training set (70%), set for feature selection (15%) and testing set (15%) which is stored in mongoDB.

Feature selection

Selection of the most informative tokens is done during training process with the help of dg metric and that’s how it works:

def dg(arr): avg = scipy.average(arr) summ = 0.0 for s in arr: summ += (s — avg) ** 2 summ /= len(arr) return math.sqrt(summ) / avg

And that’s how it’s called for token sets:

token_cnt = Counter()

topic_cnt = Counter()

topic_token_cnt = defaultdict(lambda: Counter())

for row in dataset.index: topic = dataset[‘topic’][row] topic_cnt[topic] += 1 for token in set(dataset[‘tokens’][row]): token_cnt[token] += 1 topic_token_cnt[topic][token] += 1

topics = list(topic_cnt.keys())

token_distr = {}

for token in token_cnt: distr = [] for topic in topics: distr.append(topic_token_cnt[topic][token] / topic_cnt[topic]) token_distr[token] = distr

token_dg = {}

for token in token_distr: token_dg[token] = dg(token_distr[token]) * math.log(token_cnt[token])

So the importance of words is estimated taking into account all the training set texts. Since frequency characteristics of word occurrence are used, it would be interesting to have a look at Zipf’s law. Here it is (the green line is linear interpolation of data):

Then in order to vectorize the classified texts gained weight is multiplied by word frequency in the current text and 5 words with the highest result value for every topic on the level are selected. Same is done during training process. Further, these vectors are concatenated and as a result we have one vector with the length of 5*m, where m is the number of nodes on the level. Now the data is ready for classification.

Classifier quality evaluation

We would like to have a single value to evaluate the work of the classifier as a whole. It’s easy to calculate accuracy, completeness and F-measure for every taxonomy node, but when it contains hundreds of classes it becomes pointless. Since classifier is hierarchical, the quality of separate classifiers work in lower levels depends on quality of previous levels — this is the main feature of our algorithm. Precision and Recall are calculated using the formulas:

TP — number of true positive results, FN — number of false negative etc.

F-measure is harmonic mean between accuracy and completeness. We can set the ratio, where these values get in result with the parameter ß:

When ß>1 metric is skewed towards completeness, when 0<ß<1 — towards accuracy. We choose this parameter in accordance with percentage of positive examples in testing set for all metrics on the level. The more vectors are missed further by classifier, the more chances we have to get an error in the next one etc.

The next step is calculation of average F-measure for every branch that is every associated node of the first level parent. Since F-measure for every node is calculated using testing set, it’s enough to simply calculate average F-measure of all classifiers without any additional weighing.

A common metric for the whole classifier is calculated as a branch weighted average metric, weight is a percentage of positive examples of branches in the set. A simple average is not enough because the number of nodes and searching requests for different branches might be very different. We can boast about an F-measure value as high as ~0.8 for the whole classifier calculated by the described method. It’s important that we delete words relevant to search requests from a token list during classifier testing to avoid backward relation.

To visualize results of our test we use Google OrgChart — it clearly shows tree taxonomy structure, allows us to specify metric values in all nodes and even set color indicators right into the sheets. That’s what one of the branches looks like:

Tester is implemented as a separate Flask-application, which downloads pre calculated metric values from mongoDB by request, calculates missing values and draws orgchart. There is a small bonus — it has its own simple interface which allows inputting URL-list or plain text to text field and watch the result of classification.

DMP integration

Now when we have such kind of service we have to actively use it. Every day a million of the most visited sites from our DMP Facetz.DCA are selected and classified. The sites are marked with segment ID they are assigned to and users who visited those pages during last month are assigned to those segments. At this moment classification of one page takes about 0,2–0,3 seconds (excluding host latency).

This method allows automated segment assignment of thousands of URLs per day, while analysts could manually assign just less than a hundred pages. Now analysts work is mere selection of suitable topics of searching requests, the rest is the work of DMP, which will even tell you how well it is done.

Future plans

Firstly, we had to create a working classifier prototype, and we didn’t care about choosing optimum parameters for estimators and other settings. But we didn’t expect that such a simple mathematical model provides such good work quality. Of course all the mentioned algorithms constants may be flexibly changed. Maybe even a special article devoted to optimum settings is to come soon. Now there are some next steps planned:

• We will transfer our classificatory to a non-blocking server Tornado, to be able to refer to it asynchronously;

• In addition to dg metric we will consider different variations of tf-idf;

• We will try to consider words that are contained in page-names and meta-tags separately.

• We will try the numerous estimator settings, try to change random forest to SVM and raise the number of words chosen for classification.

Please feel free to comment on if anything seemed weird to you or there is anything else that may be interesting.