The other day I needed to find a website. The only thing I could remember was that Vishesh gave me the link in IRC a few days back. So I had to grep through thousands of lines of IRC log which, quite frankly, sucks. Nepomuk should handle this. So what do we have to do to achieve that? Three things of which I will present the second thing first:

Extract web links from text in Nepomuk.

Why that? Well, to properly handle web links they need to be represented as Nepomuk resources and not just be some plain text excerpt in some text literal. Only then we can relate them to things, search them by type and order them by access count, times, or whatever.

Let’s go then. First we query for all resources that might mention a web link in their text content (we restrict ourselves to nie:plainTextContent since that covers all files and emails, and so on.):

﻿ComparisonTerm linkTerm(NIE::plainTextContent(), LiteralTerm(QLatin1String("http")));

We look for all resources that contain the string ‘http’ in their plain text content. We then force a variable name for the matched property to be able to access it in the results:

linkTerm.setVariableName(QLatin1String("text"));

We additionally exclude HTML and SVG files to avoid having too many useless links:

Term htmlExcludeTerm = !ComparisonTerm(NIE::mimeType(), LiteralTerm(QLatin1String("text/html")), ComparisonTerm::Equal); Term svgExcludeTerm = !ComparisonTerm(NIE::mimeType(), LiteralTerm(QLatin1String("image/svg+xml")), ComparisonTerm::Equal); Query query(linkTerm && htmlExcludeTerm && svgExcludeTerm);

Finally we request that Nepomuk returns two properties. We will see later on why we need those:

query.addRequestProperty(Query::RequestProperty(NIE::lastModified())); query.addRequestProperty(Query::RequestProperty(NAO::created()));

And now all we have to do is to run this query via QueryServiceClient and connect to its newEntries signal to handle each result. In that slot we iterate over all new results and see if there are really useful links in there. For that we need a little QRegExp magic which is fairly unrelated to Nepomuk but interesting nonetheless:

﻿QRegExp rx(QLatin1String("\\b(https?://[\\-a-z0-9+&@#/%?=~_\\|!:,.;]*[\\-a-z0-9+&@#/%=~_\\|])")); rx.setCaseSensitivity(Qt::CaseInsensitive);

We will use this regular expression without comment and get back to our result. First we create a list to remember our website resources (we only do this to show now Nepomuk can handle lists later on):

QList<Nepomuk::Resource> websites;﻿

We then iterate over all matches of the regular expression in the text:

const QString text = result.additionalBinding(QLatin1String("text")).toString(); int i = -1; do { if((i = rx.indexIn(text, i+1)) >= 0) { const KUrl url = rx.cap(1); Nepomuk::Resource website(url); website.addType(NFO::Website()); websites << website; } while(i >= 0); ﻿

Finally we actually relate the newly created website resources to the original resource using nie:links which is exactly the property we need:

result.resource().addProperty(NIE::links(), websites);

This could already be it. But there was one minor detail which we did not handle yet: the request properties we added to the query. The issue is rather simple: We create these website resources at a time that differs from the time we actually encountered them. Thus, to be able to sort web sites according to the time we used them last we need to change the creation date of the resources. For web links that were found in file contents this is the mtime (the best date we have). For anything else we use the creation time of the resource (the perfect fit here would be the creation time of the property which contains the link but that is for another day):

QDateTime creationDate; if(result[NIE::lastModified()].isLiteral()) creationDate = result[NIE::lastModified()].literal().toDateTime(); else if(result[NAO::created()].isLiteral()) creationDate = result[NAO::created()].literal().toDateTime(); if(creationDate.isValid()) website.setProperty(NAO::created(), creationDate);

Well, that’s it for today. Next time: great, now we have all these web sites but what do we do with them?