Jeudi 28 mars 2019, par // Logiciels, Internet, moteurs de recherche

Sarchy (URL : sarchy.tech) is an intriguing faceted search engine (with RSS saved search) based on the open source YaCy search engine and developped by Agnel Vishal (Twitter @agnelvishal), a developper from Chennai in the Tamil Nadu region of India. Sarchy has been detected by one of the best French monitoring specialists, Christophe Deschamps (TW @crid ; blog Outils Froids) and relayed by Serge Courrier (TW @secou) of RSS Circus, another French monitoring specialist.

Sarchy is based on the (rather old) YaCy open source search engine

Sarchy is not really a newcomer. It is based on the open source search engine YaCy, which is already 8 years old. YaCy is a distributed peer-to-peer search engine written by a team of German developers. The source code is hosted on GitHub. According to its web site, « you don’t need to install external databases or a web server, everything is already included ».

To be honest, I tested a YaCy implementation some years ago and I wasn’t impressed at the time. And Sarchy’s performances, especially the width of its index (for instance, it indexes quite slowly and poorly the lemonde.fr domain) doesn’t make it competitive in any way with Google or Bing. Nevertheless, *this* implementation of YaCy is very interesting.

According to Agnel Vishal :

Sarchy is a fork of YaCy. YaCy does not use pagerank algorithm but Sarchy uses one. Also, Vishal says he uses social media statistics as a ranking parameter

Sarchy’s index is a part of Yacy P2P network, but at the same time, Sarchy makes YaCy’s index accessible as a webapp 1]

the total number of web pages in YaCy’s index is around 1,7 billion. Sarchy launched a week back and has 2,43 million webpages

he plans to increase the crawl speed by 30 times within 2 to 3 weeks

he got 3000 USD Google cloud credits thanks to YC startup school. He hopes to get revenues in advertisements and donations before the cloud credits gets over. Let’s hope he will be able to obtain that or other financing in the near future.

As Serge Courrier signals, one can integer RSS feeds. Also, there is a desktop version of Yacy.

And, as argued by YaCy’s lead developper and the Free Software Foundation Europe (FSFE), which supported the YaCy project, this peer-to-peer search engine doesn’t monitor your search and doesn’t do targeted advertising [2]

Relevancy still an issue

I have just tested Sarchy with my favorite, French law oriented, test query — and some others.

The (limited compared to competitors) content indexed is of good quality in my experience. But in the legal field, at the very least, relevancy on Sarchy remains an issue. Sarchy, contrary to Google, does not seem able to guess a query’s context, not even know the query words’ synonyms (in other words, Sarchy doesn’t do machine learning version of natural language processing.

I reckon that, for the time being, relevancy is hampered by the lack of indexed content. In the legal field, I would suggest better, relevancy oriented indexing of official, Gov’t and public institutions web sites (they have good, though free, quality content and Sarchy already indexes them or at least knows their domains).

Agnel Vishal answered my remark : as soon as one searches for a page/site, the crawler automatically starts crawling related pages. To me, that’s a very good idea : it keeps the index from indexing unnecessay pages. But at the same time, there is an associated spamdexing risk. In turn, YaCy’s Twitter account explained that YaCy does link reloading to verify that the presented link actually contains the searched words to protect against spam indexes.

Of course, link reloading, content checking and a distributed architecture mean that response time is somewhat slow (4-5 seconds on an enterprise Internet connection). But I didn’t find it that annoying.

According to Vishal, in order to get faster results, the whole database is not scanned the first time a given search is done. One should try the same query 30 seconds later and may see more webpages.

Also, since relevancy is still somewhat limited (according to my tests), it would be very useful to explain clearly somewhere on the home page what Sarchy’s operators are. The simple use of quotes (" ") on Sarchy is a big bonus to relevancy.

Looking at YaCy self-hosted engine presentation, using it as an alternative to Google CSE is possible.

Search operators and filters

As in Google, one can use site :http://justice.gouv.fr to get results from that domain. For example : https://sarchy.tech/yacysearch.html?query=site%3Ajustice.gouv.fr&Enter=&contentdom=all&strictContentDom=false&former=justice.gouv.fr+site%3Ajustice.gouv.fr&maximumRecords=10&startRecord=0&verify=ifexist&resource=global&nav=all&prefermaskfilter=&depth=0&constraint=&meanCount=0&timezoneOffset=-330

Good to know : YaCy search operators are detailed on its wiki.

One of the main advantage of Sarchy over YaCy’s own portal is its facets (left column in the results page) : domains, year, language ... These suggestions on how to refine your search are practical and relevant. Also, Sarchy works. While YaCy Search is not, right now.

Vishal says search operators list will be added to Sarchy’s home page in 24 hours. It will have location, date, distance between words etc.

Ahrefs is working on general purpose search engine to compete with Google. Sounds crazy, right?

But lets talk about two huge problems with Google which they will never want to fix: — Dmitry Gerasimenko (@botsbreeder) 27 mars 2019

What’s funny is that less than two weeks after Sarchy was spotted by Christophe Deschamps, Ahrefs [3] CEO Dmitry Gerasimenko tweeted he wants to build a new search engine with the collaboration of publishers and other online content makers ... [4] Although most SEOs who answered his thread are skeptic, with the growing success of Duck Duck Go and in our French and German lands Qwant, it could be the sign of something serious. The business model he proposes, at least, makes sense.

Emmanuel Barthe

French law librarian reseearcher, monitoring/CI specialist

search engine enthusiast (ex-Google de facto evangelist, ca. 1997, still a Google specialist for law research)

More info about YaCy and Sarchy’s implementation