self_hosted_search_engine

Self hosted search engine

Twitter

Facebook

Google+

LinkedIn

Email Knowledge sharing is an important point at work. For software development teams, a search engine can improve efficiency especially to find the history of things on the project. Many sources of information can be indexed at work for software development teams: Software documentation (for example the documentation generated by Doxygen from source code)

Team wiki

Source Code (files but also commit logs)

Bug Tracker

Tasks tracker

Code reviews database

Test database and execution results

Business databases

… Once all is indexed, information can be found easier, and also it allows building key performance indicator in order to monitor the project. All that work can be done with open-source software very easily. I will show how in this page. I will use: A Ubuntu 16.04 LTS server to run all software

Docker.io for easy deployment

Elasticsearch for search engine

Calaca , a nice user interface front-end for elasticsearch

Apache Nutch , a web crawler to feed elasticsearch with our data

A word on versions

To work properly, versions of the different software must be at the good level. Here, I do not use latest releases, but releases I know well, that are easy to install and maintain: elasticsearch 1.5.2

nutch 1.11 elasticsearch 2.X with nutch 2.X is an important step an requires more time to setup.

Elasticsearch

To install elasticsearch, we will use docker. Docker can manage containers to run applications on a server. First step, install docker and add your user to docker group in order to use docker without sudo: sudo apt-get install docker.io sudo addgroup myuser docker You have to logout/login in order to be really added in the docker group. Then, pull the elasticsearch 1.5.2 container image from the official docker hub: docker pull elasticsearch:1.5.2 The image will be downloaded, it can take some time depending your bandwidth. The documentation of the image can be found here. After that, run the image and map ports to be accessible to your localhost: docker run -p 9200:9200 -p 9300:9300 -d elasticsearch:1.5.2 After a few seconds, you can check elasticsearch is running by accessing http://localhost:9200: { "status" : 200 , "name" : "Devil Hunter Gabriel" , "cluster_name" : "elasticsearch" , "version" : { "number" : "1.5.2" , "build_hash" : "62ff9868b4c8a0c45860bebb259e21980778ab1c" , "build_timestamp" : "2015-04-27T09:21:06Z" , "build_snapshot" : false , "lucene_version" : "4.10.4" } , "tagline" : "You Know, for Search" } It is necessary to allow cross origin request in order to access properly elasticsearch. So first, get a terminal in the container: docker exec -it [container-id] bash Then install vim inside the container to edit elasticsearch configuration file: apt-get update apt-get install vim And edit configuration file using: vim /usr/share/elasticsearch/config/elasticsearch.yml Add these lines to enable cors: /usr/share/elasticsearch/config/elasticsearch.yml http.cors.enabled : true http.cors.allow-origin : "*" http.cors.allow-methods : OPTIONS, HEAD, GET, POST, PUT, DELETE http.cors.allow-headers : X-Requested-With,X-Auth-Token,Content-Type, Content-Length And finally exit container and restart it: exit docker restart [container-id]

Some docker command

List running images and get their container ID docker ps -a Kill a running image using its container ID docker kill [container-id] Restart a killed image using its container ID docker restart [container-id] Delete a image using its container ID docker rm [container-id] Get a console terminal in a container using its container ID docker exec -it [container-id] bash

Calaca

Download and unzip: wget https://github.com/romansanchez/Calaca/archive/master.zip unzip master.zip Then configure calaca for your elasticsearch instance. This is done in file calaca/_site/js/config.js: config.js var CALACA_CONFIGS = { url : "http://localhost:9200" , index_name : "nutch" , type : "doc" , size : 30 , search_delay : 500 } Here index_name and type or default values for nutch, that we will use after. Copy calaca/_site directory content to apache root directory (e.g. /var/www/html). It should be also necessary to customize calaca index.html to support types from nutch. To do that, modify the article in section with class “results”: index.html ... <article class = 'result' ng-repeat = 'result in results track by $id(result)' > href = "{{result.url}}" > h2 >< a / >< / h2 > a > p > / p > < / article> ... At this step, you have a working self-hosted search engine at http://localhost/. It is now time to feed it with data.

Website crawling: nutch

Apache nutch can be used to crawl websites and feed the search engine for indexation. It can be used for example to index your project wiki. I will use nutch 1 since nutch 2 is more complicated and is not necessary. There is a very good nutch tutorial at apache site (https://wiki.apache.org/nutch/NutchTutorial) so here I will go fast to the solution and consider that you installed successfully nutch 1. Once nutch is installed, configure it to use our elasticsearch server. Edit file conf/nutch-site.xml and add elasticsearch indexer properties. Also activate indexer-elasticin plugin.includes property. nutch-site.xml <configuration > <property > <name > http.agent.name </name > <value > nutch sgripon.net </value > </property > <property > <name > plugin.includes </name > <value > protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-elastic </value > <description > Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description > </property > <!-- elasticsearch index properties --> <property > <name > elastic.host </name > <value > localhost </value > <description > The hostname to send documents to using TransportClient. Either host and port must be defined or cluster. </description > </property > <property > <name > elastic.port </name > <value > 9300 </value > <description > The port to connect to using TransportClient. </description > </property > <property > <name > elastic.cluster </name > <value > elasticsearch </value > <description > The cluster name to discover. Either host and potr must be defined or cluster. </description > </property > <property > <name > elastic.index </name > <value > nutch </value > <description > The name of the elasticsearch index. Will normally be autocreated if it doesn't exist. </description > </property > <property > <name > elastic.max.bulk.docs </name > <value > 250 </value > <description > The number of docs in the batch that will trigger a flush to elasticsearch. </description > </property > <property > <name > elastic.max.bulk.size </name > <value > 2500500 </value > <description > The total length of all indexed text in a batch that will trigger a flush to elasticsearch, by checking after every document for excess of this amount. </description > </property > </configuration > Then configure seed list and regex like described in the official nutch tutorial. At the end, launch crawling: bin/crawl -i urls/ TestCrawl/ 5 Here 5 is for “5 passes”, it means that crawling will go at 2 levels deep. Note that it is also possible to give the crawl command the index where to store documents. It can be usefull if you crawl several website and want to store each in its own index: bin/crawl -i -D elastic.index=newindex urls/ TestCrawl/ 5 That's it, you now have your own self hosted search engine! Just add the crawl command in a cron job to refresh regularly pages in the index.

Next Steps

Use kibana to build KPI on the data indexed into elasticsearch

Index databases (for example mysql)

Index data from REST web api (for example redmine issues)

… Share this page: Twitter

Facebook

Google+

LinkedIn

Email

Like this tutorial ?