I have always been passionate about Open Source Intelligence (OSINT). For me it is a jig-saw puzzle, where you retrieve information from multiple sources, place it in the right place to get the big picture.

I am currently working on an OSINT project that will be released soon. As the project requires a lot of research work beforehand, I have been doing some groundwork to add as many functionalities as I can.

Recently I came across an IEEE paper : Foraging Online Social Networks. It was published at 2014 IEEE Joint Intelligence and Security Informatics Conference. It’s pretty old when you compare the rapid growth of OSINT techniques, but I found some content worth sharing.

DISCLAIMER: I am not republishing the contents of the paper. I am explaining what I have understood.

Description

The paper gives a very impressive high level view on the architecture of an ideal OSINT tool. However, the paper is restricted to the usage of online social websites like Facebook, Twitter and Instagram for gathering information.

According to the paper, the art of social media analytics could be achieved in 3 phases:

Capturing (initial phase which includes retrieving raw data)

Understanding (processing the data)

Presenting (visually showing the gathered information)

The paper also gives a brief introduction to 4 proposed models developed for the purpose of OSINT investigations. The models discussed include:

Pouchard, Dobson and Trien model — uses data from Internet and DNI Open Source center

— uses data from Internet and DNI Open Source center Crawley and Wagner model — utilizes entity guessing, regular expressions and machine learning

— utilizes entity guessing, regular expressions and machine learning Baldini, Neri and Pettoni model — this approach is based on Natural Language Processing (NLP)

— this approach is based on Natural Language Processing (NLP) Colombini and Colella model — this model assesses if different mass media devices belong to the same person

Talking particularly about developing an OSINT tool to retrieve the online data from any Online Social Network (OSN), it should contain the following phases:

OSN search, which results in a profile overview

Profile selection and full profile overviews

Selection of relevant attributes, which results in the aggregated profile

At the end, the paper concludes with explanation of the architecture and design of a sample semi-automated application, named Profiler. This tool is not Open Source but it is not tough to create one.

Without further ado, lets jump into architecture and design of the tool.

Architecture

Profiler is a Single-Page web application developed using Django and uses SQLite DB for the backend. It uses Model View Controller (MVC) architecture pattern.

By creating a webapp, one can achieve cross-platform compatibility and multi-user support with ease. It uses multiple Python libraries like OAuth 2.0, BeautifulSoup and urllib2 to accomplish its internal tasks. It also uses AJAX to give a good UI.

Profiler’s general architecture

Processing Pipeline

To accomplish the first phase of Capturing, this tool makes use of an inbuilt crawler. The crawler is responsible for search, extraction, loading and transformation for each OSN.

The processing pipeline contains the following modules:

search — queries the OSN with the initial data (name, username or email)

— queries the OSN with the initial data (name, username or email) pre-filter — filters hyperlinks and user IDs from the results of the previous search module, removes duplicates and passes on the unique ones to a stack

— filters hyperlinks and user IDs from the results of the previous search module, removes duplicates and passes on the unique ones to a stack profile crawler — parses each hyperlink or user ID and stores it

— parses each hyperlink or user ID and stores it data extractor — extracts the data from the parsed data from the previous module

— extracts the data from the parsed data from the previous module relevance calculator — calculates the relevance ratio (0–1) of a profile to make sure that the final profile we look at belongs to the target we are interested in

Web crawler specific architecture

Inference

There are lots of OSINT tools being developed on GitHub. While some tools like DataSploit and SpiderFoot are being developed with a good structure (both in terms of programming and results parsing), there are so many other tools which just are created to accomplish a minimal task in OSINT. Following the above architecture would make such tools better.

IMHO, applying the above structure to OSINT tools will

modularize the tasks to be performed

ease the process of development

increases flexibility in terms of development

You can get the above discussed IEEE paper at: http://ieeexplore.ieee.org/document/6975600/

Call To Action

If you liked this article, click 👏 👏 👏 and share so that other people will see it here on Medium.

Want to get regular updates on the different security tools developed on GitHub ?

Follow ‘Hack with GitHub’ on Twitter.