In this section, the application of the framework presented in the previous section for the SIMMO use case is presented.

Identification of internet data sources

As indicated in “Framework for the Selection of Data Sources” section, the first step of the framework is the identification of potential sources. In the SIMMO case, potential data sources related to maritime surveillance were identified using search engines, literature reviews and consultations with subject matter experts. The search engines encompassed conventional search engines (like Google) as well as meta search engines like DogpileFootnote 5, MammaFootnote 6, and WebcrawlerFootnote 7. Apart from the search engines, other data sources were also analyzed, including sources indicated in (Kazemi et al. 2013), and those suggested by maritime practitioners. The other methods were used mainly to identify potential deep Web sources.

As a result, 59 different data sources available on the Web were found. The identified sources were part of both the shallow (22%) and the deep Web (78%). They provided information in a structured, semi-structured and unstructured manner. The list of identified internet data sources is presented in Table 2. From the point of view of data access, we divided them into four categories:

1. Open data sources (O) – websites that are freely available to internet users. 2. Open data sources with registration (OR) – websites that provide information only to authorized users. 3. Data sources with partially paid access (PPA) – websites that after the payment of a fee provide a wider scope of information. 4. Commercial (paid) data sources (PA) – websites with only paid access to the data (fee or subscription required).

From all the identified sources, for further analysis we selected only open data sources (category O and OR). At this stage, we eliminated commercial data sources and websites with paid access (categories PPA and PA). The elimination of these sources resulted from the fact that they provide only very general, marketing information about the data they have, and access to the data is available only after paying a fee or signing a contract. Moreover, our attempt to make contact with these data providers in order to get access to sample data failed (requests for data access were sent but with no response). Furthermore, in the project we did not foresee buying access to maritime data. Eventually, only sources with public content were selected for the project. Nevertheless, we believe it is sufficient to meet the users’ requirements, and provides the advantages of open data presented in “Open Data” section.

Similarly, two other data sources (IALA, SafeSeaNet) were rejected due to the fact that access to the data required the application of a long-lasting procedure with no guarantee that access would be granted. Due to the project’s limited duration, there was not enough time to apply for the data. However, in the case of getting access these sources can still be assessed according to the framework and included in the system in the future.

As a result of initial selection, 43 sources were taken into account as potential sources for the SIMMO system and assessed by the experts.

Assessment of internet data sources

In order to select sources of the highest quality and best suited to the users’ requirements, the identified data sources were assessed using the six quality criteria presented in the previous section. Definitions of these criteria were adjusted to the specifications of the SIMMO project (see Table 1).

Table 1 Quality measures used to assess Internet data sources Full size table

The process of data source assessment was conducted using the Delphi method. In fact Delphi was utilized three times: for weights assignment, source assessment, and threshold specification. In all cases the same group of experts was involved, consisting of 6 people. The experts were drawn from both inside and outside the project. They were experts either in the maritime domain or in the design and development of information systems (including maritime systems), having experience in data retrieval from various data sources (including structured- and unstructured internet sources). In the selection of experts we followed guidelines on how to conduct a rigorous Delphi study provided by (Hsu and Sandford 2007; Kobus and Westner 2016; Schmidt 1997).

At first, the Delphi method was used to define the importance of the selected quality attributes (by assigning them weights) and thus prioritize the selection criteria. Here, a variant of Delphi called “ranking-type Delphi” was used, which allows the development of a group consensus about the relative importance of issues (Kobus and Westner 2016; Schmidt 1997). This process consisted of three rounds, after which the consensus was reached.

Then, Delphi was used in the process of the assessment of the identified data sources according to the defined quality criteria. In this case the process consisted of two rounds. At the beginning each expert received the gathered basic information about a source and some statistics. Based on this information, as well as their knowledge and experience, they were asked to initially assess each potential data source by assigning a mark to each quality criterion using a four-level rating scale: high, medium, low, N/A, and provide a short justification. Here a questionnaire with a list of sources were used (similar to that presented in Table 2). Then, the results were summarized by a facilitator, and in the second round the experts were asked to review the summary of results and revise their assessments. After this round a consensus began to form, and based on the revised judgments the final marks in each criterion were selected (by majority rule). The results of the quality assessment for each source are presented in Table 2.

Table 2 List of assessed Internet data sources Full size table

Final selection of sources

After the assessment, the final selection of sources took place. Firstly, all sources with Accessibility measure marked as N/A were removed (12 sources from the O and OR categories, see Table 2). This elimination resulted from the reasons indicated before, regarding access to the data and the prohibition of using information from these sources indicated by the data provider. Also, the sources with Accessibility assessed as Low were eliminated (5 sources, see Table 2). This encompasses the sources with unstructured information (e.g. text in a natural language). We excluded them due to the fact that while defining the requirements for the system it was decided to include only sources with structured or semi-structured information. The reason for this was the limited time frame of the project and fact that an automatic retrieval of unstructured information would require a significant amount of work on developing methods for Natural Language Processing.

The sources with Relevance measure graded as Low were also eliminated (12 sources, see Table 2). It was pointless to retrieve data that are not well-suited to the requirements defined for the SIMMO system. For example, the SIMMO system focuses only on collecting and analyzing data about merchant vessels, and therefore some categories of sources may have been excluded (e.g. fishing vessels, oil platforms).

In the next step, each quality measure was converted into a numerical value: High = grade 3, Medium = grade 2, Low = grade 1, N/A = grade 0. Then, a final quality grade was calculated according to the formula:

$$ {X}_s=\sum \limits_{i=1}^n\frac{x_i}{3}{w}_i\ast 100\%, $$

where s means the number of the analyzed sources, n =6, x i means the grade assigned by the experts to a given quality measure i, and w i means the measure’s weight. The grade was also normalized to the range 0-100% (therefore each assigned measure is divided by 3).

Based on the calculated quality grades, a ranking of sources was created. Then, the experts were asked to decide on the threshold for the final selection of sources. After two rounds of Delphi the threshold was set at 85%. From the ranking list only sources with a final grade above the defined threshold were selected for use in the SIMMO system (the one in bold in Table 2).

To sum up, the application of the proposed framework for data source selection in the SIMMO use case allowed us to identify, assess and finally choose open internet data sources of the highest quality, which were then used by the SIMMO system.

Model of cooperation with data owners

In the next step, a model of cooperation with external data providers was defined. By external data providers we understand the sources selected for the SIMMO system. For each selected source a separate cooperation model was designed and described in the documentation. In defining the model, the following aspects were taken into account:

Scope of available information – what kind of information is available in a source.

Scope of retrieved information – which information pieces will be retrieved from the source.

Type of source – whether retrieved content is published in the shallow or deep Web, and in what form data are available, e.g. internal database, separate xls, pdf or csv files.

Update frequency – how often information in a source is updated; whether the whole content is updated or only new information appears.

Politeness policy – what kind of robot exclusion protocol was defined by the website administrators, for example which parts of Web servers cannot be accessed by crawlers, as well as requirements on time delay between consecutive requests sent to the server.

Re-visit approach – how often the SIMMO system will retrieve information from a given source, i.e. the intervals between consecutive downloads from the source, taking into account the politeness policy, if defined.

Retrieval of data from internet sources

Finally, data from the selected sources was to be retrieved, merged and stored in the system for further analysis. The data were to be acquired automatically by the developed Data Acquisition Modules (DAMs). DAMs connect to the data source in a defined manner, send appropriate requests, collect the returned documents and extract the required data.

Each source may have a different structure and may publish data in different ways. If DAM is to successfully acquire the data from a given source, a specific set of technical requirements must be met. Four general categories of data sources were identified in terms of such requirements: (a) shallow Web sources, (b) deep Web sources, (c) sources publishing data in XLS/CSV files, (d) sources publishing data in PDF files.

Below we describe these categories in detail and discuss how data is retrieved in the SIMMO system.

Shallow web sources publish their data in the form of web pages (HTML documents), which can be directly fetched using GET queries defined according to HTTPFootnote 8. As a result, the source sends back an HTML document with data embedded in it. Such documents usually contain data concerning a single entity (e.g. a single ship) or a list of links to web pages that contain data on single entities. The data itself may be extracted from the document using regular or XPath expressionsFootnote 9. In order to conduct monitoring of new or updated data published in the source, it is crucial to maintain a list of known URLs of documents published in this source and to manage a queue in which these URLs are to be visited.

For each shallow Web source used in SIMMO, a separate DAM was prepared that was responsible for the actual retrieval and processing of data from a given source. These modules share some common operations, such as queuing mechanisms, retrieval of HTML documents under a given URL, and writing the data to the database. Still, operations such as extraction of the data from the HTML document have to be implemented separately for each source, which is the consequence of different structures of the HTML documents.

Deep Web and AJAX data sources also publish their data in the form of HTML documents, but these documents are not directly accessible through static URL links. Instead, they are dynamically generated in response to queries submitted through the query interface to an underlying database. In order to fetch the data published in sources belonging to this category, DAMs need to perform many additional operations compared to shallow Web sources, such as posting filled forms or executing JavaScript code embedded in HTML documents.

This functionality was implemented with the Selenium WebDriverFootnote 10 toolkit and the Mozilla Firefox web browser. The toolkit allows the automation of actions within web browsers. It is then possible to automatically submit instructions to one of the supported web browsers. In our case, the developed DAM (written using the Python language) opens a Mozilla Firefox browser window inside the X virtual framebuffer (XVBF)Footnote 11. The process of acquisition of the data using the pipeline is presented in Fig. 3.

Fig. 3 Pipeline of data acquisition from AJAX and deep Web data sources Full size image

Data sources with CSV and XLS files form a third category of data sources. CSV (Comma Separated Values) files are regular text files used for storage of tabular data, where each line contains a single record, and fields in the record are separated using a selected separator (usually a comma, hence the name of the format) and quoted. CSV file format can be easily processed by any programming language. Another format very similar to CSV file, in terms of the processing pipeline, is the XLS(X) file type. This is a file format for the representation of spreadsheets.

In the data sources used in the SIMMO use case, CSV and XLS(X) files with the required data are published on a regular basis, e.g. once a week, under a certain URL.

Sometimes, these files are additionally archived, e.g. into a zip file. To fetch the data from these sources, Python scripts were developed which are executed regularly by CronFootnote 12 and monitor a given source to identify if a URL to a previously unseen CSV/XLS file appears on a web page. Once the file is downloaded (and unpacked if necessary), it can be programmatically read and its content can be processed sequentially, row by row, to get data about specific entities.

Data sources with PDF files are web portals in which data can be accessed by downloading and displaying PDF (Portable Document Format) files accessible under a certain URL. While PDF has some advantagesFootnote 13, this format is very difficult for automatic processing. This is due to the fact that it was created to be read by humans. Processing PDF documents becomes even more difficult when we aim at extracting data from a table that is embedded in it in an automatic manner.

The processing pipeline for fetching and processing PDF files is presented in Fig. 4. First, a PDF file is downloaded from the source to the local disk. Next, the file is converted to XML using the pdftohtmlFootnote 14 program, included in the Ubuntu Linux operating system. This program, when executed with the -xml option, produces an XML document containing a text which is suitable for further processing. In the obtained XML document, each piece of text is contained in a separate element with data about its coordinates on the document page (i.e. number of units in relation to its top-left corner). A set of manually-crafted rules needs to be developed in order to, based on those coordinates, recreate the original structure of the table and extract the data.

Fig. 4 Retrieval of data from sources that publish data in the form of PDF files Full size image

Data fusion

Data obtained using DAMs in the way described in the previous paragraphs are to be stored in an internal database and used in further analysis. Still, there are some challenges that must be dealt with before such data may be used. This results from the fact that data in different sources are not consistent with each other, for example:

The same entity (e.g. a vessel) in two different sources may be referred to using different names, e.g. different spellings of the name of the vessel, or using different attributes.

The same attribute for a given entity may have different, conflicting sources in different sources.

The same attributes may be described using different units of measure (e.g. meters vs feet).

Such situations should be automatically resolved if the system is to be able to utilize the data retrieved from different sources. This process is called data fusion. In case of the SIMMO system, this problem was mainly resolved firstly by assignment of artificial, unique identifiers to each entity and then development of methods that automatically assignthese identifiers to each data item related to a given entity. In the proposed methods various approached are used inter alia text similarity measures, heuristic methods, prioritizing of data sources, analysis of agreement between different attributes, lexicon building based on information provided by DBpedia. Still, this issue is well beyond the scope of this paper. The detailed results of our work on data fusion are described in another paper [Małyszko et al. 2016].