In this section, we present three evaluations: (1) a comparative study with previous methods for manga description (Section 6.1); (2) a localization evaluation (Section 6.2); and (3) a large-scale qualitative study (Section 6.3). In the comparative study and the localization evaluation, we used a single-thread implementation for a fair comparison, and employed a parallel implementation for the large-scale study.

Comparative study

A comparative study was performed to evaluate how well the proposed framework could represent manga images compared with previous methods such as those introduced in Section 2. We compared our proposal with a baseline (BoF with large window SIFT [54]), the state-of-the-art of BoF-based methods (FV [54]), and the state-of-the-art of chamfer-based methods (Compact OCM [62]). All experiments were conducted on a PC with a 2.8 GHz Intel Core i7 CPU and 32 GB RAM, using C++ implementations.

Frame image dataset

For evaluation, we cropped frames from 10 representative manga titles from the Manga109 dataset.Footnote 4 There were 8,889 cropped frames, and the average size was 372×341. We used these frames for retrieval. Note that we used frames instead of pages for this comparison because the features of BoF, FV, and Compact OCM are only comparable within a frame. Although the frame is less complex than the page, retrieval is not easy because the size of frames varies greatly, and the localization problem still exists as shown in Fig. 12c, where the target (a head of a boy) is small and the frame includes other objects and backgrounds.

Fig. 12 Targets for the comparative study. Left to right: query sketches by novice artists, skilled artists, ground-truth images, and ground-truth images with screentone removal. Top to bottom: targets, Boy-with-glasses, Chombo, and Tatoo Full size image

Target images

To make the comparisons, we chose three kinds of targets: Boy-with-glasses (Fig. 12c), Chombo (Fig. 12g), and Tatoo (Fig. 12k), which are from manga titles Lovehina vol. 1, Mukoukizu no Chombo, and DollGun, respectively. Boy-with-glasses is an easy example, Chombo is much harder because the face might be either to the right or left, and Tatoo is the most difficult because it is so small compared with the size of the frame. These target images are treated as ground truths.

Query images

To prepare query sketches, we invited 10 participants (seven novices and three skilled artists, such as a member of an art club), and asked them to draw sketches. First, we showed them query images for 20 s. Next, they were asked to draw a sketch for the query. Each participant drew all target objects; therefore, we collected 10×3=30 queries. Examples of queries are shown in Fig. 12. We used a 21-inch pen display (WACOM DTZ-2100) for drawing.

Results of comparative study

Using the queries, we evaluated each method using standard evaluation protocols in image retrieval: recall@k and mean average precision [4]. Note that the ground-truth target images were labelled manually. All frame images were used for the evaluation, and the images of the different targets were regarded as distractors. The statistics of the frame images are shown in Table 1.

Table 1 Image statistics for comparative study. All images are cropped frames from the Manga109 dataset Full size table

Figure 13 shows the results for each target. Note that this task is challenging, and all scores tend to be small. The queries from users are not always similar to the target. On the contrary, some novices even drew queries that were dissimilar to the target, as shown in Fig. 12e. Still, in all cases, the proposed method achieved the best scores. In particular, the proposed method outperformed the other methods for Boy-with-glasses. Note that BoF and FV received almost zero scores for the Tatoo case because they are not good at finding a relatively small instance from an image. From these experiments, we can say that BoF-based methods do not satisfy the purpose in manga retrieval.

Fig. 13 Results of the comparative study. Values in the legend show Recall@100 Full size image

Note that we did not apply any approximation steps for a fair comparison. Compact OCM measures an original chamfer distance without an approximation (a sparse projection). We did not compress a feature using PQ in this experiment.

About quantization

When we applied PQ to the features, the score decreased according to the quantization level, as shown in Fig. 14. There is a clear trade-off between the rate of compression and accuracy. From the result, we accepted M=16 as a reasonable option; i.e., a feature is divided into 16 parts for PQ compression and encoded to a 16-byte code. Note that, interestingly, there is not a clear relation between accuracy and the number of cells (c 2). The feature description becomes finer with a large number of cell divisions, but it does not always achieve higher recall. This indicates that some level of abstraction is required for the sketch retrieval task. We employed c=8 for all experiments, i.e., a selected area is divided into 8×8 areas for feature description.

Fig. 14 The effect of feature compression by PQ. The Y-axis represents the retrieval performance. The X-axis shows the number of cells. Each line corresponds to a compression level. As the features are compressed by PQ (i.e., the feature is represented by a smaller number of subvectors (M)), the score decreases compared with the original uncompressed feature Full size image

Parameter settings and implementation details

We show parameter settings and implementation details used for the evaluation. For BoF, SIFT features are densely extracted where the size of a SIFT patch is 64×64 pixels, with a 2×2 spatial pyramid, and the number of vectors in the dictionary is set to 1024. The final number of dimension is 4096. For FV, a Gaussian Mixture Model with 256 Gaussians was used, with a 2×2 spatial pyramid. The dimensions of SIFT are reduced from 128 to 80 by PCA. For BoF and FV, we leveraged the same parameter settings as in Schneider and colleagues [54], with the vlfeat implementation [67]. For selective search, we employed the dlib implementation [30], with little Gaussian blurring before applying selective search. To train code words for BoF, FV, and PQ, we randomly selected 500 images from the dataset. Note that they were excluded and not used for testing. To eliminate small patches, we set a minimum length of a patch as 100 pixels and discarded patches that were smaller than that.

Localization evaluation

Next, we evaluated how well the proposed method can localize a target object in manga pages. The setup is similar to that for image detection evaluation [17].

Images

As query sketches, we used the Boy-with-glasses sketches collected in Section 6.1. We prepared two datasets for evaluation. (i) A Lovehina dataset, which is a title of Lovehina vol.1, and consists of 192 pages, including the pages containing ground-truth windows. (ii) The Manga109 dataset, which is the sum of all manga data, consists of 109 titles with a total of 21,142 pages. Note that the Lovehina data is included in the Manga109 dataset, so (i) is a subset of (ii). The ground-truth area (69 windows) were manually annotated from a Lovehina dataset.

In contrast to the previous comparative study, this is an object localization task, i.e., given a query image, find a target instance in an image. In our case, we must find the target from many manga pages (21,142 pages, for Manga109).

Evaluation criteria

For evaluation, we employed a standard PASCAL overlap criterion [17]. Given a bounding box (retrieved result) from the method, it is judged to be true or false by measuring the overlap of the bounding box and ground-truth windows. Denote the predicted bounding box as B p , and the ground-truth bounding box as B g t , the overlap is measured by:

$$ r = \frac{area(B_{p} \cap B_{gt})}{area(B_{p} \cup B_{gt})}. $$ (5)

We judged the retrieved area is true if r>0.5. If multiple bounding boxes are produced, at most one among them is counted as correct.

For each query, we find the top 100 areas using the proposed retrieval method from the dataset (Lovehina or Manga109), then the retrieved areas are judged using (5). Then we can compute the standard mAP@100 from the result (true/positive sequence).

Result of localization evaluation

We show the results in Table 2. With our single implementation, searching the Manga109 dataset (14M patches) took 331 ms. This is fast enough for interaction, and the computation cen be further improved (70 ms) using a parallel implementation as discussed in Section 6.3. We also show theoretical values of memory consumption for EOH features (#patch ×8M). The whole Manga109 dataset consumes only 204 MB. As can be seen from mAP, this task is difficult because there are possibly hundreds of thousands of candidate areas in the windows (138K for Lovehina, and 14M for Manga109) for only 69 ground-truth areas. Examples of retrieved results are shown in Fig. 15. For the Lovehina data, the first result is a failure but the second is correct. For the Manga109 dataset, the first success can be found at the 35th result. The first result shares similar characteristics to the query (wearing glasses) even though the result is incorrect.

Fig. 15 Examples of localization experiments for the Lovehina dataset and Manga109 dataset Full size image

Table 2 Results for localization evaluation (single thread implementation) Full size table

Large-scale qualitative study

In this section, we show a qualitative study of retrieval from the Manga109 dataset. The whole system was implemented using a GUI as shown in Fig. 2.

We employed a parallel implementation using the Intel Thread Building Library, and the average computation time was 70 ms for the Manga109 dataset (21,142 images). The parallelization was straightforward. Neighbors were computed for each manga title in parallel. In the implementation, we selected the most similar feature from a page (not keeping all features per page), then merged the results.

Qualitative study using a sketch dataset

We qualitatively evaluated the proposed method using a public sketch dataset as queries. We used representative sketches [15] as queries. The 347 sketches each had a category name, e.g., “panda.”

Figure 16a and b show successful examples. We could retrieve objects from the Manga109 dataset successfully. In particular, the retrieval works well if the target consists of simple geometric shapes such as squares, as shown in Fig. 16b. This tendency is the same as that for previous sketch-based image retrieval systems [16, 62]. Figure 16c shows a failure example, although the retrieved glass is similar to the query.

Fig. 16 Results of the subjective study using representative sketches [15] as queries Full size image

As can be seen in Fig. 16c, text regions are sometimes retrieved and placed at the top of the ranking. Because users usually do not require such results, detecting and eliminating text areas would improve the results, which remains as a future work.

More results by relevance feedback

We show more results from queries of character faces using the proposed relevance feedback in Fig. 17. We see that the top retrieved results were the same as (or similar to) the characters in the query. In this case, all the results were drawn by the same author. Interestingly, our edge histogram feature captured the characteristic of authors. Figure 17b shows the results of “blush face.” In Japanese manga, such blush faces are represented by hatching. By the relevance feedback, blushed characters were retrieved from various kinds of manga titles. These character-based retrievals are made possible by content-based search. This suggests that the proposed query interactions are beneficial for manga search.