Additional Information

Research prospects and hypotheses for future collaboration intelligent systems combining automated reasoning power and human annotation to harness machine analysis are explored in this proposal testing a crowd-enabled human computation and machine reasoning model for semantic analytics. It is expected that this model can allow the extraction of relevant facts about the relationships between disciplines, scholars and publications, filling the limitations of current tools for understanding research attributes and trends effectively at different levels of granularity, and to relate them “semantically” through an integrated solution (Osborne et al., 2013). Taking into account the Woolley et al.'s (2010) analysis on crowd behavior and collective intelligence, this work proposes a step forward studying convergence indicators and input requirements with the use of automatic tools and crowd-based human computation (e.g., MTurk, and Crowdcrafting) on a vast set of scientific publications. The design of a community self-organizing bibliographic information system (Correia et al., 2013) will be informed “from the ground up” to support this crowd-enabled scientific data analytics process. Currently, this system only allows users’ authentication, edit and classify data using different annotations, comments and categories.

A crowdsourcing scenario provides a reliable setting for investigating human collective intelligence, generated through networks of interactions among individuals, environments, and contents. It will be presented a scenario in which scholars can pick on machine annotations to focus on key parts of publications and then provide annotations and create classification categories. Therefore they can use those annotations to self-reflect on their interpretation, or they may read other people’s annotations and discover new aspects, interpretations and knowledge. Problems are formalized in the crowdsourced setting providing efficient algorithms that are guaranteed to achieve good results with high probability. The question of crowdsourcing scientific data analysis and clustering will be answered in two steps: 1) reduce the problem to a number of independent Human Intelligence Tasks of reasonable size and assign them to a large pool of participants, and 2) develop a model of the annotation process to aggregate the human data automatically yielding a partition of the dataset into categories. The outputs of this human-machine analytical approach will be tested on a number of real data sets and compared against existing methods.

For example, consider the following evaluation scenario. A paper classified as "medical informatics" could be characterized by subarea (e.g., cognitive aging), aims and purpose, setting and context, key concepts and definitions, participant characteristics, research boundaries and limitations, method, results and findings, social-technical aspects concerning a certain technology (e.g., a Wiki to support knowledge exchange in public health), related work, scientometric data (e.g., affiliation and country of authors’ affiliation), and annotations as a meta-cognitive activity. In addition, all these data could be correlated and filtered to present the final results considering specific research purposes (e.g., identify what kind of features was introduced in health care technologies by Canadian researchers between 2009 and 2016).

A controlled experiment applying data mining (to discover previously unknown properties) and machine learning classifiers (which will be trained to recognize certain patterns in data based on a thesaurus) on a large number of publications represents the research setting for testing computational intelligence. In this sense, metadata structures (e.g., Dublin Core), hybrid tools for data alignment and generation from text (e.g., Apolo), automatic and crowd-based taxonomy creation systems (e.g., Cascade), and open classification models will support the identification of complex system dynamics and emergent vocabularies resulting in a knowledge base of scientific facts extracted from literature.

A hybrid methodology will be mainly sustained on evidence-based research for systematic analysis of data so that common concepts and ideas are extracted and then axially referenced to produce higher-level themes and concepts that frame the theoretical understanding of the researched phenomenon. Case studies based on crowdsourcing and human computation will also support the global workflow measuring the relevance of personalized search and analysis, gathering training data for machine learning classifiers and designing an intelligent system that incorporates crowdsourcing in high-quality research.

Semantics will be framed to crowd workers by means of sentences, scenarios, and descriptions discussing scientific facts and performance measures concerning the crowdsourcing process to analyze the semantic correctness, naturalness, and bias of the collected data sets. Pattern recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation will be some of the evaluation methods. Experiments will be applied on a large crowd of MTurk volunteers, and the workflow is based on selecting papers and classification dimensions for analysis, split a large task into batches, ask crowds and use automated mechanisms to classify a query, quality control, and aggregate contributions, collect results and metrics assembling methods for crowd wisdom consensus.

Once research on whether manual data gathering and evaluation can be scalable to a large set of publications and scholars remains unclear, it is assumed that the prerequisites for crowdsourcing and machine learning are present in academic settings and scientists perceive it as a useful approach for supporting research.