2.4. Specialized EDM and LA Applications

In the previous section, we discussed general-purpose tools for EDM modeling and analysis. However, specific types of data and specific analysis goals often require more specialized algorithms that are not available in these general-purpose tools. For these cases, researchers and practitioners typically use more specialized tools designed for these situations. In our last group of surveyed tools, we will cover the functionality of some of the most popular tools that accomplish these goals.

BKT is a Hidden Markov Model (and simultaneously, a simple Bayesian Network— Reye, 2004 ) that predicts whether a student has or has not mastered a particular skill within an intelligent tutoring system or similar program. BKT models are typically fit using one of the two algorithms: brute force grid search or expectation–maximization. The two algorithms perform comparably in terms of predictive performance. Some of the publicly available tools for BKT include BKT-BF, available at http://www.columbia.edu/∼rsb2162/BKT-BruteForce.zip ; BNT-SM, available at http://www.cs.cmu.edu/∼listen/BNT-SM/ (also requires Matlab to run); and hmmsclbl, available at http://yudelson.info/hmmsclbl.html .

BKT ( Corbett & Anderson, 1995 ) is a popular method for latent knowledge estimation, where a student’s knowledge is measured during online learning. This is distinct from the type of educational measurement common within tests in that, during online learning, the knowledge is changing while it is being measured.

Apache Stanbol is an open-source software tool for semantic text analysis ( stanbol.apache.org/docs/trunk/scenarios.html ). It is primarily designed to bring semantic technologies into existing content management systems and for text mining and feature extraction. Similar to TAGME, it links key words extracted from text to Wikipedia concepts. Apache Stanbol is easy to set up and run on a small set of instances. However, the tool also allows for incorporating a domain-specific ontology in the annotation process. This is highly beneficial when working with locally defined concepts specific to a given educational context. Finally, Apache Stanbol supports text annotation in multiple languages. The tool has been integrated with several content management systems.

TAGME is a text annotation tool, specifically designed for semantic annotation of short, unstructured or semistructured text segments, such as the text obtained from search engine snippets, tweets, or news feeds ( Ferragina & Scaiella, 2010 ). The text annotation process identifies a sequence of terms and annotates them with pertinent links to Wikipedia pages. That is, TAGME assigns (if possible) a Wikipedia concept to each of the term sequences in the analyzed text. An experimental evaluation of TAGME ( Ferragina & Scaiella, 2010 ) showed better performance on short text segments and a comparable precision/recall results on longer text, compared to other solutions. The tool provides an API for on-the-fly text processing and integration with other applications.

AlchemyAPI provides a set of tools for semantic data extraction using NLP tools and machine learning algorithms. The tool offers an API that provides content processing, formatted as standard text documents or standard web resources (i.e., accessible through URL). AlchemyAPI extracts concepts from documents; for each of the extracted concepts, the tool provides a relevance value that indicates importance of the extracted term for a given document. Supported response formats include XML, JSON, and RDF. Although it is a commercial platform, distributed by IBM, AlchemyAPI provides free access for up to 1,000 calls per day.

One of the primary reasons why understanding natural language is a very challenging problem is that each statement is heavily dependent on the particular context and background knowledge of the listener/reader. The approach taken by ConceptNet ( Liu & Singh, 2004 ) is to develop an enormously large graph of “commonsense” knowledge (e.g., “piano is a musical instrument”), which can be then utilized for understanding and processing natural text. By utilizing an extensive knowledge base, ConceptNet can be used to categorize textual documents, extract topical information from corpora, sentiment analysis (i.e., detecting emotions in the text), and summarization of text, among other uses.

LightSIDE ( ankara.lti.cs.cmu.edu/side ; and its predecessor TagHelper) is an open-source and free suite of tools for supporting text mining, built on top of the WEKA tool kit. LightSIDE can create a set of standard features often used for educational text, including creating variables for individual words, bigrams (common pairs of neighboring words), POS bigrams, line length, the presence of nonstop words, punctuation, and word stemming. It also provides a streamlined interface for error analysis that can help researchers to iteratively improve their text mining solution.

Given that text mining systems typically involve analysis of natural language text, NLP tool kits represent an important part of the text mining toolset. Those tools are typically used in the preprocessing stage of the analysis, for example, to (a) split paragraphs into individual sentences, utterances, or words; (b) extract syntactic dependencies between words; (c) assign POS (word grammatical categories) categories to each word; (d) reducing derived words to their root word (i.e., stemming and lemmatization); (e) named-entity extraction, which is a process of finding named entities in the text (i.e., names of people, places, institutions, monetary amounts, dates); and (f) coreference resolution (resolution of pronouns to their target nouns). There are several NLP tool kits available, which provide programmable APIs for popular programming languages (e.g., Java and Python). One popular example is the Apache OpenNLP tool kit ( Morton, Kottmann, Baldridge, & Bierner, 2005 ), a Java-based NLP tool kit that supports most of the common NLP tasks listed above. Similarly, Python NLTK ( Bird, 2006 ) is an NLP library for Python programming language with very similar capabilities. Finally, Stanford CoreNLP ( Manning et al., 2014 ) is an NLP tool kit which aside from providing a Java API, also provides a stand-alone command line interface and a set of “wrappers” for other programming languages (e.g., C#, Python, R, Ruby, Scala, and JavaScript).

Another technique that is also often used to extract topics from document corpora is LSA ( Landauer, Foltz, & Laham, 1998 ). While LDA and similar probabilistic methods use word co-occurrence to estimate which words constitute a topic, LSA uses the linear algebra technique of matrix decomposition to find sets of words that represent different topics. It can also be used to measure the semantic similarity of two documents or parts of documents, by comparing their vectors in the topic space. LSA has been implemented in several programming languages, with a Java-based text mining library (TML for LSA; tml-java.sourceforge.net ) and the lsa R package ( Wild, 2015 ) being some of the most popular LSA implementations.

These tools discover a set of latent “topics” by observing the words that often occur together (e.g., in documents where “coach” is mentioned, the word “player” has a higher chance of occurring than “apple,” suggesting a latent relationship between the words “coach” and “player”). The relevance of each topic in every document is evaluated, which enables discovery of most relevant topics for each document and “soft” clustering of documents. The most popular and widely used method for topic modeling is Latent Dirichlet Allocation (LDA; Blei, 2012 ), which is implemented in the MALLET tool kit ( McCallum, 2002 ) and the topic models ( Grun & Hornik, 2014 ) and lda ( Chang, 2010 ) R packages as well as in the STM package ( Reich, Tingley, Leder-Luis, Roberts, & Stewart, 2014 ).

In the feature engineering process, it is often important to model semantic similarity between different texts that requires understanding the relationships between different words. Word2vec is a two-layer neural network that processes input text and outputs a set of feature vectors for words in the data corpus. It is used to group the vectors of words that are similar in vector space (e.g., “Norway” is more similar to “Sweden” than to “India”) that can be used for different feature engineering tasks (e.g., calculating semantic similarity between text documents).

Another popular tool for text analysis is Coh-Metrix ( Graesser, McNamara, & Kulikowich, 2011 ; Graesser, McNamara, Louwerse, & Cai, 2004 ), which provides more than 100 measures of text divided into 11 categories. Compared to WMatrix, CohMetrix offers a more contextual understanding and analysis of text features and relationships in the data. Whereas WMatrix tags words and multiword units semantically, CohMetrix has multiple tags for assessing deep text cohesion, such as measures of narrativity or referential cohesion. With these increases in the deep meanings of analysis comes a need for greater sized data sets—using CohMetrix effectively tends to require a larger corpus of text than semantic taggers.

WMatrix ( http://ucrel.lancs.ac.uk/wmatrix/ ) is an online graphical tool that can be used for word frequency analysis and visualization of text corpora. Although it can be used to conduct the complete analysis process, it is primarily useful in the feature engineering phase for extraction of linguistic features, including word n-grams, important multiword phrases (i.e., phrases that occur more frequently than expected), part-of-speech (POS) tags, and (in particular) word semantic categories. It also provides visualization of the text corpora in the form of word clouds and provides interface for comparison of several text corpora simultaneously.

The LIWC tool ( Tausczik & Pennebaker, 2010 ) is a graphical and easy-to-use computerized text analysis tool that measures the latent characteristics of a text through analysis of the vocabulary used. LIWC provides more than 80 metrics regarding different psychological categories of vocabulary (e.g., cognitive words, affective words, functional words, and analytical words) and has been extensively used and validated in a large number of studies.

Text mining is a rapidly growing area of data mining and there are a significant number of programs, apps, and APIs available for the tagging, processing, and identification of textual data. Text analysis tools can process text parts of speech, sentence structure, and semantic word meaning. Additionally, some tools are able to identify representational relationships between different words and sentences. The tools presented below are not an exhaustive list of all the programs available but represent a selection of tools that cut across the numerous facets of textual processing and analysis.

2.4.3. SNA

SNA seeks to understand the connections and relationships that form between individuals and/or communities, most commonly expressed as node and edge diagrams. SNA is commonly employed to analyze collaborative social networks such as those seen in social media, or in student interaction within MOOCs or online courses.

Gephi (https://gephi.org) is a popular and widely used interactive tool for the analysis and visualization of different types of social networks. Gephi is extensively used in LA research, and it supports directed and undirected social networks specified in a wide range of input data formats. Often used as a tool for exploratory analysis, it provides a set of graphical tools for easy visualization of social networks, including the ability to color nodes and edges based on their attributes of the properties of their network position (e.g., clustering coefficient, degree centrality, and betweenness centrality). The tool also offers a Java API for manipulation of social network graphs, calculation of multiple measures (e.g., density, average path, and betweenness centrality), and execution algorithms (e.g., graph clustering and giant connected component extraction) commonly used in SNA. It is licensed under the GPL license and available on Microsoft Windows, Linux, and Mac OSX platforms.

EgoNet (http://egonet.sf.net) is a free SNA tool that focuses on the analysis of egocentric networks, which are, generally speaking, social networks constructed from the perspective of the individual network actors, typically using survey instruments. Through EgoNet, a researcher specifies a set of network members and distributes to all of them a small survey regarding their relationships with other members of the network. As members provide information about network structure from their perspective (hence “ego” in the name), EgoNet visualizes the overall network structure and provides a set of analysis tools to better understand the overall network structure, with options to interrogate a member of the network with further questions.

Network Overview Discovery Exploration for Excel (NodeXL, http://nodexl.codeplex.com) is an extension for Microsoft Excel that makes it easy to visualize network data in Microsoft Excel from a wide variety of input data formats. Similarly to Gephi, it provides a set of tools for filtering and visualizing the data, and also the calculation of the basic network properties (e.g., radius, diameter, and density), node properties (e.g., degree centrality, betweenness centrality, and eigenvector centrality), and other network analysis methods (e.g., cluster analysis for community mining). Currently, there are two versions: NodeXL basic (a free version) and NodeXL Pro. Beyond basic support for SNA, NodeXL Pro contains functionality for automated loading of data from several social media platforms (e.g., Twitter, YouTube, and Flickr) and text and sentiment analysis of social media streams.

Pajek (http://mrvar.fdv.uni-lj.si/pajek) is a free desktop tool for complex analysis of a wide variety of large networks (thousands and hundreds of thousands of nodes), including the analysis of networks of social interactions. Pajek is extensively used in academia for SNA, including LA research, for tasks such as network partitioning, community detection, large network visualization, and information flow analysis. At present, Pajek is available only for Windows OS. There is also Pajek-XXL version which is a specially designed version of Pajek for working efficiently with extremely large networks (with millions of nodes or more).

NetMiner (http://www.netminer.com) is a commercial graphical tool for the analysis of networks and their visualizations. Similarly to Gephi and NodeXL, it supports importing network data in various formats, network visualizations, and calculation of common graph-based and node-based statistics. NetMiner is also suitable for advanced analyses of networks and has a built-in data mining module supporting various data mining tasks (e.g., classification, clustering, recommendation, and reduction). It also has an integrated Python scripting engine for more complex and custom types of analyses. Besides the graphical user interface, it also supports a scripting interface which makes it suitable for embedding as a module in other software systems. Finally, it supports 3-D visualizations of networks and video recording of network explorations (e.g., for inclusions in presentations). NetMiner is currently available only on Microsoft Windows OS.

Cytoscape (http://www.cytoscape.org) is another open-source platform, originally developed for the visualization of molecule interaction networks, which has become a fully featured suite for analysis of various types of networks, including social networks. Cytoscape consists of a core distribution with basic network analysis and visualization capabilities, which is then extended using a large number of user-contributed modules (called apps). Cytoscape is developed on the Java platform and can be used within multiple operating systems.

SoNIA (https://web.stanford.edu/group/sonia) is an open-source platform for analysis of longitudinal network data. In the case of longitudinal network data, besides information about relationships (i.e., edges) between network members (i.e., nodes), there is also information available about the time those relationships occurred or at least the order in which those relationships developed. SoNIA supports visualization of network changes over time, with the ability to specify different network layout algorithms to some (or all) time frames to better visualize changes in network structure. The result is a nice “smooth” animation of structural changes over time, which can be exported into QuickTime video format. SoNIA is developed by Stanford University using the Java programming language and thus can be used in all major operating systems.

Social Networks Visualizer (SocNetV, http://socnetv.sourceforge.net) is an open-source tool for the analysis and visualization of social networks. It supports loading data from various network formats, calculation of typical graph and node properties, and flexible visualization of networked data (e.g., filtering, coloring, and resizing of nodes based on their properties). One interesting and unique feature of SocNetV is the embedded web crawler, which can be used to automatically extract a link structure between a collection of HTML documents (starting from one seed page and then following HTML links to other pages). It is licensed under GPL license and available on Microsoft Windows, Linux, and Mac OSX platforms.

NetworkX (http://networkx.github.io) is an open-source software library for the Python programming language for creation, manipulation, and analysis of complex network processes, structures, and dynamics. It is heavily used in academia and provides a rich set of advanced functionalities for working with networked data, including graph reduction using block modeling techniques, graph clustering, community detection, link prediction (finding missing links, e.g., missing Facebook connection among two friends), network triads analysis, and others.