The paperless office, much like the year of the Linux desktop, always seems to remain just out of reach. In large part, this is because there is little that one person can do to prevent other people from printing and sending hard-copy documents. But there are tools to help convert these unwanted papers into digital form that can be integrated, to one degree or another, with the filesystems and databases that keep track of everything else. One such tool is Paperwork, a graphical tool to scan, extract text from, and automatically index paper documents.

Scanning, itself, is close to being treated as a solved problem on Linux desktops. If the goal is merely to scan images of paper documents, there are loads of existing applications to do just that. Where Paperwork makes its claims of superiority, though, is in its support for recognizing text on the page, extracting it into a searchable form, then intelligently assigning labels that reflect semantic meaning.

Making this work requires generating some metadata from each scanned page, and storing it in a systematic fashion inside the directory structure alongside the scan. Paperwork uses the Tesseract optical character recognition (OCR) library to detect and extract page text. It then stores the extracted text in a foo.words file. Next, the scanned text is indexed and added to Paperwork's database (which allows the user to search across the document collection). Finally, a Bayesian text classifier is run on the text to suggest relevant labels. The user can also manually attach keywords to any scanned document; these keywords and the labels are also stored in files within the directory, making them, too, accessible to external programs.

Paperwork is written in Python 2 and is available for installation from the Python Package Index. This should pull in most of the dependencies, although there is a dependency checker (called paperwork-chkdeps ) included in the package. The latest release is version 0.3.1, from February 26. Once installed, the application is launched with paperwork . Any SANE-supported scanner should be usable, but Paperwork can also import image files and PDFs, so having a scanner on hand is not required.

Paperwork's workflow is largely automated. First, the user creates a new document in the internal collection, then they must scan pages to add to that document (or, alternatively, import pages from external files). Once each page has been added to the document, though, Paperwork tries automatically determine the orientation of the text, then runs the Tesseract OCR scanner on the page in the background. Whenever the user hovers the cursor over the page, the interface pops up the words that Tesseract has detected; there is also a menu entry to show all of the detected text. The user can continue to scan and add new pages until all have been added, then save the result.

After completing the OCR pass over all the pages in a new document, Paperwork attempts to automatically assign labels to categorize the text. In practice, the classifier engine that takes this step needs a corpus of documents with existing labels in order to do its job. Thus, the user will have to manually assign every label to the first document, and will likely have to delete labels assigned overzealously to the first several documents. In the long run, the usefulness of the labels hinges on having a set of somewhat related documents. A weird outlier will either end up unlabeled or mislabeled. One regrettable missing feature is an ability to re-run the classifier on one or more documents; there is a feature to re-run the OCR step, so perhaps there is hope for the future.

The search feature lets the user type words into the search box, and presents a live-updated list of matching documents below. Click on any of the matching documents in the list, and Paperwork will open it in a viewer window highlighting the location of each search term. Multi-word search terms are not supported, nor are logical operators.

The simplicity of the search feature might sound like a big limitation—after all, with a lot of documents, some way to narrow down the search would be nice. But the quality of the OCR stage is likely to be more important. Tesseract has come a long way in recent years; I tested it shortly after its initial open-source release ten years ago, when the scan results from other OCR engines were likely to be riddled with elementary spelling mistakes. Tesseract has always been better than the competition, but in those early years it tended to perform poorly on small font sizes, and it tended to erroneously flag images and shadows as text blocks.

For the sake of comparison, I ran paperwork on a some of the smallest text I could find: the legalese in the warranty booklet for a cell phone. Tesseract did admirably; it tagged several shadows, but with a single exception it did not "recognize" letters in them as it might have in the past. I only found two instances where it split a word into two segments, and none where it combined two words into one. Moreover, I do not think there were any spelling mistakes, which is a noteworthy achievement. It also seems to be geared toward detecting black-on-white text, to the point where it occasionally misses lighter print or text within images.

That said, free-software OCR is capable of doing more. Tesseract outputs text in the hOCR format, an XML format that stores each detected word along with the coordinates of its bounding box in the source image. Several other projects exist that can build on this word-level information to detect lines, paragraphs, and even larger page structure (think columns of text that wrap around images). The OCRopus engine does this internally, while GNOME's OCRFeeder can do page-layout detection on top of Tesseract output. While Paperwork does not need to detect page layout in order to run simple searches, at least reconstructing sentences would allow users to search for multi-word phrases.

Paperwork bills itself as a "scan and forget" tool, so it may be asking too much to expect it to detect page layout. As a mostly automated "document ingest" application, it certainly succeeds where other scanning programs for Linux desktops fall flat. What remains to be seen is whether scanning and forgetting works well over the long haul. The project is a small operation at present, without posted plans for where it intends to go in the future. The automatic labeling of contents will surely improve over time as Paperwork gets more familiar with a corpus of documents, but the bare-bones search functionality is likely to become a pain as one accumulates more and more scanned documents.

Comments (8 posted)

In late 2013, the CyanogenMod project introduced the WhisperPush service, a secure-messaging transport that provided end-to-end encryption and identity verification. In February 2015, however, CyanogenMod shut down the service and advised users to migrate to the Signal service provided by Open Whisper Systems. Although Signal is renowned by experts for its solid security underpinnings, the demise of WhisperPush still comes at a cost for end users.

In 2013, Open Whisper Systems was maintaining two distinct Android apps: TextSecure, which provided encrypted instant messaging, and RedPhone, which provided encrypted voice calls. WhisperPush was built on top of an independent implementation of the TextSecure protocol, and enabled CyanogenMod users to exchange messages with one another as well as with any TextSecure user.

By the time work began in earnest for CyanogenMod 12.1, though, Open Whisper Systems had evolved the TextSecure protocol considerably, merging TextSecure's functionality with that of RedPhone to develop Signal. CyanogenMod briefly mounted an effort to drop the WhisperPush app and merge support for the service into its general-purpose Messaging app but, by January 2016, the project decided that the maintenance effort was not worth the support costs.

For CyanogenMod, most of those support costs seem to have been in running the server end of the WhisperPush service, rather than in developing the app. The service required a running a message-passing server to relay encrypted exchanges between users, but also a separate verification framework. WhisperPush, like TextSecure, used SMS text messages to verify user accounts "out of band"—that is, by sending the verification codes over the normal phone system, rather than through the WhisperPush service.

Running that service incurs a separate set of costs, including maintaining phone numbers in each region of the world where WhisperPush was available. Add to those costs the rapidly growing popularity of Signal among CyanogenMod users, and running the independent service no longer made sense for Cyanogen (the company).

The blog post announcing the shutdown encourages users to move to Signal, noting along the way that Open Whisper Systems helped develop WhisperPush in the beginning and has been a friendly partner along the way. The shutdown was finalized on February 1. Those who praise Signal are quick to point out that it, like WhisperPush, is open-source software on both the client and server side. Furthermore, Signal is now available for iOS (and is in beta testing as a Chrome-extension–based "desktop app"), making it useful to a wider assortment of users.

But there are also downsides to Signal, starting with the fact that it relies on Google's proprietary Google Play Services system to relay various event notifications (although, it should be noted, not to relay encrypted messages themselves). This introduces a possible privacy risk. As a single hub through which all Signal apps send some traffic, Google Play Services might be used by an attacker or law-enforcement agent as a place to collect metadata about Signal users. As many pundits have noted in the wake of the NSA surveillance scandal, metadata can be used to collect quite a bit of information about users even when message contents remain encrypted.

Additionally, the reliance on Google Play Services means that the Signal network has a dependency on a third-party service not under Open Whisper Systems's control. If the Google service ceases to be available (again, perhaps at the behest of law enforcement), that would interfere with all Signal users.

But the most fundamental issue may be that the shutdown of WhisperPush returns the "TextSecure ecosystem" to its previous state of being a monolithic service. Even if Signal users numbered far more than WhisperPush users, the fact that the two services were federated made both more resilient to trouble. If the Open Whisper Systems servers were taken down or compromised, the alternative might still be viable.

Certainly, motivated developers can develop their own interoperable implementations of the Signal protocol; the free-software community often takes up such causes and often with great success. There is one such effort at present, named LibreSignal, but Open Whisper Systems officially regards self-hosted Signal servers as unsupported. Interoperability with WhisperPush persisted for as long as it did because of the good working relationship that already existed between the projects.

Open Whisper Systems's resistance to service federation is unfortunate, but perhaps the project could be persuaded to relax that stance if a viable service produces reliable code and demonstrates the importance of providing users with a choice. For the time being, however, the decommissioning of WhisperPush leaves security-minded mobile users with one fewer avenues for safeguarding their private communication.

Comments (none posted)

Recently, I took advantage of an unforeseen hardware problem to experiment with replacing one of the longest-running services on my home network: the MythTV service that records local broadcast television. The replacement I explored was Tvheadend (TVH), which provides a recording back-end and electronic program guide (EPG) but no dedicated playback front-end. For a number of reasons, TVH is easier to work with, although—for the time being—it does not boast quite as many features at MythTV.

To provide some background, the tuner hardware I use with MythTV is a pair of networked HDHomeRun devices. Each one is connected only to an antenna and they pick up only a handful of channels; there is no cable or satellite service in the mix. This puts my recording needs staunchly on the low end of the spectrum compared to other MythTV users, but such modest requirements do not make MythTV more cooperative to configure, use, or troubleshoot. When one of the HDHomeRun tuners started exhibiting flaky performance, I bought a replacement and decided to put the new device to the test with TVH—at least temporarily—just to gauge the difference.

MythTV has been around since 2002; it predates the digital-television transition and, thus, has a fair amount of technical debt to manage—stemming from the need to support analog tuners and encoders and due to the age of the codebase itself. TVH is much newer; it is focused specifically on supporting DVB and IP television (IPTV) services. Here in the United States, DVB is generally only found on free-to-air (FTA) satellite systems (DVB-S), while broadcast signals are sent in ATSC. Luckily, the HDHomeRun product line is the exception to TVH's DVB-centric focus (probably because the devices themselves are plug-and-play and the video formats are virtually identical MPEG-2 streams), so the devices let one record ATSC broadcasts.

Head on

I tested TVH version 4.0.8 running on an Ubuntu 14.04 machine—the same one running the MythTV backend. That older distribution choice is an artifact of using MythTV; MythTV updates have a nasty habit of breaking in subtle ways—to the point where users are usually advised to only run the service on long-term-support (LTS) releases. For deployment on a fresh system, TVH supports plenty of more recent distributions, as well as FreeBSD and Android.

The TVH installation process is simple enough; the installer prompts the user to create a username/password combination for use inside the application itself. Other users can be created after the fact, to separate configuration from recording privileges. The program runs as a single process (as the hts user) that exposes a web interface on http://localhost:9981 . There are also access control list (ACL) and key-based options available to further restrict access, although they are not enabled by default.

System configuration is done entirely in the web interface. Notably, TVH uses a somewhat different nomenclature to describe the relationship between the available signals, program streams, and "channels" that one needs to enumerate. A "network" refers to a broadcast medium (e.g., over-the-air antenna, satellite provider, IPTV source, etc.); a "mux" refers to a tunable frequency that might carry one or more "services" (which constitute what most people call "channels" in the vernacular lingo). But TVH does automatically discover most of the available tuning hardware and, when told what "network" to associate with a tuner, it will automatically scan to pick up whatever "muxes" and "services" it can find. The intent of this complex model is to allow TVH to combine multiple signal sources (terrestrial, satellite, and broadcast) into a single channel map, glossing over a number of potential differences in how the components are labeled. Consulting the user guide is helpful, though, particularly for making sense out of all of the configuration minutiae.

After configuring program reception, the next most critical setup task is getting EPG information. In my test, this was TVH's weakest point by far. Although the program supports an array of EPG sources, none of them worked for over-the-air broadcast in the US. Worse, the relevant chapter in the manual is filled with empty sections and the forums are filled with frustrated users asking the same questions and getting no solutions. Eventually, I tracked down an external blog post that pointed toward a solution; sadly, the solution is an ugly workaround for what is (apparently) TVH's broken support for the XMLTV guide-data grabber. It seems, however, that users in DVB regions of the globe have a much easier time with this step.

With EPG data in place, one can schedule single or recurring recordings from within the web interface. The UI for doing so is not sophisticated; there is a long list of programs in a spreadsheet-style list (XMLTV grabs two weeks' worth of EPG data at a time) and one can search or filter it on a number of fields (episode titles, service, content tags, etc.). The giant list of programs is less user-friendly than the classic "TV Guide" grid, but at least the search and filter functions are easy to use—which cannot be said for MythTV, where searching and filtering features are split into several sections placed rather far apart from each other in the configuration-screen hierarchy.

As for recordings themselves, TVH does little to nothing with them after they are saved to disk; the intent is that users will access their content through another front-end like Kodi. On the plus side, this makes managing content easier in some respects: normal filesystem tools are all that one needs to track and delete files, or one can simply rely on the tools built into the front-end. In contrast, MythTV relies on a background process that schedules deletions using a lengthy list of overlapping criteria. TVH also allows complete freedom in where individual recordings are stored and how the files are named; MythTV provides neither feature.

The late night wars

Setup and configuration is the hardest part of using MythTV. For starters, MythTV's back-end process (which handles EPG and scheduling recordings) can only be configured through a GUI application, making it a poor fit for headless systems. More importantly, though, that configuration tool uses a mix of keyboard-only commands that are hard to discover and onscreen elements that can only be accessed with a mouse. Many of the settings must be altered by stepping sequentially through a lengthy series of screens (at times, ten or more).

And the interface itself is not the only awkward factor: tweaking MythTV's behavior is difficult. To give one brief example, the decision of which tuner should be used to record a particular program (e.g., to cope with one antenna receiving a signal stronger than another) involves at least six settings: the "priority" of the tuner/channel mapping, an "preferred input" setting attached to the recording rule, the "priority" setting of the recording rule, separate "priority" points assigned for HDTV and wide-screen broadcasts, and the order in which the tuners were initially configured during setup. How exactly they interact is one of the project's enduring mysteries.

On that type of issue, TVH is far ahead of the competition. The desired documentation is clearly missing in places, but setup is certainly simpler and better organized than MythTV's. That said, it is also possible that TVH's design seems friendlier to those who have experience with Unix systems, because it relies on several system services. The EPG grabber, for example, adds crontab entries to schedule updates, whereas MythTV runs a separate background daemon to fetch and update guide data. Which approach is "better" may be an opinion question, but crontab is far better documented.

The other vital factor to keep in mind if one is debating a move from MythTV to TVH is that TVH is still limited primarily to DVB recording hardware. In contrast, a great many MythTV users record programs from cable or satellite set-top boxes using external encoders like the Hauppauge HD-PVR. As near as I can tell, TVH does not support this hardware use case at all, though there are some potential workarounds if one is willing to write external scripts. And, to be fair, MythTV's support for these devices relies on its share of external workaround scripts, too—they simply ship with the code.

Stay tuned

As I alluded to in September, I have been trying to streamline and simplify the various services running throughout the house, and reports were that TVH would be less convoluted to use than MythTV. Based on a few days of testing, this appears to be the case when it comes to managing hardware and scheduling recordings, although TVH still falls short on EPG support for the region I live in.

According to the project, the upcoming 4.2 release will improve support for the North American ATSC broadcast format—among other things, adding the ability to extract embedded EPG information directly from the broadcast stream. So it is a release that I plan to watch for. Those users living in DVB regions of the world will probably have an easier time getting TVH up and running; for those users, the program offers substantial improvements over MythTV.

Comments (5 posted)