Illustration by Jiin Choi

A Sign from Below

Running text analysis on hundreds of homeless people’s signs can synthesize their collective message.

New Yorkers will recognize two general forms of supplication employed by the local homeless population. One strategy is to verbalize your plight, either at the corner of an intersection or on a subway car between stops, which can draw both pocket change and ire from surrounding travelers. The other is to hold up a cardboard sign upon which your message is scrawled.

Cardboard signs are advantageous for the same reasons that text usually outperforms spoken word: it’s faster, more scalable, and allows an audience to opt out. Opting out, unfortunately, is what most of us do when passing a homeless person and his sign. It’s an act of convenience but also of realism: there are over 60,000 homeless people in New York City, and embracing each encounter with one seems like a tall task.

So, assuming you don’t stop to read most homeless people’s signs, what do you think the average one says?

Step 1 of many in figuring out what New York’s homeless write on their signs.

In truth, gathering and transcribing this data deserves a blogpost of its own, but here’s the drive-by recap of how it was done. The search term “New York City homeless sign” and others like it were pumped into Google, Twitter, and Instagram, yielding several hundred usable photos from news articles, social media posts, and personal photography blogs.

These signs were difficult to transcribe. The photos were frequently of poor quality, out of focus, or taken at an unhelpful angle. And even when an entire sign was clearly visible, other hurdles awaited. Homeless people use particular (and often incorrect) spelling, punctuation, line breaks, and handwriting. Sometimes their message is not a single string of words but various thoughts sprinkled across the cardboard with few clues as to sequence.

Homeless signs can feature scattered, illegible thoughts, making them difficult to transcribe.

Several polishing protocols were used to homogenize the text and make it better suited for parsing and analysis, such as correcting slight typos and replacing symbols with their full words (“4” to “for,” “+” to “and”). Ultimately the text mining package in R would ignore capitalization, line breaks, and punctuation anyway, so many inconsistencies were destined for standardization.

In the end, 244 signs were aggregated and prepped for analysis (the dataset is here, complete with URLs). For the record, 244 signs is a microscopic corpus for text mining. Usually massive tomes like War & Peace or an accumulation of Tweets by the million will be fed into an analysis of this ilk. Still, legitimate patterns were available.

Sifting through hundreds of these signs reveals the various methods of persuasion that homeless people adopt. Some try to be clever (I’LL BET YOU $1 YOU’LL READ THIS SIGN), others topical (I HAD AN AFFAIR WITH TIGER WOODS NOW LOOK WHERE I’M AT). Frequently they portray themselves as more sympathetic characters, identifying as pregnant mothers, military veterans, or robbery victims.

Often they’re a step ahead of your critical questions. Why are they without home? HUSBAND DIED NO INSURANCE LOST EVERYTHING HOMELESS CAN YOU HELP GOD BLESS HAPPY THANKSGIVING. Can’t they go somewhere else? NEED A BUS TICKET TRYING TO MAKE $35 TO GET TO ATLANTIC CITY, NJ I HAVE A PLACE TO LIVE AND A JOB OPPORTUNITY OUT THERE.

Enough anecdotal evidence, though. This is a data blog, so we need numbers. Here are the top 25 terms:

As demonstrated by the words’ frequency on signs, a homeless person’s priorities are to beg for assistance (“help”) and identify him or herself (“homeless”), all the while remaining courteous (“please”).

For text mining exercises, common words like “and” or “the” are often omitted since they dilute the popular terms list. Interestingly, such words were not deleted from this sample but also did not top our list, which means two things. First, the homeless are less likely to include such predictable words, probably in favor of space and speed. Second, it indicates just how prevalent words like “help” and “homeless” are, as our sign-bearers use them as often (4%) as normal text relies on ubiquitous terms like “the.”

You likely recognize the above y-axis as individual words, but they might also be referred to as unigrams, or the simplest form of an “n-gram,” which is technical speak for a sequence of n words. We can also sort for bigram frequency, or the occurrence of two-word phrases:

[The frequency of bigrams can’t be divided by word count for relatively frequency numbers, so x-axis was left as absolute frequency.]

The predictable pairings of some of the words from the first chart start to come together. “Please” and “help” are used frequently, but also used frequently in conjunction. The same goes for “God” and “bless.”

It should be noted that creating bigrams, trigrams, and so on is the basis for simulating new text based on the old. Trying to fake a natural-sounding sentence by randomly selecting words would have comically poor results. But if you know which sequences are likely, you can begin to piece together phrases and sentences as if they were train cars.

As a rudimentary example, if you started with the word “please,” your bigram frequency numbers could help you predict the next word, “help,” from which you could link your way to “thank,” “you,” “God,” and “bless.” This is known as Markov text generation, and it’s the mechanism behind online (and sometimes nefarious) bots that can mimic human writing.

The longer the n-gram, the more human the simulated writing will sound, since your building blocks are larger pieces of real text. Here’s a screenshot of using trigrams from our sample of homeless signs to generate short sentences. Some of the results are completely non-sensical, but others you could imagine being taken from an actual sign:

Alas, there’s something a little messed up about using these signs to create fake text when homelessness is all too real of a problem in New York City and other urban areas, so let’s end our analysis here. If this post has piqued your interest in text mining, you might want to check out this R walkthrough on the subject or this project by Andrey Kotov, both of which informed this post. You can also view the script and data used for the analysis, as always, in the repository linked in the postscript.

And, finally, if you’d like to make it harder for data analysts to find homeless people and their signs, consider donating to the Bowery Mission, which provides shelter, food, and clothing for New York’s homeless.