How do you find 10,000 needles in 50 haystacks?

That, in effect, is what journalists and developers with USA TODAY and The Arizona Republic set out to do two years ago: Identify among the roughly 100,000 bills introduced in the 50 states each year what's been copied from drafts pushed by special interests.

Here’s how we did it.

Using data provided by LegiScan, which tracks every proposed law introduced in the U.S., we pulled in digital copies of nearly 1 million pieces of legislation introduced between 2010 and Oct. 15, 2018. The data included a limited number of bills from 2008 and 2009.

We then asked a dozen reporters covering state legislatures for USA TODAY Network newsrooms across the nation to build a list of model bills by searching special-interest groups' websites, scouring news coverage and interviewing lobbyists and lawmakers. We identified more than 2,100 models, a list that is far from complete because many groups don't make their models public.

We then used a computer algorithm designed to recognize similar words and phrases and compared each model in our database to the bills that lawmakers had introduced.

These comparisons were powered by the equivalent of more than 150 computers, called virtual machines, that ran nonstop for months.

How did we compare bills with model legislation?

Even with that computing power, we couldn’t compare every model in its entirety against every bill. To cut computing time, we used keywords – guns, abortion, etc. Some bills have 30 to 40 keywords associated with them.

The system only compared a model with a bill if they had at least one keyword in common.

If there was a keyword match, the system compared the documents looking for strings of six or more words that appeared in both. For this search, the system used “stemmed” words, meaning they had been converted to their root. (For example, walk, walks, walked, and walking all become walk.)

If a bill and a model shared at least one keyword and one six-word string, the system assigned a score reflecting how similar the two documents were.

How our scoring system worked

Our scoring system is based on three factors: the longest string of common text between a model and a bill; the number of common strings of five or more words; and the number of common strings of 10 or more words.

Based on those factors, bills received scores on a 100-point scale. The closer to 100, the more likely a bill was copied from model legislation.

For its analysis, USA TODAY/Arizona Republic used only bills that scored 80 or higher. At that level, substantial amounts of text have been duplicated.

Another estimated 10,000 bills below the 80-point threshold were likely copied from model legislation but matched less of the model's text. Out of caution, USA TODAY/Arizona Republic cited in its investigation only bills with substantial portions copied from a model. In addition, if legislators copied an idea but not the precise language, a bill would not be flagged.

Joe Walsh, a former data scientist at the University of Chicago, used what’s known as the Smith-Waterman algorithm to create the Legislative Influence Detector, which also finds similarities between model legislation and bills. His system has been used by reporters around the country to find model bills.

Walsh reviewed USA TODAY/Arizona Republic’s investigation and findings and applauded its scoring system for showing when a bill has been substantially copied from model legislation.

“It’s really clear, the numbers are nice and round, and it's easy to show and explain,” Walsh said. "I wish that we were able to do some of this stuff. I am glad someone is.”

Can I examine the results?

USA TODAY/Arizona Republic continues to search legislation and compare it with known model bills from around the country, furthering its investigation of outside influences on state lawmakers.

Initially, the system is being rolled out to USA TODAY Network journalists for use in reporting on state legislatures.

How were bills categorized?

Special-interest groups, both liberal and conservative, have for years crafted and lobbied for model bills. Generally, the organizations that craft the bills have a clear mission or ideological bent. The American Legislative Exchange Council, the best-known and one of the most prolific model-bill factories, supports conservative ideas and efforts. The State Innovation Exchange, once known as ALICE, is in effect ALEC's liberal counterpart. We classified bills based on the mission or ideological orientation of the organizations that created each model. In some cases, groups with a conservative bent also push bills that benefit industry. We labeled each bill according to the most dominant characteristic.

How can I help?

If you know of a model bill, particularly one that you think has not been made public, we want to hear from you. Please complete this form and include text of the model bill. We will try to include it in our system.