Jordan Scrapes the Idaho Board of Medicine for Newly Licensed Doctors

Demo code here

I have a physical therapist friend that started me on this idea. He mentioned one time at lunch that most of his referrals come from doctors and so he tries to build relationships with them so they will trust him enough to send the referrals to him. A lot of the doctors, however, already have good relationships with other physical therapists so it’s hard for him to get in the door.

That’s where this idea came to me. Surely doctors’ medical licenses are a matter of public record. With a little googling, I stumbled upon the Idaho Board of Medicine. Here, you can view all doctors’ license and when it happened. And, most importantly, you can search by license date. That means I can find the new doctors that are just ripe and waiting to be relationshipped by an ambitious PT.

Another interesting note is that there were a LOT of out of state doctors. The huge amount was just too much to just be doctors going to school out of state and moving to Idaho when they finished. So I called the Bureau and they said that it was doctors doing telemedicine and that it was a field that was exploding. They must need a PT to refer to, so I decided to keep them as well.

It’s also noteworthy and a downfall of this script that it doesn’t always get a valid phone number. Idaho Board of Medicine, at least, only has a business phone number and it’s not required so it’s often blank. I would expect this to be different with different states.

Now to the actual code. I’m going to be honest, this was on the less enjoyable side of sites to scrape. It was all built with HTML tables and cells without unique indentifiers so I had to find the table, put all the cells into an array and then just get the index of the cell I was looking for. It was a pain and didn’t make for the cleanest code.

export const config = { mongoUser: "<mongoUser>", mongoPass: "<mongoPass>", mongoUrl: "<mongoUrl>", mongoDB: "<mongo-db>", mongoCollection: "<mongo-collection>", discordWebhook: "<url>" };

I use this code in a production environment right now so I have a config file set up with my the important credentials. I’ve placed a src/sample-config.ts file that has the above fields. You just need to replace them with your own credentials. I use a mongodb database and discord to handle notifications. You could use any database you wanted but you’d need to modify the code accordingly.

I use Puppeteer for this scrape. Puppeteer is all promise based so this allows us to use async/await. Because it’s awesome

(async () => { // awesome code here })();

Check this for more explanation on async/await.

I also run a lot of my Puppeteer scrapers on a Digital Ocean (which I love) ubuntu box so I have a block at the top that makes Puppeteer work in ubuntu. I have scripts set up in package.json that pass in arguments for being both headless and/or ubuntu.

"scripts": { "getNewContacts": "tsc && node ./dist/getNewContacts.js", "getNewContacts:withHead": "tsc && node ./dist/getNewContacts.js withHead", "getNewContacts:ubuntu": "tsc && node ./dist/getNewContacts.js ubuntu" },

I start this script off by initializing my mongo database and then pulling all of the current doctors’ contacts that I have. I really just get the array of all the current ones and then just add new ones into it in this script. I also handle the ubuntu and headless arguments here to change our puppeteer config accordingly.

const dbUrl = `mongodb://${config.mongoUser}:${config.mongoPass}@${config.mongoUrl}/${config.mongoDB}`; const db = await dbHelper.initializeMongo(dbUrl); let contacts = await dbHelper.getAllFromMongo(db, config.mongoCollection, {}, {}, ); const originalCount = contacts.length; let ubuntu = false; let headless = true; if (process.argv[2] === 'ubuntu' || process.argv[3] === 'ubuntu') { ubuntu = true; } if (process.argv[2] === 'withHead' || process.argv[3] === 'withHead') { headless = false; }

Next, I do the main part of the work and send of the contacts and our config parameters to the bulk of the code, which is over in the getListOfLicenses function.

contacts = await getListOfLicenses(contacts, ubuntu, headless); const newCount = contacts.length;

Now, the good stuff! The function accepts the arguments and then I set up my puppeteer browser with the arguments I set previously.

export async function getListOfLicenses(foundDetails: any[] = [], ubuntu = false, headless = true, date?: string) { let browser: Browser; try { if (ubuntu) { browser = await puppeteer.launch({ headless: true, args: [`--window-size=${1800},${1200}`, '--no-sandbox', '--disable-setuid-sandbox'] }); } else { browser = await puppeteer.launch({ headless: headless, args: [`--window-size=${1800},${1200}`] }); }

The Idaho Board of Medicine has a form as a springboard into the rest of the data. With forms I normally always try to go direct to the url that the form is POSTing or navigating to. With this one it was kind of messy. It must be doing hashed asp.net viewstate that is an added layer of security. I guess it kind of worked because I was unable to post directly with it using it.

Viewstate is huge.

Another kind of funny thing is that they have a datepicker on their form that actually doesn’t work. The datepicker sets a date with slashes, like 1/28/2019 and I guess slashes are invalid characters. When I referred this to my PT friend, he first responded with…yeah, this site doesn’t work at all. It makes me wonder how many people are actually taking advantage of this knowledge because either the Board has received the bug reports and just isn’t doing anything or it hasn’t received enough reports to even know about it.

Datepicker actually breaks the form.

This is a script that runs automatically every day so we just check whatever we missed in the time since the last scrape. So I just take the current day and go minus one. However, I have to account for the first of the month as well. So I just check that if I subtract one from the current day it’s positive. If it’s not, I just use the 1st. This does mean that we would miss the last day of the month. But, we don’t care because then we’ll just get two days worth of data instead of one and our code won’t break.

const url = 'https://isecure.bom.idaho.gov/BOMPublic/LPRBrowser.aspx'; const page = await setUpNewPage(browser); await page.setViewport({ height: 1200, width: 1900 }); await page.goto(url); if (!date) { const d = new Date(); const currentDay = d.getDate(); let desiredDay = 1; if (currentDay - 1 <= 0) { desiredDay = currentDay; } else { desiredDay = currentDay - 1; } const month = d.getMonth() + 1; const year = d.getFullYear(); date = `${month}-${desiredDay}-${year}`; } await page.type('#ctl00_CPH1_txtsrcOriginalLicenseDate', date); await page.click('#ctl00_CPH1_btnGoFind'); await page.waitForSelector('#ctl00_CPH1_PnlGrid');

The date is our only search parameter. After inputting it, we just click and then wait for the data to appear await page.waitForSelector('#ctl00_CPH1_PnlGrid'); . Next I get the number of total pages that I have so I know how much pagination I will need to do.

let currentPage = 1; let totalPages = 1; const totalPagesHTML = await getPropertyBySelector(page, '#ctl00_CPH1_lblPage strong', 'innerHTML'); totalPages = parseInt(totalPagesHTML.trim().split(' ')[3]);

Then, I start the loop. I have a separate function here that goes into more specifics for handling each row. This is mainly due to the fact that it’s really messy going through their HTML table structure.

while (currentPage <= totalPages) { console.log('Searching page ****** ', currentPage); let rows = await page.$$('.GridItemStyle'); if (rows) { await handleRows(rows, browser, foundDetails); } let aRows = await page.$$('.GridAItemStyle'); if (aRows) { await handleRows(aRows, browser, foundDetails); } currentPage++; await page.click('#ctl00_CPH1_btnNext'); await delay(750); }

And now for the handleRows function. I am checking here to make sure it’s a ‘New License’ and that I haven’t already found this license. I check this by using what I thought would be a unique identifier, the license number. Because I don’t have any unique css classes or ids, I have to just get the cells of each row into an array and luckily, because it’s consistent, I just make a legend of what each cell is.

export async function handleRows(rows: ElementHandle[], browser: Browser, foundDetails: any[] = []) { try { for (let i = 0; i < rows.length; i++) { const cells = await rows[i].$$('td'); /** * Result cells legend * 0 - Details image/link * 1 - Name (Last, First) * 2 - License number * 3 - Expiration * 4 - Current * 5 - Status * 6 - Status date, * 7 - Actions * 8 - Posting date * 9 - City, State, Zip * 10 - Profession * 11 - License Type */ const licenseStatus = await getPropertyByHandle(cells[5], 'innerHTML'); const licenseNumber = await getPropertyBySelector(cells[2], 'a', 'innerHTML'); if (licenseStatus.trim() === 'New License' && !foundDetails.find(details => details.number === licenseNumber.trim())) { const detailsUrl = await getPropertyBySelector(cells[0], 'a', 'href'); foundDetails.push(await getDetails(browser, detailsUrl)); } } return Promise.resolve(); } catch (e) { return Promise.reject(`Error in rows, ${e}`); } }

All rows looks something like this. Not sure if it was worth blocking the names out since this is public information but I thought it wouldn’t hurt.

Once I determine that this is a license I want because it’s a ‘New License’ and it’s not one I have already, I go into the next function, which handles the details page.

The details page with expert censoring.

The nice thing about the details page is that it actually does have valid css selectors so it made it a lot cleaner to get what I wanted. The only two items that were a bit tricky was the City State, Zip. I had to do some split and replace here because there were no clear differentiators in the HTML.

export async function getDetails(browser: Browser, url: string) { try { let details: any = {}; const page = await setUpNewPage(browser); await page.setViewport({ height: 1200, width: 1900 }); await page.goto(url); details.name = (await getPropertyBySelector(page, '#ctl00_CPH1_txtLicenseeName', 'value')).trim(); details.businessPhone = (await getPropertyBySelector(page, '#ctl00_CPH1_txtShopPhoneNo', 'value')).trim(); details.streetAddress1 = (await getPropertyBySelector(page, '#ctl00_CPH1_txtAddress1', 'value')).trim(); details.streetAddress2 = (await getPropertyBySelector(page, '#ctl00_CPH1_txtAddress2', 'value')).trim(); details.cityStateZip = (await getPropertyBySelector(page, '#ctl00_CPH1_txtCityStateZip', 'value')).trim(); const cityStateZipSplit = details.cityStateZip.split(' '); details.city = cityStateZipSplit.slice(0, cityStateZipSplit.length - 2).join().replace(/,/g, ' '); details.state = cityStateZipSplit[cityStateZipSplit.length - 2]; details.zip = cityStateZipSplit[cityStateZipSplit.length - 1]; details.board = (await getPropertyBySelector(page, '#ctl00_CPH1_txtBureauName', 'value')).trim(); details.licenseType = (await getPropertyBySelector(page, '#ctl00_CPH1_txtLicenseTypeDescription', 'value')).trim(); details.number = (await getPropertyBySelector(page, '#ctl00_CPH1_txtLicenseNumber', 'value')).trim(); details.dateOfIssue = (await getPropertyBySelector(page, '#ctl00_CPH1_txtLicenseIssueDate', 'value')).trim(); details.status = (await getPropertyBySelector(page, '#ctl00_CPH1_txtLicenseStatus', 'value')).trim(); details.region = getRegion(details.city); page.close(); // details.potentialPhoneDetails = []; // details.potentialPhoneDetails = await getPhoneFromFindPerson(details, config.wpFindPersonAPIKey, details.potentialPhoneDetails); // details.potentialPhoneDetails = await getPhoneFromReverseAddress(details, config.wpReverseAddressAPIKey, details.potentialPhoneDetails); console.log('details', details); return Promise.resolve(details); } catch (e) { return Promise.reject(`Error on details page, ${e}`); } }

I also tried to add a region flag where it made sense. I essentially just split Idaho into three different regions and picked (by hand) some of the bigger cities around those regions. If the city was in those regions, I would mark it as that region.

function getRegion(city: string) { const treasureValleyCities = ['BOISE', 'KUNA', 'STAR', 'EAGLE', 'NAMPA', 'CALDWELL', 'MERIDIAN']; const easternIdahoCities = ['POCATELLO', 'IDAHO FALLS', 'SHELLEY', 'BLACKFOOT', 'DRIGGS']; const northernIdahoCities = ['KELLOGG', 'SANDPOINT', 'COEUR D ALENE', 'MOSCOW', 'OROFINO', 'HAYDEN', 'HAYDEN LAKE', 'CLARKSTON', 'LEWISTON']; if (treasureValleyCities.some(acceptableCity => city.toLowerCase().indexOf(acceptableCity.toLowerCase()) >= 0)) { return 'Treasure Valley'; } else if (easternIdahoCities.some(acceptableCity => city.toLowerCase().indexOf(acceptableCity.toLowerCase()) >= 0)) { return 'Eastern Idaho'; } else if (northernIdahoCities.some(acceptableCity => city.toLowerCase().indexOf(acceptableCity.toLowerCase()) >= 0)) { return 'Northern Idaho' } else { return null; } }

And BAM. I’m done. I run it daily and then I have a database of newly licensed doctors in the state of Idaho.

Some things that I think probably aren’t the most efficient. I get ALL of the licensed doctors on init and then just add to that. Eventually I’ll be pulling a gigantic list. But I do need to be able to confirm that I don’t have it already. So really, the only two options I can think of how to do this is either pull the whole list and then just check to see if the licensed doctor is already there OR I make a database call per license I find and check that way. I’m really not sure which is best.

In any case, that’s it. It’s a fun script.