TL;DR: This post details my method for using Node scripts to search through every field in every MongoDB document across multiple collections in a database and update fields that meet a certain criteria (in this example: removing newline characters from strings, e.g. \r and

). Plenty of example code is provided, including a link to an example repo.

So here’s a scenario I was recently presented with on a client project: I’d inherited an old database which had acquired all kinds of interesting baggage over the years. One of the interesting challenges it presented was that a lot of the fields had mysteriously acquired numerous newline characters somewhere down the line: \r 's and

's.

Where they came from, nobody was to say. In the context of the app, this didn’t really cause any problems at the outset, although recently these newline characters were causing some issues for the client, and thus: they had to finally be scrubbed from all the data.

And so the question became, how exactly to go about this?

1: The Naive Approach

Querying the data itself to find fields within the documents that contain these newline characters isn’t a huge deal, the MongoDB shell command would go something like this:

db.collection.find({fieldName: /\r

/})

With fieldName representing a given field in a document and /\r

/ representing a regular expression search for those newline characters. It was quick work to search through the documents and get a sense for where these newline characters were present and how many of them there were. There were lots.

Taking this “naive approach,” all you’d have to do to remove these characters from a document field using the MongoDB shell would be the following:

db.collection.find({ fieldName: /\r

/ })

.forEach(function(document) {

document.fieldName = document.fieldName.replace(‘\r

’, ‘’);

db.collection.save(document);

});

To summarize the above shell command: we 1) perform a search for documents with newline characters in their fieldName field, 2) loop over those documents and perform a string replace on those fields, swapping out all of the newline characters with an empty string, and 3) write them to the DB again.

Where the complication arises, though, is the fact that we don’t know exactly where these characters are. They might be in some fields in some documents, but absent in others. And given that the documents in question have about a dozen+ fields, it starts to get pretty arduous to copy and paste this shell command over and over again for each field. Multiply all of that manual copy-pasting by multiple database collections, and all of a sudden this turns into a pretty time consuming, brittle, boring task. Tons of opportunities for human error, and besides, we have computers to do all this manual labor for us!

2: The Scripted Approach

Let’s write a script to achieve everything that we need to do with this data. To summarize the objectives, we need to:

Query every document in multiple collections

Search through every field in those documents for newline characters

Replace those newline characters with empty strings if and when we find them

As we code we’re going to write our functions as methods on an object, and then call those methods later to keep things tidy, modular and testable. We’re also going to use Promises to make the flow control easier, especially considering that connecting to MongoDB is asynchronous (if you’re unfamiliar with Promises, I’d highly recommend this Fun Fun Function episode on the topic).

So, first thing’s first, let’s connect to the database:

const MongoUpdater = {

...

getConnection: function(url, dbName){

return new Promise((resolve, reject) => {

MongoClient.connect(url, (err, connection) => {

if (err) {

reject(err);

}

resolve(connection.db(dbName))

})

})

},

...

}

To connect, all you have to do is:

MongoUpdater.getConnection('mongodb://localhost:27017', 'test')

.then(db => console.log(db) ) // -> MongoDB database

Next, we need to get the collection we’re going to perform the searches and updates on. Writing a separate method to get the collection might seem a little silly at this point, but remember we’re working with multiple collections, so breaking this out as a separate function makes working with multiple collections easier:

const MongoUpdater = {

...

getCollection: function(collectionName, db) {

return db.collection(collectionName)

},

...

}

Expanding our previous example, here are the function calls you’d perform to get a collection:

MongoUpdater.getConnection(‘mongodb://localhost:27017’, ‘test’)

.then(db => {

let collection = MongoUpdater.getCollection(‘test_collection’, db)

// -> MongoDB collection

Now that we’ve got our collection, we need to query all of the documents in that collection, get them into array form, and start working our magic:

MongoUpdater.getConnection('mongodb://localhost:27017', 'test')

.then(db => {

MongoUpdater.getCollection('test_collection', db)

.then(collection => {

collection.find({}).toArray(docs => {

// magic goes here!

}

})

})

})

Using reduce to update our documents

Now that we have our documents, we have to examine each field in a given document to see if they contain newline characters, while leaving any other fields unchanged. All of which requires looping through the properties of that document object, which is a bit of a pain in Javascript if you ask me. Given the task at hand, the aspiring functional programmer in me just screams to use Array.prototype.reduce . If you haven't encountered reduce in your programming travels, a) you should learn it, seriously, and b) there's a Fun Fun Function episode for that.

const MongoUpdater = {

...

processDocFields: function(doc, regex) {

// first, we get the all the keys of the document

let docKeys = Object.keys(doc)

// next, we build a new object based on those keys

let processedDocument = docKeys.reduce((current, next) => {

// we only want to process fields with strings

// and definitely don't want to touch the _id field

if (typeof doc[next] != 'string' || next === '_id') {

// if the field isn't a string or is the _id field,

// we stick it into the new object we're building

current[next] = doc[next]

return current

}

// otherwise, we replace anything in the field that matches

// the regex pattern with an empty string, and stick it back

// in the object

current[next] = doc[next].replace(regex, '')

return current

}, {})

// function spits out our processed document

return processedDocument

},

...

}

And there you have it! After this it’s just a matter of writing your fresh, clean document back to the DB, and you’ve got a stew going:

MongoUpdater.getConnection('mongodb://localhost:27017', 'test')

.then(db => {

let collection = MongoUpdater.getCollection('test_collection', db)

collection.find({}).toArray(docs => {

// First, throw the docs array into a loop

docs.forEach(doc => {

// Within the loop, process the document object

let processedDoc = MongoUpdater.processDocFields(doc, regex)

// Now that it's processed, we save it back into the database

collection.update(

// We locate the existing document by its _id field

{ _id: ObjectID(processedDoc._id) },

// then pass in your new, cleaned up document as the update

processedDoc,

(err, status) => {

// -> document has been updated in the DB!

})

})

}

})

})

And from here, it’s just a matter of tossing in a few loops to perform the same operations on multiple collections. Here’s what that looks like, with some parts abstracted out for brevity:

let collections = ['test_collection1', 'test_collection2', 'test_collection3']

let dbUrl = 'mongodb://localhost:27017'

let dbName = 'testDB'

let regex = /\r

/ MongoUpdater.getConnection(dbUrl, dbName)

.then(db => {

testCollections.forEach(coll => {

let collection = MongoUpdater.getCollection(coll, db)

collection.find().toArray(docs => {

docs.forEach(doc => {

let newDoc = MongoUpdater.processDocFields(doc, regex)

MongoUpdater.save(newDoc, collection)

})

})

})

})

Conclusion

With some crafty Node scripts, you can automate all kinds of update operations to your MongoDB databases, which not only saves you a lot of frustrating shell command copying and pasting (and the accompanying human error), but also makes it easy to perform similar updates in the future if you encounter the same problem. On top of that, you can adapt the script for other similar scenarios. Need to get first and last names for users into a consistent sentence case? Change abbreviations of a particular word into the full word, or vice versa? Spell check and correct a frequently repeating word? You could do anything really.

All of the above is available in this Github repo, complete with scripts for setting up a test MongoDB database so you can run the scripts on you own and see for yourself how it works, experiment with changes, or adapt the code to your own needs.

Thanks for reading! Happy coding.