



We’re talking databases, indexes, search engines, and why they’re basically microwaves in this episode while Joe wears a polo, Allen’s quick brown fox jumps over whatever, and Michael gives out fake URLs.

Sponsors

Datadog.com/codingblocks – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.

Survey Says

Anonymous Vote Sign in with Wordpress Now that you had some time to digest the news, how do you feel about Microsoft's acquisition of GitHub? Very excited! Looking forward to the awesome things Microsoft will add to GitHub.

I'm concerned but not enough to do anything about it. Yet.

I don't care. At all. Should I?

OMG the sky is falling. Why? How could we let this happen?

Already packed up my code and moved to GitLab. vote

News

Thank you to everyone that left us a review: iTunes – vr6apparatus, The All Ighty Ollar, BonerGs, MatiasWasAlreadyTaken, galusssss Stitcher – Vothis, JavaRuns100BillionDevices, SoulSurvivor, Twentyone

The Cynical Developer (James Studdart) has Sean Martz on the show to talk about transitioning from a sys admin to a developer. (Cynical Developer Episode 79)

In the Orlando area on June 21? Come see Joe’s talk on Building Search Driven Apps with Elasticsearch at the Orlando Backend Developer’s Meetup. Sign up for Elasticsearch and Go for the Backend at Meetup Source code and Slides over on GitHub

Joe invents a new way to listen to podcasts … by topic. Search and listen by topic at https://qit.cloud/ Source code available on GitHub



Search Driven Apps

What’s the problem? Why do we need search driven apps?

Search is a core tenant of modern computing Google, Shopping, Cortana, Spotlight, Contacts Can’t use the phone, listen to music, buy underwear or watch TV without searching

In some cases, the number of items we’re searching are small, can get away with a simple LIKE % search

But people aren’t very good at searching, and Google has spoiled us Auto-Complete helps with typos What if there’s too much data? Or the user doesn’t know what the thing is called? And what if there’s a lot of data?



Some interesting numbers:

Google servers 40k searches … per second (Ardor Seo)

(Ardor Seo) Amazon sells over 500M products (ScrapeHero)

Splunk indexes 100’s of TB per day

5B videos watched on YouTube every day (MerchDope)

Q: How are these services doing it?

A: Lots and lots of joins, and LIKE clauses?

Relational Databases …

Don’t scale horizontally very well

Are designed for ACID instead of BASE ACID From Wikipedia: Atomicity – requires that each transaction be “all or nothing”: Consistency – ensures that any transaction will bring the database from one valid state to another. Isolation – ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially Durability – ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors BASE – Basic Availability, Soft State, Eventual Consistency

In regards to the CAP Theorem, Relational DB have an emphasis on Consistency

CAP Theorem From Wikipedia: Consistency – every read receives the most recent write or an error Availability – every request receives a (non-error) response – without guarantee that it contains the most recent write Partition Tolerance – the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

“Shredding” is inefficient (normalizing data for storage, and re-constructing it for retrieval over and over again)

Writing dynamic ANSI SQL is the pits

DB vs Search Engine…why not just do it all in SQL?

Why not just turn on full text search in SQL Server? Full-Text Search (Microsoft Developer Network)

Elasticsearch Search Reference (Elastic Docs)

Ability to search across all indexes, quickly and easily… Fast aggregations (facets) Ranged aggregations Paging Scale Migrations / Versions Simpler query requests



Ever written a SQL query that started out simple

And ended up a monster?

Start with a LIKE

Add some dynamic JOINs

In clauses lead to temp tables

Paging leads to support for multi-page actions …

CTE, Pivoting, OpenQuery

Writing dynamic SQL stinks.

How Inverted Index Searches solve the problem

So … why don’t we use a NoSQL implementation, denormalize the heck out of it, and then index the heck out of it, then map reduce the heck out of it – Well… you just described a basic search engine!

NoSQL DB: Horizontal Scaling

DeNormalize our data

Index everything!

Map/Reduce to search quickly

Declarative language

A little bit about indexes …

Forward index, and …

Inverted index (Wikipedia)

Let’s consider them both at the same time using a book analogy. A table of contents is a _forward index_. It tells you where to go in the book for a “document” (aka a chapter). The index you’ve seen in books is an _inverted index_. It tells you where to go in the book for all uses of the relevant words. Great answer on Stack Overflow.



But What’s a Reverse Index?

Let’s consider a row in a database the “document” we want.

A typical DB index would be a forward index. In other words, when we want row with ID = 123, we can seek right to it. Or put another way, a typical DB index acts like a table of contents for the “document” (i.e. row) we want

Busy databases/indexes can suffer from index block contention, especially those indexes for monotonically increasing sequences. Like numerical based primary keys. Consider that the index for a table is stored across several “blocks”. Let’s hypothetically say each block contains 100 keys. Let’s also consider our table has a integer primary key that sequentially increases. So we have keys like 123, 124, 125, etc. So, 0-99 are in the first block, 100-199 are in the second block, 200-299 are in the third block, etc. If 100 write requests happen at the same time, it’s possible they could all queue up to write to the same block. Best case, two blocks given that scenario This is what a reverse index aims to solve.

A reverse index simply reverses the key So for key = 123, it would go in the block as 321 (i.e. in our “300” block) and key = 124 would go in another block (in our “400” block) as 421, etc.

This way reads/writes are distributed.

In the case of 100 concurrent write requests, they would be spread across 10 different indexes given this scenario.

Back to Inverted Indexes

Take the sentence “The Quick Brown Fox Jumps Over the Lazy Dog”

Store the entire document

Throw away the “stop” words (like “the”)

Break the sentence down into tokens for a hash table

Each token contains an array of pointers to records

When the user searches for “lazy fox” we can break that sentence into tokens, and return the records that are pointed to by our indexes

Advanced rules apply: synonyms, antonyms, stemming, pluralization, scoring, etc

Example of synonym: “PWA” and “Progressive Web Apps” or “.Net” and “Dot Net” or “JS” and “JavaScript”

Because of the storage mechanism, counting is easy too – we don’t even need to fetch the document, just count pointers

Inverted Index Search Engines

Slow to Write, Fast to Read (Sort of…there’s near real time on many now)

NoSQL before it was called NoSQL

Scale Super Well

Wicked Fast Reads

Support Complex Filters

Declarative syntax

Popular Search Engines

Resources We Like

Tip of the Week

Pattern matching with is (Microsoft Docs)

Use git cherry-pick -n to pick multiple commits into a single new commit. (git-cherry-pick Documentation)

Looking for a Python fiddle? Try https://pyfiddle.io/.

Google’s Python Class (Google for Education)

Execute T-SQL from Visual Studio Code using the mssql extension Extension available in the Visual Studio Marketplace Source code available on GitHub

Use git bisect to find the commit that broke your code. (git-bisect Documentation)

Build, deploy, and manage your next web project with all of the features of an enterprise on Netlify.