While experimenting with building a simple Twitter clone in Postgres, I found that I needed a way to parse hashtags and mentions from tweets like:

#example tweet - #testing with @postgresql

Imagine that we have a table called tweets defined with the following structure:

CREATE TABLE tweets ( id uuid PRIMARY KEY NOT NULL DEFAULT uuid_generate_v4(), post text NOT NULL, hashtags text[] NOT NULL DEFAULT '{}', mentions text[] NOT NULL DEFAULT '{}' );

Wouldn't it be nice if the hashtag and mention tokens were automatically parsed when we INSERT posts into the tweets table like the following:

INSERT INTO tweets (post) VALUES ('#example tweet - #testing with @postgresql'); SELECT * FROM tweets; id | post | hashtags | mentions ---------------------------------------+--------------------------------------------+-------------------+-------------- e133820e-7329-4852-b40b-6e9b7e2fa69d | #example tweet - #testing with @postgresql | {example,testing} | {postgresql} (1 row)

It turns out that this is pretty easy to achieve with Postgres! First we need to define a function to parse tokens from content and return an array of text.

CREATE FUNCTION parse_tokens(content text, prefix text) RETURNS text[] AS $$ DECLARE regex text; matches text; subquery text; captures text; tokens text[]; BEGIN regex := prefix || '(\S+)'; matches := 'regexp_matches($1, $2, $3) as captures'; subquery := '(SELECT ' || matches || ' ORDER BY captures) as matches'; captures := 'array_agg(matches.captures[1])'; EXECUTE 'SELECT ' || captures || ' FROM ' || subquery INTO tokens USING LOWER(content), regex, 'g'; IF tokens IS NULL THEN tokens = '{}'; END IF; RETURN tokens; END; $$ LANGUAGE plpgsql STABLE;

Let's test it out by parsing hashtags from a tweet:

SELECT parse_tokens('#example tweet - #testing with @postgresql', '#'); tokens ------------------- {example,testing} (1 row)

Parsing mentions from a tweet is just as simple:

SELECT parse_tokens('#example tweet - #testing with @postgresql', '@'); tokens -------------- {postgresql} (1 row)

Now that our parse_tokens function is working, we need to define some triggers to parse hashtags and mentions when a tweet record is inserted or updated.

CREATE TRIGGER parse_hashtags BEFORE INSERT OR UPDATE ON tweets FOR EACH ROW EXECUTE PROCEDURE parse_hashtags_from_post(); CREATE FUNCTION parse_hashtags_from_post() RETURNS trigger AS $$ BEGIN NEW.hashtags = parse_tokens(NEW.post, '#'); RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER parse_mentions BEFORE INSERT OR UPDATE ON tweets FOR EACH ROW EXECUTE PROCEDURE parse_mentions_from_post(); CREATE FUNCTION parse_mentions_from_post() RETURNS trigger AS $$ BEGIN NEW.mentions = parse_tokens(NEW.post, '@'); RETURN NEW; END; $$ LANGUAGE plpgsql;