.

.

Store application logs in timescaleDB/postgres.(23 November 2018)

Mar 9 15:08:05 LB1 local0.info haproxy[21843]: 10.0.0.1:1028 [09/Mar/2012:15:08:05.179] FT BK/SRV 0/0/1/8/9 304 12 - - ”GET / HTTP/1.1”

the environment the application is running in(production/test/dev/staging etc)

the IP address of the server that the app is running in

the name of the service/app

the service/app version number(this can be the last commit hash that was deployed)

etc

time | application_name |application_version | environment_name | log_event | trace_id | file_path | host_ip | data ------------------------------+------------------+--------------------+------------------+--------------------+----------+-----------------------------+------------+------------------------------------------------------------------------ 2018-11-18 14:07:18.936522+00 | ETL_app | v0.5.0 | production | video_process_job3 | caf72697 | /usr/src/app/code/etl.py | 172.23.0.2 | {"job_id": "658d31dd85fd", "jobType": "batch"} 2018-11-18 14:14:58.223893+00 | ETL_app | v0.6.0 | canary | video_process_job3 | 17603a0 | /usr/src/app/code/etl.py | 172.23.0.3 | {"error": "Traceback (most recent call last):

File \"/usr/src/app/code/etl.py\", line 129, in job3

ourJobs[5]

IndexError: list index out of range

", "job_id": "c1a164c2-86c5-43e6-a0e5-3ca2377abe95", "jobType": "batch"} 2018-11-18 14:09:09.581655+00 | WebApp | v0.5.0 | production | login | 45a7dc73 | /usr/src/app/code/web.py | 172.23.0.2 | {"user": "Shawn Corey Carter", "email": "someemail@email.com", "status_code": 504, "response_time": 95} 2018-11-18 14:09:09.580918+00 | WebApp | v0.5.0 | production | login | 0f776af0 | /usr/src/app/code/web.py | 172.23.0.2 | {"user": "Shawn Corey Carter", "email": "someemail@email.com", "status_code": 504, "response_time": 6} 2018-11-18 14:09:10.552307+00 | WorkerApp | v1.6.8 | staging | process_work | 6d07fb95 | /usr/src/app/code/worker.py | 172.23.0.2 | {"worker_id": "97b44537", "datacenter": "us-west"} 2018-11-18 14:09:10.551532+00 | WorkerApp | v0.5.6 | production | process_work | 8b2daa49 | /usr/src/app/code/woker.py | 172.23.0.2 | {"worker_id": "a6035461, "datacenter": "us-west"}

you query using SQL which is relatively easy to learn

your fellow developers probably already know SQL

you do not have to learn a new query language from splunk/sumologic/elasctic/influxdb et al

access to all the domain knowledge in your SQL database ecosystem(stackoverflow/google/docs/blogposts etc)

access to all the tools/extensions in your SQL database ecosystem

your average SQL database is relatively* easy to operate

#!/usr/bin/env bash create_extension() { psql -U "${POSTGRES_USER}" "${POSTGRES_DB}" -c "CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;" } create_table() { psql -U "${POSTGRES_USER}" "${POSTGRES_DB}" -c "CREATE TABLE logs ( time TIMESTAMPTZ NOT NULL, application_name TEXT NOT NULL, application_version TEXT NOT NULL, environment_name TEXT NOT NULL, log_event TEXT NOT NULL, trace_id TEXT NOT NULL, file_path TEXT NOT NULL, host_ip TEXT NOT NULL, data JSONB NULL, PRIMARY KEY(time, trace_id) );" } create_table_indices() { # read: # 1. https://www.postgresql.org/docs/11/indexes.html # 2. https://blog.timescale.com/use-composite-indexes-to-speed-up-time-series-queries-sql-8ca2df6b3aaa # 3. https://docs.timescale.com/v1.0/using-timescaledb/schema-management#indexing-best-practices psql -U "${POSTGRES_USER}" "${POSTGRES_DB}" -c "CREATE INDEX idxgin ON logs USING GIN (data);" psql -U "${POSTGRES_USER}" "${POSTGRES_DB}" -c "CREATE INDEX ON logs (log_event, trace_id, time DESC) WHERE log_event IS NOT NULL AND trace_id IS NOT NULL;" } create_hypertable() { psql -U "${POSTGRES_USER}" "${POSTGRES_DB}" -c "SELECT create_hypertable('logs', 'time');" } # 1. create timescaledb extension # 2. create database table # 3. create indices # 4. create hypertable create_extension create_table create_table_indices create_hypertable

from which server are the majority of errors emanating?

what was the path/trace of execution for the request with trace_id XYZ?

XYZ? in which module/file are the errors been raised from?

is this an issue that only affects our canary servers or does it also affect production?

is the perfomance of the current deployed application version similar to version 2.4.7 ?

etc

import os import json import uuid import errno import random import asyncio import datetime import get_ip def makeFifo(fifo_directory="/tmp/namedPipes", fifo_file="komusNamedPipe"): fifo_file = os.path.join(fifo_directory, fifo_file) try: os.mkdir(fifo_directory, mode=0o777) except OSError as e: if e.errno == 17: pass else: raise e try: os.mkfifo(fifo_file, mode=0o777) except OSError as e: if e.errno != errno.EEXIST: raise e return fifo_file def log_structure(log_event, trace_id, application_name, environment_name, file_path, data): now = datetime.datetime.now(datetime.timezone.utc) return { "log_event": log_event, "trace_id": trace_id, "application_name": application_name, "application_version": "v2.1.0", "environment_name": environment_name, "file_path": file_path, "data": data, "time": str(now), "host_ip": get_ip(), } async def emmit_logs( log_event, trace_id, application_name, environment_name, file_path, log_event_data ): try: pipe = os.open(makeFifo(), os.O_WRONLY | os.O_NONBLOCK | os.O_ASYNC) log = log_structure( log_event=log_event, trace_id=trace_id, application_name=application_name, environment_name=environment_name, file_path=file_path, data=log_event_data, ) # we use newline to demarcate where one log event ends. write_data = json.dumps(log) + "

" write_data = write_data.encode() os.write(pipe, write_data) os.close(pipe) except OSError as e: if e.errno == 6: pass else: pass finally: await asyncio.sleep(1) async def web_app(app_name): while True: await emmit_logs( log_event="login", trace_id=str(uuid.uuid4()), application_name=app_name, environment_name=random.choice(["production", "canary", "staging"]), file_path=os.path.realpath(__file__), log_event_data={ "user": "Shawn Corey Carter", "age": 48, "email": "someemail@email.com", "status_code": random.choice([200, 202, 307, 400, 404, 500, 504]), "response_time": random.randint(1, 110), }, ) loop = asyncio.get_event_loop() loop.run_until_complete(web_app(app_name="web_app")) loop.close()

import os import json import random import asyncio import asyncpg import datetime loop = asyncio.get_event_loop() class Buffer: def __init__(self, loop, interval=6): self.loop = loop self.interval = interval self.lock = asyncio.Semaphore(value=1, loop=self.loop) self.buf = [] def send_logs_every(self): jitter = random.randint(1, 9) * 0.1 return self.interval + jitter bufferedLogs = Buffer(loop=loop) async def send_log_to_remote_storage(logs): # todo: to be implemented later pass async def schedule_log_sending(): # todo: to be implemented later pass class PIPE: fifo_file = None def __enter__(self): self.fifo_file = open("/tmp/namedPipes/komusNamedPipe", mode="r") os.set_blocking(self.fifo_file.fileno(), False) return self.fifo_file def __exit__(self, type, value, traceback): if self.fifo_file: self.fifo_file.close() async def __aenter__(self): self.fifo_file = open("/tmp/namedPipes/komusNamedPipe", mode="r") os.set_blocking(self.fifo_file.fileno(), False) return await asyncio.sleep(-1, result=self.fifo_file) async def __aexit__(self, exc_type, exc, tb): if self.fifo_file: await asyncio.sleep(-1, result=self.fifo_file.close()) async def collect_logs(): async with PIPE() as pipe: while True: try: data = pipe.readline() if len(data) == 0: await asyncio.sleep(1) continue log = json.loads(data) if log: # buffer log events in memory async with bufferedLogs.lock: bufferedLogs.buf.append(log) except OSError as e: if e.errno == 6: pass else: pass tasks = asyncio.gather(collect_logs(), schedule_log_sending(), loop=loop) loop.run_until_complete(tasks) loop.close()

async def send_log_to_remote_storage(logs): IN_DOCKER = os.environ.get("IN_DOCKER") try: host = "localhost" # replace with your postgres server IP address if IN_DOCKER: host = "timescale_db" conn = await asyncpg.connect( host=host, port=5432, user="myuser", password="hey_NSA", database="mydb", timeout=6.0, command_timeout=8.0, ) all_logs = [] for i in logs: time = datetime.datetime.strptime(i["time"], "%Y-%m-%d %H:%M:%S.%f%z") application_name = i["application_name"] application_version = i["application_version"] environment_name = i["environment_name"] log_event = i["log_event"] trace_id = i["trace_id"] file_path = i["file_path"] host_ip = i["host_ip"] data = i.get("data") if data: data = json.dumps(data) all_logs.append( ( time, application_name, environment_name, application_version, log_event, trace_id, file_path, host_ip, data, ) ) # batch insert await conn.executemany( """ INSERT INTO logs(time, application_name, application_version, environment_name, log_event, trace_id, file_path, host_ip, data) VALUES($1, $2, $3, $4, $5, $6, $7, $8, $9) """, all_logs, timeout=8.0, ) print("sent") await conn.close() except Exception: pass async def schedule_log_sending(): while True: async with bufferedLogs.lock: if len(bufferedLogs.buf) > 0: await send_log_to_remote_storage(logs=bufferedLogs.buf) bufferedLogs.buf = [] await asyncio.sleep(bufferedLogs.send_logs_every())

SELECT log_event, data -> 'error' AS error FROM logs WHERE data -> 'error' IS NOT NULL ORDER BY time DESC LIMIT 5;

log_event | error -------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------- video_process_job3 | "Traceback (most recent call last):

File \"/usr/src/app/code/log_emmiter.py\", line 133, in job3

ourJobs[5]

IndexError: list index out of range

" video_process_job3 | "Traceback (most recent call last):

File \"/usr/src/app/code/log_emmiter.py\", line 133, in job3

ourJobs[5]

IndexError: list index out of range

" video_process_job3 | "Traceback (most recent call last):

File \"/usr/src/app/code/log_emmiter.py\", line 133, in job3

ourJobs[5]

IndexError: list index out of range

" video_process_job3 | "Traceback (most recent call last):

File \"/usr/src/app/code/log_emmiter.py\", line 133, in job3

ourJobs[5]

IndexError: list index out of range

" video_process_job3 | "Traceback (most recent call last):

File \"/usr/src/app/code/log_emmiter.py\", line 133, in job3

ourJobs[5]

IndexError: list index out of range

"

SELECT log_event, trace_id, file_path, data -> 'error' AS error FROM logs WHERE logs.environment_name = 'production' AND logs.trace_id = ( SELECT trace_id FROM logs WHERE data -> 'error' IS NOT NULL ORDER BY time DESC LIMIT 1);

log_event | trace_id | file_path | error -------------------+----------+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------- video_process_job1 | f552320e | /usr/src/app/code/log_emmiter.py | video_process_job2 | f552320e | /usr/src/app/code/log_emmiter.py | video_process_job3 | f552320e | /usr/src/app/code/log_emmiter.py | video_process_job3 | f552320e | /usr/src/app/code/log_emmiter.py | "Traceback (most recent call last):

File \"/usr/src/app/code/log_emmiter.py\", line 133, in job3

ourJobs[5]

IndexError: list index out of range

"

SELECT log_event, trace_id, host_ip, data -> 'status_code' AS status_code FROM logs WHERE logs.environment_name = 'production' AND logs.application_name = 'web_app' AND data -> 'status_code' IN ('500', '504');

log_event | trace_id | host_ip | status_code ----------+----------+------------+------------- login | aa9951dc | 172.24.0.4 | 504 login | 63898b0d | 172.24.0.4 | 500 login | 6154aa3e | 172.24.0.4 | 504 login | 053a9820 | 172.24.0.4 | 504 login | 29e6644c | 172.24.0.4 | 504

SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY data -> 'response_time') AS average, percentile_disc(0.75) WITHIN GROUP (ORDER BY data -> 'response_time') AS seven_five_percentile, percentile_disc(0.99) WITHIN GROUP (ORDER BY data -> 'response_time') AS nine_nine_percentile FROM logs WHERE logs.environment_name = 'canary' AND logs.application_name = 'web_app' AND data -> 'status_code' IS NOT NULL AND data -> 'response_time' IS NOT NULL;

average | seven_five_percentile | nine_nine_percentile --------+-----------------------+---------------------- 43 | 86 | 89

SELECT application_version, COUNT(data -> 'error') AS error_count FROM logs WHERE data -> 'error' IS NOT NULL GROUP BY application_version ORDER BY error_count DESC;

application_version | error_count --------------------+------------- v3.5.0 | 5 v3.4.0 | 1 v2.3.0 | 1 v1.7.0 | 1

If the title alone has already turned you off, I understand. It's almost unheard of, these days, to store application logs in a structured datastore. Whenever people talk about logs and their logging pipeline, it usually involves ELK(Elasticsearch) , MongoDB or such other NoSQL databases. And no such talk is complete without a tale of how people have to stay up at night feeding their Elasticsearch monstrous JVM with the blood of unblemished year-old sheep(or whatever JVM's feed on). And it does make a lot of sense to store logs in an unstructured database, because after all, logs are unstructured by nature. Are they? In this blogpost, I'll try & make a case for storing logs in an SQL database. I'll then go ahead and implement a proto-type logging pipeline using PostgreSQL/timescaleDB.I hold a differing opinion, I think if you look at logs hard enough; a structure starts to emerge. In this section, I'll lay out my opinion about logs and logging in general. This are my opinions and I hold no brief for anybody. I'm not going to try and convince you to adopt my worldview, but atleast to consider it.logs are actually structured. If you look at an example HTTP log of the popular tcp/http load balancer,You can see a lot of structure in it, there's;etc And all this things will be available in every log event; that's a lot of structure.You should annotate your logs with more structured information to make them much more useful. Take the haproxy log above, it would be much more useful if we added metadata like;logs should be primarily used to help debug issues in production.logs should only be persisted for a short period of time(~7days). This is a corollary of opinion 3 above. If we buy into the argument that logs are for debugging purposes, then we can also argue that if an issue occurs in production, you ought to debug it asap. Which means, you only need to store logs for a max of 7days. What you do with the logs after 7days(7 is just a rough estimate) is upto you; you can send them to /dev/null, or AWS s3 or some other cold storage but they should not be lying around in your primary log store.you should not lose logs, but it should not be a big deal if you do.you should err on the side of logging more meta/data than you think you'll need.logs can be time-series data. Take a deep breathe; time-series data at this point in time in our industry is a loaded term that means different things to different people. The definition of what time-series data is that I like asscociating myself with is; taken from the blogpost; What the heck is time-series data (go read it) ie, I'm making the case that, the following log events are time-series data:you can(and probably should) fashion your logs as time-series events. When we consider all that, we see that an SQL database is not that bad of an idea as a medium for storing logs. And on top of that you gain a lot of other added advantages:PostgreSQL is a popular SQL datatabse. TimescaleDB, despite the name, is actually a postgres extension that enhances postgres for time-series data.it is possible to use PostgreSQL to store time-series data without having to install the timescaleDB extension. In this blogpost, however, we are going to use postgres + timescaleDB extension for storing logs. You can go through the TimescaleDB documentation to learn how to set it up. But assuming you already have PostgreSQL installed, the following bash script should get you going;That creates the timescaleDB exension in postgres, creates a table, indices and a timescaledb hypertable In our application, we want our log events to have a few mandatory items;etc This will allow us to be able to ask more meaningful questions when debugging our app/service, we can ask questions like;That's why we created a database table with those fields in the bash script above. We also added a JSONB field so that we can be able to store any other fields that are not mandatory in all log events eg errors, http status code etc Our app will write logs with the said structure to a linux named pipe , then there will be another agent/daemon running continuously that reads from the same pipe and sends the logs to timescaleDB. So, the application code to write logs to a named pipe would look like the following:Now we need to write code for the agent that will read code from the named pipe and send it to timescaleDB/postgres. This will be in two parts, in part one we read from the named pipe and buffer the logs in memory, the second part takes whatever is buffered in memory and sends it to timescaleDB. In the following code, we read from the named pipe and buffer log events in memory:In the following code, we now take log events from the in memory buffer every X seconds and send them to timescaleDB/postgres:To badly misquote The O'Jays ? Let's run some queries: 1.2.3.All those instances were login events, maybe we should go and have a look at our login code. But why are all of them emanating from the same server with IP adress? Maybe the bug is a combination of the login code and that particular server and not necessarily a fault of the login code alone. 4.5.Maybe we should roll-back the deployment of version v3.5.0 as we figure out why it has introduced so many regressions.All this is fine and dandy, but does the solution scale? This is a hard question to answer conclusively. What I can encourage you to do, is to run your own benchmarks. You could also look at literature of people who have ran some benchmarks. But ultimately, it comes down to your own individual application and set of circumstances. Run your benchmarks and adopt the solution that seems to fit your app. And maybe a NoSQL database is exactly what fits your needs. Who knows? Explore and find out. If you are a small shop and are already using an SQL database for your other business processes, it might turn out that there's no need to introduce a new database for logs. All the code in this blogpost can be found at: https://github.com/komuw/komu.engineer/tree/master/blogs/timeScaleDB