Josh Beemster

just a quick point of clarification. So for the client client side pipelines, where we are actually leveraging Cuban Eddie's in the GCP pipeline, and with Looking to leverage ECS in the in the Amazon pipeline so we only use Nomad internally for internal orchestration and scheduling fabric memories and we've we've chosen Nomad for that task is really just it's it's deep level of integration with the rest of the hashey stack. So it seemed natural to say well we're using terraform we using console using volte using Packer, we should use Nomad as well, because it's older, all the kind of nice, neat, neat native native integrations, but on on the latter point around and I can talk to them why we've used ECS as well, possibly possibly after but the larger point on where our evolution has has happened. That's that's a very long, long history there. So where where we started a few years ago, we had much fewer clients and much simpler tech stack. And we could kind of get by with with a lot of manual work. So the decisions we made at that point in time were very much you know, we could do a human driven approach to deploying infrastructure. We could take some, some measures to be Yeah, we'll put some of it in cloudformation first, so we'll have some checklists. And we'll go through it and and we'll, we'll just kind of get things running as quickly as possible. So it all started with just Ansible playbooks that would then run some cloud formation that would spin up spin up the pipelines. When we first started writing, writing that that automation, we made a few big mistakes we made it made. We made several big mistakes. Mistake Mistake number one was not making the infrastructure granular. So where we'd you know, be deploying a VPC and a Elastic Beanstalk stack and maybe some Amazon s3 buckets, maybe a redshift cluster as well. We'd put all of that in one, one giant cloudformation template, and at the time that that made perfect sense, right, we had one version of what we're playing and we going to play it and then we'd run into all these interesting issues where you You'd have clients say, Well, I don't want this part of it only on this part of it, you go. Okay, so now I've got a whole new version of my stack. So you fork fork that stacking, okay, you get this version, you get this version, then you've got all those permutations. And what what we quickly realized is that what you need in infrastructure automation is is not that big bank stack that you need very, very high granularity and all the components and you want the same way as you mentioned that snowplows very composable from lots of microservices, you we needed to approach infrastructure in much the same way or even more, more composable. So where where we are now and in that journey is that we went from Big Bang Ansible playbooks with very large cloud formation templates. We then move that to a bespoke tooling system, I guess, which was still based on Ansible and cloudformation. But had a lot of that granularity starting starting to appear which which worked for quite a long time, but still wasn't wasn't very flexible. Part of that was was probably our use of cloud formation. And that we found that to be a little bit awkward to work with, and a little bit awkward to make very, very, very, very flexible. But the the key journey there was really about going from kind of low granularity to high granularity. And then we ran to a further issue, which was about state. So up until we started using terraform. All of our deployment tools have been completely stateless. We'd we'd leveraged essentially the fact that, you know, we were using cloudformation. So we could just query the outputs of cloudformation templates, or, you know, we were we were going in writing API calls to go and check if certain components have been deployed and what configuration they had at time. So it was all kind of just in time, just in time resolution and that that was really flexible, and kind of reduced any any need for us to case you're worried about stayed anywhere, but it also made us very lazy, and that we weren't caring about that. So we we weren't necessarily making the right decisions. And as well, every time we wanted to expand the system, we had to go and fetch all this information all the time. And it also meant it was very hard to build a view of what had been played, it made it very difficult to go and write a tool that could just build a report and say, Hey, this is everything that's deployed. This is current state of the entire system, because it was all stateless. So to do that, it was all very expensive, you know, long API calls and checks that were just not not very useful. And that that system, that bespoke system being what it was, was very hard to then turn into a nice API, it was also impossible to train the rest of the team on it, which I quickly discovered as I started hiring, hiring more SMEs and trying to train them into using this tool that no one could actually use it easily. So that's that's when the kind of next part of our journey began, which was saying, Well, hey, let's let's throw it all out and start again, which we started actually at the beginning of last year, and we we set it on terraform because it was it was kind of flexible enough to support multiple clouds which we which we now do. So we needed something that could have a common instruction language GCP or AWS and possibly in the future is your any other any other cloud that might might appear. We also wanted something that a new engineer joining could kind of deal with, they wouldn't have to learn the configuration languages just have to learn what we built on top of it, which which was a massive, massive difference in terms of how well we could we could support this and it also had all the all of the heavy lifting done it had state and had integrations with console and volt which were leveraging quite heavily to build detachments and you know, metadata management, centralized metadata management, centralized secret management that we could then feed into all of these this configuration that we we've done with terraform and as well with with Titan, you know, as I mentioned previously with our adoption of Nomad as well, we've now Being able to slap an API on the front of it all, which we call which call our deployment service, which is very very aptly named that can then go and go and use this whole ecosystem to go and manage manage all of this this infrastructure. So we've come from, you know, a human driven choose your adventure style infrastructure management tool based on based on cloudformation Ansible. To kind of this this world of API driven chat ops driven infrastructure management, which is which is kind of where where we've gotten to continue now. I guess the the other quick thing I'd love to touch on there as well that you mentioned is that kind of Cuban eddie's is that the the flavor flavor of the month for most for for how it runs managing managing containers? I guess on on Google we we did we did roll with Cuban Eddie's we did that because Google has like a fully managed communities offering which is very attractive to us because then we we didn't have to home roll our own Cuban Eddie's and that's that's also a big thing. We look for Our implementation is we need minimum overhead in every way shape and form, we can't have too much overhead, we can't do too many custom things, when we can use a cloud tool, we use the cloud tool, because that is very important for us for scaling out scaling out our operation and for GCP cubing it seemed seemed like a very good fit for AWS though we're we're looking in a slightly different direction. There's a few there's a few reasons for that. So one, one is that Amazon managed keeping it isn't quite the same breed as the Google managed communities you still responsible for kind of setting up the the underlying data nodes. So you still need to do it was a guess a bit like when elastic Container Service first arrived and you still had to manage all of those EC two servers yourself. It's not a fully managed service. In that sense. You still responsible for setting up your auto scaling group setting up their service and kind of hooking them up to, to to cluster So does that There's that reason but as well what what I found personally and you know, take take this effort for grain of salt is that there is a lot going on in Cuban Eddie's and for what we're trying to do with it it's not that it's overkill, but it's it does more than what we needed to do. And by virtue of that fact it costs more than then maybe we want it to cost in terms of you know, we don't need all of this advanced extra scheduling and management systems necessarily to just run a couple of pods. All they need to do is run a simple, simple Docker container that scales up and down. There's no extra service discovery, there's no you know, internal load balancing needs to happen. That's all kind of done done for us already. So we're we're looking at ECS is kind of just a very simple container container management fabric as opposed to converters, which to its credit is much more powerful, but it's just much more than one We tend to tend to need for the snowplow stack.