Day 0, 1 & 2. Decision making in cloud deployment

August 15, 2023

Season 1, Episode 7

In this episode of Cloud Unplugged, Jon Shanks and Jacob Keshur dive deep into the intricacies of platform engineering, particularly in the context of a retail company named PTP.

In This Episode, You Will Learn:

  • The importance of Day Zero in platform engineering and its role in requirement gathering.
  • The challenges faced by developers in ensuring uptime and reliability.
  • The significance of multi-availability zones in improving reliability.
  • The complexities of handling global traffic and the need for strategic load balancing.
  • The role of centralized platform teams in improving developer experience.

Themes Covered in the Podcast:

  1. Day Zero in Platform Engineering: Understanding the need for a centralized approach and the importance of requirement gathering.
  2. Uptime and Reliability: Delving into the challenges faced by developers and the importance of multi-availability zones.
  3. Handling Global Traffic: Discussing the intricacies of load balancing and the example of Amazon’s single-region traffic handling.
  4. Centralized Platform Teams: The role of these teams in improving the developer experience and ensuring consistent delivery.

Quick Takeaways:

  1. Day Zero: The initial phase of platform engineering focused on requirement gathering.
  2. Uptime: The measure of system reliability, indicating the system’s operational performance and overall reliability.
  3. Multi-Availability Zones: Multiple, isolated locations within a region that provide redundancy and failover capabilities, ensuring system availability.
  4. Load Balancing: The distribution of incoming network traffic across multiple servers to ensure no single server is overwhelmed.
  5. Centralized Platform Teams: Teams focused on providing a centralized platform to improve developer experience.
  6. Reliability: The ability of a system to recover from failures and continue to function.
  7. Global Traffic: Web traffic that comes from different regions across the world.
  8. Developer Experience: The ease with which developers can create, test, and deploy applications.
  9. Platform Engineering: The process of creating and managing platforms for software development.
  10. PTP: A fictitious retail company discussed in the episode, facing challenges in platform engineering.

Follow for more:
Jon Shanks: LinkedIn
Jay Keshur: LinkedIn
Jon & Jay’s startup: Appvia


0:05[Music] clearly I’ve just got it wrong

0:11um let me um let me kind of go find four people let’s let me go through a few things so let’s say you’re gonna tell me

0:17my job’s back no I’m not gonna tell you I’m gonna start defining the conditions so let’s say somebody

0:24hello Welcome to Cloud unplugged this is season two episode 10 I’m John Shanks

0:29and I’m Jacob Shaw and today we’re going to be talking about platforms and platform engineering but mostly for

0:36developers developer platforms we spoke on the previous episode about what developer experiences

0:42spoke about many things facetiously and genuinely around like what was important

0:49to developers and reducing the friction around all of the things they have to worry about

0:55um and giving them all the visibility and all the information they need quickly and easily that’s actionable I

1:01think was kind of what we’re concluding because if you can action on it you can improve something um to improve their experience but in

1:10this one when we’re talking about platforms we can talk about the stages of platforms as in the build out it’s like

1:17design I guess then build and then operate

1:22yeah which is like commonly referred to as day zero day one day two

1:28um kind of uh problems of building platforms I guess yeah yeah

1:34um well not problems as in just the process yeah exactly sorry um States states of building yeah the

1:41phases yeah I guess yeah yeah um which is the view that it all happens in essentially two three days two days

1:48because there isn’t the days not a day just like zero time zero day yeah done you go back in time you’ve

1:56planned it yeah in like zero days it’s all been subconscious the entire time and then you just go straight to day one

2:02two yeah planning plan in action um yeah so when we’re talking about like

2:07design and implementation phase of a platform and Gathering the requirements

2:13um about what you’re going to build and then day one will be obviously then the building implementation phase

2:20um and then obviously testing what you’re building and everything else that it actually works the way you’d expect and then day two would be actually

2:27consumption um of the thing because it’s prime time and then people should be prime time

2:33it’s a prime time platform Amazon Prime that you’re actually building so we’re talking about building Amazon Prime yep

2:40prime time I mean we can talk about building supply but um yeah good Dev experience on that I don’t know

2:47um but yeah Prime Time platforms ptps as we like to call them and then um

2:52I guess the biggest challenge is that we see uh more on the day too let’s

2:58be honest I think there’s quite a lot of tooling and loads of reusable configuration and bits and Bobs out

3:03there where you can tend to be able to get something and create something reasonably easy whether you’ve

3:09can measure that against the design because you haven’t designed anything in Day Zero

3:15um not sure if people are thinking about Day Zero as much or whether they’re just going straight to day one and two but what do you think well it feels like

3:23um kind of kind of like you said um planning is just an activity that happens a little bit disconnected

3:31from it’s a way you know when you’re talking about the success of something you’re not always talking about the

3:37success of the planning um you’re talking about the success of the implementation or the architecture which

3:42is part of the design and architecture phase so um maybe there is a link um but generally I guess the problems

3:50that we see we see in the industry is that the time and effort hasn’t gone

3:56into and the knowledge just isn’t there about um what it takes to operate a a platform

4:04that is responsible for developer experience and hosting services and Cloud right

4:10um so um like like you kind of summarized planning there’s lots of information out

4:18there about platforms that other people have built that might um fact you might factor into the into

4:24your design um day one there’s again loads of Open Source projects or things even in the

4:31cloud um that exist that will help you execute on parts of your design quite quickly

4:40um but day two is where most of the issues and the kind

4:46of ambiguity lies because no one really knows how to operate something and how

4:53the operations of that technology meets the rest of the business and the businesses needs for stability security

5:00you know efficiency Etc okay so say

5:05um where like where a company whatever we want to call like this nucleus picture it’s

5:11gonna be like a fictitious company so um I’m not sat here with a company hoodie yeah so we are a fictitious

5:19company in this in this situation um whatever call it like I don’t know PTP because it says in

5:27acronym um there is a eight development teams made up of say

5:35five to six Engineers of developers software developers um

5:41they are writing obviously different functionality maybe different some

5:47slight different products some pieces of the product overall um you’re there as a central platform

5:55team uh just being on my own say you have a team of

6:03um how many teams have already forgotten eight people it was eight teams eight

6:08teams or five six people yeah okay so say you have four engineers in your team

6:14okay with you which is kind of pretty large to be fair yeah I mean so I guess one thing that you’re already sort of

6:21building out is that there’s you’ve now implied that there is a kind of platform

6:27or yeah of course yeah because we’re talking about platform self-service all

6:32right so a platform team that is building a platform for uh developers yeah and the ratio or the numbers that

6:39you’ve just described um do you want to do the math on it should I do the math so uh 40 40 to 46

6:47people yeah yeah and um then the number of

6:53um developed devops or platform Engineers you said is for the including

6:59yourself in there so I guess it’s three and then one for you yeah I think it’s four in total of that

7:05platform team including like you leading that Team all right yeah um but that already is not actually realistic right

7:12so it’s not realistic no because that’s how I started this I said it was a fictitious

7:17but I’m saying that this the the team that I’m building already I’d be like uh

7:23so the industry actually puts more time and sorry more value in this function

7:29which means that it is uh heavily invested in and the way that this

7:34industry likes to invest in this space weirdly is by putting people in sorry

7:40that’s right or wrong but do I need to hire someone else yeah I think so I really feel like um this isn’t working

7:46out already I really more money I really thought you were the man but clearly I

7:51just got it wrong um let me um let me kind of go find four

7:57people let’s let me go through a few things so let’s say you’re gonna tell me my job spec no I’m not gonna tell you

8:02just say I’m gonna start defining the conditions so let’s say some of these projects have already gone into Cloud maybe there’s a

8:09bunch of kooky things going on like on my on the platform that we’ve built no they’re saying like there was like there

8:15was non-centralized function going on for a while nice so embedded devops

8:21um maybe some of the devops left um like some of them have been doing Python scripts and kind of bits of glue

8:27and string and like maybe Secrets kind of all over the place in things and it’s a bit messy and someone’s like okay we

8:34really need to do something about it there’s bottlenecks and a delivery um so your boss was like basically goes

8:42to hire you in because they’re like we need to do something about this we need to get someone in to kind of take a look at all this and you’re like actually we

8:48need to just probably centralize some of this stuff um because it hasn’t worked very well

8:53and there’s too much inconsistency it’s all been done differently and the risk is really high and you don’t really have any oversight we need a bit more of a

8:59delivery framework in the business to get consistent so promote you pitch platform

9:05engineering like platform engineering means we have Central engineering going on to improve the developer experience

9:12in one place overall and you do pitch and they’re like yeah sold great idea

9:17and then you then hire three more people kind of to help you on defining it

9:23that’s where we’re at now that’s where we’re at oh yeah so thanks for that that’s all right still stuck with three people in this you didn’t get a choice

9:29in it but I just thought to give to give you some expectation and context of the fact that people are already doing

9:35things Okay and like apps might be live in some of this as well live so much

9:40maybe not yeah cool all right um so there’s this Central team um now there isn’t a platform that

9:48exists there’s no design that’s done um so I guess we’re starting from you

9:54know day Zero um and I guess if you’re looking for consistency across all of the teams

10:01security and security people are worried about the security that’s why your boss is hired you in because they’re like I know you were just in one cloud provider

10:07many Cloud providers you’re just doing one cloud provider that’s easy yeah absolutely okay super easy right yeah um

10:12so so next week so um day Zero is actually when you start delivering value

10:18if you’re just in I’m joking uh of course um using cloud is difficult there’s like

10:23you know let’s say AWS now has 200 odd services so you’re not expecting or more

10:29than 200 odd services so you’re not expecting all of those teams to be able to be experts in or you know that cloud

10:36provider completely so you have to start start thinking about standardizing on technology

10:42and making choices not necessarily on behalf of the teams but feeding in the

10:48requirements that allow you to centralize on that choice um to what the requirements should be looking for so this is day Zero so

10:55you’ve got all these different teams I’ve given you a bit of a rundown of audit of the fact that like

11:00things are all kind of manually automated so that like the basically yeah as in like it it probably they

11:07spend more time fixing up the automation than they probably do the automation working yeah so kind of devaluing the

11:14construct of even automation at this point like it probably would have been faster to have done it manually in the end

11:20um and so and they’ve kind of got themselves in a bit of a hole where like somebody’s already done a load of stuff

11:25so they just keep going with the things that were already there and just keep packing away like maybe if I had this project in and maybe if I had this other

11:32environment in and like even you know that kind of stuff that kind of goes on and a bit of bash bit of python going on

11:38and some CI jobs running and you know maybe we’re kind of throwing something else like I’ll go CD in there and then

11:43it’s like now it’s like it’s a bit yeah it’s like kubernetes there someone’s got a couple of clusters I’ll go see Dean no

11:50real oversight anymore because it’s obviously per cluster it’s kind of separate to the CI job so it’s kind of a bit disconnected so now you’re kind of

11:56like basically you’ve got to go and now work out what’s going on where do you begin as in like what would

12:04you look for in your day Zero set requirements I guess part of that is

12:09going and speaking to the team’s right to understand what it is that they’re doing why they’re doing those things what would you be asking what the

12:16business is Hey Jake it’s really nice to meet you I hear you’re new here and you’re going to help fix all these

12:21problems I am new um uh what the f have you been doing but

12:26it would so there used to be this yeah the other guy that used to work here called Jake and um I don’t know just did all this

12:33stuff before his story

12:38um so uh I’d obviously be asking you know what other requirements because they have Downstream users right

12:46um or they might be kind of regulatory requirements or whatever of those applications so they have a bunch of applications which might factor into the

12:54design of the platform that you’re trying to build um I mean we’ve already talked about so

12:59you’re saying that you’re going to lead on like languages so first of all you all think it’s more about the language is that what you’re saying no no so

13:04applications is quite generic what do you mean um so uh they these these teams are

13:12building some functionality um they’re either building functionality in separate kind of business services or

13:20applications um and those applications have users in some sort of Industry let’s say for for

13:27argument’s sake it’s financial industry right um in the finance industry there is a

13:33bunch of Regulation that means that this isn’t Finance okay cool yeah

13:43um so there’s no regulation retail I mean there is regulation even in retail but yeah but not not in not in the

13:49financial it’s not a financial institution regulation okay so

13:54um maybe uh some of those requirements are um business driven so if my let’s say

14:02let’s say it’s an e-commerce website that’s probably the easiest it’s a shop it’s an online shop I mean it’s not I

14:08mean we can go with it I’m amazing all right fine so it’s just some unknown

14:13functions it’s basically it’s a it’s a it’s a retail company and part of that retail

14:20has Logistics Services it’s got payment services it’s got different products elements of there from like selling

14:26technology products it can sell clothes products so it all comes under one right it does but also there’s Logistics

14:31back-end stuff as well you’ve got a factor okay yeah so so I’m right fine another half stock they have stores as

14:37well it’s not all e-commerce now they’ve got shops it’s like they’ve had lots of different functionality lots of different things yeah cool

14:43um I mean I shop at PTP it’s right you know like business ages

14:49so so um this this the shop you know one of the requirements might be that um or one

14:56of the things that the business cares about might be up time right of yeah that is a big problem actually yeah yeah

15:01it goes down quite a lot yeah there we go so we already know there’s some value

15:07that could be driven from a central platform because there’s a real problem that yeah it’s facing so uptime I don’t

15:13know one of the things that you might um uh try to design for is greater uptime

15:20um is what about the cost of um the services uh that like the logistics and

15:26do you have like a really high delivery fee or is there like a really high

15:31um operational cost for running these teams um that is presented to the user so

15:39sometimes things go wrong where the wrong stock gets sent to people

15:44sometimes that’s occurred a few times yeah um so that’s a fire yeah I mean this is

15:50just the day in the life of PTP oh man um but mostly the biggest issues are probably net new services so like new

15:59ideas I was trying to be competitive they just take forever yeah to get out the door

16:04um and then basically new features enhancements so on the existing Services also take a really long time

16:12um and then the back end stuff is quite problematic and also the uptime is quite like things go down takes a long time to

16:18recover so the meantime to recovery is really quite long-winded okay and people don’t really know where to look for

16:24things or whether logs are properly and it could take the major say performance of the sites performance actually isn’t

16:30too bad overall but just just reliability reliability so it kind of

16:36just goes can go down and then take his ages to fix something and does it go down for like security reasons or is it

16:42other reasons well just sometimes if we’re like doing any upgrades things will go down if we’re patching to a new

16:48version of the software sometimes I’ve had issues there and then resolving those issues can take time and the

16:53rollbacks can take time if we need to roll back well sounds like you have a really hard problem to fix glad you

17:00hired me to fix it because yeah I just yeah really know it yeah so so how’s

17:07this platform teams what is this day two what’s feeding into this day Zero requirement Gathering so now we know the

17:15fact that the reliability needs to be improved um the time to Market

17:22um can be improved the developer experience overall basically could could do with

17:28um a help in hand uh performance not too bad [Music] um but you might so in the reliability

17:35thing as an example Cloud obviously and we’ve talked about this on previous episodes cloud has

17:41um lots of different ways to improve reliability whether it’s using you know

17:47pad services that come with um higher availability and reliability

17:53or you know architecting your application so that it can be multi-region or multi-az or you know any

18:01any and or all of these things just means that you’re not just

18:07um reliable but resilient to failures potentially of the car providers right

18:14um so in this design then you’re saying that we should be multi-az or multi-region at least multi-az at least

18:21least is just more than one multi-az just means more than one correct yeah so is

18:27it more than one to two are we saying two um ideally

18:38this starts to factor in yeah where you’re saying you need for for

18:43reliability to increase your reliability in your uptime Distributing applications across

18:50multiple azs is going to improve the reliability if you’ve architect your architected

18:57your app well um and that you’re you know moving state to the right place and all that kind of

19:02thing then yeah for sure um uh Arctic uh having multiple availability zones

19:09um that your application lives in will mean that it has a greater chance of

19:15higher up time um it will also mean that if you’re if one of those availability zones goes

19:21down or is unavailable for whatever reason then your application is affected yeah so so they’re gonna you’re you’re

19:26now saying some design principles that are going to shift to the developer teams to to capitalize on the platform

19:34you’re going to be building because your platform You’re Building you’re proposing is going to distribute applications across multiple azs yeah so

19:40therefore their application needs to be able to be in different azs needs to

19:45scale across multiple OCS and take the traffic which could go into either

19:52one of these azs and then deliver on the outcome appropriately so that if that AZ

19:57went down there’s still the two other azs but they aren’t necessarily burdened by knowing what it takes to build the

20:05application in those azs right they don’t need to change their app they they

20:10um will have to change their app to scale um yeah but they might not have to know

20:16that they are in each AZ um because you know you’re I’m trying to

20:21provide a uh a platform that takes care of some of that complexity so that load

20:27cognitive load for them in in all of this design isn’t isn’t on them right so

20:32what they all they need to validate is the fact can there be more than one of my applications alive and can it receive

20:41traffic it can each one receive traffic equally and you can serve the request and the request is the right result you

20:48would have expected yes um not maybe somebody else’s order but exactly or whatever else yeah

20:54um so all of these would obviously factor in um if you wanted to go even bigger on

20:59that you might introduce another region with more availability zones if you wanted to improve and what would that

21:05mean today app um so again uh you are trying to figure

21:10out whether um you know you now have to think about latency and things like that um where

21:16the users are and where um they’re coming in from um and whether it makes sense for the

21:23load of the application to be um uh to go

21:29to be distributed via in the Geo that is near so let’s say if you had a bunch of

21:34users in us then you might have a region that’s in the US if you had it in LA if

21:40most of your regions most of your users were in London then you know it probably makes sense to do do it there however if

21:47it’s Global and latency was such an important factor then you’d probably

21:52have both and load balance the requests coming in so that they are specific to

22:00the regions that they’re in funnily enough in in this example like Amazon is

22:05a global Service all of that is just um all of the traffic is actually handled

22:12in one region did you know that um all the traffic is handled in one

22:17region yeah so the whole of the Amazon website uh oh right I see you mean right

22:23so you’re not talking about as in there’s only one not so much services so Amazon

22:31um has a e-commerce site you know all of the traffic for that is handled in in one region in one region that’s pretty

22:36cool isn’t it yeah um which is unbelievable skills I’m I’m assuming

22:42um anyway tangent uh so those things would be an issue um the type of choice the technology

22:49choices uh that this platform can handle uh so you know different languages am I

22:56trying to simplify um for only one language on my uh being quite

23:03flexible so that you can host yeah there’s loads of lounges so many languages yeah there’s quite a lot there’s quite a lot of all right so some

23:09flexibility and you might try to say loads but there’s like some people have kind of gone off and done separate things and you know so and

23:17there’s a whole other strategy about how the look and feel of everything kind of comes together better because it’s

23:22sounds like it really made a bit of a mess with this PTP business oh yeah PCP manages this business really yeah I know

23:30it’s pretty poor but anyway this is what’s happened so um but they’ve got back end orders

23:35obviously front-end stuff I mean you can like use State and data as well yeah the

23:41state because obviously orders have to go through credit card details have to go through there used to be payments done obviously then tracking your order

23:48um you know knowing where it’s going to be delivery dates all those kind of things yeah um so there’s like Logistics part and

23:54then obviously there’s the payment part and then there’s obviously then the front end part which is the catalog of the things that you can buy in the

23:59pricing and the you know so it sounds like users you’ve already got um a bit

24:04of a kind of micro service architecture with all of these different teams um which means that there are lots of

24:11different kind of components that you might want to be able to scale um individually and deploy individually

24:18because those teams might make changes to those to their applications and then the way that they interact

24:26um could be made simple right so I’m guessing let me just take a guess to

24:31what the current state is that every change you make um in the in this in this organization

24:39with these different functions things like networking anytime there’s a change you know has to go through one team and

24:45all of this is centralized um or um you know working by Network so so

24:52let’s say one a payment one of your payment applications relies on your

24:57Logistics application for some reason and if you’ve made um you know a change so that your app is

25:04available on a different port then you have to go to some Central team to allow communication if security is an issue to

25:11allow communication between your things it depends because some people share

25:17some team stuff and other other teams don’t so it’s a bit of a mixed bag so

25:22some teams that work some other things historically ended up sharing infrastructure of another team just

25:29because it had the relationships before because it used to be in that team so they kind of just started sharing it and some of them was quite a few kubernetes

25:35clusters but then there’s some um using Lambda and some other things going on across different accounts yeah

25:43um but it’s all kind of working somehow in the end so there’s obviously like some front-end stuff that’s kind of

25:48Lambda there’s some things that like obviously trigger some events on when somebody puts an order through or

25:54whatever else there’ll be like an event that kind of happens for the back end so that’s kind of the Lambda stuff

25:59um but then the other things are the you know where the catalog’s hosted and all those other things are all shared in

26:05like a kubernetes clusters so you’ve also got different experience and different things different architectures

26:12for different architectures and and those architectural um choices that you’ve made

26:19um are you happy with them do you think there’s there’s a lot of I mean a team made them yeah you know it’s because

26:24they’re own separate teams and so they decided that was like the right thing to do that somebody was like Adventure of

26:30an architecture makes sense for this use case so they went off and use that um and then other teams were already

26:35doing containers and so they just went off and started to use kubernetes because they heard about it and it’s like managed services in the clouds like

26:42might as well use that um as well and are they in one account many accounts they’re in many accounts

26:50yeah many accounts are cool and not like loads and loads of accounts but enough there’s like there is more than one is

26:56what I’m saying and are those accounts just like their you know the stages of these environments or is there more than

27:03that other stages plus more yeah so there’s some stages that have accounts

27:09um I think some teams have gone a bit crazy with the number of stages some have like five accounts some only have

27:15two accounts yeah um so it’s been a bit of a mixed bag on how the team decided

27:21to split it up it’s quite a lot of autonomy and cost that it was because it was Central it was like it was basically

27:28a decentralized devops yeah so there’s different no consistency a lot of I mean

27:34we tried but people were speaking to each other you know but out of reusing and helping each other out as well

27:39there’s terraform oh cool now like everyone loves a terraform yeah so there’s terraform there and there’s like

27:45lots of different um lots of different accounts so it’s just a devops decided that the patterns were slightly different

27:52so people were like you know some of the data that was the card data people were a bit risk-averse they were like let’s split it all out in separate accounts

27:58yeah some people were then using like the production data to test and other accounts they had to be treated like

28:03production and all these other things for the credit card data so that was treated differently in like more secure

28:09by splitting everything out but then other things weren’t like that so they didn’t do that um so it’s just like yeah evolution of

28:15like the requirements from other places so it sounds like and we’ve talked about

28:20um the kind of Landing Zone type approach in previous episodes but you know I’m going to apply that as a kind

28:27of at least a baseline platform um to rely upon and then but that

28:32doesn’t really um uh that doesn’t really solve the developer experience which is what we’re

28:39talking about here yeah so we need to we need to so now you have segregated accounts for teams and some sort of

28:46Central Services that can be reused so it’s less of a decentralized

28:52um mess that seems to have been created and there’s some consistency to how you’re delivering in cloud and then the

28:59next thing I guess to to solve for is um how uh all of those technology

29:05choices that have been made um for right reason wrong reason whatever can either be standardized the

29:12operational overhead um can be sort of maintained um the ongoing operability for all of

29:21these things are day two operations can be um kind of prioritized and solved for as

29:26well so day two operations so at the moment I don’t think we’ve because I do

29:32know that one of the Clusters went offline because you upgraded it no it was like basically I think one of the

29:39cloud vendors we just didn’t upgrade it for so long it was on such an old version

29:44um that I think got terminated because we’re one to upgrade but no one knew how to upgrade them

29:50um so we couldn’t really work how to do it so I don’t think the team did it but then we did manage to obviously sort it

29:57out yeah um and then kind of move over so that was fine we’ve got support from the cow

30:02vendor on that at the time um but that was expensive yeah it was very expensive

30:07um but then we’re now just looking just to start again so the aim is like we need to start a bit yeah we know it’s a

30:13mess in many different places so there’s no point trying to work out some consistency in all of this yeah because

30:19right it’s too that would take longer than maybe just three thinking what we need to do so you’re saying put a

30:26landing Zone in first which is what the cloud best practice is which is exactly to support multi-team and multi-account

30:32on multi-project whatever you want to call it depending on the cloud provider yeah so that each team is going to get their

30:38own account but then the platform team then is going to do what is we we do all

30:44have a kubernetes each as a team or good question yeah I guess it kind of depends on

30:50um you know uh the the architecture or the things that are that’s why yeah they’re important so it’s quite a lot of

30:57support yes there’s a lot of important assets right so um let’s say well you you actually just sort of spoke about it

31:04earlier so you have card information yes PCI compliance and that seems to have uh

31:12data that you probably want to segregate often have a bit more risk aversion

31:18around yeah whereas other things might not necessarily have the same that’s the highest risk profile and probably the

31:25user data as well there we go so yeah so already you’ve you’ve kind of got alignment of workloads depending on the

31:33profile of the applications or you can Define everything to the highest risk if

31:39you’re really risk-averse you know you put everything in completely segregated environments where you’re not reusing

31:45any of the underlying infrastructure and everything is locked down

31:50um and and the um the kind of information flow between teams is now centralized as well so yeah

31:57they’re a really good developer experiences really mobile Okay cool so yeah I’m just wondering like when this

32:03develops because it’s going to arrive so at the moment we need to obviously suffer but we’re just looking to ship

32:09some features faster so we really want to know when this platform’s good but like so it’s I mean from all of these

32:16things me building a platform from scratch now this is there’s just so many requirements this it feels like it’s

32:22going to take me quite some time probably not like day two it feels like day 200 that I would have finished

32:29um addressing some of these requirements by we don’t need to necessarily I’m just telling you what the situation is yeah and obviously we can’t ship new features

32:38until at the moment but we also have new teams that are going to start so obviously we know we’ve we know that the

32:43current e-commerce we’ve got other ideas yeah business so we need to do something because there isn’t really a thing to

32:48align to so something that you as much as improving what we have so we need a migration strategy probably to the new

32:55thing this is like so much what there’s lots of things happening that’s not as much of a priority as actually the new

33:01project it’s not new net new net new stuff that needs to go quick and we’re just trapped because we’re

33:08you know don’t really know what to use because there isn’t a one thing to you so standardized so all right so things

33:15that we decisions that we have made there is now a cloud Landing Zone um that’s good uh there is

33:23um kubernetes because that’s already used as well it’s already done is it or we can use it now well it doesn’t exist

33:29yet but okay technology choice so we’re still in the tech okay so we’re still in the planning and now we’re kind of in

33:35Day Zero and day one of phases of this uh of this rollout so yeah

33:41um so the the cloud stuff has been planned has been executed on

33:47um now the um the developer experience is now being planned for and executed on

33:52so as part of that I’m going to come up with a strategy for using kubernetes in the organization so that I can we can

34:00standardize on the skills in the org and standardize on how applications are

34:06being shipped um through the different environments um and standardize how they’re being

34:13monitored you know the observability around all of them um

34:18because obviously I’ve got a few different teams um that are outsourced actually

34:24outsourced teams oh wow yeah more and more requirements just keep yeah just because like make sure you don’t have

34:30the capacity we don’t have a capacity at the moment with the current team there’s PTP it’s just got infinite there’s like

34:36so much it’s just there’s a lot going on with eight teams before ready but we’ve

34:43got two new teams starting they’ve started well yeah I mean I mean they’re about to start so they’re doing their

34:49own scoping right for the new stuff obviously the other eight existing teams will need to migrate at some point to

34:55this new stuff we need a plan for that yeah so I’m just wondering like what’s the you’ve mentioned kubernetes you’ve got the landing zones

35:02um are we just having a giant cluster is it like what are you providing like a big platform that we’re all sharing or

35:07like what’s the kind of depends so you know uh if you’re focusing on I mean

35:13there’s credit card details and all these other stuff so I wouldn’t really feel great about us sharing things to be fair you don’t feel great about shares

35:19well that security risk it would be high so I wouldn’t want promote that to be we’ve already got that risk now in some

35:25projects there you go right so we’ve already we’ve already spoken about this though right we already said that the different security profiles of the

35:32applications would be in different accounts different clusters Etc but not absolutely everything because

35:39um you don’t want the cognitive overhead of managing it all unless

35:46I do know though that when we didn’t upgrade the previous cluster because it was shared yeah everything yeah and yeah

35:54which we don’t want either because that was high risk so so you probably want to segregate off

35:59um some of the some of the applications so there’s not as much risk I if a

36:05cluster goes down then it doesn’t take down the entire business

36:10well if it’s were you saying the whole old because you said three azs he’s saying if the whole three azs go down no

36:16no if someone just doesn’t upgrade a cluster so yeah I mean a cluster can we just make that easy yeah so that’s not a

36:22problem that’s a good idea so how what what things are there in the industry that make cluster upgrades easier in

36:29fact um are you going to we are the team so I

36:34mean that’s kind of why I mean let me ask the team how do you do clusters [Music]

36:40so you you guys are already using terraform right so you know there’s functionality in in there that allows

36:47you to upgrade a cluster and Amazon makes that sort of easy but there’s loads of operational things now uh or

36:54questions that need to be asked right so am I responsible for the cluster probably

37:00um but you have a bunch of apps in that cluster do I decide when to upgrade and

37:06potentially affect your apps who knows well I don’t know because there will be 10 teams in total so I have to protect

37:13how many clusters are you proposing for for all those teams in the end well take the requirements in hand but let’s for

37:20argument’s sake and I’m not saying that this is the most efficient but for argument’s sake let’s say each of those teams has individual class well multiple

37:28clusters one for each stage of their uh environment pipeline or whatever

37:34um or stage of development so let’s say Dev u18 product so now there’s dirty

37:40clusters okay 10 teams 30 clusters that’s a lot

37:45of clusters and each team how much is this gonna cost it’s so cheap it’s like

37:50okay [Music] um but but now you have a uh obviously

38:00the the choice that has been made is that um all of the accounts and clusters are

38:06segregated um no underlying infrastructure so

38:12um a bit less potential well less of a footprint for security issues

38:19to happen in um but then the cognitive load and the operational overhead is high because you

38:25now have to manage dirty clusters also each time you do an upgrade you have to use sorry who are you talking about when

38:32you say you now have to manage who’s the you in the platform the platform team okay but

38:39not the developer if they are responsible for the cluster maybe they I in this weird scenario where I’m I don’t

38:47know if I’m the platform team or I’m just like yeah cool so this is my decision that I cool

38:54all right I hope so I’m just trying to still figure out this this weird uh I

38:59mean fictitious scenario so problem so in so I have 30 clusters

39:07um and I’m gonna take responsibility of owning these clusters however I think

39:13because it’s going to be I mean I only have three other people working in this and those three people cannot feasibly

39:20go into each of these teams and ask them when they can upgrade a cluster

39:26um the only the only thing that I can think about is is you know using um some solutions out there that make

39:34that easier um so it moves the operations into the team and the team

39:39can then decide so basically creating um uh commodity but they won’t know the

39:47teamwork they don’t need to know terraform so they you know they could use products out there that make

39:54um upgrades of clusters as simple as clicking a button that just says do you

39:59want to you know this upgrade is available do you want to hit okay so you’re saying

40:07[Music] and the team won’t need to know

40:14terraform that’s consuming this cluster yep and the Clusters are going to use something

40:21that takes that basically upgrades themselves almost or something is doing

40:27this upgrades yes um and the team is going to be responsible

40:33for when that happens because it’s their application they’re responsible for

40:38their application I’m responsible for the Clusters however I don’t want to be responsible for upgrading their clusters

40:44so I want to pass that down you can’t because you’ve got I’ve only got three people I can’t feasibly be responsible

40:49for that right so I have to find something um a solution out there that takes care of or moves this responsibility down or

40:57build it myself and I can’t build it because I’ve only got three people so okay so so you’re buying rather than

41:02building buying a solution um that that does this

41:08um what are the other problems I guess um or well just the speed of access so

41:13if I’m getting this cluster that my team needs like how long is that going to

41:19take [Music]

41:24um that I’ve bought rather than built um you can actually self-serve clusters Okay so we’ve got so we’ve got a way for

41:31the team that’s going to get these two clusters yeah which is obviously moving friction on accessor and then how do I

41:37get environments in the cluster for my apps that I need to deploy so an environment is

41:42essentially a namespace okay or one one concept of an environment is a namespace a different environment you could

41:49construct it as a cluster depending on you know how much infrastructure you want to reuse or not

41:54um but all of that is also self-serve so you can either get a namespace yourself

42:00or you can get a cluster yourself right but all of that is so I’m as a platform

42:07team I’m just you’re trusting the developer to create their own environments with

42:15knowing that the guard rails are in the right places I am only allowing the

42:20developers to do the things that I have um that I have trust in this tool can I

42:26delete the environments to kind of depends do I want you to delete the environments I think it should be

42:32allowed so yeah yeah apart from potentially prod just in case you mess up okay and that would be a guard rail

42:37is what you’re saying that would be a guard rail exactly however if you wanted to delete Dev if you wanted to delete

42:43uat go ahead right or they’re just not I just can’t destroy prod you can’t destroy prod Yeah

42:50okay or there’s um and you know additional gate that you need if you are destroying prod for

42:56example if you wanted to do blue green deployments in prod and you were kind of moving off of a

43:03cluster moving your workloads onto another cluster you probably do want to delete the one that is just sat there

43:09doing nothing right yeah um so yeah those those types of guard rails and rules can can probably can

43:16happen in this wonderful um will this cluster optimize itself for costs so like because the other thing

43:22we’ve had is obviously lots of idle like resources yeah it’s costing a lot of

43:28money we’re not really using the infrastructure fully thankfully that is um one of the benefits of uh kubernetes

43:35so there’s lots of different add-ons in this world um Auto scalers and such and then um

43:42ways to manage your application as well so that that scales to demand so if your

43:49application is getting to demand and your infrastructure is scaling to demand then

43:55um as long as the right information is being fed into the Clusters and the

44:00management of those clusters then yes it’s going to be so in this platform I

44:05can get the ax get the infrastructure without knowing much about

44:10how you’re having the skills but they’ve known how you’ve done it but like it happens and that happens yeah exactly

44:16and then I can get access to environments and then this infrastructure when I

44:21start deployment apps will scale up and down accordingly to kind of save money so if I like and you can also get Cloud

44:28resources so let’s say your application needed sqs because you’re talking about it just knows I need one yeah well it

44:35doesn’t know you need one all right you need to tell it that you need one right okay you don’t

44:42necessarily need to know terraform so I’m going to standardize um I’m going to standardize the way that

44:49you’re asking for cloud services like a kubernetes cluster or um you know a

44:55database or a message queue or anything like that so in this experience so before because we’re talking about Dev experience yeah and we’re talking about

45:01all the touch points of the dev so I could now see I uh is that there it’s

45:09yeah um platform team so one of my responsibilities is um making sure that

45:15I’ve standardized on a CI okay so I’ve got a CI that my team can use so that’s kind of there yeah

45:21um I’ve also integrated automatically you know the way that um you can deploy

45:27from CI into those clusters right okay so you’re

45:33so all I do what’s on the dev team in this platform in your in this you write code you containerize that code so write

45:41a Docker file and that’s it um and you have to write either some

45:49kubernetes templates or a Helm chart or something and pop it into the right folder structure that I’m going to give

45:55you and then it will just appear in your cluster okay so when I’m engineering

46:01I’ll just be testing for Dev experience I’m testing locally

46:06then I push to CI and then and then these deployment files how do I

46:12know they’re going to work these deployment files so it gets deployed and tested all within the CI pipeline so you

46:20have um you know linting that happens on the kubernetes manifests inside the pipeline

46:26you also have an automatic deployment into a ephemeral cluster that goes away

46:31to make sure that the app can come up and there’s tests within that so um you’re being really efficient about

46:38when you’re deploying to long-lived infrastructure right

46:43cool cool platform isn’t it so I don’t have to learn so I have to learn kubernetes so that could take me a bit

46:48of time um as a Dev but some of the dev teams know it already yeah there’s quite a lot of good kubernetes

46:55um like learning product you know products out there or Solutions out there that help you learn so how long

47:00would you say these two new teams if they had the code ready how long

47:07would it take for them to get do they have an access to Dev and things like that and also what about solutions for like troubleshooting like if Dev didn’t

47:13work or they um are I’m assuming they’re on boarded into the organization so they

47:18have like entity and yeah they’ll be they’ll be in their Central identity yeah if they have a central identity then they can have access pretty much

47:25straight away okay so then so long as they’re in the IDP yeah then they’ll get access and that’s going

47:31to be the onboarding process and then from that point on they can basically is there training probably yeah yeah yeah

47:37so just to standardize for a developer experience you want everyone to have the

47:42same um you know um Baseline knowledge of experience Etc

47:48so we’re going to provide you some deep you know some training some best practices and some of that material

47:54already exists online but there’s three of us and now that all of this stuff has been taken care of for the most part I

48:01could spend the time working with developers to train them on the things specific to the company and you’re

48:07saying that the PCP platform super easy to engineer now yeah so also this platform will be highly available yep

48:14it’s going to upgrade itself secure something’s going to do the upgrade itself I just need to inform when there

48:20is an upgrade and with my app go down if it’s upgrading itself uh depends how you like if you’ve if

48:27your application is meeting the requirements and the standards that we put in and also it has to because we put

48:34the right checks and balances and guard rails in those clusters so that you couldn’t deploy unless you were multi-az

48:41or um you know have multiple replicas or multiple

48:47um uh instances of your app running um then you’ll have no downtime okay so

48:53you’re going to put policies in place that would make from it force me to have to have enough

49:00I guess enough scale in my deployment that match your requirements exactly um

49:05and then the upgrades and shouldn’t impact and then the things on there that maybe are

49:11your things will be impact to those things whatever those things are how I’m going to get my logs and other stuff off this Envision we manage that don’t worry

49:18okay so there won’t be any downtime for like my logs are not going to disappear and all of a sudden and uh we’ll just

49:23ship it to Cloud you know they have uh services that um so I have access to the how do I get

49:29access to the cloud um we’ll we’ll manage that all okay through the landing Zone all right so you’re going to give me access to the

49:36crowd and the platform does that or like I need to speak to the team or like uh just as part of your onboarding you get

49:42it automatically cool so I’ll know it’s partly on board in the train I’m gonna know where my logs are going to be where

49:47my monitoring is going to be monitoring access templates templates that you can reuse

49:53it will even give you the ability to scan your code so that any

50:00vulnerabilities or anything that you find um you you can take care of early rather

50:05than taking care of when they’re in production yeah cool so basically these two projects

50:10should be able to go live quickly and then the day two challenges you’re saying Are Gonna it’s gonna be

50:17like a self-healing self-upgrading kind of self-scaling thing that takes

50:24the day two burdens away yeah and then you’re gonna for my application day two like troubleshooting managing operating

50:31knowing about performance you can provide me all the tooling for that so that I know what to do in my app

50:37so cool pretty good eh that’s pretty good awesome I mean so that’s basically what

50:44we need they do with four people in you know in the team because I didn’t build

50:49it couldn’t build it yeah cool that’s

50:56awesome so I guess now um I mean

51:02TPT took so long we actually went on

51:09um but no that’s good that was good to see um like could be a day in the life of a

51:15company yeah to take them from kind of craziness to something yeah standardized

51:21exactly for the new things and then we can align to it and it couldn’t it shouldn’t really take that long to do

51:26really if you’ve you know well I suppose if you don’t if you’re not having to engineer it all from scratch then yes

51:31exactly as long as the things are out there for you to use I’m ready to go and let the people know

51:37how to train people up on those things and yeah sounds good cool all right so the day day zero one

51:43and two process of you have to go through to before you like work out how you’re going to solve day two how are

51:49you going to solve day one are you going to think about Day Zero are you factor in developer experience and some of that

51:54date one wasn’t even building it was also thinking about X like

52:00research to see whether there’s something out there already yeah true exactly um absolutely everything right day one even

52:08though it’s kind of focused on building doesn’t mean you have to yeah cool um

52:14there you go interesting all right well I think I grilled you enough of that yeah

52:20[Laughter] cool thanks everyone for listening and

52:26we’ll be back with another episode soon thanks again bye bye


52:42thank you