August 15, 2023
Season 1, Episode 7
In this episode of Cloud Unplugged, Jon Shanks and Jacob Keshur dive deep into the intricacies of platform engineering, particularly in the context of a retail company named PTP.
In This Episode, You Will Learn:
Themes Covered in the Podcast:
Quick Takeaways:
Follow for more:
Jon Shanks: LinkedIn
Jay Keshur: LinkedIn
Jon & Jay’s startup: Appvia
0:05[Music] clearly I’ve just got it wrong
0:11um let me um let me kind of go find four people let’s let me go through a few things so let’s say you’re gonna tell me
0:17my job’s back no I’m not gonna tell you I’m gonna start defining the conditions so let’s say somebody
0:24hello Welcome to Cloud unplugged this is season two episode 10 I’m John Shanks
0:29and I’m Jacob Shaw and today we’re going to be talking about platforms and platform engineering but mostly for
0:36developers developer platforms we spoke on the previous episode about what developer experiences
0:42spoke about many things facetiously and genuinely around like what was important
0:49to developers and reducing the friction around all of the things they have to worry about
0:55um and giving them all the visibility and all the information they need quickly and easily that’s actionable I
1:01think was kind of what we’re concluding because if you can action on it you can improve something um to improve their experience but in
1:10this one when we’re talking about platforms we can talk about the stages of platforms as in the build out it’s like
1:17design I guess then build and then operate
1:22yeah which is like commonly referred to as day zero day one day two
1:28um kind of uh problems of building platforms I guess yeah yeah
1:34um well not problems as in just the process yeah exactly sorry um States states of building yeah the
1:41phases yeah I guess yeah yeah um which is the view that it all happens in essentially two three days two days
1:48because there isn’t the days not a day just like zero time zero day yeah done you go back in time you’ve
1:56planned it yeah in like zero days it’s all been subconscious the entire time and then you just go straight to day one
2:02two yeah planning plan in action um yeah so when we’re talking about like
2:07design and implementation phase of a platform and Gathering the requirements
2:13um about what you’re going to build and then day one will be obviously then the building implementation phase
2:20um and then obviously testing what you’re building and everything else that it actually works the way you’d expect and then day two would be actually
2:27consumption um of the thing because it’s prime time and then people should be prime time
2:33it’s a prime time platform Amazon Prime that you’re actually building so we’re talking about building Amazon Prime yep
2:40prime time I mean we can talk about building supply but um yeah good Dev experience on that I don’t know
2:47um but yeah Prime Time platforms ptps as we like to call them and then um
2:52I guess the biggest challenge is that we see uh more on the day too let’s
2:58be honest I think there’s quite a lot of tooling and loads of reusable configuration and bits and Bobs out
3:03there where you can tend to be able to get something and create something reasonably easy whether you’ve
3:09can measure that against the design because you haven’t designed anything in Day Zero
3:15um not sure if people are thinking about Day Zero as much or whether they’re just going straight to day one and two but what do you think well it feels like
3:23um kind of kind of like you said um planning is just an activity that happens a little bit disconnected
3:31from it’s a way you know when you’re talking about the success of something you’re not always talking about the
3:37success of the planning um you’re talking about the success of the implementation or the architecture which
3:42is part of the design and architecture phase so um maybe there is a link um but generally I guess the problems
3:50that we see we see in the industry is that the time and effort hasn’t gone
3:56into and the knowledge just isn’t there about um what it takes to operate a a platform
4:04that is responsible for developer experience and hosting services and Cloud right
4:10um so um like like you kind of summarized planning there’s lots of information out
4:18there about platforms that other people have built that might um fact you might factor into the into
4:24your design um day one there’s again loads of Open Source projects or things even in the
4:31cloud um that exist that will help you execute on parts of your design quite quickly
4:40um but day two is where most of the issues and the kind
4:46of ambiguity lies because no one really knows how to operate something and how
4:53the operations of that technology meets the rest of the business and the businesses needs for stability security
5:00you know efficiency Etc okay so say
5:05um where like where a company whatever we want to call like this nucleus picture it’s
5:11gonna be like a fictitious company so um I’m not sat here with a company hoodie yeah so we are a fictitious
5:19company in this in this situation um whatever call it like I don’t know PTP because it says in
5:27acronym um there is a eight development teams made up of say
5:35five to six Engineers of developers software developers um
5:41they are writing obviously different functionality maybe different some
5:47slight different products some pieces of the product overall um you’re there as a central platform
5:55team uh just being on my own say you have a team of
6:03um how many teams have already forgotten eight people it was eight teams eight
6:08teams or five six people yeah okay so say you have four engineers in your team
6:14okay with you which is kind of pretty large to be fair yeah I mean so I guess one thing that you’re already sort of
6:21building out is that there’s you’ve now implied that there is a kind of platform
6:27or yeah of course yeah because we’re talking about platform self-service all
6:32right so a platform team that is building a platform for uh developers yeah and the ratio or the numbers that
6:39you’ve just described um do you want to do the math on it should I do the math so uh 40 40 to 46
6:47people yeah yeah and um then the number of
6:53um developed devops or platform Engineers you said is for the including
6:59yourself in there so I guess it’s three and then one for you yeah I think it’s four in total of that
7:05platform team including like you leading that Team all right yeah um but that already is not actually realistic right
7:12so it’s not realistic no because that’s how I started this I said it was a fictitious
7:17but I’m saying that this the the team that I’m building already I’d be like uh
7:23so the industry actually puts more time and sorry more value in this function
7:29which means that it is uh heavily invested in and the way that this
7:34industry likes to invest in this space weirdly is by putting people in sorry
7:40that’s right or wrong but do I need to hire someone else yeah I think so I really feel like um this isn’t working
7:46out already I really more money I really thought you were the man but clearly I
7:51just got it wrong um let me um let me kind of go find four
7:57people let’s let me go through a few things so let’s say you’re gonna tell me my job spec no I’m not gonna tell you
8:02just say I’m gonna start defining the conditions so let’s say some of these projects have already gone into Cloud maybe there’s a
8:09bunch of kooky things going on like on my on the platform that we’ve built no they’re saying like there was like there
8:15was non-centralized function going on for a while nice so embedded devops
8:21um maybe some of the devops left um like some of them have been doing Python scripts and kind of bits of glue
8:27and string and like maybe Secrets kind of all over the place in things and it’s a bit messy and someone’s like okay we
8:34really need to do something about it there’s bottlenecks and a delivery um so your boss was like basically goes
8:42to hire you in because they’re like we need to do something about this we need to get someone in to kind of take a look at all this and you’re like actually we
8:48need to just probably centralize some of this stuff um because it hasn’t worked very well
8:53and there’s too much inconsistency it’s all been done differently and the risk is really high and you don’t really have any oversight we need a bit more of a
8:59delivery framework in the business to get consistent so promote you pitch platform
9:05engineering like platform engineering means we have Central engineering going on to improve the developer experience
9:12in one place overall and you do pitch and they’re like yeah sold great idea
9:17and then you then hire three more people kind of to help you on defining it
9:23that’s where we’re at now that’s where we’re at oh yeah so thanks for that that’s all right still stuck with three people in this you didn’t get a choice
9:29in it but I just thought to give to give you some expectation and context of the fact that people are already doing
9:35things Okay and like apps might be live in some of this as well live so much
9:40maybe not yeah cool all right um so there’s this Central team um now there isn’t a platform that
9:48exists there’s no design that’s done um so I guess we’re starting from you
9:54know day Zero um and I guess if you’re looking for consistency across all of the teams
10:01security and security people are worried about the security that’s why your boss is hired you in because they’re like I know you were just in one cloud provider
10:07many Cloud providers you’re just doing one cloud provider that’s easy yeah absolutely okay super easy right yeah um
10:12so so next week so um day Zero is actually when you start delivering value
10:18if you’re just in I’m joking uh of course um using cloud is difficult there’s like
10:23you know let’s say AWS now has 200 odd services so you’re not expecting or more
10:29than 200 odd services so you’re not expecting all of those teams to be able to be experts in or you know that cloud
10:36provider completely so you have to start start thinking about standardizing on technology
10:42and making choices not necessarily on behalf of the teams but feeding in the
10:48requirements that allow you to centralize on that choice um to what the requirements should be looking for so this is day Zero so
10:55you’ve got all these different teams I’ve given you a bit of a rundown of audit of the fact that like
11:00things are all kind of manually automated so that like the basically yeah as in like it it probably they
11:07spend more time fixing up the automation than they probably do the automation working yeah so kind of devaluing the
11:14construct of even automation at this point like it probably would have been faster to have done it manually in the end
11:20um and so and they’ve kind of got themselves in a bit of a hole where like somebody’s already done a load of stuff
11:25so they just keep going with the things that were already there and just keep packing away like maybe if I had this project in and maybe if I had this other
11:32environment in and like even you know that kind of stuff that kind of goes on and a bit of bash bit of python going on
11:38and some CI jobs running and you know maybe we’re kind of throwing something else like I’ll go CD in there and then
11:43it’s like now it’s like it’s a bit yeah it’s like kubernetes there someone’s got a couple of clusters I’ll go see Dean no
11:50real oversight anymore because it’s obviously per cluster it’s kind of separate to the CI job so it’s kind of a bit disconnected so now you’re kind of
11:56like basically you’ve got to go and now work out what’s going on where do you begin as in like what would
12:04you look for in your day Zero set requirements I guess part of that is
12:09going and speaking to the team’s right to understand what it is that they’re doing why they’re doing those things what would you be asking what the
12:16business is Hey Jake it’s really nice to meet you I hear you’re new here and you’re going to help fix all these
12:21problems I am new um uh what the f have you been doing but
12:26it would so there used to be this yeah the other guy that used to work here called Jake and um I don’t know just did all this
12:33stuff before his story
12:38um so uh I’d obviously be asking you know what other requirements because they have Downstream users right
12:46um or they might be kind of regulatory requirements or whatever of those applications so they have a bunch of applications which might factor into the
12:54design of the platform that you’re trying to build um I mean we’ve already talked about so
12:59you’re saying that you’re going to lead on like languages so first of all you all think it’s more about the language is that what you’re saying no no so
13:04applications is quite generic what do you mean um so uh they these these teams are
13:12building some functionality um they’re either building functionality in separate kind of business services or
13:20applications um and those applications have users in some sort of Industry let’s say for for
13:27argument’s sake it’s financial industry right um in the finance industry there is a
13:33bunch of Regulation that means that this isn’t Finance okay cool yeah
13:43um so there’s no regulation retail I mean there is regulation even in retail but yeah but not not in not in the
13:49financial it’s not a financial institution regulation okay so
13:54um maybe uh some of those requirements are um business driven so if my let’s say
14:02let’s say it’s an e-commerce website that’s probably the easiest it’s a shop it’s an online shop I mean it’s not I
14:08mean we can go with it I’m amazing all right fine so it’s just some unknown
14:13functions it’s basically it’s a it’s a it’s a retail company and part of that retail
14:20has Logistics Services it’s got payment services it’s got different products elements of there from like selling
14:26technology products it can sell clothes products so it all comes under one right it does but also there’s Logistics
14:31back-end stuff as well you’ve got a factor okay yeah so so I’m right fine another half stock they have stores as
14:37well it’s not all e-commerce now they’ve got shops it’s like they’ve had lots of different functionality lots of different things yeah cool
14:43um I mean I shop at PTP it’s right you know like business ages
14:49so so um this this the shop you know one of the requirements might be that um or one
14:56of the things that the business cares about might be up time right of yeah that is a big problem actually yeah yeah
15:01it goes down quite a lot yeah there we go so we already know there’s some value
15:07that could be driven from a central platform because there’s a real problem that yeah it’s facing so uptime I don’t
15:13know one of the things that you might um uh try to design for is greater uptime
15:20um is what about the cost of um the services uh that like the logistics and
15:26do you have like a really high delivery fee or is there like a really high
15:31um operational cost for running these teams um that is presented to the user so
15:39sometimes things go wrong where the wrong stock gets sent to people
15:44sometimes that’s occurred a few times yeah um so that’s a fire yeah I mean this is
15:50just the day in the life of PTP oh man um but mostly the biggest issues are probably net new services so like new
15:59ideas I was trying to be competitive they just take forever yeah to get out the door
16:04um and then basically new features enhancements so on the existing Services also take a really long time
16:12um and then the back end stuff is quite problematic and also the uptime is quite like things go down takes a long time to
16:18recover so the meantime to recovery is really quite long-winded okay and people don’t really know where to look for
16:24things or whether logs are properly and it could take the major say performance of the sites performance actually isn’t
16:30too bad overall but just just reliability reliability so it kind of
16:36just goes can go down and then take his ages to fix something and does it go down for like security reasons or is it
16:42other reasons well just sometimes if we’re like doing any upgrades things will go down if we’re patching to a new
16:48version of the software sometimes I’ve had issues there and then resolving those issues can take time and the
16:53rollbacks can take time if we need to roll back well sounds like you have a really hard problem to fix glad you
17:00hired me to fix it because yeah I just yeah really know it yeah so so how’s
17:07this platform teams what is this day two what’s feeding into this day Zero requirement Gathering so now we know the
17:15fact that the reliability needs to be improved um the time to Market
17:22um can be improved the developer experience overall basically could could do with
17:28um a help in hand uh performance not too bad [Music] um but you might so in the reliability
17:35thing as an example Cloud obviously and we’ve talked about this on previous episodes cloud has
17:41um lots of different ways to improve reliability whether it’s using you know
17:47pad services that come with um higher availability and reliability
17:53or you know architecting your application so that it can be multi-region or multi-az or you know any
18:01any and or all of these things just means that you’re not just
18:07um reliable but resilient to failures potentially of the car providers right
18:14um so in this design then you’re saying that we should be multi-az or multi-region at least multi-az at least
18:21least is just more than one multi-az just means more than one correct yeah so is
18:27it more than one to two are we saying two um ideally
18:38this starts to factor in yeah where you’re saying you need for for
18:43reliability to increase your reliability in your uptime Distributing applications across
18:50multiple azs is going to improve the reliability if you’ve architect your architected
18:57your app well um and that you’re you know moving state to the right place and all that kind of
19:02thing then yeah for sure um uh Arctic uh having multiple availability zones
19:09um that your application lives in will mean that it has a greater chance of
19:15higher up time um it will also mean that if you’re if one of those availability zones goes
19:21down or is unavailable for whatever reason then your application is affected yeah so so they’re gonna you’re you’re
19:26now saying some design principles that are going to shift to the developer teams to to capitalize on the platform
19:34you’re going to be building because your platform You’re Building you’re proposing is going to distribute applications across multiple azs yeah so
19:40therefore their application needs to be able to be in different azs needs to
19:45scale across multiple OCS and take the traffic which could go into either
19:52one of these azs and then deliver on the outcome appropriately so that if that AZ
19:57went down there’s still the two other azs but they aren’t necessarily burdened by knowing what it takes to build the
20:05application in those azs right they don’t need to change their app they they
20:10um will have to change their app to scale um yeah but they might not have to know
20:16that they are in each AZ um because you know you’re I’m trying to
20:21provide a uh a platform that takes care of some of that complexity so that load
20:27cognitive load for them in in all of this design isn’t isn’t on them right so
20:32what they all they need to validate is the fact can there be more than one of my applications alive and can it receive
20:41traffic it can each one receive traffic equally and you can serve the request and the request is the right result you
20:48would have expected yes um not maybe somebody else’s order but exactly or whatever else yeah
20:54um so all of these would obviously factor in um if you wanted to go even bigger on
20:59that you might introduce another region with more availability zones if you wanted to improve and what would that
21:05mean today app um so again uh you are trying to figure
21:10out whether um you know you now have to think about latency and things like that um where
21:16the users are and where um they’re coming in from um and whether it makes sense for the
21:23load of the application to be um uh to go
21:29to be distributed via in the Geo that is near so let’s say if you had a bunch of
21:34users in us then you might have a region that’s in the US if you had it in LA if
21:40most of your regions most of your users were in London then you know it probably makes sense to do do it there however if
21:47it’s Global and latency was such an important factor then you’d probably
21:52have both and load balance the requests coming in so that they are specific to
22:00the regions that they’re in funnily enough in in this example like Amazon is
22:05a global Service all of that is just um all of the traffic is actually handled
22:12in one region did you know that um all the traffic is handled in one
22:17region yeah so the whole of the Amazon website uh oh right I see you mean right
22:23so you’re not talking about as in there’s only one not so much services so Amazon
22:31um has a e-commerce site you know all of the traffic for that is handled in in one region in one region that’s pretty
22:36cool isn’t it yeah um which is unbelievable skills I’m I’m assuming
22:42um anyway tangent uh so those things would be an issue um the type of choice the technology
22:49choices uh that this platform can handle uh so you know different languages am I
22:56trying to simplify um for only one language on my uh being quite
23:03flexible so that you can host yeah there’s loads of lounges so many languages yeah there’s quite a lot there’s quite a lot of all right so some
23:09flexibility and you might try to say loads but there’s like some people have kind of gone off and done separate things and you know so and
23:17there’s a whole other strategy about how the look and feel of everything kind of comes together better because it’s
23:22sounds like it really made a bit of a mess with this PTP business oh yeah PCP manages this business really yeah I know
23:30it’s pretty poor but anyway this is what’s happened so um but they’ve got back end orders
23:35obviously front-end stuff I mean you can like use State and data as well yeah the
23:41state because obviously orders have to go through credit card details have to go through there used to be payments done obviously then tracking your order
23:48um you know knowing where it’s going to be delivery dates all those kind of things yeah um so there’s like Logistics part and
23:54then obviously there’s the payment part and then there’s obviously then the front end part which is the catalog of the things that you can buy in the
23:59pricing and the you know so it sounds like users you’ve already got um a bit
24:04of a kind of micro service architecture with all of these different teams um which means that there are lots of
24:11different kind of components that you might want to be able to scale um individually and deploy individually
24:18because those teams might make changes to those to their applications and then the way that they interact
24:26um could be made simple right so I’m guessing let me just take a guess to
24:31what the current state is that every change you make um in the in this in this organization
24:39with these different functions things like networking anytime there’s a change you know has to go through one team and
24:45all of this is centralized um or um you know working by Network so so
24:52let’s say one a payment one of your payment applications relies on your
24:57Logistics application for some reason and if you’ve made um you know a change so that your app is
25:04available on a different port then you have to go to some Central team to allow communication if security is an issue to
25:11allow communication between your things it depends because some people share
25:17some team stuff and other other teams don’t so it’s a bit of a mixed bag so
25:22some teams that work some other things historically ended up sharing infrastructure of another team just
25:29because it had the relationships before because it used to be in that team so they kind of just started sharing it and some of them was quite a few kubernetes
25:35clusters but then there’s some um using Lambda and some other things going on across different accounts yeah
25:43um but it’s all kind of working somehow in the end so there’s obviously like some front-end stuff that’s kind of
25:48Lambda there’s some things that like obviously trigger some events on when somebody puts an order through or
25:54whatever else there’ll be like an event that kind of happens for the back end so that’s kind of the Lambda stuff
25:59um but then the other things are the you know where the catalog’s hosted and all those other things are all shared in
26:05like a kubernetes clusters so you’ve also got different experience and different things different architectures
26:12for different architectures and and those architectural um choices that you’ve made
26:19um are you happy with them do you think there’s there’s a lot of I mean a team made them yeah you know it’s because
26:24they’re own separate teams and so they decided that was like the right thing to do that somebody was like Adventure of
26:30an architecture makes sense for this use case so they went off and use that um and then other teams were already
26:35doing containers and so they just went off and started to use kubernetes because they heard about it and it’s like managed services in the clouds like
26:42might as well use that um as well and are they in one account many accounts they’re in many accounts
26:50yeah many accounts are cool and not like loads and loads of accounts but enough there’s like there is more than one is
26:56what I’m saying and are those accounts just like their you know the stages of these environments or is there more than
27:03that other stages plus more yeah so there’s some stages that have accounts
27:09um I think some teams have gone a bit crazy with the number of stages some have like five accounts some only have
27:15two accounts yeah um so it’s been a bit of a mixed bag on how the team decided
27:21to split it up it’s quite a lot of autonomy and cost that it was because it was Central it was like it was basically
27:28a decentralized devops yeah so there’s different no consistency a lot of I mean
27:34we tried but people were speaking to each other you know but out of reusing and helping each other out as well
27:39there’s terraform oh cool now like everyone loves a terraform yeah so there’s terraform there and there’s like
27:45lots of different um lots of different accounts so it’s just a devops decided that the patterns were slightly different
27:52so people were like you know some of the data that was the card data people were a bit risk-averse they were like let’s split it all out in separate accounts
27:58yeah some people were then using like the production data to test and other accounts they had to be treated like
28:03production and all these other things for the credit card data so that was treated differently in like more secure
28:09by splitting everything out but then other things weren’t like that so they didn’t do that um so it’s just like yeah evolution of
28:15like the requirements from other places so it sounds like and we’ve talked about
28:20um the kind of Landing Zone type approach in previous episodes but you know I’m going to apply that as a kind
28:27of at least a baseline platform um to rely upon and then but that
28:32doesn’t really um uh that doesn’t really solve the developer experience which is what we’re
28:39talking about here yeah so we need to we need to so now you have segregated accounts for teams and some sort of
28:46Central Services that can be reused so it’s less of a decentralized
28:52um mess that seems to have been created and there’s some consistency to how you’re delivering in cloud and then the
28:59next thing I guess to to solve for is um how uh all of those technology
29:05choices that have been made um for right reason wrong reason whatever can either be standardized the
29:12operational overhead um can be sort of maintained um the ongoing operability for all of
29:21these things are day two operations can be um kind of prioritized and solved for as
29:26well so day two operations so at the moment I don’t think we’ve because I do
29:32know that one of the Clusters went offline because you upgraded it no it was like basically I think one of the
29:39cloud vendors we just didn’t upgrade it for so long it was on such an old version
29:44um that I think got terminated because we’re one to upgrade but no one knew how to upgrade them
29:50um so we couldn’t really work how to do it so I don’t think the team did it but then we did manage to obviously sort it
29:57out yeah um and then kind of move over so that was fine we’ve got support from the cow
30:02vendor on that at the time um but that was expensive yeah it was very expensive
30:07um but then we’re now just looking just to start again so the aim is like we need to start a bit yeah we know it’s a
30:13mess in many different places so there’s no point trying to work out some consistency in all of this yeah because
30:19right it’s too that would take longer than maybe just three thinking what we need to do so you’re saying put a
30:26landing Zone in first which is what the cloud best practice is which is exactly to support multi-team and multi-account
30:32on multi-project whatever you want to call it depending on the cloud provider yeah so that each team is going to get their
30:38own account but then the platform team then is going to do what is we we do all
30:44have a kubernetes each as a team or good question yeah I guess it kind of depends on
30:50um you know uh the the architecture or the things that are that’s why yeah they’re important so it’s quite a lot of
30:57support yes there’s a lot of important assets right so um let’s say well you you actually just sort of spoke about it
31:04earlier so you have card information yes PCI compliance and that seems to have uh
31:12data that you probably want to segregate often have a bit more risk aversion
31:18around yeah whereas other things might not necessarily have the same that’s the highest risk profile and probably the
31:25user data as well there we go so yeah so already you’ve you’ve kind of got alignment of workloads depending on the
31:33profile of the applications or you can Define everything to the highest risk if
31:39you’re really risk-averse you know you put everything in completely segregated environments where you’re not reusing
31:45any of the underlying infrastructure and everything is locked down
31:50um and and the um the kind of information flow between teams is now centralized as well so yeah
31:57they’re a really good developer experiences really mobile Okay cool so yeah I’m just wondering like when this
32:03develops because it’s going to arrive so at the moment we need to obviously suffer but we’re just looking to ship
32:09some features faster so we really want to know when this platform’s good but like so it’s I mean from all of these
32:16things me building a platform from scratch now this is there’s just so many requirements this it feels like it’s
32:22going to take me quite some time probably not like day two it feels like day 200 that I would have finished
32:29um addressing some of these requirements by we don’t need to necessarily I’m just telling you what the situation is yeah and obviously we can’t ship new features
32:38until at the moment but we also have new teams that are going to start so obviously we know we’ve we know that the
32:43current e-commerce we’ve got other ideas yeah business so we need to do something because there isn’t really a thing to
32:48align to so something that you as much as improving what we have so we need a migration strategy probably to the new
32:55thing this is like so much what there’s lots of things happening that’s not as much of a priority as actually the new
33:01project it’s not new net new net new stuff that needs to go quick and we’re just trapped because we’re
33:08you know don’t really know what to use because there isn’t a one thing to you so standardized so all right so things
33:15that we decisions that we have made there is now a cloud Landing Zone um that’s good uh there is
33:23um kubernetes because that’s already used as well it’s already done is it or we can use it now well it doesn’t exist
33:29yet but okay technology choice so we’re still in the tech okay so we’re still in the planning and now we’re kind of in
33:35Day Zero and day one of phases of this uh of this rollout so yeah
33:41um so the the cloud stuff has been planned has been executed on
33:47um now the um the developer experience is now being planned for and executed on
33:52so as part of that I’m going to come up with a strategy for using kubernetes in the organization so that I can we can
34:00standardize on the skills in the org and standardize on how applications are
34:06being shipped um through the different environments um and standardize how they’re being
34:13monitored you know the observability around all of them um
34:18because obviously I’ve got a few different teams um that are outsourced actually
34:24outsourced teams oh wow yeah more and more requirements just keep yeah just because like make sure you don’t have
34:30the capacity we don’t have a capacity at the moment with the current team there’s PTP it’s just got infinite there’s like
34:36so much it’s just there’s a lot going on with eight teams before ready but we’ve
34:43got two new teams starting they’ve started well yeah I mean I mean they’re about to start so they’re doing their
34:49own scoping right for the new stuff obviously the other eight existing teams will need to migrate at some point to
34:55this new stuff we need a plan for that yeah so I’m just wondering like what’s the you’ve mentioned kubernetes you’ve got the landing zones
35:02um are we just having a giant cluster is it like what are you providing like a big platform that we’re all sharing or
35:07like what’s the kind of depends so you know uh if you’re focusing on I mean
35:13there’s credit card details and all these other stuff so I wouldn’t really feel great about us sharing things to be fair you don’t feel great about shares
35:19well that security risk it would be high so I wouldn’t want promote that to be we’ve already got that risk now in some
35:25projects there you go right so we’ve already we’ve already spoken about this though right we already said that the different security profiles of the
35:32applications would be in different accounts different clusters Etc but not absolutely everything because
35:39um you don’t want the cognitive overhead of managing it all unless
35:46I do know though that when we didn’t upgrade the previous cluster because it was shared yeah everything yeah and yeah
35:54which we don’t want either because that was high risk so so you probably want to segregate off
35:59um some of the some of the applications so there’s not as much risk I if a
36:05cluster goes down then it doesn’t take down the entire business
36:10well if it’s were you saying the whole old because you said three azs he’s saying if the whole three azs go down no
36:16no if someone just doesn’t upgrade a cluster so yeah I mean a cluster can we just make that easy yeah so that’s not a
36:22problem that’s a good idea so how what what things are there in the industry that make cluster upgrades easier in
36:29fact um are you going to we are the team so I
36:34mean that’s kind of why I mean let me ask the team how do you do clusters [Music]
36:40so you you guys are already using terraform right so you know there’s functionality in in there that allows
36:47you to upgrade a cluster and Amazon makes that sort of easy but there’s loads of operational things now uh or
36:54questions that need to be asked right so am I responsible for the cluster probably
37:00um but you have a bunch of apps in that cluster do I decide when to upgrade and
37:06potentially affect your apps who knows well I don’t know because there will be 10 teams in total so I have to protect
37:13how many clusters are you proposing for for all those teams in the end well take the requirements in hand but let’s for
37:20argument’s sake and I’m not saying that this is the most efficient but for argument’s sake let’s say each of those teams has individual class well multiple
37:28clusters one for each stage of their uh environment pipeline or whatever
37:34um or stage of development so let’s say Dev u18 product so now there’s dirty
37:40clusters okay 10 teams 30 clusters that’s a lot
37:45of clusters and each team how much is this gonna cost it’s so cheap it’s like
37:50okay [Music] um but but now you have a uh obviously
38:00the the choice that has been made is that um all of the accounts and clusters are
38:06segregated um no underlying infrastructure so
38:12um a bit less potential well less of a footprint for security issues
38:19to happen in um but then the cognitive load and the operational overhead is high because you
38:25now have to manage dirty clusters also each time you do an upgrade you have to use sorry who are you talking about when
38:32you say you now have to manage who’s the you in the platform the platform team okay but
38:39not the developer if they are responsible for the cluster maybe they I in this weird scenario where I’m I don’t
38:47know if I’m the platform team or I’m just like yeah cool so this is my decision that I cool
38:54all right I hope so I’m just trying to still figure out this this weird uh I
38:59mean fictitious scenario so problem so in so I have 30 clusters
39:07um and I’m gonna take responsibility of owning these clusters however I think
39:13because it’s going to be I mean I only have three other people working in this and those three people cannot feasibly
39:20go into each of these teams and ask them when they can upgrade a cluster
39:26um the only the only thing that I can think about is is you know using um some solutions out there that make
39:34that easier um so it moves the operations into the team and the team
39:39can then decide so basically creating um uh commodity but they won’t know the
39:47teamwork they don’t need to know terraform so they you know they could use products out there that make
39:54um upgrades of clusters as simple as clicking a button that just says do you
39:59want to you know this upgrade is available do you want to hit okay so you’re saying
40:07[Music] and the team won’t need to know
40:14terraform that’s consuming this cluster yep and the Clusters are going to use something
40:21that takes that basically upgrades themselves almost or something is doing
40:27this upgrades yes um and the team is going to be responsible
40:33for when that happens because it’s their application they’re responsible for
40:38their application I’m responsible for the Clusters however I don’t want to be responsible for upgrading their clusters
40:44so I want to pass that down you can’t because you’ve got I’ve only got three people I can’t feasibly be responsible
40:49for that right so I have to find something um a solution out there that takes care of or moves this responsibility down or
40:57build it myself and I can’t build it because I’ve only got three people so okay so so you’re buying rather than
41:02building buying a solution um that that does this
41:08um what are the other problems I guess um or well just the speed of access so
41:13if I’m getting this cluster that my team needs like how long is that going to
41:19take [Music]
41:24um that I’ve bought rather than built um you can actually self-serve clusters Okay so we’ve got so we’ve got a way for
41:31the team that’s going to get these two clusters yeah which is obviously moving friction on accessor and then how do I
41:37get environments in the cluster for my apps that I need to deploy so an environment is
41:42essentially a namespace okay or one one concept of an environment is a namespace a different environment you could
41:49construct it as a cluster depending on you know how much infrastructure you want to reuse or not
41:54um but all of that is also self-serve so you can either get a namespace yourself
42:00or you can get a cluster yourself right but all of that is so I’m as a platform
42:07team I’m just you’re trusting the developer to create their own environments with
42:15knowing that the guard rails are in the right places I am only allowing the
42:20developers to do the things that I have um that I have trust in this tool can I
42:26delete the environments to kind of depends do I want you to delete the environments I think it should be
42:32allowed so yeah yeah apart from potentially prod just in case you mess up okay and that would be a guard rail
42:37is what you’re saying that would be a guard rail exactly however if you wanted to delete Dev if you wanted to delete
42:43uat go ahead right or they’re just not I just can’t destroy prod you can’t destroy prod Yeah
42:50okay or there’s um and you know additional gate that you need if you are destroying prod for
42:56example if you wanted to do blue green deployments in prod and you were kind of moving off of a
43:03cluster moving your workloads onto another cluster you probably do want to delete the one that is just sat there
43:09doing nothing right yeah um so yeah those those types of guard rails and rules can can probably can
43:16happen in this wonderful um will this cluster optimize itself for costs so like because the other thing
43:22we’ve had is obviously lots of idle like resources yeah it’s costing a lot of
43:28money we’re not really using the infrastructure fully thankfully that is um one of the benefits of uh kubernetes
43:35so there’s lots of different add-ons in this world um Auto scalers and such and then um
43:42ways to manage your application as well so that that scales to demand so if your
43:49application is getting to demand and your infrastructure is scaling to demand then
43:55um as long as the right information is being fed into the Clusters and the
44:00management of those clusters then yes it’s going to be so in this platform I
44:05can get the ax get the infrastructure without knowing much about
44:10how you’re having the skills but they’ve known how you’ve done it but like it happens and that happens yeah exactly
44:16and then I can get access to environments and then this infrastructure when I
44:21start deployment apps will scale up and down accordingly to kind of save money so if I like and you can also get Cloud
44:28resources so let’s say your application needed sqs because you’re talking about it just knows I need one yeah well it
44:35doesn’t know you need one all right you need to tell it that you need one right okay you don’t
44:42necessarily need to know terraform so I’m going to standardize um I’m going to standardize the way that
44:49you’re asking for cloud services like a kubernetes cluster or um you know a
44:55database or a message queue or anything like that so in this experience so before because we’re talking about Dev experience yeah and we’re talking about
45:01all the touch points of the dev so I could now see I uh is that there it’s
45:09yeah um platform team so one of my responsibilities is um making sure that
45:15I’ve standardized on a CI okay so I’ve got a CI that my team can use so that’s kind of there yeah
45:21um I’ve also integrated automatically you know the way that um you can deploy
45:27from CI into those clusters right okay so you’re
45:33so all I do what’s on the dev team in this platform in your in this you write code you containerize that code so write
45:41a Docker file and that’s it um and you have to write either some
45:49kubernetes templates or a Helm chart or something and pop it into the right folder structure that I’m going to give
45:55you and then it will just appear in your cluster okay so when I’m engineering
46:01I’ll just be testing for Dev experience I’m testing locally
46:06then I push to CI and then and then these deployment files how do I
46:12know they’re going to work these deployment files so it gets deployed and tested all within the CI pipeline so you
46:20have um you know linting that happens on the kubernetes manifests inside the pipeline
46:26you also have an automatic deployment into a ephemeral cluster that goes away
46:31to make sure that the app can come up and there’s tests within that so um you’re being really efficient about
46:38when you’re deploying to long-lived infrastructure right
46:43cool cool platform isn’t it so I don’t have to learn so I have to learn kubernetes so that could take me a bit
46:48of time um as a Dev but some of the dev teams know it already yeah there’s quite a lot of good kubernetes
46:55um like learning product you know products out there or Solutions out there that help you learn so how long
47:00would you say these two new teams if they had the code ready how long
47:07would it take for them to get do they have an access to Dev and things like that and also what about solutions for like troubleshooting like if Dev didn’t
47:13work or they um are I’m assuming they’re on boarded into the organization so they
47:18have like entity and yeah they’ll be they’ll be in their Central identity yeah if they have a central identity then they can have access pretty much
47:25straight away okay so then so long as they’re in the IDP yeah then they’ll get access and that’s going
47:31to be the onboarding process and then from that point on they can basically is there training probably yeah yeah yeah
47:37so just to standardize for a developer experience you want everyone to have the
47:42same um you know um Baseline knowledge of experience Etc
47:48so we’re going to provide you some deep you know some training some best practices and some of that material
47:54already exists online but there’s three of us and now that all of this stuff has been taken care of for the most part I
48:01could spend the time working with developers to train them on the things specific to the company and you’re
48:07saying that the PCP platform super easy to engineer now yeah so also this platform will be highly available yep
48:14it’s going to upgrade itself secure something’s going to do the upgrade itself I just need to inform when there
48:20is an upgrade and with my app go down if it’s upgrading itself uh depends how you like if you’ve if
48:27your application is meeting the requirements and the standards that we put in and also it has to because we put
48:34the right checks and balances and guard rails in those clusters so that you couldn’t deploy unless you were multi-az
48:41or um you know have multiple replicas or multiple
48:47um uh instances of your app running um then you’ll have no downtime okay so
48:53you’re going to put policies in place that would make from it force me to have to have enough
49:00I guess enough scale in my deployment that match your requirements exactly um
49:05and then the upgrades and shouldn’t impact and then the things on there that maybe are
49:11your things will be impact to those things whatever those things are how I’m going to get my logs and other stuff off this Envision we manage that don’t worry
49:18okay so there won’t be any downtime for like my logs are not going to disappear and all of a sudden and uh we’ll just
49:23ship it to Cloud you know they have uh services that um so I have access to the how do I get
49:29access to the cloud um we’ll we’ll manage that all okay through the landing Zone all right so you’re going to give me access to the
49:36crowd and the platform does that or like I need to speak to the team or like uh just as part of your onboarding you get
49:42it automatically cool so I’ll know it’s partly on board in the train I’m gonna know where my logs are going to be where
49:47my monitoring is going to be monitoring access templates templates that you can reuse
49:53it will even give you the ability to scan your code so that any
50:00vulnerabilities or anything that you find um you you can take care of early rather
50:05than taking care of when they’re in production yeah cool so basically these two projects
50:10should be able to go live quickly and then the day two challenges you’re saying Are Gonna it’s gonna be
50:17like a self-healing self-upgrading kind of self-scaling thing that takes
50:24the day two burdens away yeah and then you’re gonna for my application day two like troubleshooting managing operating
50:31knowing about performance you can provide me all the tooling for that so that I know what to do in my app
50:37so cool pretty good eh that’s pretty good awesome I mean so that’s basically what
50:44we need they do with four people in you know in the team because I didn’t build
50:49it couldn’t build it yeah cool that’s
50:56awesome so I guess now um I mean
51:02TPT took so long we actually went on
51:09um but no that’s good that was good to see um like could be a day in the life of a
51:15company yeah to take them from kind of craziness to something yeah standardized
51:21exactly for the new things and then we can align to it and it couldn’t it shouldn’t really take that long to do
51:26really if you’ve you know well I suppose if you don’t if you’re not having to engineer it all from scratch then yes
51:31exactly as long as the things are out there for you to use I’m ready to go and let the people know
51:37how to train people up on those things and yeah sounds good cool all right so the day day zero one
51:43and two process of you have to go through to before you like work out how you’re going to solve day two how are
51:49you going to solve day one are you going to think about Day Zero are you factor in developer experience and some of that
51:54date one wasn’t even building it was also thinking about X like
52:00research to see whether there’s something out there already yeah true exactly um absolutely everything right day one even
52:08though it’s kind of focused on building doesn’t mean you have to yeah cool um
52:14there you go interesting all right well I think I grilled you enough of that yeah
52:20[Laughter] cool thanks everyone for listening and
52:26we’ll be back with another episode soon thanks again bye bye
52:35[Music]
52:42thank you