no subject

Whoops, well, still available. Left this open and forgot about it, so -

As to those specific things - I'm a software engineer but I don't work on product ('Implement this feature for customers!'), I work on reliability and 'system' engineering, where 'system' is 'the backend stack, databases, appservers, etc' rather than 'write the kernel! fix the network card!'.

So, in my role, there's a /ton/ of New And Interesting stuff. Things I deal with include - okay, for example, we're integrating this new product that wants to use all of our backends, but the existing queries have a deadline of 600ms and the new ones have a deadline of 70ms. How do we separate and track these traffic flows so the new traffic always gets priority? Which database callouts can we still afford to do? How do we guarantee these queries get served close enough to the user that the round-trip-time is short enough to the user, when we don't really know what that is?

In response to this, I'm probably going to spend the next month writing a module that determines how much time a query has left, calculates how much network time this means we can afford, detects a list of clusters within that time range, and restricts the load balancer to only talk to them. We had a long chat with the team who designs and runs the load balancers, about making this a core feature, and they went 'Um. That's super specialised, guys, what are you even doing? Sorry, we can't help, but tell us how it goes?'

Which, to answer the other point - over the course of figuring this out, I and other SREs ('site reliability engineer') have had long discussions with a) traffic team b) load-balancer team c) product for the new queries d) development team for the frontend binary e) development team for the auction binary f) development team for each of the databases, etc. So SRE is very communications-y and you end up, as a junior SRE, talking to sometimes /very/ senior developers in other orgs and being respected - we're rare, and we're very specialised in these issues, so if you asked for a consult you better listen XD

Within my team, it's very social and loyal and supportive. We go on call, 12hrs a day for a week every ~7 weeks or so, dealing with things breaking, sometimes in pretty serious ways. It's stressful. But you can absolutely always call on a teammate if it's 9pm on a saturday and you don't understand why the database is gone, and they'll be there, no questions, won't make you feel guilty for asking or feel dumb for not knowing. We have a 'postmortem culture', where failures are dissected in the context of the circumstances that created them, not the people.

So. On both of those points - I love my job.

(20 comments)

no subject

Post a comment in response: