Please don't move to us-west. TF makes API calls to the underlying cloud. I'm not sure if Zoom has any Critical infra in AWS though. (assuming the services that are down are not the base layers of AWS, like the 2020 outage). It is more complicated, and it does require a different sort of person. This explains it. Anyone else notice similar issues in US-West-2 a few hours before this issue in US-East-2? and you know, deleting a key for the state lock (one that explicitly tells you when and who created it) aint that hard or a that big of a deal. Same here - we were finally able to log in to the console, but we're in us-east-2 and are having a ton of issues. Not everything is worth the money. Zone downtime still falls under an AWS SLA so you know about how much downtime to accept and for a lot of businesses that downtime is acceptable. We expect to recover the vast majority of EC2 instances within the next hour. They are fully isolated partitions of the AWS global infrastructure. Crackle happens to be the name of a video on demand company. At least in my experience, AWS downtime also only accounts for a minor share of the total downtime; the major source are crashes and bugs in the application you're actually trying to host. I think you may have slightly misread. But https://news.ycombinator.com/item?id=32267154, https://www.youtube.com/watch?v=RuJNUXT2a9U. This seems to be the first major us-east-2 outage, indeed, vs us-east-1 and other regions. So anyone who is in us-west-2 is there intentionally, which makes me assume there is a smaller footprint there (but I have no idea). Depends what its stuck doing, but you might ctrl-c it and later manually unlock the state file (by carefully coordinating with colleagues and deleting the dynamo DB lock object if youre using the s3 backend) when the outage is over. Our EC2 instances can't connect RDS and just got 500 errors on the dashboard. The connection proxy/sites were giving 504s. all of our production services are multi-az as well. Not all systems require high availability. It's not debt if you don't have to pay for it -- and if the ongoing costs of whatever it is are relatively insignificant. Lots of things intermittently unreachable in us-east-2 for us, across multiple AWS accounts. For a long time the default when you first logged into the console was us-east-1 so a lot of companies set up there (that's where all of reddit was run for a long time and Netflix too). designing for those scenarios increase complexity; cost; architecture style and most of the time it will bring you in microservices territory where most of the companies lack experience and just are following best practices in a field where engineers are expensive and few. I suspect, behind the scenes, AWS fails to absorb the massive influx in requests and network traffic as AZs shift around. If it really got stuck and you have to kill it, then sure, you might have to mess with it a bit. Running a business is about managing costs and risks. With its own powerful infrastructure, an Availability Zone is physically separated from any other zones. Just my hunch given that it happened during the middle of the week in the middle of the day, and came back relatively quickly. Insert clip of O'Brien explaining to cardassians why there are backups for backups, In case anyone is unaware of the reference, thats taken from Star Trek Deep Space 9. Datacenter power has all kinds of interesting failure modes. Only reason we knew to shut them down at all was because AWS told us the exact AZ in their status update. Some people care about it but not enough to justify the added downsides - multi-data center is expensive (you pay per data center) and its complex (data sharding/duplication/sync). Availability zones are not guaranteed to have the same name across accounts (ie. I had to manually re-deploy to get pod distribution even again. Those never fail, right? In our case, it's probably the better way to just take a backup, shutdown everything, do an offline upgrade and rebuild & restore if things unexpectedly go wrong. Pedantic clarification for the unfamiliar: the breakfast cereal is named. us-west-2 has had outages as well but it is less common, even rare. This should be at most 100km. I presume they are trying to express an extra cardinal dimension perpendicular to the plane. Interestingly, we saw a bunch of other services degrade (Zoom, Zendesk, Datadog) before AWS services themselves degrade. Both RDS and elasticache run on EC2. Which is normally not needed. If your application and infra can magically utilize multiple zones with a couple lines then I would say you are miles ahead of just about every other web company. And if I would have read the page the link points to better, that's exactly the reason. Its premature when its premature. Our primary kafka instance (R=3) survived but this auxiliary one failed and caused downtime. This has really started to change my thoughts of how to approach, e.g. So, if youre looking to fundraise and your business has tight margins, dont be too hasty to move to managed services. On top of that, I was looking at the documentation for KMS keys yesterday, and a KMS key can be multiregion, but if you don't create it as multiregion from the start, you can't update the multiregion attribute. 10:25 AM PDT We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. If those hang, you'll have to wait for them to time out. Multi-region failover would have made us more fault tolerant but our infrastructure wasn't setup for that yet (besides an RDS failover in a separate region). There is a distance of several kilometers, although all are within 100 km (60 miles of each other). AWS works with multiple availability zones (AZ) per region, some products by default deploy in several ones at the same time, while others leave it up to you. That's why I'm so surprised. AWS availability zones (so like us-west-2b rather than us-west-2) are not the same between accounts. AWS availability zones are randomly shuffled for each AWS account your us-east-2a won't (necessarily) be the same as another user's (or even another account in the same organization): I wonder if this is done because people have a tendency or something to always create resources in 'A' (or some other AZ) and this helps spread things around. Multi-AZ is a requirement on production level loads if you cannot sustain prolonged downtime. The buildings are loc Jul 08, 2022 | Posted by MadalineDunn | Amazon AWS. However, the actual underlying stack deployment succeeded. Spend more time on features instead. Looks to be a larger issue in the US East, also seeing Cloudfare and Datadog with issues. The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry. But as an AWS customer, you still feel the impact. I look forward to the postmortem. I don't work on cloud stuff, so I'm genuinely unsure if this is a joke. Clarification: 1/3 of sites will go down (those using the AZ that went offline), but my point is the same. Could not write to the db at all. Funny I failed away from the zone and RDS still doesn't work, connections fail. ASGs, RDS, load balancers, etc. Our best was a bird landing on a transformer up on a pole. They call their data centers availability zones. This outage occurred in us-east-2. If you're using Aurora then it is handled automatically and is even less expensive. When the three campuses are fully developed, each will have five 150,000 SF data centers with a total power capacity of over 300 MW. Haha literally had this same thought. and like you mentioned, the oldest. Everything else is stateless and can be moved quickly. 99.5% availability allows up to about 3 and a half hours of downtime a month. Edit: although, one of our vendors that uses AWS has said that they think ELB registration is impacted (but I don't recall if that's regional?) [10:25 AM PDT] We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. There's all sorts of cascading/downstream "weirdness" that can result on AWS's own services through the loss of an AZ. ), I did notice it being a little slow but I'm also on 4G at the moment (it got the blame), And the reason that works is because HN is mostly hosted on its own stuff, without weird dependencies on anything beyond "the servers being up" and "TCP mostly working.". - Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world? Thanks, this comment made it very clear to me that I never want to touch a terraform system. If you are not chaos-monkeying regularly, you won't know about those cases until they happen. I saw one of our RDS instances do a failover to the secondary zone because the primary was in the zone that had the power outage. I've seen outages caused by a cat climbing into a substation, rats building a nest in a generator, fire-fighting in another part of the building causing flooding in the high-voltage switching room, etc. I understand that us-east is AWS's oldest and biggest facility, but Amazon seems to have more money than Croesus, why aren't they fixing/rebuilding/replacing us-east with something more modern? Even if they did care, the business is often too incompetent to understand that they could easily prevent these things. Companies dont want to pay for in house architecture/etc and developers are generally ultra hostile towards ops people. is usually enough to cause it to abort without any ill effects to the state file. Given that the outage affected a portion of one DC in one AZ, we can make some assumptions, but the truth is we just don't know. us-east-2 is our default region for most stuff and so far that's been good. AWS makes it pretty easy to operate in multiple AZs within a region (each AZ is considered a separate datacenter but in real life each AZ is multiple datacenters that are really close to each other). To them it's much easier to communicate "Hey, our customer service is going to be degraded on the second saturday in october, call on monday" to their customers 1-2 month in advance, prepare to have the critical information without our system, and have agents tell people just that. a single DC in the use2-az1 availability zone. Going by your example, If your website requires 1 application server, to tolerate a single AZ failure, it requires you to double the number of application servers. It's not just a single instance too, there's generally a lot more infrastructure (db servers, app servers, logging and monitoring backends, message queues, auth servers etc), (And checkbox-easy is sweeping edge cases and failure modes under the rug). Not everything is greenfield, and re-architecting existing applications in an attempt to shoehorn it into a different deployment model seems a bit much. EH2 is reportedly working on a 100MW-minimum electrolyzer technology for industria Jun 21, 2022 | Posted by Abdul-Rahman Oladimeji | Amazon AWS. This is true, at the AWS account level. Whether TF can update the state & release its locks would depend on where those were hosted. RDS failovers are not free and have a small window of downtime (60-120s as claimed by AWS[1]). At least thats what was recently told to me by my manager to explai why my employer prefers to hire people to self manage the AWS infra. possibly related: https://news.ycombinator.com/item?id=32267154, Now I guess we have to move to us-west-2. Architect here. https://docs.google.com/spreadsheets/d/1Gcq_h760CgINKjuwj7Wu (from https://awsmaniac.com/aws-outages/), Always check HN before trying to diagnose weird issues that shouldn't be connected. ;). Local Zones serve as Edge locations hosting applications that need low latency to final users or on-premises installations. You really do have to look at the use-cases and requirements before making sweeping staements. At which time it's too late. You decide on what kind of failure you're willing to tolerate and then architect based on those requirements (loss of multiple AZ's, loss of a region, etc..). Interestingly enough, to them, a controlled downtime of 2-4 hours with an almost guaranteed success is preferable, compared to a more complex, probably working zero-downtime effort that might leave the system in a messed up - or not messed up - state. Living a bit more dangerously at the moment as HN is still running temporarily on AWS. Both the EC2 instance health and our HTTP health checks. And I thought us-east-2 is the way to escape us-east-1's problems. The loss of power is affecting part of a single data center within the affected Availability Zone. - Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? us-east-2a in one account might be us-east-2d in another). > us-east-1 is the region you're thinking of that has issues. People working in IT naturally think keeping IT systems up 100% time is most important. This. People are just unaware, and probably making bad calls in the name of being "portable". its more expensive to have more things and its more expensive to have more complicated things that are also complex. don't all the multi-AZ deployments imply at least 1 standby replica in a different AZ? I'd consider using it, but the biggest roadblock for me is that I work in a regulated industry in Australia, and until AWS finishes their Melbourne region (next year maybe?) We can totally test the happy case to death, and if the happy case works, we're done in 2 hours for the largest systems with minimal risk. That would be a valid reason not to check this checkbox, if your business can survive a bit of downtime here and there. To an investor, having a devops role on staff is acceptable. So you need to create a new KMS key and update everything to use the new multiregion key. According to BizJournal, the company paid up to $30.8 million to real estate firm Clarke-Hook Corp. That SaaS has been running Aurora for years and has never experienced anything similar. Rather the opposite - us-west-2 is big but not the biggest region, or the smallest, or the oldest or newest, it's not partitioned off like the China or GovCloud regions. Not sure why we're on that list. It would be one thing if this was a Regional failure, but a single AZ failure should not have any noticeable effect. Amazon Web Services plans to invest $400 million to build a 170,000-square-foot data center on 100 acres in the eastern portion of the New Albany International Business Park, In 2015, Amazon acquired three data center campuses in the Columbus Region (New Albany, Dublin and Hilliard.). Second: EKS (Kubernetes). Also, I think a lot, but not all of the services I use work okay with multiple regions. But both of them have Multi-AZ options. There are six 9s in there. What's more likely is that their companies have other priorities. It gives me a bad gut feeling when you imply that multiple instances of a service is more complex than a single instance which cannot be duplicated easily. Of course it's more costly, you need to ensure state between locations so by virtue there's more infra to pay for. According to BizJournal, Amazon Web Services has acquired a 58.5-acre land in Prince William County, Virginia, for $87.8m. > each AZ is considered a separate datacenter but in real life each AZ is multiple datacenters that are really close to each other. I was pointing out that your statement was over-general and that there are many instances where making the informed decision to ignore HA is a completely reasonable thing to do. Sad face. Are you saying that while AWS maintains multiple AZs they cant maintain reliability on the failover systems between them? Having _everything_ on a single AZ of AWS is, indeed, a problem. I can get into my hosts, but my LB isn't routing. This will be AWS third low-rise office property acquisition in Sterling, Virginia, in 2022. > Looks like Snap, Crackle and Pop are down as well. Most of that expense is just the cost of a hot failover, but there is some additional cost around inter-AZ data transfer. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a. (I'd link to the threads about this from a few weeks ago but am on my phone ATM. It could be that their shared-fate scope is an entire data hall, or a set of rows, or even an entire building given that an AZ is made up of multiple datacenters. I believe it's on AWS after its two servers broke at the same time the other day. There is a shortage of good cloud engineers, but even if there were more of them, the business doesn't give a crap about brief outages like this. > you are miles ahead of just about every other web company. It was because apparently Netlify and Auth0 use AWS and went down, which took down our static sites and our authentication. Most startups that go that route don't survive long enough to make use of this optimization.