Cryptopolitan
2026-05-10 05:03:05

Coinbase’s pivot to AI-led operations is not going so well

Coinbase (Nasdaq: COIN) has once again shown crypto traders how slow cloud hardware can spoil even a fast exchange. It looks like the company’s AI-powered operations pivot strategy might have been its worst move yet. On Friday, the company said a cooling failure inside Amazon Web Services (Nasdaq: AMZN), helped trigger a multi-hour outage that hit trading, exchange access, and balance updates across its platform. The problem began at roughly 23:50 UTC on May 7th when internal monitors detected a widespread breakout of quote failures within the company’s systems. At that point, several Sev1 incidents were created by the engineers, and customers were already impacted in terms of services like spot trading, Coinbase Prime, International, derivatives, Retail, Advanced, and Institutional exchanges. Brian Armstrong, who is the CEO of Coinbase, wrote on X that his company “experienced an outage” and that such an occurrence was “never acceptable.” According to him, the reason behind it was “a room overheating in an AWS data center due to multiple chillers failing.” According to Brian, the company ensures that all their services are designed in such a way that they do not go offline in case one AWS availability zone fails. The majority of services are structured this way, except for the exchange, which uses a different infrastructure due to its high latency demands. Coinbase blames failed AWS chillers as quote systems start breaking before midnight UTC It was reported by Cryptopolitan earlier that Coinbase is planning to terminate 700 workers from their staff because it constitutes approximately 14% of the total workforce. And this is done with the intention of replacing manual processes with AI. Rob Witoff, who heads the Platform of Coinbase, gave the technical details of the matter. As per him, the outage lasted for a long time and affected “trading, exchange access, and balance updates.” The initial warning came about at 23:50 UTC due to quote failures emanating from within the internal systems. An immediate Sev1 analysis followed. According to Rob, the cause of this challenge was a “thermal event” in a small percentage of racks in one of the facilities in AWS us-east-1. Such a structure for the exchange infrastructure came in handy. Rob said that Coinbase maintains its exchange infrastructure in one availability zone, as the industry values speed. Additionally, the firm has a distributed backup copy of this exchange infrastructure in case of such scenarios. But the failure of one part of the exchange infrastructure in question at the moment did not stay within its boundaries, prolonging the process of fixing the situation. Two components failed. There was a malfunction within the hardware below the matching engine. Therefore, before anything else, there was the need to perform recovery and failover operations. Also, the distributed Kafka cluster, tasked with sharing information throughout all systems within the organization, went down. It took the recovery of the Kafka partitions on a new hardware broker, amounting to TiBs of information. Engineers rebuild quorum and bring Coinbase markets back through cancel-only and auction modes The matching engine was responsible for the largest trading stall. The matching engine processes orders and maintains order books. The system works in a distributed cluster and requires quorum before choosing a leader and conducting trades safely. Since not all the nodes remained healthy due to the constraints in the data center during the outage, quorum could not be achieved, thereby preventing trading activities on the Retail, Advanced, and Institutional exchanges. Rob mentioned that on-call support and engineering teams had to execute the company’s disaster recovery procedures, establish quorum, and assess system health under difficult infrastructure circumstances. According to him, the team had to develop, test, deploy, and validate a solution while managing the broader outage. Kafka would have required extensive manual recovery because its partitioned architecture manages thousands of terabytes daily. There were some problems with delayed balance streams because Kafka was behind. Rob stated that these issues with balances disappeared after replication became synchronized. According to Coinbase, no data was lost. When the matching engine was back in service, markets were not re-enabled simultaneously. First, Coinbase switched all products to cancel-only mode, checked product statuses, switched all markets to auction mode, and finally, enabled trading on Coinbase Exchange. Moreover, Rob emphasized that customers should not be locked out of their accounts temporarily. Coinbase assured everyone that the company would provide a detailed explanation for this incident within several weeks. However, Josh Ellithorpe rebutted the rumors after reading Rob’s post on Twitter. As he put it, “no one vibe coded something that failed. A ‘non-engineer’ didn’t push production code and take out the trading engine. It wasn’t intentional. It wasn’t because Coinbase failed to design a failover system. Things happen at scale, don’t let the armchair quarterbacks tell you tall tales.” If you want a calmer entry point into DeFi crypto without the usual hype, start with this free video.