Lessons to be Learned From the 2003 Blackout that Left Over 50 Million People Without Power!
On August 14, 2003, a cascading failure of the power grid plunged more than 50 million people into darkness in the northeast US and Canada. It was the most significant power outage ever in North America, with an economic impact north of ten billion dollars. Calamities like this don’t happen in a bubble, and there were many human factors, political aspects, and organizational issues that contributed to the blackout. But, this is an engineering channel, and a bilateral task force of energy experts from the US and Canada produced this in-depth 240-page report on all of the technical causes of the event that I’ll try to summarize here. Even though this is kind of an older story, and many of the tough lessons have already been learned, it’s still a nice case study to explore a few of the more complicated and nuanced aspects of operating the electric grid, essentially one of the world’s largest machines.
Nearly every aspect of modern society depends on a reliable supply of electricity, and maintaining this reliability is an enormous technical challenge. I have a whole series of videos on the basics of the power grid if you want to keep learning after this, but I’ll summarize a few things here. And just a note before we get too much further, when I say “the grid” in this video, I’m really talking about the Eastern Interconnection that serves the eastern two-thirds of the continental US plus most of eastern Canada.
There are two big considerations to keep in mind concerning the management of the power grid. One: supply and demand must be kept in balance in real-time. Storage of bulk electricity is nearly non-existent, so generation has to be ramped up or down to follow the changes in electricity demands. Two: In general, you can’t control the flow of electric current on the grid. It flows freely along all available paths, depending on relatively simple physical laws. When a power provider agrees to send electricity to a power buyer, it simply increases the amount of generation while the buyer decreases their own production or increases their usage. This changes the flow of power along all the transmission lines that connect the two. Each change in generation and demand has effects on the entire system, some of which can be unanticipated.
Finally, we should summarize how the grid is managed. Each individual grid is an interconnected network of power generators, transmission operators, retail energy providers, and consumers. All these separate entities need guidance and control to keep things running smoothly. Things have changed somewhat since 2003, but at the time, the North American Electric Reliability Council (or NERC) oversaw ten regional reliability councils who operated the grid to keep generation and demands in balance, monitored flows over transmission lines to keep them from overloading, prepared for emergencies, and made long-term plans to ensure that bulk power infrastructure would keep up with growth and changes across North America. In addition to the regional councils, there were smaller reliability coordinators who performed the day-to-day grid management and oversaw each control area within their boundaries.
August 14th was a warm summer day that started out fairly ordinarily in the northeastern US. However, even before any major outages began, conditions on the electric grid, especially in northern Ohio and eastern Michigan were slowly degrading. Temperatures weren’t unusual, but they were high, leading to an increase in electrical demands from air conditioning. In addition, several generators in the area weren’t available due to forced outages. Again, not unusual. The Midwest Independent System Operator (or MISO), the area’s reliability coordinator, took all this into account in their forecasts and determined that the system was in the green and could be operated safely. But, three relatively innocuous events set the stage for what would follow that afternoon.
The first was a series of transmission line outages outside of MISO’s area. Reliability coordinators receive lots of real-time data about the voltages, frequencies, and phase angles at key locations on the grid. There’s a lot that raw data can tell you, but there’s also a lot of things it can’t. Measurements have errors, uncertainties, and aren’t always perfectly synchronized with each other. So, grid managers often use a tool called a state estimator to process all the real-time measurements from instruments across the grid and convert them into the likely state of the electrical network at a single point in time, with all the voltages, current flows, and phase angles at each connection point. That state estimation is then used to feed displays and make important decisions about the grid.
But, on August 14th, MISO’s state estimator was having some problems. More specifically, it couldn’t converge on a solution. The state estimator was saying, “Sorry. All the data that you’re feeding me just isn’t making sense. I can’t find a state that matches all the inputs.” And the reason it was saying this is that twice that day, a transmission line outside MISO’s area had tripped offline, and the state estimator didn’t have an automatic link to that information. Instead it had to be entered manually, and it took a bunch of phone calls and troubleshooting to realize this in both cases. So, starting around noon, MISO’s state estimator was effectively offline.
Here’s why that matters: The state estimator feeds into another tool called a Real-Time Contingency Analysis or RTCA that takes the estimated state and does a variety of “what ifs.” What would happen if this generator tripped? What would happen if this transmission line went offline? What would happen if the load increased over here? Contingency analysis is critical because you have to stay ahead of the game when operating the grid. NERC guidelines require that each control area manage its network to avoid cascading outages. That means you have to be okay, even during the most severe single contingency, for example, the loss of a single transmission line or generator unit. Things on the grid are always changing, and you don’t always know what the most severe contingency would be. So, the main way to ensure that you’re operating within the guidelines at any point in time is to run simulations of those contingencies to make sure the grid would survive. And MISO’s RTCA tool, which was usually run after every major change in grid conditions (sometimes several times per day), was offline on August 14th up until around 2 minutes before the start of the cascade. That means they couldn’t see their vulnerability to outages, and they couldn’t issue warnings to their control area operators, including FirstEnergy, the operator of a control area in northern Ohio including Toledo, Akron, and Cleveland.
That afternoon, FirstEnergy was struggling to maintain adequate voltage within their area. All those air conditioners use induction motors that spin a magnetic field using coils of wire inside. Inductive loads do a funny thing to the power on the grid. Some of the electricity used to create the magnetic field isn’t actually consumed, but just stored momentarily and then returned to the grid each time the current switches direction (that’s 120 times per second in the US). This causes the current to lag behind the voltage, reducing its ability to perform work. It also reduces the efficiency of all the conductors and equipment powering the grid because more electricity has to be supplied than is actually being used. This concept is kind of deep in the weeds of electrical engineering, but we normally simplify things by dividing bulk power into two parts: real power (measured in Watts) and reactive power (measured in var). On hot summer days, grid operators need more reactive power to balance the increased inductive loads on the system caused by millions of air conditioners running simultaneously.
Real power can travel long distances on transmission lines, but it’s not economical to import reactive power from far away because transmission lines have their own inductance that consumes the reactive power as it travels along them. With only a few running generators within the Cleveland area, FirstEnergy was importing a lot of real power from other areas to the south, but voltages were still getting low on their part of the grid because there wasn’t enough reactive power to go around. Capacitor banks are often used to help bring current and voltage back into sync, providing reactive power. However, at least four of FirstEnergy’s capacitor banks were out of service on the 14th. Another option is to over-excite the generators at nearby power plants so that they create more reactive power, and that’s just what FirstEnergy did.
At the Eastlake coal-fired plant on Lake Erie, operators pushed the number 5 unit to its limit, trying to get as much reactive power as they could. Unfortunately, they pushed it a little too hard. At around 1:30 in the afternoon, its internal protection circuit tripped and the unit was kicked offline – the second key event preceding the blackout. Without this critical generator, the Cleveland area would have to import even more power from the rest of the grid, putting strain on transmission lines and giving operators less flexibility to keep voltage within reasonable levels.
Finally, at around 2:15, FirstEnergy’s control room started experiencing a series of computer failures. The first thing to go was the alarm system designed to notify operators when equipment had problems. This probably doesn’t need to be said, but alarms are important in grid operations. People in the control room don’t just sit and watch the voltage and current levels as they move up and down over the course of a day. Their entire workflow is based on alarms that show up as on-screen or printed notifications so they can respond. All the data was coming in, but the system designed to get an operator’s attention was stuck in an infinite loop. The FirstEnergy operators were essentially driving on a long country highway with their fuel gauge stuck on “full,” not realizing they were nearly out of gas. With MISO’s state estimator out of service, Eastlake 5 offline, and FirstEnergy’s control room computers failing, the grid in northern Ohio was operating on the bleeding edge of the reliability standards, leaving it vulnerable to further contingencies. And the afternoon was just getting started.
Transmission lines heat up as they carry more current due to resistive losses, and that is exacerbated on still, hot days when there’s no wind to cool them off. As they heat up, they expand in length and sag lower to the ground between each tower. At around 3:00, as the temperatures rose and the power demands of Cleveland did too, the Harding-Chamberlin transmission line (a key asset for importing power to the area) sagged into a tree limb, creating a short-circuit. The relays monitoring current on the line recognized the fault immediately and tripped it offline. Operators in the FirstEnergy control room had no idea it happened. They started getting phone calls from customers and power plants saying voltages were low, but they discounted the information because it couldn’t be corroborated on their end. By this time their IT staff knew about the computer issues, but they hadn’t communicated them to the operators, who had no clue their alarm system was down.
With the loss of Harding-Chamberlin, the remaining transmission lines into the Cleveland area took up the slack. The current on one line, the Hanna-Juniper, jumped from around 70% up to 88% of its rated capacity, and it was heating up. About half an hour after the first fault, the Hanna-Juniper line sagged into a tree, short circuited, and tripped offline as well. The FirstEnergy IT staff were troubleshooting the computer issues, but still hadn’t notified the control room operators. The staff at MISO, the reliability coordinator, with their state estimator issues, were also behind on realizing the occurrence and consequences of these outages.
FirstEnergy operators were now getting phone call after phone call, asking about the situation while being figuratively in the dark. Call transcripts from that day tell a scary story.
“[The meter on the main transformer] is bouncing around pretty good. I’ve got it relay tripped up here…so I know something ain’t right,” said one operator at a nearby nuclear power plant.
A little later he called back: “I’m still getting a lot of voltage spikes and swings on the generator… I don’t know how much longer we’re going to survive.”
A minute later he calls again: “It’s not looking good… We aint going to be here much longer and you’re going to have a bigger problem.”
An operator in the FirstEnergy control room replied: “Nothing seems to be updating on the computers. I think we’ve got something seriously sick.”
With two key transmission lines out of service, a major portion of the electricity powering the Cleveland area had to find a new path into the city. Some of it was pushed onto the less efficient 138 kV system, but much of it was being carried by the Star-South Canton line which was now carrying more than its rated capacity. At 3:40, a short ten minutes after losing Hanna-Juniper, the Star-South Canton line tripped offline when it too sagged into a tree and short-circuited. It was actually the third time that day the line had tripped, but it was equipped with circuit breakers called reclosers that would energize the line automatically if the fault had cleared. But, the third time was the charm, and Star-South Canton tripped and locked out. Of course, FirstEnergy didn’t know about the first two trips because they didn’t see an alarm, and they didn’t know about this one either. They had started sending crews out to substations to get boots on the ground and try to get a handle on the situation, but at that point, it was too late.
With Star-South Canton offline, flows in the lower capacity 138 kV lines into Cleveland increased significantly. It didn’t take long before they too started tripping offline one after another. Over the next half hour, sixteen 138 kV transmission lines faulted, all from sagging low enough to contact something below the line. At this point, voltages had dropped low enough that some of the load in northern Ohio had been disconnected, but not all of it. The last remaining 345 kV line into Cleveland from the south came from the Sammis Power Plant. The sudden changes in current flow through the system now had this line operating at 120% of its rated capacity. Seeing such an abnormal and sudden rise in current, the relays on the Star-Sammis line assumed that a fault had occurred and tripped the last remaining major link to the Cleveland area offline at 4:05 PM, only an hour after the first incident. After that, the rest of the system unraveled.
With no remaining connections to the Cleveland area from the south, bulk power coursing through the grid tried to find a new path into this urban center.
First overloads progressed northward into Michigan, tripping lines and further separating areas of the grid. Then the area was cut off to the east. With no way to reach Cleveland, Toledo, or Detroit from the south, west, or north, a massive power surge flowed east into Pennsylvania, New York, and then Ontario in a counter-clockwise path around Lake Erie, creating a major reversal of power flow in the grid. All along the way, relays meant to protect equipment from damage saw these unusual changes in power flows as faults and tripped transmission lines and generators offline
Relays are sophisticated instruments that monitor the grid for faults and trigger circuit breakers when one is detected. Most relaying systems are built with levels of redundancy so that lines will still be isolated during a fault, even if one or more relays malfunction. One type of redundancy is remote backup, where separate relays have overlapping zones of protection. If the closest relay to the fault (called Zone 1) doesn’t trip, the next closest relay will see the fault in its Zone 2 and activate the breakers. Many relays have a Zone 3 that monitors even farther along the line.
When you have a limited set of information, it can be pretty hard to know whether a piece of equipment is experiencing a fault and should be disconnected from the grid to avoid further damage or just experiencing an unusual set of circumstances that protection engineers may not have anticipated. That’s especially true when the fault is far away from where you’re taking measurements. The vast majority of lines that went offline in the cascade were tripped by Zone 3 relays. That means the Zone 1 and 2 relays, for the most part, saw the changes in current and voltage on the lines and didn’t trip because they didn’t fall outside of what was considered normal. However, the Zone 3 relays – being less able to discriminate between faults and unusual but non-damaging conditions – shut them down. Once the dominos started falling in the Ohio area, it took only about 3 minutes for a massive swath of transmission lines, generators, and transformers to trip offline. Everything happened so fast that operators had no opportunity to implement interventions that could have mitigated the cascade.
Eventually enough lines tripped that the outage area became an electrical island separated from the rest of the Eastern Interconnection. But, since generation wasn’t balanced with demands, the frequency of power within the island was completely unstable, and the whole area quickly collapsed. In addition to all of the transmission lines, at least 265 power plants with more than 508 generating units shut down. When it was all over, much of the northeastern United States and the Canadian province of Ontario were completely in the dark. Since there were very few actual faults during the cascade, reenergizing happened relatively quickly in most places. Large portions of the affected area had power back on before the end of the day. Only a few places in New York and Toronto took more than a day to have power restored, but still the impacts were tremendous. More than 50 million people were affected. Water systems lost pressure forcing boil-water notices. Cell service was interrupted. All the traffic lights were down. It’s estimated that the blackout contributed to nearly 100 deaths.
Three trees and a computer bug caused a major part of North America to completely grind to a halt. If that’s not a good example of the complexity of the power grid, I don’t know what is. If you asked anyone working in the power industry on August 13, whether the entire northeast US and Canada would suffer a catastrophic loss of service the next day, they would have said no way. People understood the fragility of the grid, and there were even experts sounding alarms about the impacts of deregulation and the vulnerability of transmission networks, but this was not some big storm. It wasn’t even a peak summer day. It was just a series of minor contingencies that all lined up just right to create a catastrophe.
Today’s power grid is quite different than it was in 2003. The bilateral report made 46 recommendations about how to improve operations and infrastructure to prevent a similar tragedy in the future, many of which have been implemented over the past nearly 20 years. But, it doesn’t mean there aren’t challenges and fragilities in our power infrastructure today. Current trends include more extreme weather, changes in the energy portfolio as we move toward more variable sources of generation like wind and solar, growing electrical demands, and increasing communications between loads, generators, and grid controllers. Just a year ago, Texas saw a major outage related to extreme weather and the strong nexus between natural gas and electricity. I have a post on that event if you want to take a look after this. I think the 2003 blackout highlights the intricacy and interconnectedness of this critical resource we depend on, and I hope it helps you appreciate the engineering behind it. Thank you for reading and let me know what you think.
IS SALT A SPIRITUAL WEAPON? Uncover the SECRET that can CHANGE your LIFE!
Thank you Madge for an expertly written piece on national grids and their interconnectivity, and therefore their fragility beyond even the comprehension of qualified and highly experienced electrical engineers.
ReplyDeleteI have tried to 'understand what electricity is in reality', over my lifetime off 77-years, but must admit it is still a mystery, and of course it is at a minimum, a phenomena of what we think we know of physics.
Although I had to present to major technical and commercial audiences the complexities of IT applications systems and the complexities of networking since the 1980's, I found electricity anything but simple.
What you have achieved here is a lesson to us all that complex can be explained to the layman and should be more often so that the people can all understand and appreciate what we have put together and call 'modern society', as it beats sleeping on the steps of universities, town halls and pavements outside tall apartment blocks.
I have one request: what on ‘earth’ do the power stations do with all the electricity they are producing, especially nuclear power stations, when they have nowhere to send it, especially as they must need to over-produce to meet presumed higher demands at all times…?
Raccoon Mountain in Tennessee houses a TVA pumped storage reservoir. When system demand is low motor/generator/turbines pump water up the mountain into the lake. When demand is high water is allowed to flow out of the lake through the motor/generator/turbine where it generates electricity to feed back into the grid. Think there are other pumped storage arrangements but don’t know for sure.
DeleteThere are a number of pump storage facilities across the country. The ones I am familiar with were built in conjunction with nuclear power plants to utilize the approximately one penny per kilowatt hour nuclear energy in off peak times.
DeleteI did not intend to make an anonymous commentary to Madge, my name is Andrew Charnley, a Brit retired out to Trinidad and Tobago since 2006, and extremely grateful for the air-conditioning and T-Tec (who supply our electricity from 100% LNG) I use 24/7 in the tropics and we get our electricity at £0.43 pence per kWh compared to the UK's £0.24, for now...
ReplyDeleteGood article simplifying a very complicated subject.
ReplyDeleteThe public has no idea how complex the electrical infrastructure is.
The main takeaway from this article should be have a plan for when the electrical grid goes down.
The minimum being enough generating capacity, Number 12 gauge drop cords, Number 10 if you’re gonna run several on one, to run all of your refrigerators and freezers at the same time.
No, Power means no cell phone or landline and this is where I suggest you find smart articles on Ham radio
Stay away from radio systems that require cell towers they won’t be functioning. Neither will those radios beyond a mile or two.
Ahh this brings back memories. Thanks for the article. Understand folks that the 'grid' is always undergoing changes, upgrades and having entirely new transmission lines and substations installed. All of this is done under FERC Federal Energy Regulatory Commission rules but they have to interface with other grid operators, the NRC for nuke plants and a host of other entities. Then you have some local municipal plants and operators that generally follow FERC rules but can have local standards since they don't transmit across state and regional boundaries. Whew! I'm surprised we can turn a light bulb on some days.
ReplyDeleteI was working in the industry consulting to nuclear power plants after being licensed by the NRC as a control room operator for years. Rotating shift work, happy home life and aging don't mix so I contracted out.
Part of the issue with relays here is there are old time relays that are essentially a coil to operate contact 'fingers' to execute certain functions like over/under voltage, same for current, frequency and a host of other measured variables to control the grid. New relay schemes utilize predictive functions to protect the grid from failure, zone 3's and such, so they reach far and wide across the grid to determine stability. When they encounter an older, sometimes slower relay they essentially move past it as it may not have functioned yet. So the newer schemes are great at looking into the future grid conditions but sometimes choke on real time conditions. Think how your ABS brakes sometimes pulse on ice when you really want to stop or your new 6 speed electronic transmission will hesitate because it trying to determine if you're on or off the gas to shift accordingly. The grid engineers are constantly running checks on their system for those potential flaws.
Now add in nuke plants that hesitate to use a lot of predictive relaying as it could affect plant safety. A lot of in-plant relay schemes are 1940's-60's, depending on the part of the electrical system they're in, whereas most substations the plants power are predictive. That old crap is pretty reliable and you'd have to run multiple analyses to prove it won't affect a nuke plants access to a UHS (ultimate heat sink).
So the cascading failure ran upper midwest across New York and hit some older relays in western Mass and Vermont and...stopped. With grid reliability projects and better tested schemes I don't think it should happen again but who knows. The right combination and all the holes in the cheese line up for that lightning to strike and out go the lights.
One other vulnerability is the dreaded EMP/large solar flare that could absolutely wreak havoc and not all of these predictive relay systems are 'hardened' against them but it's underway.
Last item for the first poster's last question; nuke plants are 'base loaded' plants running 100% most of the time. Other plants are load following and can even be operated remotely by some grid operators. Peaking units (typically gas units) are used to pick up short term drops in voltage. All of these work together through a schedule of who intends to be available or who has to come down for an outage. Grid operators try to coordinate it all through some very lawyerly language in power station procedures because everyone wants maximum profit for their investment. Who knew?
I hope this helps.
There is zero justification for a outage to extend beyond the substation that experienced it first.
ReplyDeleteIf a circuit breaker tripped in the affected branch, that branch would be isolated and report such to network control for investigation. There is no reason for a tripped circuit breaker in a home or apartment to extend any further. Likewise every part of an electrical grid.