Bugtraq mailing list archives

RE: DOE Releases Interim Report on Blackouts/Power Outages, Focus on Cyber Security


From: "Russ" <Russ.Cooper () rc on ca>
Date: Fri, 21 Nov 2003 17:54:41 -0500

Well, they did specifically discount both current (at the time) Internet worms/activity, and terrorist activity, as 
having any part in the blackout. As for the RTU failures, FE told investigators they believed that was because they 
believed the RTU's "started queuing and overloading the terminals buffers". Given that the EMS Alarm program was 
already crashed at this stage, its feasible to see a real-time reporting terminal not know what to do (other than to 
page an FE IT person) when its host can't accept its input. Since they refer to the RTU's connectivity both as 
"dial-ups" and as "data links", its hard to say what they were. Nothing else in the EMS system failed until 14:54 when 
both primary and backup EMS servers were down, so its unlikely that any "network" connectivity between RTUs and EMS 
were interrupted due to the problems with the EMS systems until that time. Ergo we're left with "comms" problems 
between RTU and EMS that led some FE personnel to describe them as "network" problems. It may all have simply been the 
fact that the RTUs had stalled, sending no "comms".

Interesting that a page went to FE IT folks when the RTUs stopped, but nothing went to them with the EMS Alarm program 
"stalled".

I think the refresh rate of the EMS consoles isn't actually a factor. The alarm function "stalled", or "froze", and did 
not produce any alarms. That EMS consoles were being refreshed after 59 seconds didn't alter the fact operators weren't 
seeing new alarms. The lack of alarms coupled with the arrogance of the staff who insisted reports by others were 
mistaken led to critical failures in line load which ultimately left them unable to recover.

During the same period of time MISO's State Estimating system, which was receiving telemetry from much of FEs network, 
experienced a normal mis-match in load calculations. A manual process is used to correct this, and was done within ~30 
minutes of its first occurrence near the FE problem time-frame. An operator at MISO, however, left the estimating 
system in manual mode and went to lunch. It was put back into automatic mode 93 minutes later, at which time it again 
had a mis-match solution...so it had to be manually corrected again. It wasn't back into automatic mode until 16:04. 
Hard to say it would have made a big difference if it had been running in automatic mode during this whole time. 
Probably yes, but given FE's adamancy they had good data, they may have spent an equal amount of time arguing over who 
knew what.

If either of these events occurred independently, its likely the blackout could have been avoided.

If FE's operators not been so sure of themselves, its likely the blackout could have been avoided.

Finally, FE's IT staff took 54 minutes to complete their first attempt at recovering the alarm process, this after both 
primary and backup servers had failed (14 minutes after both had failed.) They were obviously relying on the failure 
not transferring from the primary to the backup. 34 minutes after the first warm reboot, and 4 minutes before the EMS 
crashed again, they discussed with FE operators the possibility of doing a complete cold boot because only then were 
they informed that the alarm function wasn't running (still). FE operators dissuaded the IT staff from doing so, 
fearing they'd have less data then they already had (arrogance again, they had already demonstrated their inability to 
perform adequately with the "less" data.)

Unfortunately, nobody tells us how long it would have actually taken to do a cold boot, and FE's IT staff say they 
didn't find out that was the only way to recover the alarm system until after the blackout (meaning the warm boot was a 
useless effort in the first place.)

And during all this time there were those damn trees!!!

MISO failed to adequately warn, and FE failed to adequately control its security space (physically and electronically). 
And it all happened on a hot August afternoon.

Cheers,
Russ - NTBugtraq Editor


Current thread: