FIN #: I0616-1 SYNOPSIS: Ecache Memory Parity Error DATE: Sep/11/00 KEYWORDS: Ecache Memory Parity Error --------------------------------------------------------------------- - Sun Proprietary/Confidential: Internal Use Only - --------------------------------------------------------------------- FIELD INFORMATION NOTICE (For Authorized Distribution by SunService) SYNOPSIS: Solaris kernel patches provide improved handling and reduction of CPU, Ecache, and main memory errors in UltraSPARC systems. TOP FIN/FCO REPORT: Yes PRODUCT_REFERENCE: Solaris 2.5.1, 2.6, 7, and 8 PRODUCT CATEGORY: Software / Solaris PRODUCTS AFFECTED: Mkt_ID Platform Model Description Serial Number ------ -------- ----- ----------- ------------- Systems Affected ---------------- - E10000-HPC ALL Ultra Enterprise 10000 HPC - - E10000 ALL Ultra Enterprise 10000 - - E6500-HPC ALL Ultra Enterprise 6500 HPC - - E6500 ALL Ultra Enterprise 6500 - - E5500-HPC ALL Ultra Enterprise 5500 HPC - - E5500 ALL Ultra Enterprise 5500 - - E4500-HPC ALL Ultra Enterprise 4500 HPC - - E4500 ALL Ultra Enterprise 4500 - - E3500-HPC ALL Ultra Enterprise 3500 HPC - - E3500 ALL Ultra Enterprise 3500 - - E450-HPC ALL Ultra Enterprise 450 HPC - - A25 ALL Enterprise 450 - - A33 ALL Enterprise 420R - - A26 ALL Enterprise 250 - - A34 ALL Enterprise 220R - - N14 ALL Netra T-1405 - - N15 ALL Netra T-1400 - - N06 ALL Netra T1 AC - - N04 ALL Netra T-1125 - - N03 ALL Netra T-1120 - - A27 ALL Ultra 80 - - A23 ALL Ultra 60 - - A20 ALL Ultra 450 - - A16 ALL Ultra 30 - - A14 ALL Ultra 2 - - E6000 ALL Ultra Enterprise 6000 - - E5000 ALL Ultra Enterprise 5000 - - E4000 ALL Ultra Enterprise 4000 - - E3000 ALL Ultra Enterprise 3000 - - A12 ALL Ultra 1E - - A11 ALL Ultra 1 - - A22 ALL Ultra 10 - - A21 ALL Ultra 5 - X-Options Affected ------------------ X2248A - - 480Mhz UltraSPARC II Module 8MB Cache - X2244A - - 400Mhz UltraSPARC II Module 4MB Cache - X1994A - - 400Mhz UltraSPARC II Module 2MB Cache - X2240A - - 300MHz UltraSPARC II Module 2MB Cache - X2230A - - 250MHz UltraSPARC II Module 1MB Cache - X1995A - - 450Mhz UltraSPARC II Module 4MB Cache - X1997A - - 440Mhz UltraSPARC II Module 4MB Cache - X2580A - - 400MHz UltraSPARC II Module 8MB cache - X2570A - - 400MHz UltraSPARC II Module 4MB cache - X1993A - - 400Mhz UltraSPARC II Module 2MB Cache - X1992A - - 360Mhz UltraSPARC II Module 4MB Cache - X2560A - - 336MHz UltraSPARC II Module 4MB Cache - X1991A - - 300Mhz UltraSPARC II Module 1MB Cache - X2550A - - 250MHz UltraSPARC II Module 4MB Cache - X1990A - - 250Mhz UltraSPARC II Module 1MB Cache - X2530A - - 250MHz UltraSPARC II Module 1MB Cache - X1188A - - 200MHz UltraSPARC I Module 1MB Cache - X2510A - - 167MHz UltraSPARC I Module 1MB Cache - X1187A - - 167MHz UltraSPARC I Module .5MB Cache - X2500A - - 167MHz UltraSPARC I Module .5MB Cache - PART NUMBERS AFFECTED: Part Number Description Model ----------- ----------- ----- 501-5729-0X 480 MHz UltraSPARC II Module 8MB Cache - 501-5344-0X 450 MHz UltraSPARC II Module 4MB Cache - 501-5539-0X 450 MHz UltraSPARC II Module 4MB Cache - 501-5682-0X 440 MHz UltraSPARC II Module 4MB Cache - 501-5235-0X 400 MHz UltraSPARC II Module 8MB Cache - 501-5661-0X 400 MHz UltraSPARC II Module 8MB Cache - 501-5762-0X 400 MHz UltraSPARC II Module 8MB Cache - 501-4995-0X 400 MHz UltraSPARC II Module 4MB Cache - 501-5239-0X 400 MHz UltraSPARC II Module 4MB Cache - 501-5420-0X 400 MHz UltraSPARC II Module 4MB Cache - 501-5425-0X 400 MHz UltraSPARC II Module 4MB Cache - 501-5446-0X 400 MHz UltraSPARC II Module 4MB Cache - 501-5500-0X 400 MHz UltraSPARC II Module 4MB Cache - 501-5585-0X 400 MHz UltraSPARC II Module 4MB Cache - 501-5237-0X 400 MHz UltraSPARC II Module 2MB Cache - 501-5445-0X 400 MHz UltraSPARC II Module 2MB Cache - 501-5541-0X 400 MHz UltraSPARC II Module 2MB Cache - 501-5545-0X 400 MHz UltraSPARC II Module 2MB Cache - 501-4781-0X 360 MHz UltraSPARC II Module 4MB Cache - 501-5129-0X 360 MHz UltraSPARC II Module 4MB Cache - 501-5552-0X 360 MHz UltraSPARC II Module 4MB Cache - 501-4363-0X 336 MHz UltraSPARC II Module 4MB Cache - 501-4196-0X 300 MHz UltraSPARC II Module 2MB Cache - 501-4849-0X 300 MHz UltraSPARC II Module 2MB Cache - 501-4249-0X 250 MHz UltraSPARC II Module 4MB Cache - 501-4836-0X 250 MHz UltraSPARC II Module 4MB Cache - 501-4178-0X 250 MHz UltraSPARC II Module 1MB Cache - 501-4278-0X 250 MHz UltraSPARC II Module 1MB Cache - 501-4857-0X 250 MHz UltraSPARC II Module 1MB Cache - 501-3041-0X 200 MHz UltraSPARC I Module 1MB Cache - 501-4791-0X 200 MHz UltraSPARC I Module 1MB Cache - 501-2959-0X 167 MHz UltraSPARC I Module 1MB Cache - 501-2702-03 167 MHz UltraSPARC I Module .5MB Cache - 501-2941-0X 167 MHz UltraSPARC I Module .5MB Cache - 501-2942-0X 167 MHz UltraSPARC I Module .5MB Cache - 501-5149-0X 440 MHz UltraSPARC IIi Module 2MB Cache - 501-5740-0X 400 MHz UltraSPARC IIi Module 2MB Cache - 501-5741-0X 400 MHz UltraSPARC IIi Module 2MB Cache - 501-5148-0X 360 MHz UltraSPARC IIi Module 256KB Cache - 501-5222-0X 360 MHz UltraSPARC IIi Module 2MB Cache - 501-5090-0X 333 MHz UltraSPARC IIi Module 2MB Cache - 501-5568-0X 333 MHz UltraSPARC IIi Module 2MB Cache - 501-4379-0X 300 MHz UltraSPARC IIi Module 512KB Cache - 501-5040-0X 300 MHz UltraSPARC IIi Module 512KB Cache - 501-4477-0X 270 MHz UltraSPARC IIi Module 256KB Cache - 501-5039-0X 270 MHz UltraSPARC IIi Module 256KB Cache - (SCSI Devices) Type Vendor Model Serial Number(Min) Serial Number(Max) Firmware ---- ------ ------- ------------------ ------------------ -------- N/A REFERENCES: FIN: I0570-3 FIN: I0593-1 Sun Alert: SA 24669 - Possible WAIT_MBOX_DONE System Panics With Recent Kernel Update Patches DOC: 806-5118-13 Best Practices Guide Addressing: E-cache Parity Errors PatchId: 103640-34 Kernel Patch (Solaris 2.5.1) PatchId: 105181-23 Kernel Patch (Solaris 2.6) PatchId: 106541-13 Kernel Patch (Solaris 7) PatchId: 108528-04 Kernel Patch (Solaris 8) PatchId: 110151-01 SunMC 2.1 FCS Patch (Solaris 2.6) PatchId: 110152-01 SunMC 2.1 L10N Patch (Solaris 2.6) PatchId: 110094-01 SunMC 2.1.1 FCS Patch (Solaris 2.6) PatchId: 103346-26 Exx00 flashprom update URL: http://bestpractices.central/ URL: http://cte-www.uk/cgi-bin/afsr/afsr.pl URL: http://cte-www.eng/cgi-bin/afsr/afsr.pl PROBLEM DESCRIPTION: Solaris Kernel patches are available (see "Features Table" below for availability details) that provide improved handling and reduction of CPU, Ecache, and main memory errors in systems using UltraSPARC-I, -II, -IIi, and -IIe processors. All customers on Solaris 2.5.1, 2.6, 7 and 8 are encouraged to consider upgrading to these kernel patches as they become available. Table Of Contents ***************** Kernel Patch Features Overview Cache Scrubber Improved Error Handling Improved Error Messages Performance Considerations Kernel Patch Features Details Features Table Details on the Cache Scrubber Errors and Events Details on Improved Error Handling Details on Improved Error Messages Messages that identify the type and source of an error Messages that supply a cache line or memory dump Messages from the kernel error recovery code Messages that indicate the disposition of an error Error Messages Examples EDP Event - Ecache Data Parity Event WP Event - Writeback Data Parity Error CP Event - Copyout Data Parity Error UE Event - Uncorrectable Memory Error BERR Event - Bus Error CE Event - Correctable Memory Error Starfire Specific Arbstop Recordstop DTag Considerations Kernel Patch Features Overview ****************************** With the patches listed below, one or more of the following features become available in the Solaris operating system (see "Features Table" below to determine the features delivered with each patch): 1. Cache Scrubber ============== To reduce the likelihood of Ecache Data, Writeback and CopyOut Parity errors, a "Cache Scrubber" has been implemented in the Solaris Kernel that periodically flushes modified cache lines out to main memory and invalidates cache lines that have not been modified. By reducing the likelihood that an otherwise nonfatal error in the Ecache will result in a system failure, this procedure improves the system's reliability. 2. Improved Error Handling ======================= Each error reported by the CPU is now evaluated to determine whether it is fatal to the operating system, only fatal to a user process, or of no immediate consequence. Fatal errors in the kernel result in a system panic, as they did before. Fatal errors within user space will now cause the machine to reboot instead of panic, allowing file systems to be fully synched and also preventing the creation of unnecessary kernel core files. Events that do not affect the integrity of either the kernel or user processes are logged, but otherwise ignored. Because UltraSPARC-IIi and UltraSPARC-IIe use simplified error reporting logic (as compared to UltraSPARC-II), the error handling behavior for UltraSPARC-IIi and UltraSPARC-IIe based systems has not been changed. Those systems will still panic on most CPU, Ecache, or uncorrectable memory errors. 3. Improved Error Messages ======================= The CPU, Ecache, and memory error messages have been improved to be more accurate and complete. Text descriptions have been rewritten to emphasize the important parameters associated with each event. Also, the logic for reporting hardware errors has changed to ensure that error events are reported accurately, completely, and in the order they occurred. These new error messages will make it easier to determine the CPU that has encountered an error. There are related patches to SunMC so that it will recognize the improved error messages; without them, the management console will under-report the occurrence of corrected main memory errors. See "Corrective Action" item 3, below, for a list of the related patches. Performance Considerations ========================== The above changes can slightly degrade system performance. The primary cause of this is the Improved Error Handling, which required inserting membars in the kernel to properly isolate user-encountered errors from kernel-encountered ones. (A membar is an UltraSPARC instruction that stalls the CPU pipeline until all outstanding memory operations have completed, and any errors that may result from them have been reported. Any errors reported after the execution of a membar completes can only result from instructions that follow the membar in the instruction stream.) In addition, the Cache Scrubber consumes 0.4% of CPU cycles in scanning the Ecache. Measurements using industry standard benchmarks have shown a decrease in TPC-C performance of about 2% and in one kenbus configuration a decrease in performance of about 5%. Performance degradation of most of the other benchmarks in the performance suite was indistinguishable from measurement noise. We do not expect most customers to notice significant performance degradation. Kernel Patch Features Details ***************************** Features Table ============== The following list gives details about the features delivered with each of the patches: Solaris 2.5.1 with patch 103640-34 will introduce: - Cache Scrubber Solaris 2.6 with patch 105181-23 will introduce: - Cache Scrubber - Improved Error Messages - Improved Error Handling [1] Solaris 7 with patch 106541-13 (est. Nov/10/2000) will introduce: - Cache Scrubber - Improved Error Messages - Improved Error Handling [1] Solaris 8 with patch 108528-04 (est. Oct/27/2000) will introduce: - Cache Scrubber Only for UltraSPARC-I, -II, -IIi - Improved Error Messages Only for UltraSPARC-I, -II, -IIi, -IIe - Improved Error Handling [1] Only for UltraSPARC-I, -II Solaris 8 Update 3 (est. Dec/2000) will introduce: - Cache Scrubber Only for UltraSPARC-I, -II, -IIi, -IIe - Improved Error Messages Only for UltraSPARC-I, -II, -IIi, -IIe - Improved Error Handling [1] Only for UltraSPARC-I, -II NOTE [1]: Due to hardware limitations there is no improved error handling for UltraSPARC-IIi and UltraSPARC-IIe based systems. Details on the Cache Scrubber ============================= The cache scrubber reduces the likelihood of EDP, WP, and CP events by shortening the data lifetime in the Ecache, and by eliminating parity errors where possible. (See "Errors and Events" below for an explanation of the EDP, WP, and CP event types.) The cache scrubber is enabled by default. It scans the entire Ecache of every CPU in the system once every ten seconds. On an idle CPU, it scrubs all clean lines (lines that are identical to the system memory from where they came), and dirty lines (lines that have newer data than the system memory from where they came) that have good parity. This reduces the lifetime of data in the Ecache on an idle CPU, reducing the likelihood that a parity error will affect critical system or user data. On a busy CPU, it only scrubs clean lines with bad parity (which might otherwise lead to EDP or CP events). Clean lines with good parity and dirty lines are left in the Ecache so as to not adversely impact system performance. The cache scrubber never scrubs dirty lines with bad parity to avoid causing WP events. These bad lines could get overwritten by the program using them before they are accessed or flushed, thereby eliminating a bad parity event from occurring at all. (This is sometimes referred to as the natural scrubbing behavior of a busy system.) Errors and Events ================= UltraSPARC processors can detect errors that are reported in the following types of events (as detailed in the UltraSPARC-I/II User's Manual, P/N 802-7220-02): ETP A parity error was detected by the CPU when reading from the Ecache Tag SRAM. This is a fatal error because system coherency has been lost. The system will reset (POR) and Starfire domains will arbstop (UPA Fatal error). No Solaris error message will be generated. EDP A parity error was detected by the CPU when reading from the Ecache Data SRAM on a cache hit. LDP A parity error was detected by the CPU while reading main memory through its Ultra Data Buffer (UDB) chip on an Ecache miss. Note that the Ecache itself is not involved. This can occur when the CPU is reading non-cacheable data (for example, a frame buffer or I/O device), or when filling a line of cache from main memory. WP A parity error was detected by one of the UDB chips while data was being written back from the Ecache into main memory. The UDB chips convert the data with bad parity into data with bad ECC, so that a subsequent access to the same physical address will result in a UE. (See UE below.) (The conversion of a parity error to a latent UE does not occur on either UltraSPARC-IIi or -IIe, which is one of the reasons why improved error handling is not available on those processors.) CP A parity error was detected during a copyout transaction; that is, a data transfer from one CPU's Ecache to another CPU. This error is detected by the UDB chips of the providing CPU, resulting in the CP event. The providing CPU's UDB chips convert the data with bad parity to data with bad ECC, so that the UDBs of the receiving CPU will report a UE event. (See UE below.) UE An uncorrectable memory error has occurred. This event refers to an error in the main system memory, reported by the system data bus on a read access. The underlying source of this error could be main memory, another CPU module (see CP above), or another UPA device (for example, the I/O controller). The UDB chips detect this error. CE A correctable error was detected when reading from main memory, or when reading from another CPU's UDB chips. The data read has been corrected and valid data is given to the CPU and the CPU's Ecache. This error is detected by the UDB chips. BERR A bus error has occurred during an attempt to read from a memory address. Either there is no device at that address, or the device at that address has returned a bus error. Therefore, bus errors are caused by a programming error or by a corrupted or defective device. TO A bus timeout was encountered during an attempt to read from a memory address. Too much time has elapsed waiting for a device at that address to respond. Details on Improved Error Handling ================================== Any of the above mentioned errors can occur in kernel instruction space, kernel data space, user instruction space, user data space, or when the kernel reads or writes user data (as in copyin). Depending on these different states, the operating system will react differently so as to maximize system availability. On EDP, LDP, CP, UE, BERR, and TO events, the system will panic if the affected data is in kernel space or if the error occurs while the CPU is at a trap level greater than zero. Otherwise, the process that caused the error will be killed immediately (sent SIGKILL) and the system will be rebooted (as if a privileged user had entered "init 6"). [2] On WP events, an error is reported, and the memory scrubber is notified to scan all of system memory for the latent UE the hardware has written to memory (see below for the behavior of the memory scrubber on encountering UE events). If some CPU later attempts to read this location (other than on behalf of the memory scrubber), a UE event will occur. Hence, when a UE event is encountered, it is recommended that the log be checked for an earlier WP event that may have in fact caused the UE event. If the memory scrubber detects a UE event the system will neither panic nor reboot but trigger a recovery mechanism instead. If the page containing the corrupted data is not in use, it will be retired and the error will be cleared. If it is in use, it will be marked for retirement and clearing if and when it is no longer in use. NOTE [2]: An active SC2.X cluster node will panic with a "Failfast timeout" (usually with "Device closed while Armed") when rebooted. It is therefore useful to check the system messages for EDP, LDP, CP, UE, BERR, and TO events while encountering "Failfast timeout" panics. Details on Improved Error Messages ================================== For each error that is detected, the kernel generates an individual report. This is a major change; previously, some errors would hide other errors, and some errors were combined into a single message. The report typically consists of several error messages. Each message [3] contains an AFT ("Asynchronous Fault Trap") tag that eases filtering, and an errID code that associates all of the messages emitted for the same event. The errID is a 64-bit code that corresponds to a specific set of error bits in the Asynchronous Fault Status Register (AFSR) at a specific instance in time; the value has no intrinsic meaning. Each message may be longer than one physical line; long messages are folded using embedded newlines. Each folded line begins with four space characters. NOTE [3]: Because of the introduction of improved error messages, any tool using the affected error messages may have to be modified. Neither the format nor the content of kernel error messages are committed interfaces, and both may change without notice. Users (both internal and external) who rely on the exact format and/or content do so at their own risk. The error messages can be grouped into four categories: Category 1: Messages that identify the type and source of an error ------------------------------------------------------------------ Example: WARNING: [AFT1] EDP event on CPU1 Instruction access at TL=0, errID 0x0000ad88.6cd9989f AFSR 0x00000000.80408000<PRIV,EDP> AFAR 0x00000000.0f0c8080 AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 FAULT_PC 0x780b481c UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 Either the [AFT0] tag (for correctable errors) or the [AFT1] tag (for uncorrectable errors) is present in the message. An "errID" field appears at the end of the first line of the message. Messages from this category are displayed on the console and collected in the log file. [4] To aid diagnosis of an Ecache-related error, especially if multiple components are involved, a heuristic algorithm has been included that automates analysis of the P_SYND bytes. Every component reporting a failure has its AFSR decoded and a score ranging from 5 to 95 is assigned ("Score 95" in the above example). The Score indicates the likelihood that this component was the original source of the bad parity. The higher the value, the higher the likelihood that this component was the original source. NOTE [4]: This is the default behavior. The /etc/system setting report_ce_console is no longer referenced and should therefore be removed. Category 2: Messages that supply a cache line or memory dump ------------------------------------------------------------ Example: [AFT2] errID 0x0000ad88.6cd9989f PA 0x00000000.0f0c8080 E$tag 0x00000000.0bc001e1 E$State: Modified E$parity 0x05 [AFT2] E$Data (0x00): 0xffffffff.beefface *Bad* PSYND=0x8000 [AFT2] E$Data (0x08): 0x00000000.00000000 [AFT2] E$Data (0x10): 0x6d656d6d.6f727920 [AFT2] E$Data (0x18): 0x6572726f.7220696e [AFT2] E$Data (0x20): 0x6a656374.6f720000 [AFT2] E$Data (0x28): 0x6d656d74.65737420 [AFT2] E$Data (0x30): 0x6d757465.780059f8 [AFT2] E$Data (0x38): 0x00000300.00c11000 [AFT2] Event PA displayed in AFAR was derived from E$Tag Messages from this category are targeted for Sun Microsystems support staff to be used in backline diagnosis and for statistics. The [AFT2] tag is always present in these messages. The "errID" field appears at the beginning of the first line of the message. Messages from this category are by default only collected in the log file. Category 3: Messages from the kernel error recovery code -------------------------------------------------------- Example: [AFT3] errID 0x00000058.0d0dc830 Above Error detected by protected Kernel code that will try to clear error from system Messages from this category supply analysis information from the kernel error recovery code, thereby indicating the actions the kernel took to contain the error. The [AFT3] tag is always present in these messages. An "errID" field appears at the beginning of the first line of the message. Messages from this category are by default only collected in the log file. Category 4: Messages that indicate the disposition of an error -------------------------------------------------------------- Example: panic[CPU1]/thread=30000670800: [AFT1] errID 0x00000392.89cbfefc EDP Error(s) See previous message(s) for details Messages from this category state the final handling (like panic or reboot) of a previously encountered error. Either the [AFT0] tag (for correctable errors) or the [AFT1] tag (for uncorrectable errors) is present in the message. The "errID" field appears at the beginning of the first line of the message. Messages from this category are displayed on the console and collected in the log file. Error Messages Examples ======================= The following compares previous messages with the new, improved error messages. Note that this is not an exhaustive list, but a sampling of possible messages for each event type. This also just shows what appears on the console; the log-only messages are not shown. Lines are shown exactly as they appear on the console. If you print this file, you will need to either use software that wraps long lines, or print in landscape mode. EDP Event - Ecache Data Parity Event ------------------------------------ * Solaris 8 Message - Kernel Data: panic[CPU1]/thread=3000225bcc0: CPU1 Ecache SRAM Data Parity Error: AFSR 0x00000000.80408000 AFAR 0x00000000.0bd83bd0 * Improved Message - Kernel Data: WARNING: [AFT1] EDP event on CPU1 Data access at TL=0, errID 0x00000093.6323e6f8 AFSR 0x00000000.80408000<PRIV,EDP> AFAR 0x00000000.06901980 AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 Fault_PC 0x78128a84 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 panic[cpu1]/thread=30000ae5000: [AFT1] errID 0x00000093.6323e6f8 EDP Error(s) See previous message(s) for details * Solaris 8 Message - User Data: panic[CPU3]/thread=30001f4fa00: CPU3 Ecache SRAM Data Parity Error: AFSR 0x00000000.00400080 AFAR 0x00000000.01820000 * Improved Message - User Data (Reboot): Aug 16 16:47:20 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] EDP event on CPU3 Data access at TL=0, errID 0x00000057.d35eff81 Aug 16 16:47:20 thishost AFSR 0x00000000.00400080<EDP> AFAR 0x00000000.05e24418 Aug 16 16:47:20 thishost AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00 Fault_PC 0x11ce8 Aug 16 16:47:20 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 Aug 16 16:47:20 thishost unix: NOTICE: Scheduling clearing of error on page 0x00000000.05e24000 Aug 16 16:47:20 thishost unix: WARNING: [AFT1] initiating reboot due to above error in pid 309 (mtst) Aug 16 16:47:23 thishost unix: NOTICE: Previously reported error on page 0x00000000.05e24000 cleared INIT: New run level: 6 The system is coming down. Please wait. System services are now being stopped. Print services stopped. Aug 16 16:47:27 thishost syslogd: going down on signal 15 The system is down. syncing file systems... done rebooting... Resetting ... * Solaris 8 Message - Kernel Data at TL=1: panic[CPU3]/thread=30001cfabe0: Async data error at tl1: AFAR 0x00000000.0ab8f760 AFSR 0x00000000.80400080 * Improved Message - Kernel Data at TL=1 (Panic): WARNING: [AFT1] EDP event on CPU3 Data access at TL>0, errID 0x00000111.53a7b8dd AFSR 0x00000000.80408000<PRIV,EDP> AFAR 0x00000000.01f47dc0 AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 Fault_PC 0x1002fe20 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 panic[cpu3]/thread=30000a4e040: [AFT1] errID 0x00000111.53a7b8dd EDP Error(s) See previous message(s) for details * Solaris 8 Message - Kernel Instruction at TL=1: panic[CPU3]/thread=3000226a140: Async instruction error at tl1: AFAR 0x00000000.0dd55f70 AFSR 0x00000000.80408000 * Improved Message - Kernel Instruction at TL=1 (Panic): WARNING: [AFT1] EDP event on CPU3 Instruction access at TL>0, errID 0x00000043.24bfd349 AFSR 0x00000000.80400800<PRIV,EDP> AFAR 0x00000000.0605c790 AFSR.PSYND 0x0800(Score 95) AFSR.ETS 0x00 Fault_PC 0x1002fe20 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 panic[cpu3]/thread=30000ad05c0: [AFT1] errID 0x00000043.24bfd349 EDP Error(s) See previous message(s) for details WP Event - Writeback Data Parity Error -------------------------------------- * Solaris 8 Message: panic[CPU1]/thread=30001b26640: CPU1 Ecache Writeback Data Parity Error: AFSR 0x00000000.00800080 AFAR 0x00000000.0d5010f0 * Improved Message: Aug 16 16:50:56 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] WP event on CPU1, errID 0x0000002b.3c7cd6d9 Aug 16 16:50:56 thishost AFSR 0x00000000.00800080<WP> AFAR 0x000001c8.01802800 Aug 16 16:50:56 thishost AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00 Fault_PC 0x11d7c Aug 16 16:50:56 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 Aug 16 16:50:56 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] Uncorrectable Memory Error on CPU3 Data access at TL=0, errID 0x0000002b.45daae92 Aug 16 16:50:56 thishost AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.03824418 Aug 16 16:50:56 thishost AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10023414 Aug 16 16:50:56 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03 Aug 16 16:50:56 thishost UDBL Syndrome 0x3 Memory Module 190x Aug 16 16:50:56 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] errID 0x0000002b.45daae92 Syndrome 0x3 indicates that this may not be a memory module problem Aug 16 16:50:56 thishost unix: NOTICE: Scheduling clearing of error on page 0x00000000.03824000 Aug 16 16:50:58 thishost unix: NOTICE: Previously reported error on page 0x00000000.03824000 cleared NOTE: The last message (reporting clearing of the error) may appear much later, or may never appear, as the page may never drop out of use. Also, the message reporting scheduling of clearing may occur more than once, as the memory scrubber may encounter the particular UE more than once before it can be cleared. CP Event - Copyout Data Parity Error ------------------------------------ * Solaris 8 Message: panic[CPU3]/thread=2a100105d40: CPU3 UE Error: Ecache Copyout on CPU1: AFSR 0x00000000.01000080 AFAR 0x00000000.06c53090 * Improved Message - Kernel (Panic): WARNING: [AFT1] Uncorrectable Memory Error on CPU3 Data access at TL=0, errID 0x0000003a.30aafcba AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.00347dc0 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78067b54 UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00 UDBH Syndrome 0x3 Memory Module 190x WARNING: [AFT1] errID 0x0000003a.30aafcba Syndrome 0x3 indicates that this may not be a memory module problem WARNING: [AFT1] CP event on CPU1 (caused Data access error on CPU3), errID 0x0000003a.30aafcba AFSR 0x00000000.01008000<CP> AFAR 0x00000000.00347dc0 AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 panic[cpu3]/thread=2a100157d40: [AFT1] errID 0x0000003a.30aafcba UE Error(s) See previous message(s) for details * Improved Message - User (Reboot): Aug 16 17:06:44 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] Uncorrectable Memory Error on CPU3 Data access at TL=0, errID 0x0000002b.963a3d3c Aug 16 17:06:44 thishost AFSR 0x00000000.00200000<UE> AFAR 0x00000000.00224418 Aug 16 17:06:44 thishost AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x12380 Aug 16 17:06:44 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03 Aug 16 17:06:44 thishost UDBL Syndrome 0x3 Memory Module 190x Aug 16 17:06:44 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] errID 0x0000002b.963a3d3c Syndrome 0x3 indicates that this may not be a memory module problem Aug 16 17:06:44 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] CP event on CPU1 (caused Data access error on CPU3), errID 0x0000002b.963a3d3c Aug 16 17:06:44 thishost AFSR 0x00000000.01000080<CP> AFAR 0x00000000.00224418 Aug 16 17:06:44 thishost AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00 Aug 16 17:06:44 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 Aug 16 17:06:44 thishost unix: NOTICE: Scheduling clearing of error on page 0x00000000.00224000 Aug 16 17:06:44 thishost unix: WARNING: [AFT1] initiating reboot due to above error in pid 304 (mtst) Aug 16 17:06:46 thishost unix: NOTICE: Previously reported error on page 0x00000000.00224000 cleared INIT: New run level: 6 The system is coming down. Please wait. System services are now being stopped. Print services stopped. Aug 16 17:06:50 thishost syslogd: going down on signal 15 The system is down. syncing file systems... done rebooting... Resetting ... NOTE: Due to a coding error, early versions of some of the patches produce the string "CP Error" instead of "CP event"; programs that parse the messages must be prepared to deal with both. UE Event - Uncorrectable Memory Error ------------------------------------- * Solaris 8 Message - CPU Reference to Memory: panic[CPU1]/thread=2a1000R7dd40: UE Error: AFSR 0x00000000.80200000 AFAR 0x00000000.089cd740 Id 0 Inst 0 MemMod U0501 U0401 * Improved Message - CPU Reference to Memory - Kernel (Panic): WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Instruction access at TL=0, errID 0x0000004f.818d9280 AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.0685c7a0 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x7815c7a0 UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00 UDBH Syndrome 0x3 Memory Module 190x WARNING: [AFT1] errID 0x0000004f.818d9280 Syndrome 0x3 indicates that this may not be a memory module problem panic[cpu1]/thread=30000ad6320: [AFT1] errID 0x0000004f.818d9280 UE Error(s) See previous message(s) for details * Improved Message - CPU Reference to Memory - User (Reboot): Aug 16 17:03:04 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Instruction access at TL=0, errID 0x00000032.593d8229 Aug 16 17:03:04 thishost AFSR 0x00000000.00200000<UE> AFAR 0x00000000.04921bf0 Aug 16 17:03:04 thishost AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x11bf0 Aug 16 17:03:04 thishost UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00 Aug 16 17:03:04 thishost UDBH Syndrome 0x3 Memory Module 190x Aug 16 17:03:04 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] errID 0x00000032.593d8229 Syndrome 0x3 indicates that this may not be a memory module problem Aug 16 17:03:04 thishost unix: NOTICE: Scheduling clearing of error on page 0x00000000.04920000 Aug 16 17:03:07 thishost unix: NOTICE: Previously reported error on page 0x00000000.04920000 cleared Aug 16 17:03:07 thishost unix: WARNING: [AFT1] initiating reboot due to above error in pid 304 (mtst) INIT: New run level: 6 The system is coming down. Please wait. System services are now being stopped. Print services stopped. Aug 16 17:03:13 thishost syslogd: going down on signal 15 The system is down. syncing file systems... done rebooting... Resetting ... * Solaris 8 Message - SBus I/O Reference to Memory: panic[CPU1]/thread=2a10007dd40: SBus0 UE Primary Error DMA read: AFSR 0x40001be0.00000000 AFAR 0x00000000.02818000 MemMod U0501 U0401 Id 31 * Improved Message - SBus I/O Reference to Memory: WARNING: SBus0 UE Primary Error DMA read: AFSR 0x40001be0.00000000 AFAR 0x00000000.0d25c000 MemMod U0501 U0401 Id 31 panic[cpu0]/thread=2a10007dd40: Fatal Sbus0 UE Error BERR Event - Bus Error ---------------------- * Solaris 8 Message: panic[CPU1]/thread=30000d2c300: CPU1 Privileged Bus Error: AFSR 0x00000000.84000000 AFAR 0x00000000.03422000 * Improved Message - Kernel (Panic): WARNING: [AFT1] Bus Error on System Bus in privileged mode from CPU1 Data access at TL=0, errID 0x0000002c.52b3d2c8 AFSR 0x00000000.84000000<PRIV,BERR> AFAR 0x00000000.05224410 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x780671a4 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 panic[cpu1]/thread=30000b06080: [AFT1] errID 0x0000002c.52b3d2c8 BERR Error(s) See previous message(s) for details CE Event - Correctable Memory Error ----------------------------------- * Solaris 8 Message: May 8 14:35:30 thishost SUNW,UltraSPARC-II: CPU1 CE Error: AFSR 0x00000000.00100000 AFAR 0x00000000.8abb5a00 UDBH Syndrome 0x85 MemMod U0904 May 8 14:35:30 thishost SUNW,UltraSPARC-II: ECC Data Bit 63 was corrected May 8 14:35:30 thishost unix: Softerror: Intermittent ECC Memory Error, U0904 * Improved Message: Aug 16 16:34:48 thishost SUNW,UltraSPARC-II: [AFT0] Corrected Memory Error on CPU1, errID 0x00000036.629edc25 Aug 16 16:34:48 thishost AFSR 0x00000000.00100000<CE> AFAR 0x00000000.00347dc0 Aug 16 16:34:48 thishost AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1002fe20 Aug 16 16:34:48 thishost UDBH Syndrome 0x85 Memory Module 1904 Aug 16 16:34:48 thishost SUNW,UltraSPARC-II: [AFT0] errID 0x00000036.629edc25 Corrected Memory Error on 1904 is Intermittent Aug 16 16:34:48 thishost SUNW,UltraSPARC-II: [AFT0] errID 0x00000036.629edc25 ECC Data Bit 63 was in error and corrected Starfire Specific ***************** Arbstop ======= STag Parity Errors on an E10000 almost always result in a "UPA Fatal Error" Arbstop dump. Although these can also be caused by poor VCore voltage power pucks on a System Board, error trends have shown that these errors are generally an "ETP Event", caused by the CPU identified in the Arbstop dump file. Recordstop ========== Recordstop dump files will be generated anytime data is transferred through the crossbar of the Starfire centerplane. This means that a recordstop is likely to occur during WP, CP, and LDP events. As always, the "psi" reported error is an extremely strong indication of the source of the "UE ECC Error" as reported in the wfail output of redx. The reporting XDB can be associated with one or two CPUs, but which CPU actually sourced the data cannot be determined from the recordstop itself, unless only one of the two possible CPUs are present at the time of the error. In these cases, the syndrome of 03 is always present in the XDB Error report. Use the recordstop dump to complement and confirm information provided by Solaris in the message and console logs. Expect Solaris to report a relatively high "score" against one of the CPUs attached to the reporting XDB within the AFT messages previously described in this document. Note of Caution: Conversely, an XDB could report an "ldat" error with a syndrome of 03, which includes the same data pattern and xmux_par values. In these cases, the XDB that reports the "ldat" error is the XDB for the "victim" CPU in a copyback (CP) event. In essence, an "ldat" error reported by an XDB will actually prove that the CPUs it services are victims of another CPU's Cache Parity Error, and therefore can be used to exonerate the attached CPUs. These XDB reported "ldat" are extremely rare, but can occur due to other variables in a Starfire platform. These errors may or may not be reported with a complementary "psi" error, but the XDBs will continue to report a "UE ECC" error in the wfail output, along with a syndrome value of 0x03. For these events, the "ldat" error exonerates the attached CPUs, and might be traced back through the Centerplane X-Bar to the System Board where the data originated from. However, it is likely that the error will not be traceable back to a CPU on the sourcing System Board, unless a corresponding "psi"-side error is reported by an XDB from that System Board. For all XDB-reported "ldat" errors, expect Solaris to report a low "score" against one of the CPUs attached to the reporting XDB within the AFT messages previously described in this document. DTag Considerations =================== A rumor has been circulating that this patch increases the rate of DTag parity errors on E10000 systems. That rumor is false. The development team observed two customer systems (out of 30) using the USER-level scrubber that experienced an increased rate of DTag parity errors. It was determined that the combination of the USER- level scrubber plus a certain customer-dependent application mix (which we have yet to characterize) tickles marginal E10000 boards into producing DTag parity errors. The KERNEL level scrubber that is contained in the patch uses a completely different algorithm, and does not have this tickling effect. There have been no reports of increased DTag parity errors with the KERNEL scrubber. If a customer experiences a DTag parity error with or without the USER or KERNEL level scrubber, standard replacement policies apply. IMPLEMENTATION: (T) (R) (Proactive vs Reactive) --- | | MANDATORY (Fully Pro-Active) --- --- | X | CONTROLLED PRO-ACTIVE (per Sun Geo Plan) --- --- | | REACTIVE (As Required) --- CORRECTIVE ACTION: The following recommendations are provided as a guide for authorized Enterprise Services Field Representatives and Enterprise Customers on UltraSPARC based platforms running Solaris versions 2.5.1, 2.6, 7, and 8; 1. If this system is running the user level cache scrubber, remove it. To determine whether a system is running the user level cache scrubber, enter the command: /usr/lib/cachescrubber -V If the response is "Command not found," then the user level scrubber is not installed on this system. If the response is a message containing the current version of the user level scrubber, then the user level scrubber is installed on this system and must be removed. To remove the user level cache scrubber, follow the removal procedure as described in the README file for the user level cache scrubber. The removal procedure varies for different versions of the scrubber. 2. Apply the appropriate Kernel Patch for the version of Solaris per the chart below. PatchId Solaris Release Availability (estimated) --------- --------------- ------------ 103640-34 Solaris 2.5.1 Now 105181-23 Solaris 2.6 Now 106541-13 Solaris 7 Nov/10/2000 108528-04 Solaris 8 Nov/15/2000 3. If the system is running SunMC, apply the appropriate SunMC patches, per the table below. This is necessary to maintain SunMC's ability to report corrected memory errors. If the system is running SyMon, it will be necessary to upgrade to SunMC and then apply the appropriate patch. Solaris Release --------------- 2.6 7 [6] 8 [6] --------- --------- --------- SunMC 2.1 FCS 110151-01 110213-01 110216-01 SunMC 2.1 L10N 110152-01 110214-01 110217-01 SunMC 2.1.1 FCS 110094-01 110215-01 110218-01 (SunMC patches are not needed for 2.5.1 as the 2.5.1 patch does not contain improved error messages.) NOTE [6]: The patches for Solaris 7 and Solaris 8 are not yet available. 4. To ensure proper preservation of system error messages across a panic or or reboot: - E3x00, E4x00, E5x00, E6x00 systems must apply OBP patch 103346-25 (or higher). - E10000 systems must activate netcon logging, as described in FIN I0593-1. 5. If the system has operations personnel that have been trained to respond to the older system error and panic messages, these personnel must be notified, and become familiar with, the changed error messages that are described in this document (see "Details on Improved Error Messages" and "Error Messages Examples" sections, above). See the comment on "Customer White Paper," below. 6. If the system employs custom software tools that extract system messages from kernel core dumps or log files (like /var/adm/messages), these tools will have to be modified to recognize the new messages. See the comment on "Customer White Paper," below. 7. For FRU replacement guidelines, refer to the Best Practices Guide: http://bestpractices.central.sun.com/BestPrac_Sept11_2000.ps 8. For those systems where the appropriate above listed Kernel Patches have not yet been applied, FINI0570-3 will remain the reference document for troubleshooting Ecache errors. 9. A mailing list has been set up to address the KJP. Any bugs filed against the cache scrubber or error recovery mechanisms should include this mailing list on the interest list of the bug. In addition, any unexplained system behavior changes should be directed to this mailing list as well. [email protected] COMMENTS: User Level Cache Scrubber ------------------------- The user level cache scrubber was an early process-level implementation of the cache scrubber, deployed by a small number of customers as an interim measure. Its functionality is superceded by that of the kernel level cache scrubber that is provided in the Kernel Patch. Running the user level scrubber on a system that has the Kernel Patch applied may degrade performance and will defeat some of the functionality of the kernel cache scrubber. For this reason, the user level scrubber should be removed (uninstalled) prior to applying the Kernel Patch. Systems that are currently running the user level cache scrubber and are not applying the Kernel Patch (for example, on platforms where the Kernel Patch is not yet available) should continue to run the user level scrubber. The user level scrubber should be removed only in preparation for installing the Kernel Patch. AFSR Decoder Tool ----------------- The SPG-CTE AFSR decoder is available at the following URLs: http://cte-www.uk/cgi-bin/afsr/afsr.pl http://cte-www.eng/cgi-bin/afsr/afsr.pl An equivalent output as provided by AFSR decode is now immediately available in the [AFTn] messages. However, the tool remains useful while troubleshooting I/O related problems (DVMA transaction) and it has been updated to reflect the Score parameter. (See "Details on Improved Error Messages" Category 1, above, for details on Score.) Customer White Paper -------------------- A customer white paper is being written that describes the improved error handling capabilities of the Solaris Operating System. The document will assist customer operations personnel and monitoring tools developers who need to become familiar with the new error messages. -------------------------------------------------------------------------- Implementation Footnote: i) In case of MANDATORY FINs, Enterprise Services will attempt to contact all affected customers to recommend implementation of the FIN. ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical support teams will recommend implementation of the FIN (to their respective accounts), at the convenience of the customer. iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the need arises. -------------------------------------------------------------------------- All released FINs and FCOs can be accessed using your favorite network browser as follows: SunWeb Access: -------------- * Access the top level URL of http://sdpsweb.ebay/FIN_FCO/ * From there, select the appropriate link to query or browse the FIN and FCO Homepage collections. SunSolve Online Access: ----------------------- * Access the SunSolve Online URL at http://sunsolve.Corp/ * From there, select the appropriate link to browse the FIN or FCO index. Supporting Documents: --------------------- * Supporting documents for FIN/FCOs can be found on Edist. Edist can be accessed internally at the following URL: http://edist.corp/. * From there, follow the hyperlink path of "Enterprise Services Documenta- tion" and click on "FIN & FCO attachments", then choose the appropriate folder, FIN or FCO. This will display supporting directories/files for FINs or FCOs. Internet Access: ---------------- * Access the top level URL of https://infoserver.Sun.COM -------------------------------------------------------------------------- General: -------- * Send questions or comments to [email protected] -------------------------------------------------------------------------- ------------- End Forwarded Message -------------