# Nanometer Soft Errors, what lies beneath?

### Abstract

The Increased performance if integrated circuits has been the key innovation that has made the semiconductor business successful over the past decade. Technology evolved forward by a steady growth in performance at a lower cost. Many challenges such as chip's reliability, production capacity, yield and quality have been successfully conquered.

With rapid nanometer evolution the semiconductor industry is constantly facing crucial new challenges. As integrated circuits physical dimensions continue to shrink and correspondingly the number of functions continues to grow, signal integrity issues are becoming a major issue.

One of the major growing concerns issues are the transient faults that occur in VLSI circuits due to external radiation and which affect the logic states of sensitive nodes. This type of phenomenon is called soft errors.

In this paper I will discuss the cause, the effects and the methodologies to reduce soft errors in nanometer process technologies.

### Introduction

As process technology advances into deep nanometer ranges, CMOS V/ULSI system reliability is becoming a major concern. One of the main causes of reliability reduction is caused by charge particle strikes due to cosmic radiation which create soft errors, also referred to as Single Event Upsets (SEUs). Soft errors in a semiconductor device are actually glitches that completely random, usually not catastrophic, and normally do not destroy the device but may cause a logical error that may lead to wrong functionality. They are caused by external elements outside of the designer's control. Many systems can tolerate some level of soft errors other than systems that need to be based on a high reliability factor. (Avionics, Medical equipment, Space industry, etc') In other types of consumer electronics (Video, Audio, Etc') these errors may or may not be noticeable or even important to the user. In past processes, this problem was limited to radiation hostile environments such as space. With deep nanometer designs, however, low energy particles at the around level can cause soft errors, making CMOS circuits sensitive to atmospheric neutrons, as well as to alpha particles created by the unstable isotopes that can be found in materials of a nanometer process integrated circuit.

### What are Soft Errors?

A soft error is basically a fault in an integrated circuit. In other words it is a signal or datum which is wrong. If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. These type of errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. Highly reliable systems use error correction methodologies to correct soft errors on the fly. However, in many systems, it may be impossible to determine the correct data, or even to discover that an error is present at all. In addition, before the correction can occur, the system may have crashed, in which case the recovery procedure must include a reboot.

### The phenomenon

As a particle hitting the silicon layer of an integrated circuit, it leaves a cylindrical track of electron-holes pairs. When the resultant ionization track traverses or comes close to the depletion region of a digital gate, the electric field rapidly collects carriers, creating a current/voltage glitch at the gate's node. The farther away from the junction the hit occurs, the smaller the charge collected and the less likely the event will cause a current/voltage glitch. The *collected charge* depends on a complex combination of factors, including the gate's size, the biasing of the various circuit nodes, the substrate structure, the device doping level and characteristics of the particle hit such as energy, trajectory, and charge. The minimum amount of charge necessary to disturb a memory element is called *critical charge* and depends on the node capacitance, the operating voltage, and, for static memory cells such as SRAMcell and flip-flops, the strength of feedback transistors. Whether a circuit experiences a soft error because of a particle hit depends on the energy of the incoming particle, the geometry of the impact, and the design of the logic circuit. For simple isolated junctions (such as DRAM cells in storage mode), a particle hit induces a soft error if the collected charge is greater than the critical charge. In SRAM and in logic with active feedback, a soft error occurs only if the collected charge overcomes the critical charge by a factor related to the compensation current from the feedback. In general, a higher critical charge means fewer soft errors. Unfortunately, a higher critical charge also means a slower logic gate and higher power dissipation. Although desirable for many reasons, reduction in chip feature size and supply voltage decreases the critical charge; thus, the importance of soft errors increases as chip technology advances. While the effects of alpha particles can be reduced by using non-contaminated packaging materials, the effect of neutron particles remains a major concern. Even in advanced manufacturing processes using highly purified chips and packaging materials, shielding from high- and low-energy neutrons requires 50 feet of concrete.

As semiconductor technology moves forward towards deep nanometer geometries, additional soft-error types have emerged. These include electrical noise from cross talk and disturbances in the power supply, electromagnetic interference generated by the operation of other electronic circuits in proximity. Although they are different in nature from the radiation effects described above, their net result is the same, an unpredictable and spontaneous alteration of the information stored in a digital circuit that is a soft error.



Figure 1: The phenomenon of Soft Error

### Causes and types of Soft Errors

Soft errors are caused by a charged particle striking a semiconductor memory or a memory-type element. Specifically, the charge (electron-hole pairs) generated by the interaction of an energetic charged particle with the semiconductor atoms corrupts the stored information in the memory cell. These charged particles can come directly from radioactive materials and cosmic rays or indirectly as a result of high-energy particle interaction with the semiconductor itself. High-energy cosmic rays and solar particles react with the upper atmosphere generating high-energy protons and neutrons that shower to the ground. Neutrons are particularly troublesome as they can penetrate most man-made construction (a neutron can easily pass through five feet of concrete). This effect varies with both latitude and altitude. In London, the effect is two times worse than on the equator. In Denver, with its high altitude, the effect can be 100-800 times worse than at sea-level. This type of soft errors is defined as a critical charge SER.

Another common source of these errors is the alpha particles, which are emitted by the trace amount of radioactive isotopes present in the packaging materials of integrated circuits. This type of soft errors is called package delay. Bump materials used in flip-chip packaging technique have also been identified as containing significant alpha particle sources.

The radiation mechanisms that cause soft errors have been studied for the last few decades. In the late 1970s, alpha particles emitted by uranium and thorium impurities in packaging materials were the dominant cause of soft errors in DRAM devices. During the same era, studies showed that high-energy neutrons from cosmic radiation could induce soft errors via the secondary ions produced by the neutron reactions with silicon nuclei. In the mid 1990s, high-energy cosmic radiation became the dominant source of soft errors in DRAM devices. Studies have also identified a third mechanism induced by the interactions of low-energy cosmic neutrons with the boron-10 isotope, which can be found in the borophosphosilicate glass (BPSG) used in integrated circuits to form insulator layers. The phenomenon was defined as a cosmic ray SER.

Although the phenomenon was first noticed in DRAMs, SRAM memories and SRAMbased programmable logic devices are also subject to the same effects. Unlike capacitor-based DRAMs, SRAMs are constructed of cross-coupled devices, which have far less capacitance in each cell. The lower the capacitance of a cell, the greater the chance of an upset. As both the voltage and cell size are reduced with each new process generation, the SRAM cell capacitance continues to decrease, making the cell even more vulnerable to more types of (lower energy) particles. In 2000, Sun's UltraSPARC II workstations were crashing at an alarming rate. The inability to initially locate the source of the problem created significant customer dissatisfaction issues for Sun. The root cause of the problem was finally traced to IBM supplied SRAMs that were experiencing high upset rates due to charged particles causing soft errors in the memory system. Ultimately, not only did Sun switch memory vendors, they also designed new error checking and correcting logic and implemented it across the entire cache architecture.

#### Firm errors

Although the physical phenomenon is often referred to as a soft error or as the soft error rate (SER), strictly speaking, this term only applies to memory elements used for data storage. An error in a memory element is considered soft because it corrupts the data. This same type of radiation induced error in an FPGA is a "firm" error, because it is not just a transient data error. When a firm error occurs, the data is not corrupted; it is the device's functionality that is affected. There are no soft errors in an SRAM FPGA configuration memory; they are firm errors, and they can have serious system consequences.

### **Soft Errors Effects**

Soft errors are critical issue in high safety and performance systems. Among these systems are aerospace and military applications, Avionics and transportation, Medical and high end networking systems. Since soft errors occur in memory elements such as SRAM, DRAM, latches, and registers, they directly affect the information stored in digital circuits and can cause the failure of an entire piece of equipment. In today's applications and electronics storage circuits are almost a must and therefore may be significantly impacted by soft errors. Soft errors originating from current/voltage glitches in combinational logic can have the same effect when these glitches are latched by downstream memory elements. SRAM-based FPGAs store their configuration in a large number of SRAM cells, offering high chances of configuration corruption due to soft errors.

### Soft Errors Analysis

### Single-Event Effects (SEE)

The natural space environment contains several subatomic energetic particles such as neutrons, protons and heavy ions that can collide with electronic devices and cause different types of damage. Single-Event Effects (SEE) are disturbances in an active electronic device caused by a single, energetic particle and can take on many forms. They normally appear as transient pulses in logic or as bit-flips in memory cells or registers. As semiconductor process geometries decrease, transistor threshold voltage also decreases. These lower thresholds reduce the ionizing field charge per node required to cause errors thereby increasing the devices susceptibility to SEE. Single event phenomena can be classified into three effects in order of permanency as plotted in Figure 2:

- 1. Single-Event Upset (SEU)
- 2. Single-Event Latchup (SEL)
- 3. Single-Event Burnout (SEB)



Figure 2: Single-Event Effects (SEE) classification

SEU is defined by NASA as "radiation-induced errors in microelectronic circuits caused when charged particles (usually from the radiation belts or from cosmic rays) lose energy by ionizing the medium through which they pass, leaving behind a wake of electron-hole pairs". SEU reverses the stored digital information in a storage or sequential circuit. SEUs are transient and non-destructive soft errors, which mean that a reset or rewriting of the device results in normal device behavior thereafter. SEUs manifest themselves as either SBUs (Single-Bit Upsets) or MBUs (Multiple-Bit Upsets). SBU refers to the flipping of one bit due to the passage of a single energetic radiation particle, where MBU is possible in which a single ion hits two or more bits causing simultaneous errors. SER of MBUs is much less (hundreds or thousands of times less) than that of SBUs. Another soft error is SET (Single-Event Transient), which occurs when a cosmic particle strikes a sensitive node within a combinational logic circuit. A voltage disturbance is produced at that node which may propagate through the logic. SEL is a condition that causes loss of device functionality due to a single-event induced current state. These errors are hard errors and can cause permanent device damage. SEL results in a high operating current, above device specification. If power is not removed quickly, catastrophic failure may occur due to excessive heating, metallization or bond wire failure. SEB is a condition that can cause device destruction permanently due to a high current state in a power transistor. SEBs includes burnout of power MOSFETs (Metal Oxide Silicon Field Effect Transistors), gate rupture, frozen bits, and noise in CCDs (Charge-Coupled Devices).

# Soft Errors - Single-Event Upsets (SEU)

SEUs are soft errors, i.e., transient faults or bitflips, caused by an energetic particle. They are temporary and non-recurring since a reset of the device results in normal device behavior. In other words, after observing a soft error, there is no implication that the system is less reliable than before. External radiation induces SEUs predominantly and intrinsic noise as well as interference can also cause SEUs; but they can be accommodated by design engineers. Three main sources to soft errors are alpha particles, cosmic rays and thermal neutron. Thermal neutrons are primarily an SEU issue only if BPSG (Boron-Phosphor-Silicate-Glass) dielectric layers are present; eliminating the use of B-10 isotopes effectively addresses the problem.

# Soft Error Rate (SER)

The rate at which SEUs occur is given as SER, and you measure it in FITs (Failures in Time), which expresses the number of failures in one billion device-operation hours. A measurement of 1,000 FITs corresponds to a MTTF (Mean Time To Failure) of approximately 114 years<sup>2</sup>. The potential impact on typical memory applications illustrates the importance of considering soft errors. A cell phone with one 4 Mbit, low-power memory with an SER of 1,000 FITs per megabit will likely have a soft error every 28 years. But a high-end router with 10 Gbits of SRAM and an SER of 600 FITs per megabit can experience an error every 170 hours. For a router farm that uses 100 Gbits of memory, a potential networking error interrupting its proper operation could occur every 17 hours. Finally, consider a person on an airplane over the Atlantic at 35,000 feet working on a laptop with 256 Mbytes (2 Gbits) of memory. At this altitude, the SER of 600 FITs per megabit becomes 100,000 FITs per megabit, resulting in a potential error every five hours. The FIT rate of soft errors is more than 10 times the typical FIT rate for a hard reliability failure. Soft errors are not the same concern for cell phones as they can be for systems using a large amount of memory.

# Soft Errors from Alpha Particles

Alpha particle-induced soft errors refer to transient errors in the operation of a dynamic random access memory (DRAM) devices caused by alpha particles emitted by traces of radioactive elements such as uranium and thorium present in the packaging material of the device like ceramic packages and lead-based connectors. These alpha particles manage to penetrate the die and generate a high density of holes and electrons in its substrate as displayed in Figure 1, which creates an imbalance in the device's electrical potential distribution that causes stored data to be corrupted. The alpha particles emitted by the device package can have energies of 2 to 9 MeV (Million electron Volt). It takes about 3.6 eV to generate an electronhole pair in the substrate, so an alpha particle can generate approximately one million electron-hole pairs within 2 to 3 microns of the alpha particle track. The potential well of a memory cell that contains a '0' is filled with electrons (inversion mode), while that of a memory cell that contains a '1' is devoid of electrons (depletion mode). When an alpha particle hits the substrate and generates holes and electrons, the holes will be pulled toward the substrate supply while the electrons will be pulled toward the potential well. An empty well can fill up with enough electrons to have its stored information reversed from '1' to '0'. Cells that already have electron-filled wells in the first place are not affected by alpha particles. The amount of charge needed to corrupt stored information and result in a soft error is referred to as the critical charge, or Qcrit. Qcrit becomes smaller as devices are reduced in size and operating voltages, making soft errors bigger problem for smaller devices. Qcrit is also a function of the stored charge in the memory cell. Alpha particles normally cause SBUs because they have lower energies, but they can cause MBUs in devices with low supply voltage. Soft error rates due to alpha particles may be minimized by: 1) reducing the number of alpha particles emitted by the package; 2) coating the chip surface with a film such as polyimide resin that blocks alpha particle irradiation; and 3) better design of memory device to make it less sensitive to alphainduced soft errors.

### Soft Errors from Cosmic Rays

Heavy ions of cosmic rays cause a direct ionization SEE, i.e., if an ion particle transverse a device deposits sufficient charge, an event such as a memory bit flip or transient may occur. Cosmic rays may be galactic or solar in origin. Protons, usually trapped in the earth's radiation belts or from solar flames, may cause direct ionization SEEs in very sensitive devices. However, a proton may more typically cause a nuclear reaction near a sensitive device area, and thus, create an indirect ionization effect potentially causing an SEE. High-energy neutrons have energies of 10 to 800 MeV; in contrast, protons have energies greater than 30 MeV. High-energy neutrons have no charge; therefore, they do not coulombically interact with the semiconductor material, so their interaction with silicon differs from that of an alpha particle. High-energy neutron produces ionized particles by colliding with the silicon nucleus and undergoing impact ionization with the silicon nuclei. This collision can generate alpha particles and other heavier ions, thus producing electron-hole pairs but with higher energies than a typical alpha particle from mold components. The schematics in Figure 3 show how galactic cosmic rays deposit energy in an electronic device. And shielding is ineffective against galactic cosmic rays due to their high energies. Neutrons are in particular troublesome, since they can penetrate most manmade construction. A neutron, for instance, can pass through five feet of concrete. The flux-rate is geoposition-dependent and increases at higher altitudes due to a lower shielding effect of atmosphere. For example, the effect in London is 1.2 times worse than at the equator. In Denver with its high altitude, the effect is three times worse than at sea level in San Francisco. In an airplane, the effect can be 100 to 800 times worse than on the ground.



Figure 3: The effect of cosmic rays on electronic device

### The Effects of scaling on SEU

The SER problem first gained widespread attention as a memory-data issue in the late 1970s, since DRAMs began to show signs of random failures. As process technologies continue to shrink, the critical charge required to cause an upset is decreasing faster than the charge-collection area in the memory cell. Therefore, with smaller geometries, such as 65nm and below, soft errors are more of a concern, and designers must take steps to control SER levels. The effects of scaling on SEUs are explained in the Figure 4. The shrinking dimensions, increasing frequency of operation and reduced critical charge for upset increase SEU with an advance of scaled technology.





### SER in DRAM

Historically, DRAM devices had poor SER due to their small stored charge versus their large collection cross-section for funneling charge created by alpha particles or cosmic rays. SER of DRAM is smaller than that of SRAM, i.e., DRAMs are much more immune to soft error than SRAMs in current technology. For example, SER in 1T DRAM is more than 10 times better than 6T SRAM. This continuous reduction is attributed to the shrinking junction volumes (lowering the collected charge), the relatively high node capacitance (achieved with an external three-dimensional cell capacitor), and the relatively gradual voltage scaling. Discusses this reduction and concludes that DRAM devices generally have improved SER with each new process generation due to the faster reduction of collection cross-section as compared to critical charge reduction. There is some concern that as DRAM density increases further and thus the components on DRAM chips get smaller, whilst at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently since lower energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI (Silicon on Insulator) may make individual cells less susceptible and so counteract, or even reverse this trend. DRAM failure rates at the system level, however, have remained unchanged because system memory size has increased as fast as the reduction in single bit SER. Today's DRAM devices typically have SER in the order of a few hundred to a few thousand failures in a billion device hours (FITs) when operated at full speed.

# SER in SRAM

Six transistor SRAM (6T SRAM) devices traditionally had superior SER immunity due to high signal levels from high operating voltages and their more stable cell made up of two large cross-coupled inverters, each strongly driving the other to keep the bit in its programmed state. However, SRAM devices tend to have worsening SER with each new process generation due to the faster reduction of critical charge required to cause an error as compared to collection cross-section reduction, with the degradation factor of 5 to 10 times for each new process generation. The explanation for this trend is the intuitions that reductions in cell collection efficiency, with each successive SRAM generation, due to the shrinking cell depletion volume have been swamped out by big reductions in operating voltage and reductions in node capacitance. Thus SRAM single bit SER increased with each successive generation, particularly in products using BPSG. Most recently, as feature sizes have been reduced into the deep nanometer regime, the SRAM single bit SER is due to saturation in the voltage scaling (further reduction in operating voltage is limited by transistor threshold voltages), reductions in junction collection efficiency, and increased charge sharing due to short-channel effects with neighboring nodes. The exponential growth in the amount of embedded SRAM in electronics had led the SRAM system SER to increase with each generation. Flash memory is much more immune to soft errors than SRAMs and DRAMs.

# SER in Logic Components

In general, soft errors within logic circuits are viewed as less of a threat to circuit malfunction. Since sequential logic elements are less densely packed, they are statistically less likely to be affected by particle collisions than larger memory areas. Thus, SER has been focused traditionally on random access memory such as SRAM and DRAM but recent literatures investigate the effects of soft errors on logic components like a processor core, which are becoming increasingly important. The core of modern electronic systems consists of a microprocessor or digital signal processor with a large embedded memory (usually SRAM) interconnected by sequential logic. Such systems usually incorporate a large external memory (typically DRAM). These logic elements include latches and flop-flops used to hold system event signals and buffer data before it goes in or out of the chip – combinatorial elements that perform logical operations based on multiple inputs can also contribute to the chip SER (if the transient error that is induced by radiation is latched in a flip-flop or latch) but were not considered seriously. Flip-flops and latches are similar to SRAM cells in that they use a cross-coupled inverters, however, they have

historically been much more robust against soft errors because they are constructed with more and larger transistors which can more easily compensate for spurious charge collected during radiation events. The SER contribution of combinational logic for state-of-the-art processes is still considerably smaller compared to the contributions of unprotected SRAMs and sequential elements such as latches and flip-flops. For core logic, asynchronous soft errors are much more common than synchronous ones. The impact of operating frequency on the chip-level SER is therefore negligible. Further, it is significantly costly for core logic to tolerate faults since it requires more logic and redundancy driving the logic complex while ECC (Error Correction Coding) is common to tolerate the soft error in memory. The detection and protection of areas of a microprocessor from the effects of soft errors is difficult; available solutions often incur significant penalties in area and performance and are still not always 100 percent effective in resolving soft errors. Even when a solution delivers the anticipated error detection facilities, error correction remains hugely complex. For example, Mitigation of soft errors in logic involves the use of multiple identical logic paths feeding into a majority voting circuit. This method uses three times the chip area and reduces maximum operating frequencies since extra gate delays are introduced. More importantly this type of intervention, because it is so costly, requires specialized simulation tools and characterization methodologies to identify logic sensitivity and the critical logic paths that dominate the product failure rate, so that correction is added only to these key components. In SoC, the proportion of memory on a SoC die crossed the 50% level in 2002 and it increases to 90% of the SoC die area in 2010. Current research suggests that the average rate of failure for complex chips may be in excess of four errors per year, which can be translated into 29,000 FITs per a complex chip approximately.

### Handling Soft Error

The minimization of soft errors rate can be done by judicious device design, choosing the right semiconductor, package and substrate materials, and the right device geometry. Often, however, this is limited by the need to reduce device size and voltage, to increase operating speed and to reduce power dissipation. One technique that can be used to reduce the soft error rate in digital circuits is called radiation hardening. This involves increasing the capacitance at selected circuit nodes in order to increase its effective Qcrit value. This reduces the range of particle energies to which the logic value of the node can be upset. Radiation hardening is often accomplished by increasing the size of transistors who share a drain/source region at the node. Since the area and power overhead of radiation hardening can be restrictive to design, the technique is often applied selectively to nodes which are predicted to have the highest probability of resulting in soft errors if struck. Tools and models that can predict which nodes are most vulnerable are the subject of past and current research in the area of soft errors.

#### Soft errors Correction

Designers can choose to accept that soft errors will occur, and design systems with appropriate error detection and correction to recover gracefully. Typically, a semiconductor memory design might use forward error correction, incorporating redundant data into each word to create an error correcting code. Alternatively, rollback error correction can be used, detecting the soft error with an error-detecting code such as parity, and rewriting correct data from another source. This technique is often used for write-through cache memories.

Soft errors in logic circuits are sometimes detected and corrected using the techniques of fault tolerant design. These often include the using of redundant circuitry or computation of data, and typically come at the cost of circuit area. decreased performance, and/or higher power consumption. The concept of triple modular redundancy can be employed to ensure very high soft-error reliability in logic circuits. In this technique, three identical copies of a circuit compute on the same data in parallel and outputs are fed into majority voting logic, returning the value that occurred in at least two of three cases. In this way, the failure of one circuit due to soft error is discarded assuming the other two circuits operated correctly. In practice, however, few designers can afford the greater than 200% circuit area and power overhead required, so it is usually only selectively applied. Another common concept to correct soft errors in logic circuits is temporal (or time) redundancy, in which one circuit operates on the same data multiple times and compares subsequent evaluations for consistency. This approach, however, often incurs performance overhead, area overhead (if copies of latches are used to store data), and power overhead, though is considerably more area-efficient than modular redundancy.

Traditionally, DRAM has had the most attention in the quest to reduce, or workaround soft errors, due to the fact that DRAM has comprised the majority-share of susceptible device surface area in desktop, and server computer systems. Hard figures for DRAM susceptibility are hard to come by, and vary considerably across designs, fabrication processes, and manufacturers. 1980s technology 256 kilobit DRAMS could have clusters of five or six bits flip from a single alpha particle. Modern DRAMs have much smaller feature sizes, so the deposition of a similar amount of charge could easily cause many more bits to flip.

The design of error detection and correction circuits is helped by the fact that soft errors usually are localized to a very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on a chip. So, even a multi-cell upset leads to only a number of separate single-bit upsets in multiple correction words, rather than a multi-bit upset in a single correction word. So, an error correcting code needs only to cope with a single bit in error in each correction word in order to cope with all likely soft errors. The term 'multi-cell' is used for upsets affecting multiple cells of a memory, whatever correction words those cells happen to fall in. 'Multi-bit' is used when multiple bits in a single correction word are in error. When it comes to the microprocessors world the situation is more critical since die size is a crucial factor. The detection and protection of areas of a microprocessor from the effects of soft errors is not trivial; available solutions often incur significant penalties in area and performance and are still not always 100 percent effective in resolving soft errors. Even when a solution does deliver the anticipated error detection facilities, error correction remains enormously complex. There are a variety of approaches that offer a route to resolving microprocessor logic errors. Implementing a dual processor configuration provides a route to detecting soft errors in the core logic; a functional error in one processor results in each processor having different outputs. However this approach requires at least twice as much logic as a single processor solution and the additional logic on the critical path creates a performance penalty of between 10-20 percent. In addition the chip's area is larger which causes immediate effect on the profitability. Other solutions all have a significant impact on performance as they all require the addition of logic. These include implementing time redundancy at the end of each stage of the processor pipeline, the REESE approach (Redundant Execution Using Spare Elements), reverse instruction generation and comparison, and two-rail

coding. Code checking schemes for verifying logic circuits offer a relatively minor performance overhead and do not require major design changes, but designing the logic for generating the check code does present a significant challenge. The schemes include parity, weight-based codes and modulo weight-based codes, all of which operate by generating a code based on the input to the logic circuit, delivering detection rates of between 95-99 percent. No doubt, with moving towards smaller geometries, the industry will face major difficulty in this arena.

### Conclusion

As the semiconductor industry is stepping further into deep nanometer range soft errors become a significant issue. The industry agreed that the trend for SER is clearly set to rise with increasingly smaller process geometries. Possible solutions need to be acceptable to manufacturers and foundries in terms of the cost impact on both area and performance. The most effective approach to define the soft-error detection in processors and other combinational logic is to implement steps to manage soft error issues at the manufacturing, design and software stages. Most importantly, addressing error detection at the design stage gives the system designer the opportunity to evaluate what are the implications of dedicating more resource to error against the correlating impact on performance. Many chip design houses have successfully implemented memory detection mechanisms but significant research still has to be done in order to analyze and correct soft errors in future nanometer products.

### References

[1] P. Shivakumar, M. Kistlerand, S. W. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," in *Proc. ACM International Conference on Dependable Systems and Networks*, June 2002, pp. 389–398.

[2] P. Hazucha and C. Svensson, "Impact of CMOS technology scaling on the atmospheric neutron soft error rate," *IEEE Transactions on Nuclear Science*, vol. 47, no. 6, pp. 2586–2594, Dec. 2000.

[3] S. Hareland, J. Maiz, M. Alavi, K. Mistry, S.Walsta, and C. Dai, "Impact of CMOS process scaling and SOI on soft error rates of logical processes," in *Symposium on VLSI Technology Digest of Technical Papers*. IEEE, 2001, pp. 73–74.

[4] M. Nicolaidis, "Time redundancy based soft-error tolerance to rescue nanometer technologies," in *Proc. International VLSI Test Symposium*, 1999.

[5] L. Anghel and M. Nicolaidis, "Cost reduction and evaluation of a temporary faults detecting technique," in *Proc. Design Automation and Test Europe*, 2000.

[6] J. Lo, "A novel area-time efficient static CMOS totally selfchecking comparator," *IEEE Journal of Solid-State Circuits*, vol. 28, pp. 165–168, Feb. 1993.

[7] C. Metra, M. Favalli, and B. Ricco, "Self-checking detection and diagnosis of transient, delay, and crosstalk faults affecting bus lines," *IEEE Transactions on Computers*, vol. 49, pp. 560–574, June 2000.

[8] K. Mohanram and N. A. Touba, "Partial error masking to reduce soft error failure rate in logic circuits," in *Proc. International Symposium on Defect and Fault Tolerance in VLSI Systems*, 2003, pp. 433–440.

[9] H. Cha and J. Patel, "Latch design for transient pulse tolerance," in *Proc. ACM International Conf. Computer Design (ICCD)*, Oct. 1994, pp. 385–388. [10] K. Hass, J. Gambles, B. Walker, and M. Zampaglione, "Mitigating single event upsets from combinational logic," in *Proc. 7th NASA Symposium on VLSI Design*. NASA, 1998.

[11] M. Baze and S. Buchner, "Attenuation of single event induced pulses in CMOS combinational logic," *IEEE Transactions on Nuclear Science*, vol. 44, pp. 2217–2223, Dec. 1997.

[12] S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performanc e microprocessor," in *International Symposium on Microarchitecture*, Dec. 2003. [13] S. Krishnamohan and N. Mahapatra, "An efficient error masking technique for

improving the soft-error robustness of static CMOS circuits," in *Proc. IEEE* International System on Chip Conference, Sept. 2004.

[14] T. Karnik, S. Vangal, V. Veeramachaneni, P. Hazucha, V. Erraguntla, and S. Borkar, "Selective node engineering for chip-level soft error rate improvement," in *Symposium on VLSI Circuits Digest of Technical Papers*, June 2002, pp. 204–205. [15] K. Bernstein, "High speed CMOS logic responses to radiationinduced upsets," in *The Designing Robust Circuits and Systems with Unreliable Components Workshop*, 2002.

[16] J. Rabaey, A. Chandrakasan, and B. Nikolic, *Digital integrated circuits*, 1st ed. Prentice Hall, 1996.

[17] J. Grad and J. E. Stine, "A standard cell library for student projects," in *International Conference on Microelectronic Systems Education*, 2003, pp. 98–99. [18] K. Hass and J. Gambles, "Single event transients in deep

submicron CMOS," in *Proc. Midwest Symposium on Circuits and Systems*, 1999. [19] P. E. Dodd and L. W. Massengill, "Basic mechanisms and modeling of singleevent upset in digital microelectronics," *IEEE Trans. on Nuclear Science*, vol. 50, pp. 583-602, 2003.

[20] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," *ICDSN*, pp. 389-98, 2002.

[21] M. Nicolaidis and Y. Zorian, "On-line testing for VLSI – a compendium of approaches," *JETTA*, vol. 12, pp. 7-20, 1998.

[22] M. Nicolaidis, "Time redundancy based soft-error tolerance to rescue nanometer technologies," *VTS*, pp. 86-94, 1999.

[23] K. Mohanram and N. A. Touba, "Cost-effective approach for reducing soft error failure rate in logic circuits," *ITC*, pp. 893-901, 2003.

[24] M. Oman, G. Papasso, D. Rossi, C. Metra, "A model for transient fault propagation in combinatorial logic," *IOLTS*, pp.

111-15, 2003.

[25] Y. Cao, T. Sato, M. Orshansky, D. Sylvester, C. Hu, "New paradigm of predictive MOSFET and interconnect modeling for early circuit simulation," *CICC*, pp. 201-204, 2000.

[26] C. Zhao, X. Bai, S. Dey, "A scalable soft spot analysis methodology for compound noise effects in nano-meter circuits," *DAC*, pp. 894-899, 2004.
[27] F. N. Najm and I. N. Hajj, "The complexity of fault detection in MOS VLSI

circuits," IEEE Trans. on Computer-

*Aided Design of Integrated Circuits and Systems*, vol. 9, pp. 995-1001, 1990. [28] Y. S. Dhillon, A. U. Diril, A. Chatterjee, A. D. Singh, "Sizing CMOS circuits for increased transient error tolerance," *IOLTS*, pp. 11-16, 2004.

[29] Y. S. Dhillon, A. U. Diril, A. Chatterjee, H.H. S. Lee, "Algorithm for achieving minimum energy consumption in cmos circuits using multiple supply and threshold voltages at the module level," *ICCAD*, pp. 693-700, 2003.