Semiconductor Engineering sat down to discuss reliability, resilience, machine learning and advanced packaging with Rahul Goyal, vice president in the technology and manufacturing group at Intel; Rob Aitken, R&D fellow at Arm; John Lee, vice president and general manager of the semiconductor business unit at ANSYS; and Lluis Paris, director of IP portfolio marketing at TSMC. What follows are excerpts of that conversation.



(L-R): Rahul Goyal, Rob Aitken, John Lee, Lluis Paris

SE: How do we improve reliability at 7/5/3nm, especially when some of these chips are expected to be in the market for 15 or more years in markets such as automotive and industrial.

Goyal: We have to create time capsules. Fifteen years from now, errors will happen, because if chips are designed now it will be three years before they’re out in the market. So we will be called on to support customers within 48 hours to find the root cause. We need to have our models and design environments in storage for that time. The only way we know to make that work is to put machine software in a vault. We don’t know of any other way, 15 years from now, to replicate the environment we have today. That is what we think it will take to support that. The things we know about we are taking care of already.

Aitken: We know that there will be failures. The other challenge we face is that we will have to be able to tolerate them. The systems themselves have to be designed to be fault-tolerant, whether that’s error correction in memory or some form of software tolerance. We also know from a security standpoint that hackers 15 years from now will be significantly more capable than hackers are now. The car I bought last year has Internet technology built in. It’s connected to everything. And 10 years from now, any kid with a cell phone will be able to hack it. I’m hoping that some firmware updates will make it more tolerant or resistant to that. The ability to update over time is critical to keep it reliable.

SE: One of the new buzzwords here is resilience. But the problem is now you need cross-layer and cross-system resilience. A lot of components are disaggregated today. Will that work?

Lee: The easiest place to start is to look at aerospace. In the the aerospace industry, resiliency has to be there. If you’re going to the moon or across an ocean, the resiliency in the hardware needs to be baked in starting at the electronic system level. There has to be a holistic solution looking at failure modes. The software is another part of the stack that needs to be modeled. Mission-critical software that may be in an ADAS or airplane system typically needs to be generated in a way that is functionally safe, and then it has to be verified throughout the work flow. Those are the components. It starts at the multi-physics level and goes all the way through the whole stack, so fundamentally it’s a much more difficult task.

Aitken: Another key piece of that is standardization. If the interfaces are standard and the standard is well enough defined, it simplifies the problem a little bit. It doesn’t eliminate the problem in the system, though. The challenges are still there. But at least if there’s some kind of standardization on the interfaces, then it allows the disaggregation to work a little bit better.

Goyal: There are no shortcuts. We need to optimize all of these things together and start from the design phase to make sure that we are tackling everything. The world is getting more complex as more things are connected.

SE: There is a cost to all of this, both in terms of time and money. The problem is that this industry is set up to reduce that. Do we need different methodologies?

Paris: The costs are real. We believe that the value is real. Security will be a requirement. My kids will be driving cars and I want them to be safe. I see that as a value and I will pay for that. As long as people are willing to pay for value, then solutions will exist. What those will be we will have to see.

Lee: It’s an interesting challenge. A Tesla can update itself overnight and then you have a completely different drive system. We want that kind of drive system versus something that’s more heavily regulated, where the car that you buy is the car you have for the next 15 years. But we’re also going to have to err on the side of being more focused on process and verification, and making sure that things don’t get out there before their time. If you’re driving a car, you need to be fully aware of what the limitations are.

Aitken: Part of that is modeling what the costs are. In safety-critical microprocessor systems, you need dual-core lock-step methodology, where the two cores run and if they differ from one another, you go back to a check point and restart. We ran an experiment in triple-core lockstep to see if three cores are better than two, but it turned out that under a realistic failure mode, a triple-core lockstep wasn’t that much better than a dual-core lockstep. It’s one thing to say the costs are important. But it’s also important to model the actual problems you’re looking at and determine whether the extra cost you’re paying is actually solving the problem you’re trying to solve.

Goyal: And even then, as the complexity increases, it becomes an optimization problem for the integration. Things that can come together simultaneously can reduce the cost collectively. But that includes not just the things we can do today. It has to go across the domains we are comfortable using today, and those we are not comfortable with, in order for the costs to come down.

SE: There certainly is a lot of debate just how far along ADAS is. But from our side we certainly need to tighten up signoff and verification and improve test and make it all more reliable. Will machine learning help here? And if so, how?

Lee: Machine learning certainly will help, and in some circumstances, it has already helped. If you look at reliability, where human experts have to weigh certain cases for electromigration, that was an obvious area where machine learning will apply. If you look at how to model non-Gaussian distributions at ultra-low voltage, we also applied machine learning there. Two years ago it wasn’t very clear to us where we could apply machine learning. Its application has accelerated.

Aitken: There’s a key to machine learning in terms of the data set you use to train it and run an inference on. If you’re looking at a non-Gaussian distribution and you have samples that are well-enough distributed across that, then you can use machine learning to do a fancy curve for that data. But if you have an inferior data set with inferior points, then all of the machine learning in the world won’t help. You won’t be able to extrapolate outside of a data set and find problems that might be there. It’s really important to seed machine learning with the right data. Then you can achieve the full benefit.

Lee: You need to target areas that have extremely high value because it’s an extremely difficult problem to apply machine learning. But always be aware that the machine learning model can drop by the side of the road. It’s only as good as the data you trained it with it. It’s not like an explicit set of algorithmic code you can look at. In some ways, it’s a black box. The behavior can be extremely unpredictable at the wrong time.

Aitken: For safety-critical systems in the future, it will be really important to have more explainable machine learning so that if a car veers off the road, the machine learning can explain why it came up with that choice.

SE: We’re heading toward more heterogeneous computing, and we’re seeing much more interest in advanced packaging. How will that affect reliability?

Paris: The reason we are looking at 3D integration is that, from 30,000 feet, Moore’s Law is slowing down. In the past, it was always more practical to integrate. By the time you were ready, you would have a solution that was cheap enough at the next node. That’s not the case anymore. The cost is not linear. We tested MCM years ago and by the time it was ready, it was too expensive. That’s not the case anymore, either. There are some things that don’t make sense to do at 5nm. For a foundry, that’s a reality. If you try to fight physics, you lose every time. We need to support advanced packaging. If customers need that system solution, we have to do it. It’s complex for the supply chain, and to make it work you need good partners.

Goyal: Like everything else, it’s a progression. In some cases you’re optimizing analog, in some cases you’re optimizing everything together. Now you have to optimize the chip and the package. You have to have both of these work, and just one is not going to be enough for systems at the security, performance and reliability levels that we are all trying to achieve. The problems are getting more complex, whether it’s about advances between chip and package, or chip, package and board, or the interconnects or components or memory. And the supply chain runs through all of that.

Lee: Customers are saying that by going to 2.5D or 3D, it introduces a lot more flexibility in their design schedule. If you’re doing everything on one piece of silicon, it’s extremely hard to be flexible. But if you need to do system-level optimization using something like InFO, you have a lot more options. The challenge on the EDA or the software side is that the techniques we use to process 10 billion transistors are very different than the techniques you want to use to model the InFO layer package. The physics are the same but the techniques are different, so you can’t take an IC extractor that works on-chip and ask it to extract other layers. You have to be very careful about which algorithms to use, and ideally you’re not going to have to create something where you need to do all the dies in a package.

Related Stories

Aging In Advanced Nodes

Why aging and reliability no longer can be addressed with margining in finFETs and automotive applications.

Process Corner Explosion

At 7nm and below, modeling what will actually show up in silicon is a lot more complicated.

Big Changes For Mainstream Chip Architectures

AI-enabled systems are being designed to process more data locally as device scaling benefits decline.