Clean data is essential to good results in AI and machine learning, but data can become biased and less accurate at multiple stages in its lifetime—from moment it is generated all the way through to when it is processed—and it can happen in ways that are not always obvious and often difficult to discern.

Blatant data corruption produces erroneous results that are relatively easy to identify. Biased data, in contrast, can cause more subtle changes. Most people are aware of the potential for bias in language used to create algorithms in AI/ML/DL, but bias can happen in ways that do not involve humans in the loop. In fact, algorithms used for inferencing or training may be working perfectly well and not recognize bias because those aberrations are so small that they escape detection.

But bias also can be cumulative, and in some cases exponential. As such, it can cause issues much further down the line, making it difficult to trace back to where the problem originated.

“Biasing can start wherever that data is generated,” said Kevin Robinson, director of customer success at yieldHUB. “Every measurement has accuracy levels and tolerances, but they shift over time. Sensors have variability, and that variability changes over time. So you have to figure out where it is in the calibration cycle and keep trying to correct for variability. That’s part of the picture. But every piece of data has some variability in it, too. So if you add in all of that data randomly, you could multiply the variability.”

This has significant implications for AI systems in cars and robotics, as well as in manufacturing. While German carmakers have told their suppliers that electronic components need to be reliable for 18 years, those requirements typically focus on the ability of parts to stand up to years of wear and tear on the road from heat, cold and vibration. Minor shifts in sensors over time — or in the interconnects or memories any other part of that system — can affect the data they are collecting. And decisions made based upon that data need to recognize those shifts.

That’s true regardless of the end application, but it gets much more complicated as more sources of data are added in complex systems. The data itself has to be weighted in importance, and that prioritization can create yet another set of issues. This is particularly troublesome for sparse data that has been cleaned through some sort of pre-processing. While sparse data runs faster, the impact of any type of bias on sparse data is greater.

“There is an inherent statistical bias when collecting data from the different manufacturing and testing stages,” said Noam Brousard, vice president of product at proteanTecs. “Since we’re trying to gain insights from what is often circumstantial data, a lot of effort needs to be put into data cleansing. And even then, there is an intrinsic inaccuracy. For example, you could infer the quality of a chip from a host of ancillary production and testing information such as results of adjacent chips on wafer, which machine the tests are run on, what software version it is running on and which geographical site is running the tests. This could produce impressive results, but again, there will be an inherent inaccuracy.”

It gets even more complicated than that. “There are a lot of non-essential factors that can creep in, and your machine learning will have to know how to filter them out,” Brousard said. “We took a different approach to mitigate this. By extracting vital fundamental measurements from within the chip, we can analyze information directly related to its power, performance and quality. To this end, we add Agents (IPs) into the hosting design, which pick up the cleanest information directly from the underlying electronics to feed our algorithms with distilled data. Normally, when basing an analysis on multiple data sources, there is a question of how those sources are mixed and if one of them may actually add bias. In our case, since the basis of our analysis is true data, we find that additional data sources just enhance our accuracy and time-to-insight.”

Still, as with most things involving AI and machine learning, all of these aberrations are difficult to track and identify if you aren’t looking for them. They can result in poor performance, unexpected or erroneous results, or in some cases total chip failures that may take months or years to show up.

“The phrase we use is silent data corruption,” said Dennis Ciplickas, vice president of characterization solutions at PDF Solutions. “You thought you had a working chip and it produced a wrong result. That’s evil. When that happens and you don’t know that you’re producing bad data, you’ve got huge egg on your face.”

Challenges in data collection

The proliferation of sensors everywhere means there is much more data being generated than in the past. The challenge now is to make better use of that data, which is the raw material for the explosion in AI.

This cuts across many industries and vertical markets, and the potential for problems are as varied as the applications of AI itself. In semiconductor manufacturing, data is being utilized to improve a variety of processes. Consider phased profilometry, for example—profiling the surface of a die or package by illuminating it with a split beam of light, so that the surface being examined can be compared to a reference mirror. This technique is critical in finding anomalies that can affect yield and reliability, and it produces a huge amount of data that needs to be correlated.

“This only works with competent, useful data,” said John Hoffman, computer vision engineering manager at CyberOptics. “It’s the algorithm team’s job to look at the image and understand when there is image corruption. The challenge is that not every surface is diffuse. Some are shiny. The specular region sometimes can break the underlying physics assumptions when you analyze the data.”

In this case, data is being streamed from multiple cameras. In other manufacturing steps inside a chip fab, those sensors may collect everything from temperature readings to vibration and noise. But not all of this data is consistent, and detecting shifts in the data over time is difficult as it is combined and processed at multiple stages, adding levels of data complexity never seen before in manufacturing.

“People have access to more data in the past, and they can aggregate it from more sources,” said Doug Elder, vice president and general manager of OptimalPlus. “You get a bunch of data from testers and burn-in that was never brought together before and you can tweak that upstream. Testers are a big data source, and so is every step in manufacturing. You can run analytics against that data.”

Elder said that recently there has been a big improvement in the quality of that data. The key is collecting data at precisely the right place, and that can vary by manufacturing process, by factory, and even over time. It involves understanding how and where that data is going to be used, and applying domain expertise to pull out what’s relevant and check that bias hasn’t crept into the process.

“The challenge is in constantly monitoring how data gets pulled off of these processes,” Elder said. “It’s not just about the temperature of the solder. It’s about all of the parametrics.”

Recognizing data shifts

Data partitioning, whether that happens on-device, in an on-premise data center, or in the cloud, makes it harder to track subtle shifts in data. The further away from the source, the more work it takes to reconstruct the picture of where data may have become biased.

One solution is to set up a feedback loop for that data, so that when changes are detected they are immediately fed back into the manufacturing process using a closed loop system. This sounds straightforward, but it’s harder than it looks because the manufacturing processes and the equipment are in a constant state of change.

“The data is only as good as the camera or any other sensor you are using,” said Elder. “With something like 5G technology, some stuff has to be pushed to the system level rather than being detected earlier in the process. But if you are collecting data later on, you can draw conclusions and drive that back into the process. The technology being manufactured already has gone through the manufacturing process and you can see how it performs in the system. Then, if you have a returned antenna or module, you can draw conclusions about why it failed and you can figure out what you can do to move that data collection further upstream.”

This is the leading edge of data analytics today. The challenge is being able to fuse together different kinds of data and make sense of all of it.

“In more complex systems, you need a vision of which is the key data for a product line and learn the different relationships between them,” said yieldHUB’s Robinson. “You need to correlate changes against each other to identify changes in bias. That’s where machine learning opens the door to a different way of looking at large amounts of data. It’s a really good application of machine learning. We’ve seen customers making sensors for physics experiments where they’re trying to get noise out of sensors, which is the starting point. Noise and bias in the sensor can negate accuracy. But what they are finding is that with better understanding of this, they may not need to use an entire sensor array. Maybe there is an area with less noise. Once you have got all of the data from that array in a database, you can figure out which part of the sensor array actually worked best. And if you can reprogram that device, you can use different parts of that sensor array for different things.”

Along with this, device manufacturers and OEMs need to be constantly watching over data to make sure that it is working within acceptable parameters. Apple’s decision to limit the performance of its older model iPhones to prevent unexpected shutdowns is an example of how to utilize data to maintain functionality.

But lithium-ion batteries follow a well-documented degradation curve. Aging in sensors is less studied, in part because the sensors are newer and in part because there are not sufficient volumes of many of these sensors to do broad-based market studies. And sensors may produce different results at the time they are first manufactured versus several years later, and under different environmental conditions.

“We’re just starting to get enough historical data to review the behavior of sensors,” said Will Stone, director of printed electronics integrations and operations at Brewer Science. “Typically sensors need to be calibrated and then recalibrated, and that includes all sensors. The volume of data that you get needs to be correlated, and you need to determine whether that is correlating correctly. Whether that correlates the same way that it did when the sensor was manufactured is an ongoing challenge.”

Some of that can be bounded up front, even before a device is released into the market. For example, many components do not run at their maximum performance in order to account for changes over time. In effect, this is like building margin into the device’s performance over its lifetime, rather than allowing it to degrade to a lower level.

“There are certainly biases on the data set,” said Lazaar Louis, senior director and head of marketing and business development for Tensilica products at Cadence. “Our customers take into account the number of hours of use, the conditions under which they are used, and the expected years of life. So they set the maximum frequency and put bounds around it. The IP may be capable of doing more, but they need to put it into a box to meet the requirements. So they are very aware of the conditions and may choose to run it at a lower voltage. What they’re doing is designing it to operate in the best way, and we verify it at various design points. Those are the tradeoffs they are making.”

Sometimes, that includes adding in-circuit monitoring when devices are in use in the field to make sure these devices remain within projected operating parameters. And that has broad implications for how this kind of data can be used if there is enough granularity about how devices degrade over time.

“There’s the kind of mundane, data-center stats, where if you look at the total power needed to run all of those processors, you can’t possibly deliver that much power to the building,” said PDF’s Ciplickas. “So they can only run at 40%, or some number—pick your tens of percentage points. The challenge is how do you load-balance to make the data center last as long as possible with the highest reliability at the least cost? You see this with cars. Each one feels a little different. If you look at the processors that way, what’s the best one to use for any one task and how do you balance your load based on that?”

Conclusion

The explosion of data from sensors everywhere opens up enormous possibilities for improving efficiency, but it also adds significant new risks for misunderstanding the accuracy of data and therefore its value. In addition, that accuracy can change over time, and those shifts may be undetectable unless users are aware of what they are supposed to be looking for.

This kind of bias can be very subtle, with no human in the loop, and it can produce results that are skewed enough to cause problems either immediately or over long periods of time. The challenge is to know when it is being skewed and to account for those changes. Just because data is available doesn’t mean that data is good, and even if data is good from the start it doesn’t mean it will remain good.

Related Stories

Artificial Intelligence Knowledge Center

AI Top stories, special reports, technical & white papers, blogs and videos

Dirty Data: Is The Sensor Malfunctioning?

Why sensor data needs to be cleaned, and why that has broad implications for every aspect of system design.

Data Confusion At The Edge

Disparities in processors and data types will have an unpredictable impact on AI systems.

Big Shifts In Big Data

Why the growth of cloud and edge computing and the processing of more data will have a profound effect on semiconductor design and manufacturing.

Data Analytics Knowledge Center

Top stories, technical papers, white papers, blogs and videos on Data Analytics