Breaking down the unexpected challenges that come with standardizing cross-organizational data at scale

Manufacturing: A Data-Driven Industry

Industry 4.0 for manufacturers is about combining data from many different sources in the factory with new tools like Machine Learning and Artificial Intelligence to create new classes of business value. It is about finding the hidden links between complex systems that lead to huge ROIs by increasing throughput, and it’s about improving yield or reducing downtime.

Electronics manufacturers generate incredible amounts of data, some of which have been collected and used in their operations for a long time. With Industry 4.0, there is an increasingly validated and evidence-based belief that combining data from existing silos with new sources and modern machine learning tools will yield massive ROIs. Yet, despite their efforts, many organizations have discovered that there are unexpected practical challenges and pitfalls to making this process work at scale. Primarily, this is because the process is often exploratory, and you don’t know if any specific effort will pan out. You just know that hidden, somewhere, in the data about your factory are industry-moving insights and you want to find them before your competitors do. Which leads to the question, how quickly can you move?

You know, at very least, you need to:

  1. Collect data about every part of your operation
  2. Apply modern AI/ML tools to find insights
  3. Put the insights into production to achieve ROI
  4. Repeat for the next ROI … and the next.

This sounds like a clear process but it’s missing a couple of stages between steps one and two:

  • Combine parts of that data together in one place
  • Clean and structure the relevant data so that it can be fed into data science tools.

In many cases, the limiting factor for how quickly you can solve problems with data is how quickly you can put together a clean, unified dataset, making these two steps the key bottleneck to tackle. Once you have clean data, it’s straightforward to get value from it.

This is where the hidden cost of data silos comes from.

The first thing we often hear from new partners is “we have the data, what we need is data science to solve our problem”.  When we dig in though, we tend to find that while they have, indeed, collected a lot of data, it’s stored in many different systems for different purposes.  These are data silos.  It sounds like it should be easy to just “put all the data together” and then do the data science, but it’s not.  It’s a classic case of a problem that’s easy to solve when it’s small but increasingly tricky as it gets larger.

The Science of Standardization

To understand why let’s look at a familiar analogy that SMT experts will recognize: the pick-and-place (PnP) machine. This workhorse of the SMT industry is a modern marvel, capable of placing tens of thousands of components per hour and assembling circuit boards in seconds. In almost all ways, it far surpasses the old way of hand soldering circuit boards. There is, however, one challenge that a modern PnP struggles with: one-off designs with loose components. Say you have a PnP machine that can place 50,000 components per hour and a customer wants you to make a single board with 10 resistors on it. They give you a printout diagram and a little ESD baggie of the resistors. The customer wants to know how long it would take to build this one board with your state-of-the-art PnP machine. As everyone knows, the answer is not 10 resistors / (50k components/hour) = a few seconds. The PnP machine requires two prerequisites before it can begin its work:

  1. The assembly drawing in a machine-readable format that can be used to program the machine
  2. All the components, carefully organized onto reels, fed into known locations on the PnP machine in a standard way.

For a single board with loose components, it’s probably faster to just solder it by hand than to spend an hour programming the PnP machine to then spend 1-second placing components on the board. In fact, because your customer provided the components in a little baggie, you couldn’t even use them because it would cost far more to load them onto a reel than to just buy a reel of the same part and use that instead. The key takeaway is that the prerequisites that enable a PnP machine to work so quickly are:

  • a standardized problem description in a machine-readable format it can understand
  • standardized components on carefully configured reels or trays to minimize the component-to-component variability during pickup.

Without those two conditions, a PnP machine can’t work and you’re better off hand soldering it because a human can understand a printed out assembly diagram and our hands can manipulate tweezers to place bulk components from little baggies. Nowadays, it’s not hard to find any component you want to be preloaded onto a reel for a PnP machine. The industry has had many decades to build careful standardization around this process because it’s critical for reliable, high-speed manufacture of all circuit boards. For Industry 4.0, the equivalent to the PnP machine is the machine learning (ML) algorithms such as a Deep Neural Network.

Just like the PnP machine, it has the same two prerequisites to function:

  1. a standardized, machine-readable problem description that it can understand
  2. standardized and sanitized data inputs that have very specific regular formats.

The Industry 4.0 equivalent of hand soldering is an expert with Excel and a few SQL databases. Unfortunately, the industry hasn’t had decades to standardize all its data into “data reels” that can be fed into ML tools. In many ways, we are still trying to feed in loose baggies of data and finding we’re not satisfied with the results. Sometimes the baggies are very large, like an MES database, and other times they’re very small, like a log file from a single machine. It’s not the size of the baggie that’s important, it’s the fact that it’s still in a baggie and not on a reel.

Just like the PnP machine, ML tools struggle to deal with tiny variations from one data entry to another, variations humans can perceive as inconsequential. The key limiting factor is that a PnP has no ability to understand how to build a circuit board, it just knows where you told it to move its arm to pick up parts and where it’s supposed to drop them. Similarly, advanced ML tools don’t understand your problem, they just know how to combine lots of latent features from the standardized data you feed them into an answer. This is why data variability is so problematic for ML. It can’t tell the difference between the tiny variations it should ignore (because they’re just noise) and the ones you’re asking it to pull together into a giant ROI. In both cases, if the instructions are not precise or there is too much variability in the inputs, the process breaks down and you’re better off working manually, allowing the human mind to power through the ambiguity.

Because of scale,  the question for the SMT industry now is how to standardize the data sources into data reels that are ready to be fed into the workhorse machine learning tools of Industry 4.0. It’s evident that solving the same problem on a small scale turns out to be very different from solving it on a massive scale. On a small scale, you can deal with edge cases manually. For example, if you have 100 rows in your spreadsheet and 1% have issues, it’s not a problem to just fix that one issue. However, a 1% issue on a spreadsheet of 10 million rows results in 100,000 problems. This is not only expensive and inefficient, but it’s also likely cost-prohibitive.

Exploratory Costs

In this light, the real hidden cost of data silos comes into focus. It’s not that you can’t get data out of them or that it’s impossible to combine data from multiple silos to solve a problem, it’s that it’s a nontrivial amount of work to do so, every single time. So, you naturally stop doing it unless you’re very sure it’s going to be worth your while. As a result, many problems never get solved, even if the ROI could be large- because the cost of finding out if you’re right is just too high.

DataOps, or the process of combining and cleaning data, has been a key impediment to implementing Industry 4.0 projects, even for semiconductor people who know and value clean data.  One difficulty is that machines report data only to a database built by the same vendor, requiring engineers to spend hours or even days manually downloading information from silos to correlate it with other data in order to glean the needed information. This costly and time-consuming process may deter companies from fully exploring data insights to better optimize factories.

Additional DataOps challenges semiconductor manufacturers face include:

  • Tools used to solve problems on a small scale are ineffective on a large scale
  • A belief that data in silos is already clean and is being utilized at max value
  • A belief that current methods of data collection are the most cost-effective

Maximizing the value of a company’s own data can mean the difference between acquiring new or even retaining current customers.  Semiconductor manufacturers collect and record significant amounts of data, but it’s often collected for a different reason than they need later in the process.  An omniscient manufacturer would never guess wrong on the end-need for data, but because it’s impossible to predict all potential needs and uses for data, the current process is incomplete and may introduce significant flaws in data collection and sorting.

Let’s use semiconductor wafer tracking as an example. A semiconductor company stores detailed data in a wafer-tracking manufacturing execution system (MES). The data was captured to control process parameters during wafer processing, so it was associated with individual wafers as they progressed through the line.  The data was very clean and sliced up for the intended purpose.  Now, the company wants to use that same data to track the health of the underlying machine, but they soon find out they have a huge problem. They associated the data with a wafer rather than a machine, so it becomes very complicated to determine what happened to the actual machine over time.  Just getting started requires finding every wafer a machine processed, gathering the data for that machine from each wafer data silo, and stitching it together to get a machine-focused picture instead of the collected wafer-focused picture.  After doing all this, they still only have a partial picture because they lack data from when the machine was on but not producing wafers.

Tools of Technology

Software alone cannot fix this problem, nor can hardware. By contrast, the right Industrial Internet of Things (IIOT) platform does fix the problem because it scavenges the data from machines on the floor and passes it to a data broker and unifies it with existing data for real-time analytics, converting raw factory information into business value. In an increasingly competitive global marketplace, the ability to satisfy customers’ demands for faster, more powerful metrics is a dominant competitive edge. Data is being increasingly used to solve process issues, which is critical as many new product introduction lines are asked to ramp up production rapidly and they need to solve problems faster.

Semiconductor companies are leading the way in transforming factory floors to Industry 4.0. They have been generating and analyzing data with the newest technologies for years.  Now, Artificial Intelligence (AI) can help companies gain insight into effectively improving fab operations and drive lower maintenance costs and higher yield. The concepts applied to improve speed and yield in fab operations are the same as those used to enhance data standardization and output. In both situations, correct tools are key.  With automation in place, and with the right tools to properly tee up the raw material (data), it’s an open lane to increasing return.

Bottom line: reducing the cost of unifying data across silos pays dividends throughout the organization because it reduces all data problem-solving barriers. That realization is a key step toward unlocking the factory of the future.




Tim Burke is co-founder and Chief Technology Officer of Arch Systems where he works to accelerate Industry 4.0 by standardizing connectivity and data gathering across the factory. He has broad expertise in industrial communication protocols as well as the challenges of working with the diverse set of machines found in electronics and semiconductor factories. Tim’s published work on the device physics of organic photovoltaics has been cited by thousands of researchers.




Wondering what to do with all that data? View Tim’s interview with IConnect007 at IPC APEX 2020 for more insights.

Arch Systems‘ unique technology retrofits new and legacy machines to extract industrial machine data and propel AI and Industry 4.0 transformations. The ArchFX Broker streamlines Industry 4.0 implementation and drives greater ROIs by running one application across all machines. Because it integrates with SCADA, MES, and ERP, it is Machine Learning and AI-ready. This results in an answer to one of the most common challenges in manufacturing data analysis today: providing a well-marked path to identify, optimize, and automate processes.


Questions? Feel free to Contact Us anytime!