From the course: Data-Centric AI: Best Practices, Responsible AI, and More
Introduction to data-centric AI
From the course: Data-Centric AI: Best Practices, Responsible AI, and More
Introduction to data-centric AI
- In this chapter, we'll be covering three main concepts. We'll discuss what is data-centric AI and talk about its components. It'll set the stage for the rest of this course because we have moved from a model-centric approach to a data-centric approach. In the second video, we will be talking about the role of data in different modules of the ML and AI workflow. Finally, we will be talking about why data quality is important. What are the issues if data is not kept at the center of the focus while building machine learning systems? We will also be covering how this impacts the robustness and reliability of the machine learning systems. Let's start with the core of data-centric AI, that is data. It is important that we recognize why data is critical. Only then we will be able to understand why our modeling workflow needs to be data-centric. So let me walk you through a simple example. Let's imagine a situation where we want to make banana bread. To make our perfect banana bread, let's say we go and get all the required ingredients and get a great chef to make it for us. They'll come up with their own recipe of banana bread that includes different proportions of the ingredients. They'll have their own method of mixing and baking. So after the first chef made the banana bread, we realized that the banana bread was too dry and it did not turn how we thought it would. So what do we do now? We feel that maybe the recipe wasn't right and hence we go to a different chef and give them the same set of ingredients we got initially and ask them to make the banana bread. The second chef comes up with their own recipe using different proportions of mixing and baking methods, and this time as well, the bread turns out to be too hard. Hmm, still seems to be a problem. Okay, so we assume that the previous two chefs did not have the right recipe for our perfect banana bread, and now we go to the third chef. We now give them the same batch of ingredients and ask them to make the banana bread. The third chef too, tries their own recipe, very different from the first two with different proportions, different mixing and baking method, and yet it turns out to be too soggy. At this point, we decide to take a pause and analyze what really is going on. Is this the case that we are absolutely unable to nail the perfect banana bread recipe, or is it that something else is going wrong? Well, if we had probably tried to first look at the ingredients before entirely trying different recipes, we would have realized that the banana that we were working with itself is bad. One can try all different methods, but they won't be able to make a perfect tasting banana bread until we start with a good set of ingredients. Now, if we were to look at the data science and machine learning problems with the same lens, we will realize that similar to bad banana, if we are working with bad data, any recipe or any model or any algorithm isn't going to get us the good results that we are looking for. Some of you might be thinking, sure, we have heard about garbage in, garbage out. This is trivial. Then what's new with the data-centric AI approach that we're talking about? Let me pause you there and say, trivial things are the ones that are most difficult to implement in a process because it is always overlooked based on how trivial it is. Now that you saw with an example of why and how data is a critical component of any kind of modeling lifecycle, let's see the formal definition of data-centric AI. Data-centric AI places data as the foundation of artificial intelligence and machine learning process. It emphasizes that the quality, quantity, and reliability of data are paramount in building successful AI applications.