Julian Le Dem

Yes, there are a couple of use cases where that becomes very handy. So one is, of course, when something goes wrong, right. I think a lot of what when you see data processing in companies, a lot of those framework andron mentor, very designed with the best case scenario in mind, like people know what happens if the job be successful and you produce the die and you trigger downstream processing. However, when something goes wrong, then it becomes hard to debug. Or if you need to reprocess something, it becomes hard to debug. So Marquez is capturing very precise metadata about when the job run one version of the code run, what version of the data set was right? Especially if you use a storage layer like iceberg and delta Lake, where you have precise definition of each version of the data set. And so when you job fails, or it's taking too long, or the job be successful, but the data looks wrong, you can start looking at what change right you can see a for your particular job. Does a version of the code changed since last time you tried? Or is that the data set shape of the input? Change, right? You could use things like great expectations, which is an open source framework for defining declarative properties of your data set and verify that they're still valid or they didn't change significantly. And you could look at that Not only for your job, but for all the upstream jobs because you understand the dependencies. So often, you have simple thing happening, like, why is my job not training, why it's not training because your input is not showing up and your input is not showing up, because the job that's producing it is not training, right? So you can work that graph upstream until you find the source of your problem. And it may be that there's some improvement as that's wrong. It may be that the there's a bug that got introduced, and you can figure out what's going on. Right. So first, you have a lot of information, depending what's happening. And second, since you have a precise model, and you know, for each run, what version of it that is set in run on if you need to restate a partition, in a data set, you can improve your triggering, you know exactly what jobs need to rerun. So I think the state of the industry is often that people have to do a lot of manual work when they need to restate something and rerun the old done stream jobs. And the first capabilities that is required is having visibility and understanding all the dependencies, right what to rerun. And, and in the future, you could even imagine using that very precise model to trigger automatically, all the things that need to be around. Or if something is too expensive to be around, and he's not worth it, you could flag it as something that doesn't, you know, the data is dirty and should not be used or something like that. So there are a lot of aspects like this that are important. And I think in the world where you see a lot of more machine learning jobs happening on data, I mean, these information of that particular training set training job, run on these version of the training set, using those hyper parameters, and producing that version of the model that was then used in that experiment was an experiment ID and tying everything together as a lot of usefulness right because people need to be able to reproduce Same model. So capturing these information, or even the model is drifting over time, having the proper matrix and being able to get back to that version of the training set, or understand what has changed, whether in the data or in the parameters is really important. So that's some of the, you know, specific things we have in mind, where are we looking at this very precise model of jobs and data set and what's running?