Flavio Villanustre

co I love to so very well let's let's set up something very simple. As an example, you have a number of data sets that are coming from the outside, you need to load those data sets into HPC. So the first operation that happens is something that is known as spray spray is simple process is an spray comes from the concept of spray painting the data across the cluster, right. So this runs on a Windows box or a Linux box and it will take the data set, let's say that your data set is just given number in million records long. It will unusual as it can be in any format, CSV or or any other or fixed length limited or whatever. So it will look at your data total data set, it will look at the size of the four cluster where the data will be saved initially for processing. And let's say that you have a million records in your data set and you have MN nodes on your for let's just make round numbers and the small numbers. So it will a petition the dataset into 10 partitions because you have to note and it will a then just copy transfer each one of those partitions to the corresponding to full node This is done. If it can be better lies in some way, because for example, your latest fix link, it will automatically use pointers and paralyze this if the data is in either no and XML format or in the limited format where it's very hard to find the partition points, you will need to do a pass in the data, find the friction points and eventually do the panel copying to the thought system. So now you will end up with 10 partitions of the data with the data in no particular order, the Netherlands, all of them that you had before, right. So the first 100,000 records will go to the first note the second 100,000 Records, we go to the second node and so on so forth until you go to the end of the data set this put each one of the nodes in a similar amount of records per node, which tends to be a good thing for most processes. Once the data is spread or

while the data has been sprayed. And depending on the length of the data,

or or even before year, you will most likely need to arrive at work you need to work on the data. And I'm trying to do this example in a way that he said that data The first thing you see that data. So otherwise, all of these automated, right, so you don't need to do anything manually. All of this is scheduled and automated. And working that you already had will run on the new data set and will have appended or whatever it needs to be done. But let's imagine that is completely. So now you write your work unit. And let's say that your latest said was a phone book, and you want to first of all, and even a duplicate, build some rollout views on the phone book. And eventually you want to allow the users to run some queries on a web interface to look up people in the phone book. So you and let's just for the sake of an argument argument, let's say that you're also trying to join that phone book with your customer contact information. So, you will right they will connect that it will have that join to merge those two, you will have some duplication and perhaps some sort of thing. And after you have that you will need to build you will want you don't need to but you will want to build some keys. There is another again, key build processing for the oldest runs on for that will be part of your work unit. So essentially, it's all the CO writer working with ECL submit their work unit, they still will be compile will run on your data, hopefully, they feel will be syntactically correct when you submit it. And it will run with giving you the resource that you were expecting on the data. You see. I mentioned this before, but he says surgical type language as well, which means that it is a little bit harder to errors that will only appear in runtime between the fact that he has no side effects. And that is typically typed most typing errors, type errors they've made in errors and they might into function operations errors are a lot less frequent. There is not like Python, but you may

seem okay. The

run may be fine. But then one run at some point it will give you some we are there because a variable that was supposed to have a piece of text has a number to revise the verse. So you run the work in it, they will give you the result as a result of this work unit will give will potentially give you some statistics and the data some metrics. And he will give you a number of keys, those keys will be also partitioned in four. So there will be filtered nodes, the keys will be partitioning them pieces in those nodes. And you will be able to play those keys as well from for Joe, you can write a few attributes that can do the quoting there. But at some point you will run to you will want write those queries for Roxy to us. And you will want to put the date and Roxy because you don't have one user creating the data you will have a million users going to query that data and perhaps a 10,000 of them will be simple things liquidating. So for the process, you write a another piece of ECL another sort of work in it, but we call this query and you submit that to Roxy instead of four. And there is a slightly different way to submit it to Roxy. So you select Roxy and you submit this, the difference between this query and they work in it I do the heat you have in four is that the query is parameter raised and similar to paradise to proceed in your database, you will find some variables that are supposed to be coming from the front end from the input from the user. And then you just use the values and those variables to run some of the whatever filters or or aggregations that you need to do there, which will work in Roxy and will leverage the keys that you have from for i said before the keys are not mandatory, Roxy can perfectly work without keys can even cast a way to work with in memory distributed data sets as well. So even if you don't have a key, you don't pay a significant price in they look at by doing the sequential look up on the data and the full table scans of your database. So you submit that to Roxy, when you submit that query to Roxy, Roxy will realize that it has the data and it's not in Roxy's in for and this is also your choice, but most likely you will just tell Roxy to load the data from for it will know what to all the data from because he knows what are the keys are and what the names of those keys are, it will automatically load those keys. And also your choice to the Roxy to a stair allowing users to query the front an interface or to a while it's loading the data or to wait until the data is loaded before it allows the queries to happen. The moment you submit the query to Roxy, Roxy will automatically exposed on the front end there is a component called ESP, that component called DSP exposes a web services interface. And this gives you a restful interface, a soap interface, JSON for the payload, if you're going from the restful interface, even AM an old EBC interface if you want. So you can have unit even SQL and front on the front end. So the moment you submit the query, the query automatically generate out to generates these, all of these web service interfaces. So automatically, if you want to go with a web browser on the front end, or if you have an application that can use I don't know a restful interface over HTTP or HTTPS, you can use that and it will automatically have access to that Roxy quality that you submitted, of course, a single Roxy might have not one query but 1000 different queries at the same time, a all of them leasing an interface, and it can have several versions of the same of the queries as well. The queries are all exposed version from the front end. So you know, what they use is an accent. And if you are deploying a new version of equity or modified and extinguish it, you don't burn your users, if you don't want to, you give them the ability to migrate to the new version as as they want. And that's it. That's pretty much the process. Now, as you have committed to these while you need to have automation, all of these can be fully automated. In ECL, you may want to have data updates. And I told you data is immutable. So every time you think you're mutating data, you're updating data, you're really creating a new data set, which is good because it gives you full provenance, you can go back to your everyday version, of course, at some point, you need to delete data, or you will run out of space. And that can be also automated. And if you have updates on your

data, we have concepts like super files where you can apply updates, which are essentially new overlays from the existing data. And the existing work unit can just work on that, happily as if it was a single data set. So a lot of these complexities in the that otherwise will be exposed to the user to developer are all abstracted out by the system. So the developers if they don't want to see the underlying complexity, they don't need to, if they do they have the ability to do that I mentioned before Well, ECL will optimize things. So if you tell it, do this, join, but before doing the join to the sword, well, you may know that it is to us or to the sort of won't be that. But a if you know that your latest resorted, you might say, well, let's not do this, or I want to do this join each one of our politicians locally, instead of a global join, and order they are the same thing with sort of disorder operation and ECL of course, if you tell it to do that, and you know better than than the system, you see, I will follow your orders. If not, it will take the safe approach to your operation. Even if it's a little bit more overhead. Of course,