Julian Le Dem

So yeah, I can give a little bit of the history. I think strafed protocol buffer and Avro preceded parking and parking kind of build the tried to be complimentary to them, like one of the things they define is the ideal, and how you define your type system. And Avro is definitely better at all the parts about whole kind of pipelines type of codes when you need to understand the schema and do transformations, and makes this easier to deal with schema evolution and understanding the schema and be more self describing and passing the schema schema along with the data. And so parquet is trying to not redefine the ideal, but just define a columnar format that you can can become complimentary to those things, right. So you can reuse your same you have the seamless replacement when you can use your same ADL that you're using with arrow, for example, that describe your type system. And use these columnar representation on disk when it's convenient, right when it's the right use case. So maybe you are using ever before. And you can still use Avro as your model, but you can swap with the arrow fight format, which is reoriented when it's useful. And you can swap to park a column now representation when it's better for a sequel analysis. So that's one in and so in the history of market versus RC, I think back in the day there was these need for a column narrow presentation on on these for Hadoop, right? So I this use case when I was at Twitter was trying to make Hadoop more like vertical. And there was this need, and you know, there was a little bit of overlap on people working on those columnar format. And then you start talking about it when it's ready, right. So you kind of publicize it and you say, hey, look, it's open source, we're trying to build that we think there's a need for it. So it's a little unfortunate that you know, bad today, I connected with an Impala team that was trying to do something as well. And later on, we connected with other teams and kind of grow the park a community that there was these parallel efforts. So you know, their representation of nested data structures is different. So Parker uses a Dremel model. And, or sees using a different model, where they're going to have very similar characteristics, because they're trying to solve the same problem. I think parquet has been better at integrating in the ecosystem. Like from the beginning, I was really aware that I didn't want to build another proprietary file format, you know, same problem that if you're importing a database, your data, then you can use it only in your database, I really wanted it to become like a standard for the ecosystem. So from the beginning, from the community building point, have you I spent a lot of work kind of making sure people's opinion were integrated into design, like the drill Apogee drill team had some needs for new types. And we integrated their needs, the entire team was coming with a c++ native code execution engines. So the market format is very language agnostic, and we merged our designs early on to create parking. And so it's been very open and making sure like people would come and get what they need. So a team at Netflix did the work of integrating with crystal, and they had some special needs, because they were using Amazon and s3 at the time. So we made we did the work to make sure it would work well for their use case, as well. And just being often and at some point, you reach a critical mass, and I more people start using it, because that's what you know, they see it starting as their email teams and projects using it, that didn't make sense for people to reuse the same format instead of inventing their own. So I think that was part of their success of 4k was to be very open and very inclusive in the community early on. And you know, sparks equal started using parky. And we didn't even have to help them, right, they just decided to do it. And indeed it and once you were done, they talked about it. So you know, the effort you put early on to be inclusive, it paid off pretty well. And now Park is pretty much supported everywhere. And but i don't think i think you know, technically, the characteristic of Paki are going to be very similar to RC. But what makes it more valuable, I think and again, you know, being the party guy, I'm biased. But I think that's something that was important to me early on, to make sure that we were making something standard that we, you know, we keep the flexibility of Hadoop, which is the beauty of the ecosystem is, there are all those tools you can use. And you're not like siloed in one tool because of the strategy or you pick. And so the last part is talking about arrow. So it's kind of the next step. So we talked about serialization format. And so our role and parky as a storage layer on top of Hadoop in HDFS and arrow it thinking about, you know, the same problematic but in main memory, because the access patterns and the characteristics, you know, the latency of accessing main memory computing to accessing this different. So when you are storing data in memory, you similarly there are benefits to using columnar or presentation in memory that is zero. But the trade offs are different, right, the latency of accessing memory versus disk is different. You want to optimize more for the throughput of the CPU than in arcade, you want to optimize more for the speed of getting it off of disk. So there are different trade offs that weren't a different format. And so that's where arrow is more from in memory processing. And you know, as technology has evolved, we used to have, you know, late domain memory, and more disks. And now there's more and more main memory, and there are more tears showing up, because now we used to have spinning disks, now you FSS DS, with flash memory, and you also of envy me, which is non volatile memory, which is flash, but in the dim slots. And so you have different characteristics of the latency of accessing the data, the throughput of reading the data are different, right. So you have different trade offs, and also the cost of storage. So the how much main memory versus how much envy me versus how much is the versus how much is being destroyed you have. And so those different trade offs will apply, right, you have more range of where you store the data, and how fast you can access it and process it.

And so all those things are very interesting. So that's where you can have things now I'm on the spectrum. So arrow is more on the memory end, and Barclays Monday on this end of optimizing the layout for query processing. And there's going to be in the future, there's going to be interesting evolution on where which one is more efficient. And that's where kind of abstracting more where the data is stored. And making this more manage like in the database is going to be interesting in the future in simplifying that problem for end users. Ideally, arrow is something that end users don't need to see or be aware of, I mean, they can be aware of it, but they don't need to writing their code, then they reading arrow writing arrow is kind of more from a database manage.