John Maiden

Yeah, so the data is extremely noisy in many different ways. I mean, I've worked with many large noisy data sets. This one real estate in particular, because a lot of real estate data is collected at the county level, it just how things are done in the US, which means that you have varying levels of information formatting I mean, zoning codes vary everyone's different. So it's not like if I say I've got a national database of real estate data, you know that there's going to be tons of inconsistency. So some of this stuff just involves having great analysts who really know their data. So we've got a good portion of the team has strong real estate knowledge and couldn't quickly look at the data and say, this makes sense. Or this doesn't, you know, this is relevant for what our customers care about or not. So having subject matter experts is always a critical part. You know, if you're going to build a knowledge graph, the graph part's cool. But you also need the knowledge. And so going through the data determining what is relevant what's not with our data yet, again, because we're coming in from multiple sources. Sometimes you just have to think about, you know, you have to apply some very strong business logic to the data, finding, you know, missing data components are very hard. So trying to find ways to either fill them in or at least provide enough information that you can complete the graph is useful now. I mean, most of the time, we have a very complete comprehensive data set. But you know, not everything is going to be perfect, especially when we're trying to combine the data on the messiness side, you know, not just, you know, just because we have providers, or we ourselves put the data together into a national database, it doesn't really mean it's always connected. The biggest ones that drive us crazy are addresses and names. And so to build a powerful knowledge graph, I mean, if anyone's worked with real estate data, you probably know that like addresses are very complicated. You can have multiple addresses that all mean the same thing, but are written differently. So we used to be on Sixth Avenue you can write Sixth Avenue is the number of 60 HSZH Avenue, the Americas, those are all valid addresses. But then if you've got multiple datasets that each use the different spellings, those are going to lead you into different points. You can't connect the data that way, especially if you have you know, you put in the wrong state, maybe you transpose the zip code. So address standardization is a big effort that's, you know, to get the knowledge graph, putting the data together is not important, as important as being able to clean it up well, and our big efforts been on address standardization. So making sure that all the addresses we Get from all the different sources all match together, that's a big lift in itself. And that's something a service that we provide we use internally to clean our data as well as provide to our customers to allow them to connect their data, entity resolutions and other big one, buildings can have multiple addresses. So what most people see in the building is probably the mailing address. But you know, with a range of addresses for different buildings, you also have to put in time and effort both from data science and an engineering perspective to resolve all the different data sets. I mean, our current office is you know, has a street address and an avenue address, which means that depending on which data set you're using, you would have one data set pointing to our street address and other into Avenue dress you got to resolve both of those actually meet in the middle. And the last one that's tricky is names. So everyone you know, think that all names should be easy to do. Every different data set has different ways of formatting names, especially when we're trying to connect them across different disparate data sets. So one data set might have made in common john, another one might have john Maiden, a third set, we'll have john who Are these all the same? John's? Is there a different john out there? You know, if there was a typo and someone had john q maiden versus john w maiden? How do you guarantee that you know that these are the same people that the middle initials not important, or you know enough that they are the same or they are distinct people, and you have to keep them separate. So name resolution is very important. The other problem is also and this ties back to putting everything into a graph is that certain names are very common. You know, you there are definitely many john doe's in New York City may not john doe's per se but there's a lot of people with very common names. You know, if you look at the data, you could say, you know, this one guy owns 20% of New York City real estate, that's not really the case. Because if you then put everything into graph and you start looking at all of the different networks, you can say, Well, obviously, all of these different connections, go to this one person, this one name, john doe owns all these properties. But if I look at from a graph perspective, I really see disparate networks. So there is a back and forth and building a knowledge graph is very much an iterative process. So there's the cleaning the data as well as you can. There's putting every into a Knowledge Graph format, then there's looking at how the data actually connects together and then doing further cleaning and iteration, because you're never gonna get anything perfect on the first try. But, you know, it's you need clean data to build the graph. But once you build the graph, you can still see How noisy your data is. And you can do further rounds of iteration. And so be able to do that is very important. So it's not just, you know, the data in one place, you know, a knowledge graph is not just about getting the data in one place. It's also recognizing all the effort and time that goes into getting it there in the first place and making those connections. And for us, you know, addressing his name centralization is very important to be able to make it as powerful and as useful as we can.