Date: From: Brian Hurt <bhurt@s...> Subject: Re: [Caml-list] Long-term storage of values

On Thu, 28 Feb 2008, Dario Teixeira wrote: > Hi, > > Suppose I have a value of type Story.t, fairly complex in its definition. > I wish to store this value in a DB (like Postgresql) for posterity. > At the moment, I am storing in the DB the marshalled representation > of the data; whenever I need to use it again in the Ocaml programme > I simply fetch it from the DB and unmarshal it. The following is just my opinion, not that of my employeer. You're making two mistakes. Mistake #1: treating a database as a dumb object store. This is a really popular idea right now- Hibernate does this, as does Ruby on Rails, and a number of other ORM packages take this effective approach. On the other hand, dynamically typed languages are also really popular. A database is an incredibly powerfull tool, used correctly. Used correctly, they allow you to handle huge amounts of data shared between multiple different clients with great flexibility and good performance. Used incorrectly, they tend to be bloated, slow, pigs. There are a lot of things databases aren't good at- multidimensional data, for example, or recursive ("tree-structured") data. Databases have some signifigant limitations. Every single element in a relation (aka table) has to be exactly the same type- no superclasses, no variant types. Worse yet, SQL isn't even Turing-complete. It's the world's oldest, most popular, DSL. So "used correctly" is tricky to define, because relational databases are a paradigm, not unlike functional programming or object oriented programming. But the trick is that you're designing, and coding, to the database, and you can't hide that or ignore it. Some things are easy: databases are really good at filtering, joining, some simple mapping and aggregation. The first few "levels" of data handling should be done in SQL in the database- you should never be sucking whole tables down. If you do try to hide the essential nature of the database, you're run right into the meatgrinder of it's limitations. Used correctly, you get the advantages and avoid the disadvantages. So, mistake number one: either use the data, and structure your data (at that layer) to take advantage of it, or don't use a database. Mistake number two: file formats (and this includes marshalled data structures), are wire protocols, and need to be designed to be as abstract as possible- to reveal as little about the internal structure of the program as possible (preferrably none at all). This is an idea that gets reinvented time after time, and it always ends in tears and recriminations: have some magic protocol that allows programs to communication directly- just have program X call a function or pass an object to program Y directly, and have the protocol handle all the mucking about with serializing/deserializing data, converting function calls into request/response messages, etc. Sun RPC, COM, CORBA, OLE, XML-RPC, and SOAP are the implementations that spring to mind. Object serialization hits the exact same problem: it doesn't matter whether program X and Y are communicating via TCP/IP sockets, files, or quantum-tachyon entanglement. Sooner or later (and generally sooner), it'll happen: program X will ask to some function, or pass some type of data, that program Y doesn't have any knowledge of. It may be because version X is a newer version of the program/protocol, and the function/data type has been added. It may be because X is an older version, and the function/data type has since been removed. In any case, the first time this happens is when the tears and recriminations start. Versioning simply makes it more painfully obvious that you're shackled to the past. You want to get rid of that pesky function? You can't, because older versions of the protocol require it to be there. Don't need a peice of data anymore? Tough, older versions of the protocol still require it. The best thing versioning gives you is the ability to error out early, and make a more sensible error message ("Sorry, but protocol support >= 2.14 is required!"), but it doesn't solve the problem. The best solution I've found is to be aware that, when you're communicating with the outside world, you're implementing a *protocol*. And that protocol should be, as I said, as abstract as possible and reveal as little about the structure of the program as possible. So I can change the program enormously, even reimplement it from scratch in a different language, without great difficulty. Consider SMTP, HTTP, and YAML as examples of protocols or generic file formats done right. Note that you can do protocol design, and then implement it is Corba or XML. A sure that you've done this is the existance of a "translation layer" - comments like "OK, now we translate the XML data structure into our internal data structure" and such like. THe successfull projects I've seen that used these technologies did this (or got lucky and grew into this). So that's mistake number two: you're communicating between different versions of the program with an ill-defined (at best) and not generic protocol/file format. Fix these two problems, and I'm willing to bet most of the rest of the problems go away too. Brian