$\begingroup$

A CSV file is characterized by a header, describing the n datarows to come. The header is a text string separated by a separator. A CSV header might look like (C1)

Date;Time;ZIP-Code;Address;Temperature

Assume we have a second CSV file (C2) with this structure:

Date;ZIP-Code;Address;Temperature

C1 and C2 are similar, but they are not the same: C2 lacks "Time"

Assume another C3:

Time;Date;Address;ZIP-Code;Temperature

Here the same items as in C1 are present, but the order is different.

What I am after is a metric which would give me the similarity between two sets, including the relative closeness of items with these two sets. In other words, if the order of items varies within one set, but the cardinality is the same, the similarity value should be bigger compared to different cardinality or if the cardinality is the same but the items itself vary.

I came up with this preliminary mental measure. I could plot in a matrix

Date;Time;ZIP-Code;Address;Temperature Date 1 Time 1 ZIP-Code 1 Address 1 Temperature 1 Date;Time;ZIP-Code;Address;Temperature Time 0 1 0 0 0 ZIP-Code 0 0 1 0 0 Address 0 0 0 1 0 Temperature 0 0 0 0 1 Date;Time;ZIP-Code;Address;Temperature Temperature 0 0 0 0 1 Address 0 0 0 1 0 ZIP-Code 0 0 1 0 0 Time 0 1 0 0 0 Date 1 0 0 0 0 Date;Time;ZIP-Code;Address;Temperature Head1 0 0 0 0 0 Head2 0 0 0 0 0 Head3 0 0 0 0 0 Head4 0 0 0 0 0

My sense of similarity would be "the more structure / pattern in the matrix", the more similar the tow CSV headers are.

I wonder if there is a measure like Dice-Distance, Cosine-Similarity, Jaccard-Index which would be of help?