Saturday, October 22, 2016

Computing Similarity Matrix in ECL

In the previous introductory articles on ECL, I have created two toy sets.  The members_file_raw data set and the likes_file data set.  The members_file_raw data set contains 12 persons with their ID, name and date of birth.  The record type and data set are:

Record_Member_Raw := RECORD
    UNSIGNED8 Id;
    STRING15 LastName;
    STRING15 FirstName;
    STRING20 Birthdate;
END;

members_file_raw := DATASET([
    {1,'Picard','Jean-Luc','July 13, 2305'},
    {2,'Riker','William','2335'},
    {3,'La Forge','Geordi','February 16, 2335'},
    {4,'Yar','Tasha','2337'},
    {5,'Worf','','2340'},
    {6,'Crusher','Beverly','October 13, 2324'},
    {7,'Troi','Deanna','March 29, 2336'},
    {8,'Data','','February 2, 2338'},
    {9,'Crusher','Wesley','July 29, 2349'},
    {10,'Pulaski','Katherine','2309'},
    {11,'O\'Brien','Miles','September 2328'},
    {12,'Guinan','','1293'}], Record_Member_Raw);

The likes_file data set contains who each person likes based on their ID.  The record type and data set are:

Record_Like := RECORD
    UNSIGNED8 SourceId;
    UNSIGNED8 TargetId;
END;

likes_file := DATASET([
        {1,2},{1,3},{1,4},{1,5},{1,6},{1,7},{1,8},{1,9},{1,10},
        {2,3},{2,4},{2,6},{2,7},{2,8},
        {3,8},{3,9},{3,11},
        {4,5},
        {5,7},
        {6,7},{6,9},
        {7,4},{7,5},{7,6},
        {8,3},{8,9},
        {9,3},{9,6},{9,8},
        {12,1}], Record_Like);

Since we have a set of preferences or likes for each person, I was wondering how I could calculate the similarity between two persons based on their likes using, for example, the Jaccard coefficient.  (The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of two sets.)  I took a crack at this.  While my solution may not be the best or efficient way to go about doing this in ECL, it seems to work.

First, I define the similarity matrix as:

Record_Similarity_Matrix := RECORD
    Id1 := members_file_raw.Id;
    Id2 := members_file_raw.Id;
    STRING30 FullName1;
    STRING30 FullName2;
    SET OF UNSIGNED8 LikeIds1;
    SET OF UNSIGNED8 LikeIds2;
    DECIMAL4_3 Score := 0;
END;

The idea is to collect all the likes for each pair of person (Id1 and Id2) into the sets LikeIds1 and LikeIds2.  The Score field will hold the computed Jaccard coefficient for sets LikeIds1 and LikeIds2.

Then, we initialize the matrix data set:

Record_Similarity_Matrix InitSimilarityMatrix(Record_Member_Raw L, Record_Member_Raw R) := TRANSFORM
    SELF.Id1    := L.Id;
    SELF.Id2    := R.Id;
    SELF.LikeIds1   := [];
    SELF.LikeIds2   := [];
    SELF.FullName1  := TRIM(L.LastName) + IF (L.FirstName != '', ', ' + L.FirstName, '');
    SELF.FullName2  := TRIM(R.LastName) + IF (R.FirstName != '', ', ' + R.FirstName, '');
END;

blank_similarity_matrix_file := JOIN(members_file_raw, members_file_raw, LEFT.Id >= RIGHT.Id, InitSimilarityMatrix(LEFT, RIGHT), ALL);

The blank_similarity_matrix_file recordset contains 12 * 12 = 144 records.  The JOIN condition is limited to LEFT.Id >= RIGHT.Id to save computation time since the Jaccard coefficient J(A, B) is equal to J(B, A).

The next step is to gather up the likes for Id1.

Record_Similarity_Matrix AddLikeIds1(Record_Similarity_Matrix L, Record_Like R) := TRANSFORM
    SELF.LikeIds1 := L.LikeIds1 + [R.TargetId];
    SELF := L;
END;

temp_similarity_matrix_file := DENORMALIZE(blank_similarity_matrix_file, likes_file, LEFT.Id1 = RIGHT.SourceId, AddLikeIds1(LEFT, RIGHT));

For each record (LEFT) in blank_similarity_matrix_file, the DENORMALIZE action finds all the records (RIGHT) from likes_file such that LEFT.Id1 = RIGHT.SourceId.  For example, if Id1 is 6, the matching records from likes_file with SourceId = 6 are {6,7} and {6,9}.  The transform function AddLikeIds is called for each matching record from likes_file.  The matching record is passed in as the RIGHT parameter to the function while the returned result from the previous call is passed in as the LEFT parameter.

We do a similar step to collect the likes for Id2.

Record_Similarity_Matrix AddIds2(Record_Similarity_Matrix L, Record_Like R) := TRANSFORM
    SELF.LikeIds2 := L.LikeIds2 + [R.TargetId];
    SELF := L;
END;

similarity_matrix_file := DENORMALIZE(temp_similarity_matrix_file, likes_file, LEFT.Id2 = RIGHT.SourceId, AddIds2(LEFT, RIGHT));

Now that LikeIds1 and LikeIds2 are populated, we can proceed to computing the Jaccard coefficients:

Record_Similarity_Matrix ComputeSimilarity(Record_Similarity_Matrix L, Record_Similarity_Matrix R) := TRANSFORM
    IdSet1 := DATASET(R.LikeIds1, {UNSIGNED8 Id});
    IdSet2 := DATASET(R.LikeIds2, {UNSIGNED8 Id});
    CombinedIds   := IdSet1 + IdSet2;
    UnionIds      := DEDUP(CombinedIds, LEFT.Id = RIGHT.Id, ALL);
    IntersectIds  := JOIN(IdSet1, IdSet2, LEFT.Id = RIGHT.Id);
    SELF.Score    := COUNT(IntersectIds) / COUNT(UnionIds);
    SELF          := R;
END;

similarity_file := ITERATE(similarity_matrix_file, ComputeSimilarity(LEFT, RIGHT), LOCAL);

The ITERATION action traverses the records in the similarity_matrix_file data set and invokes the ComputeSimilarity function on each record.  I don't know if there is an easier way to compute set union and intersection so I use DEDUP and JOIN, respectively.  They might be an overkill.

To see the result, we simply use the action:

OUTPUT(similarity_file(Id1 > Id2));

Here is a snippet of the output:

Id1  Id2  FullName1         FullName2         LikeIds1      LikeIds2                              Score
11   1    O'Brien, Miles    Picard, Jean-Luc                '2','3','4','5','6','7','8','9','10'  0   
12   1    Guinan            Picard, Jean-Luc  '1'           '2','3','4','5','6','7','8','9','10'  0
3    2    La Forge, Geordi  Riker, William    '8','9','11'  '3','4','6','7','8'                   0.143
...

As I said earlier, this may not be the most efficient way to compute similarity matrix but it works!

No comments:

Post a Comment