Record_Member_Raw := RECORD
UNSIGNED8 Id;
STRING15 LastName;
STRING15 FirstName;
STRING20 Birthdate;
END;
members_file_raw := DATASET([
{1,'Picard','Jean-Luc','July 13, 2305'},
{2,'Riker','William','2335'},
{3,'La Forge','Geordi','February 16, 2335'},
{4,'Yar','Tasha','2337'},
{5,'Worf','','2340'},
{6,'Crusher','Beverly','October 13, 2324'},
{7,'Troi','Deanna','March 29, 2336'},
{8,'Data','','February 2, 2338'},
{9,'Crusher','Wesley','July 29, 2349'},
{10,'Pulaski','Katherine','2309'},
{11,'O\'Brien','Miles','September 2328'},
{12,'Guinan','','1293'}], Record_Member_Raw);
The likes_file data set contains who each person likes based on their ID. The record type and data set are:
Record_Like := RECORD
UNSIGNED8 SourceId;
UNSIGNED8 TargetId;
END;
likes_file := DATASET([
{1,2},{1,3},{1,4},{1,5},{1,6},{1,7},{1,8},{1,9},{1,10},
{2,3},{2,4},{2,6},{2,7},{2,8},
{3,8},{3,9},{3,11},
{4,5},
{5,7},
{6,7},{6,9},
{7,4},{7,5},{7,6},
{8,3},{8,9},
{9,3},{9,6},{9,8},
{12,1}], Record_Like);
Since we have a set of preferences or likes for each person, I was wondering how I could calculate the similarity between two persons based on their likes using, for example, the Jaccard coefficient. (The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of two sets.) I took a crack at this. While my solution may not be the best or efficient way to go about doing this in ECL, it seems to work.
First, I define the similarity matrix as:
Record_Similarity_Matrix := RECORD
Id1 := members_file_raw.Id;
Id2 := members_file_raw.Id;
STRING30 FullName1;
STRING30 FullName2;
SET OF UNSIGNED8 LikeIds1;
SET OF UNSIGNED8 LikeIds2;
DECIMAL4_3 Score := 0;
END;
The idea is to collect all the likes for each pair of person (Id1 and Id2) into the sets LikeIds1 and LikeIds2. The Score field will hold the computed Jaccard coefficient for sets LikeIds1 and LikeIds2.
Then, we initialize the matrix data set:
Record_Similarity_Matrix InitSimilarityMatrix(Record_Member_Raw L, Record_Member_Raw R) := TRANSFORM
SELF.Id1 := L.Id;
SELF.Id2 := R.Id;
SELF.LikeIds1 := [];
SELF.LikeIds2 := [];
SELF.FullName1 := TRIM(L.LastName) + IF (L.FirstName != '', ', ' + L.FirstName, '');
SELF.FullName2 := TRIM(R.LastName) + IF (R.FirstName != '', ', ' + R.FirstName, '');
END;
blank_similarity_matrix_file := JOIN(members_file_raw, members_file_raw, LEFT.Id >= RIGHT.Id, InitSimilarityMatrix(LEFT, RIGHT), ALL);
The blank_similarity_matrix_file recordset contains 12 * 12 = 144 records. The JOIN condition is limited to LEFT.Id >= RIGHT.Id to save computation time since the Jaccard coefficient J(A, B) is equal to J(B, A).
The next step is to gather up the likes for Id1.
Record_Similarity_Matrix AddLikeIds1(Record_Similarity_Matrix L, Record_Like R) := TRANSFORM
SELF.LikeIds1 := L.LikeIds1 + [R.TargetId];
SELF := L;
END;
temp_similarity_matrix_file := DENORMALIZE(blank_similarity_matrix_file, likes_file, LEFT.Id1 = RIGHT.SourceId, AddLikeIds1(LEFT, RIGHT));
For each record (LEFT) in blank_similarity_matrix_file, the DENORMALIZE action finds all the records (RIGHT) from likes_file such that LEFT.Id1 = RIGHT.SourceId. For example, if Id1 is 6, the matching records from likes_file with SourceId = 6 are {6,7} and {6,9}. The transform function AddLikeIds is called for each matching record from likes_file. The matching record is passed in as the RIGHT parameter to the function while the returned result from the previous call is passed in as the LEFT parameter.
We do a similar step to collect the likes for Id2.
Record_Similarity_Matrix AddIds2(Record_Similarity_Matrix L, Record_Like R) := TRANSFORM
SELF.LikeIds2 := L.LikeIds2 + [R.TargetId];
SELF := L;
END;
similarity_matrix_file := DENORMALIZE(temp_similarity_matrix_file, likes_file, LEFT.Id2 = RIGHT.SourceId, AddIds2(LEFT, RIGHT));
Now that LikeIds1 and LikeIds2 are populated, we can proceed to computing the Jaccard coefficients:
Record_Similarity_Matrix ComputeSimilarity(Record_Similarity_Matrix L, Record_Similarity_Matrix R) := TRANSFORM
IdSet1 := DATASET(R.LikeIds1, {UNSIGNED8 Id});
IdSet2 := DATASET(R.LikeIds2, {UNSIGNED8 Id});
CombinedIds := IdSet1 + IdSet2;
UnionIds := DEDUP(CombinedIds, LEFT.Id = RIGHT.Id, ALL);
IntersectIds := JOIN(IdSet1, IdSet2, LEFT.Id = RIGHT.Id);
SELF.Score := COUNT(IntersectIds) / COUNT(UnionIds);
SELF := R;
END;
similarity_file := ITERATE(similarity_matrix_file, ComputeSimilarity(LEFT, RIGHT), LOCAL);
The ITERATION action traverses the records in the similarity_matrix_file data set and invokes the ComputeSimilarity function on each record. I don't know if there is an easier way to compute set union and intersection so I use DEDUP and JOIN, respectively. They might be an overkill.
To see the result, we simply use the action:
OUTPUT(similarity_file(Id1 > Id2));
Here is a snippet of the output:
Id1 Id2 FullName1 FullName2 LikeIds1 LikeIds2 Score 11 1 O'Brien, Miles Picard, Jean-Luc
'2','3','4','5','6','7','8','9','10'
0 12 1 Guinan Picard, Jean-Luc '1' '2','3','4','5','6','7','8','9','10' 0 3 2 La Forge, Geordi Riker, William '8','9','11' '3','4','6','7','8' 0.143 ...
As I said earlier, this may not be the most efficient way to compute similarity matrix but it works!
No comments:
Post a Comment