Saturday, October 22, 2016

Joining Data Sets in ECL

So far, I have demonstrated how to initialize, load and parse data in HPCC ECL.  We have been working with a toy data set consisting of person id, names and dates of birth.  In this article, I'll add a second data set and show how to join multiple data sets.

The new data set describes who likes whom denoted by the person ID.  The record type of the data set is:

Record_Like := RECORD
    UNSIGNED8 SourceId;
    UNSIGNED8 TargetId;
END;

And here are some sample data:

likes_file := DATASET([
        {1,2},{1,3},{1,4},{1,5},{1,6},{1,7},{1,8},{1,9},{1,10},
        {2,3},{2,4},{2,6},{2,7},{2,8},
        {3,8},{3,9},{3,11},
        {4,5},
        {5,7},
        {6,7},{6,9},
        {7,4},{7,5},{7,6},
        {8,3},{8,9},
        {9,3},{9,6},{9,8},
        {12,1}], Record_Like);

It's hard to figure out who likes whom just by looking at a pair of IDs in the likes_file data set.  To make it easier, we'll combine the members_file data set we have seen in previous articles with the new likes_file data set so that the person names are spelled out.  The joined results will have the record type:

Record_Named_Like := RECORD
    SourceLastName   := members_file.LastName;
    SourceFirstName  := members_file.FirstName;
    Record_Like;
    TargetLastName   := members_file.LastName;
    TargetFirstName  := members_file.FirstName;
END;

Essentially, Record_Named_Like extends Record_Like with 4 additional fields.  The data types for the SourceLastName, SourceFirstName, TargetLastName and TargetFirstName fields are the same as LastName or FirstName from the members_file data set.

First, we are going to join the 2 data sets so that Record_Member.Id is Record_Like.SourceId.  We'll copy the last and first name from the corresponding Record_Member record from likes_file to a new Record_Named_Like record.

Record_Named_Like JoinMembersAndLikes(Record_Member L, Record_Like R) := TRANSFORM
    SELF.SourceLastName    := L.LastName;
    SELF.SourceFirstName   := L.FirstName;
    SELF.TargetLastName    := '';
    SELF.TargetFirstName   := '';    
    SELF := R;
END;

member_likes_file := JOIN(members_file, likes_file, LEFT.Id=RIGHT.SourceId, JoinMembersAndLikes(LEFT, RIGHT));

The JOIN action joins the two data sets based on the condition LEFT.Id=RIGHT.SourceId where LEFT refers to members_file and RIGHT refers to likes_file.  During joining, the transform function JoinMemberAndLikes is called to produce the resulting joint record of type Record_Named_Like.

We do another similar join such that Record_Member.Id is equal to Record_Named_Like.TargetId to identify the name of the liked person.

Record_Named_Like JoinLikesAndMembers(Record_Member L, Record_Named_Like R) := TRANSFORM
    SELF.TargetLastName    := L.LastName;
    SELF.TargetFirstName   := L.FirstName;
    SELF := R;
END;

member_likes_member_file := JOIN(members_file, member_likes_file, LEFT.Id = RIGHT.TargetId, JoinLikesAndMembers(LEFT, RIGHT));

After both joins, the final data set member_likes_member_file looks like:

SourceLastName   SourceFirstName   SourceId   Target   TargetLastName   TargetFirstName
Guinan                             12         1        Picard           Jean-Luc
Picard           Jean-Luc          1          2        Riker            William
...

Now, we can easily spot who likes whom by name.  Let's go a bit further by doing some summary statistics and reporting.  We want to generate a report that tells us how many likes there are for each person and of those likes how many are for Data and Worf.  Precisely, the record type for the report is:

Record_Likes_Stats := RECORD
    LastName    := member_likes_member_file.SourceLastName;
    FirstName   := member_likes_member_file.SourceFirstName;
    LikesData   := COUNT(GROUP, member_likes_member_file.TargetLastName = 'Data');
    LikesWorf   := COUNT(GROUP, member_likes_member_file.TargetLastName = 'Worf');
    TotalLikes  := COUNT(GROUP);
END;

In SQL parlance, the first 2 fields (LastName and FirstName) in Record_Likes_Stats are the keys for our upcoming group-by query while the remaining fields (LikesData, LikesWorf and TotalLikes) are the aggregate functions performed on each group.  For example, TotalLikes contains the count for each unique group of last and first name.  LikesData contains the portion of the group count satisfying the filter condition of TargetLastName = 'Data'.

The group-by query is done by the TABLE action:

member_likes_stats := TABLE(member_likes_member_file, Record_Likes_Stats, SourceLastName, SourceFirstName);

The parameters to the TABLE action are the input data set, the returning record type and the group-by keys.  Here is what the output looks like if you output member_likes_stats:

LastName   FirstName   LikesData   LikesWorf   TotalLikes
Crusher    Beverly     0           0           2
Crusher    Wesley      1           0           3
... 

To limit the output to only those who like Data or Worf, you can add a filter to the OUTPUT action:

OUTPUT(member_likes_stats(LikesData > 0 OR LikesWorf > 0));

In this article, I have introduced two new ECL actions which are JOIN and TABLE.  There are other ways of using these actions which I haven't described.  To learn more, you can check out the ECL Language Reference.





No comments:

Post a Comment