Saturday, October 22, 2016

Computing Similarity Matrix in ECL

In the previous introductory articles on ECL, I have created two toy sets.  The members_file_raw data set and the likes_file data set.  The members_file_raw data set contains 12 persons with their ID, name and date of birth.  The record type and data set are:

Record_Member_Raw := RECORD
    UNSIGNED8 Id;
    STRING15 LastName;
    STRING15 FirstName;
    STRING20 Birthdate;
END;

members_file_raw := DATASET([
    {1,'Picard','Jean-Luc','July 13, 2305'},
    {2,'Riker','William','2335'},
    {3,'La Forge','Geordi','February 16, 2335'},
    {4,'Yar','Tasha','2337'},
    {5,'Worf','','2340'},
    {6,'Crusher','Beverly','October 13, 2324'},
    {7,'Troi','Deanna','March 29, 2336'},
    {8,'Data','','February 2, 2338'},
    {9,'Crusher','Wesley','July 29, 2349'},
    {10,'Pulaski','Katherine','2309'},
    {11,'O\'Brien','Miles','September 2328'},
    {12,'Guinan','','1293'}], Record_Member_Raw);

The likes_file data set contains who each person likes based on their ID.  The record type and data set are:

Record_Like := RECORD
    UNSIGNED8 SourceId;
    UNSIGNED8 TargetId;
END;

likes_file := DATASET([
        {1,2},{1,3},{1,4},{1,5},{1,6},{1,7},{1,8},{1,9},{1,10},
        {2,3},{2,4},{2,6},{2,7},{2,8},
        {3,8},{3,9},{3,11},
        {4,5},
        {5,7},
        {6,7},{6,9},
        {7,4},{7,5},{7,6},
        {8,3},{8,9},
        {9,3},{9,6},{9,8},
        {12,1}], Record_Like);

Since we have a set of preferences or likes for each person, I was wondering how I could calculate the similarity between two persons based on their likes using, for example, the Jaccard coefficient.  (The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of two sets.)  I took a crack at this.  While my solution may not be the best or efficient way to go about doing this in ECL, it seems to work.

First, I define the similarity matrix as:

Record_Similarity_Matrix := RECORD
    Id1 := members_file_raw.Id;
    Id2 := members_file_raw.Id;
    STRING30 FullName1;
    STRING30 FullName2;
    SET OF UNSIGNED8 LikeIds1;
    SET OF UNSIGNED8 LikeIds2;
    DECIMAL4_3 Score := 0;
END;

The idea is to collect all the likes for each pair of person (Id1 and Id2) into the sets LikeIds1 and LikeIds2.  The Score field will hold the computed Jaccard coefficient for sets LikeIds1 and LikeIds2.

Then, we initialize the matrix data set:

Record_Similarity_Matrix InitSimilarityMatrix(Record_Member_Raw L, Record_Member_Raw R) := TRANSFORM
    SELF.Id1    := L.Id;
    SELF.Id2    := R.Id;
    SELF.LikeIds1   := [];
    SELF.LikeIds2   := [];
    SELF.FullName1  := TRIM(L.LastName) + IF (L.FirstName != '', ', ' + L.FirstName, '');
    SELF.FullName2  := TRIM(R.LastName) + IF (R.FirstName != '', ', ' + R.FirstName, '');
END;

blank_similarity_matrix_file := JOIN(members_file_raw, members_file_raw, LEFT.Id >= RIGHT.Id, InitSimilarityMatrix(LEFT, RIGHT), ALL);

The blank_similarity_matrix_file recordset contains 12 * 12 = 144 records.  The JOIN condition is limited to LEFT.Id >= RIGHT.Id to save computation time since the Jaccard coefficient J(A, B) is equal to J(B, A).

The next step is to gather up the likes for Id1.

Record_Similarity_Matrix AddLikeIds1(Record_Similarity_Matrix L, Record_Like R) := TRANSFORM
    SELF.LikeIds1 := L.LikeIds1 + [R.TargetId];
    SELF := L;
END;

temp_similarity_matrix_file := DENORMALIZE(blank_similarity_matrix_file, likes_file, LEFT.Id1 = RIGHT.SourceId, AddLikeIds1(LEFT, RIGHT));

For each record (LEFT) in blank_similarity_matrix_file, the DENORMALIZE action finds all the records (RIGHT) from likes_file such that LEFT.Id1 = RIGHT.SourceId.  For example, if Id1 is 6, the matching records from likes_file with SourceId = 6 are {6,7} and {6,9}.  The transform function AddLikeIds is called for each matching record from likes_file.  The matching record is passed in as the RIGHT parameter to the function while the returned result from the previous call is passed in as the LEFT parameter.

We do a similar step to collect the likes for Id2.

Record_Similarity_Matrix AddIds2(Record_Similarity_Matrix L, Record_Like R) := TRANSFORM
    SELF.LikeIds2 := L.LikeIds2 + [R.TargetId];
    SELF := L;
END;

similarity_matrix_file := DENORMALIZE(temp_similarity_matrix_file, likes_file, LEFT.Id2 = RIGHT.SourceId, AddIds2(LEFT, RIGHT));

Now that LikeIds1 and LikeIds2 are populated, we can proceed to computing the Jaccard coefficients:

Record_Similarity_Matrix ComputeSimilarity(Record_Similarity_Matrix L, Record_Similarity_Matrix R) := TRANSFORM
    IdSet1 := DATASET(R.LikeIds1, {UNSIGNED8 Id});
    IdSet2 := DATASET(R.LikeIds2, {UNSIGNED8 Id});
    CombinedIds   := IdSet1 + IdSet2;
    UnionIds      := DEDUP(CombinedIds, LEFT.Id = RIGHT.Id, ALL);
    IntersectIds  := JOIN(IdSet1, IdSet2, LEFT.Id = RIGHT.Id);
    SELF.Score    := COUNT(IntersectIds) / COUNT(UnionIds);
    SELF          := R;
END;

similarity_file := ITERATE(similarity_matrix_file, ComputeSimilarity(LEFT, RIGHT), LOCAL);

The ITERATION action traverses the records in the similarity_matrix_file data set and invokes the ComputeSimilarity function on each record.  I don't know if there is an easier way to compute set union and intersection so I use DEDUP and JOIN, respectively.  They might be an overkill.

To see the result, we simply use the action:

OUTPUT(similarity_file(Id1 > Id2));

Here is a snippet of the output:

Id1  Id2  FullName1         FullName2         LikeIds1      LikeIds2                              Score
11   1    O'Brien, Miles    Picard, Jean-Luc                '2','3','4','5','6','7','8','9','10'  0   
12   1    Guinan            Picard, Jean-Luc  '1'           '2','3','4','5','6','7','8','9','10'  0
3    2    La Forge, Geordi  Riker, William    '8','9','11'  '3','4','6','7','8'                   0.143
...

As I said earlier, this may not be the most efficient way to compute similarity matrix but it works!

Joining Data Sets in ECL

So far, I have demonstrated how to initialize, load and parse data in HPCC ECL.  We have been working with a toy data set consisting of person id, names and dates of birth.  In this article, I'll add a second data set and show how to join multiple data sets.

The new data set describes who likes whom denoted by the person ID.  The record type of the data set is:

Record_Like := RECORD
    UNSIGNED8 SourceId;
    UNSIGNED8 TargetId;
END;

And here are some sample data:

likes_file := DATASET([
        {1,2},{1,3},{1,4},{1,5},{1,6},{1,7},{1,8},{1,9},{1,10},
        {2,3},{2,4},{2,6},{2,7},{2,8},
        {3,8},{3,9},{3,11},
        {4,5},
        {5,7},
        {6,7},{6,9},
        {7,4},{7,5},{7,6},
        {8,3},{8,9},
        {9,3},{9,6},{9,8},
        {12,1}], Record_Like);

It's hard to figure out who likes whom just by looking at a pair of IDs in the likes_file data set.  To make it easier, we'll combine the members_file data set we have seen in previous articles with the new likes_file data set so that the person names are spelled out.  The joined results will have the record type:

Record_Named_Like := RECORD
    SourceLastName   := members_file.LastName;
    SourceFirstName  := members_file.FirstName;
    Record_Like;
    TargetLastName   := members_file.LastName;
    TargetFirstName  := members_file.FirstName;
END;

Essentially, Record_Named_Like extends Record_Like with 4 additional fields.  The data types for the SourceLastName, SourceFirstName, TargetLastName and TargetFirstName fields are the same as LastName or FirstName from the members_file data set.

First, we are going to join the 2 data sets so that Record_Member.Id is Record_Like.SourceId.  We'll copy the last and first name from the corresponding Record_Member record from likes_file to a new Record_Named_Like record.

Record_Named_Like JoinMembersAndLikes(Record_Member L, Record_Like R) := TRANSFORM
    SELF.SourceLastName    := L.LastName;
    SELF.SourceFirstName   := L.FirstName;
    SELF.TargetLastName    := '';
    SELF.TargetFirstName   := '';    
    SELF := R;
END;

member_likes_file := JOIN(members_file, likes_file, LEFT.Id=RIGHT.SourceId, JoinMembersAndLikes(LEFT, RIGHT));

The JOIN action joins the two data sets based on the condition LEFT.Id=RIGHT.SourceId where LEFT refers to members_file and RIGHT refers to likes_file.  During joining, the transform function JoinMemberAndLikes is called to produce the resulting joint record of type Record_Named_Like.

We do another similar join such that Record_Member.Id is equal to Record_Named_Like.TargetId to identify the name of the liked person.

Record_Named_Like JoinLikesAndMembers(Record_Member L, Record_Named_Like R) := TRANSFORM
    SELF.TargetLastName    := L.LastName;
    SELF.TargetFirstName   := L.FirstName;
    SELF := R;
END;

member_likes_member_file := JOIN(members_file, member_likes_file, LEFT.Id = RIGHT.TargetId, JoinLikesAndMembers(LEFT, RIGHT));

After both joins, the final data set member_likes_member_file looks like:

SourceLastName   SourceFirstName   SourceId   Target   TargetLastName   TargetFirstName
Guinan                             12         1        Picard           Jean-Luc
Picard           Jean-Luc          1          2        Riker            William
...

Now, we can easily spot who likes whom by name.  Let's go a bit further by doing some summary statistics and reporting.  We want to generate a report that tells us how many likes there are for each person and of those likes how many are for Data and Worf.  Precisely, the record type for the report is:

Record_Likes_Stats := RECORD
    LastName    := member_likes_member_file.SourceLastName;
    FirstName   := member_likes_member_file.SourceFirstName;
    LikesData   := COUNT(GROUP, member_likes_member_file.TargetLastName = 'Data');
    LikesWorf   := COUNT(GROUP, member_likes_member_file.TargetLastName = 'Worf');
    TotalLikes  := COUNT(GROUP);
END;

In SQL parlance, the first 2 fields (LastName and FirstName) in Record_Likes_Stats are the keys for our upcoming group-by query while the remaining fields (LikesData, LikesWorf and TotalLikes) are the aggregate functions performed on each group.  For example, TotalLikes contains the count for each unique group of last and first name.  LikesData contains the portion of the group count satisfying the filter condition of TargetLastName = 'Data'.

The group-by query is done by the TABLE action:

member_likes_stats := TABLE(member_likes_member_file, Record_Likes_Stats, SourceLastName, SourceFirstName);

The parameters to the TABLE action are the input data set, the returning record type and the group-by keys.  Here is what the output looks like if you output member_likes_stats:

LastName   FirstName   LikesData   LikesWorf   TotalLikes
Crusher    Beverly     0           0           2
Crusher    Wesley      1           0           3
... 

To limit the output to only those who like Data or Worf, you can add a filter to the OUTPUT action:

OUTPUT(member_likes_stats(LikesData > 0 OR LikesWorf > 0));

In this article, I have introduced two new ECL actions which are JOIN and TABLE.  There are other ways of using these actions which I haven't described.  To learn more, you can check out the ECL Language Reference.





Parsing and Extracting Data in ECL

As promised last time, I'll show how to pre-process data in ECL by parsing the date of birth from the format of "Month Day, Year" into separate month, day and year fields based on what I've learned so far about ECL.  For demonstration, I'll continue to use the toy data set (members_file_raw) we have created previously.

The date of birth will be processed in 2 steps.  The steps I'm about to show may not be the most efficient, cleverest or fastest way of extracting data but they work.  First, we declare the regular expressions for parsing the date of birth.

PATTERN regex_month := PATTERN('[a-z]+');
PATTERN regex_day := PATTERN('[0-9]{1,2}');
PATTERN regex_year := PATTERN('[0-9]{4}');
PATTERN regex_month_seg := regex_month [' '];
PATTERN regex_day_seg := regex_day [', '];
PATTERN regex_date := regex_month_seg? regex_day_seg? regex_year;
PATTERN regex_birthdate := regex_date?;

Basically, we're declaring that the date of birth consists of an optional month name, an optional day and a mandatory year.  The month name and day are delimited by an empty space while the day and year are delimited by a comma and empty space.

Next comes the data type for storing the processed record:

Record_Member := RECORD
    members_file_raw;
    STRING9 MonthStr := MATCHTEXT(regex_month[1]);
    UNSIGNED1 Month  := 0;
    UNSIGNED1 Day    := (INTEGER) MATCHTEXT(regex_day[1]);
    UNSIGNED2 Year   := (INTEGER) MATCHTEXT(regex_year[1]);
    UNSIGNED2 Age    := 0;
END;

In the previous article, we have declared record type Record_Member_Raw which consists of 4 fields - Id, LastName, FirstName and Birthdate, and members_file_raw is a data set composing of Record_Member_Raw records.  Here, we are declaring that Record_Member extends Record_Member_Raw by including all the fields from the latter.  In addition, Record_Member includes 5 new fields.  The MonthStr, Day and Year fields store the month name, day and year, respectively, to be parsed from a date of birth string such as "January 1, 2017".  The Month field stores the month as a number.  We'll be using the Age field to store the age later.

With the regular expressions and record type defined, we can proceed to parsing the data.

members_file_temp := PARSE(members_file_raw, Birthdate, regex_birthdate, Record_Member, MAX, NOCASE);

Here, we're instructing the PARSE function to parse the Birthdate field of the members_file_raw data set using the regex_birthdate pattern and output the results based on Record_Member record type. The NOCASE flag forces the pattern matching to ignore case.

To see what the parse result looks like, add the action:

OUTPUT(members_file_temp);

This will output something like:

id   lastname   firstname   birthdate       monthstr   month   day   year   age
1    Picard     Jean-Luc    July 13, 2305   July       0       13    2305   0
2    Riker      William     2335                       0       0     2335   0
...

Our next tasks are to convert the month name to number and calculate the age.  To do this, we define a dictionary which maps month name to number.

months_file := DATASET([{'Jan', 1},
                        {'Feb', 2},
                        {'Mar', 3},
                        {'Apr', 4},
                        {'May', 5},
                        {'Jun', 6},
                        {'Jul', 7},
                        {'Aug', 8},
                        {'Sep', 9},
                        {'Oct', 10},
                        {'Nov', 11},
                        {'Dec', 12}], {STRING3 name, UNSIGNED1 number});

MonthNameToNumber := DICTIONARY(months_file, {name => number});

You might have noticed that the full month name is not spelled out in the dictionary.  That's because we'll truncate the month name to the 3 first letters when we process the data:

CurrentYear := 2362;

CalculateAge(UNSIGNED2 birthYear) := CurrentYear - birthYear;

Record_Member NumerateMonth(Record_Member L, Record_Member R) := TRANSFORM
    MonthSubstr  := R.MonthStr[1..3];
    SELF.Month   := MonthNameToNumber[MonthSubstr].number;
    SELF.Age     := CalculateAge(R.Year);
    SELF         := R;
END;

members_file := ITERATE(members_file_temp, NumerateMonth(LEFT, RIGHT), LOCAL);

The ITERATE function traverses the members_file_temp data and invokes the NumerateMonth transform function on each record.  Each time the NumerateMonth is called, the to-be-transformed record is passed in as the RIGHT parameter while the previously transformed record as the LEFT parameter.  You can think of ITERATE function as having a memory of 1 record.  It remembers the result of the previous NumerateMonth call and passes it to the next NumerateMonth call.

The NumerateMonth transform function returns a Record_Member record with its fields initialized from the data of the given R record.  For example, the Month field is set by looking up the month number using the first 3 letters of R.MonthStr from the MonthNameToNumber dictionary.  The Age field is initialized by calling the CalculateAge function.   Any remaining uninitialized field values are copied from the corresponding fields in R.

To check that month number and age are extracted correctly, let's dump the result:

OUTPUT(members_file);

This will output something like:

id   lastname   firstname   birthdate       monthstr   month   day   year   age
1    Picard     Jean-Luc    July 13, 2305   July       7       13    2305   57
2    Riker      William     2335                       0       0     2335   27
...

Now that we have the month number and age calculated, let's do some simple data exploration:

total_members := COUNT(members_file);
OUTPUT(total_members);

older_members := members_file(age >= 50);
OUTPUT(older_members);

feb_or_jul_born_members := members_file(month = 2 OR month = 7);
OUTPUT(feb_or_jul_born_members);

So far, we have only worked with a single data set.  A data set in ECL is somewhat similar to a database table.  In the next article, I'll add a second data set and show how to join multiple data sets together.


Friday, October 21, 2016

Initialize, Load and Save Data in ECL

In a previous article, I introduced HPCC and ECL, the data-centric declarative programming language for HPCC.  In this first article on ECL, I'll share what I have learned about ECL after experimenting with it for a while.  As ECL and HPCC are about extract, transform and load data, I'll demonstrate using a toy data set consisting of person names and dates of birth. I'll continue to use this data set in the following articles on ECL.

Without further adieu, we first declare the format of the incoming data.

Record_Member_Raw := RECORD
    UNSIGNED8 Id;
    STRING15 LastName;
    STRING15 FirstName;
    STRING20 Birthdate;
END;

The syntax is rather intuitive. We're declaring a data row or record consisting of 4 fields and their data types are 8 bytes unsigned integer, 15 bytes string, 15 bytes string and 20 bytes string.  ECL strings are space padded and not null-terminated.  Also, ECL is not case sensitive so Record_Member_Raw is same as record_member_raw, LastName is same as lastname, and UNSIGNED8 is same as unsigned8.

In the real world, you will be getting your data from an existing data source but in this case, I'm hardcoding the data manually to members_file_raw:

members_file_raw := DATASET([
    {1,'Picard','Jean-Luc','July 13, 2305'},
    {2,'Riker','William','2335'},
    {3,'La Forge','Geordi','February 16, 2335'},
    {4,'Yar','Tasha','2337'},
    {5,'Worf','','2340'},
    {6,'Crusher','Beverly','October 13, 2324'},
    {7,'Troi','Deanna','March 29, 2336'},
    {8,'Data','','February 2, 2338'},
    {9,'Crusher','Wesley','July 29, 2349'},
    {10,'Pulaski','Katherine','2309'},
    {11,'O\'Brien','Miles','September 2328'},
    {12,'Guinan','','1293'}], Record_Member_Raw);

Our data set consists of 12 records.  The records are of type Record_Member_Raw.  To display or output the data set or recordset result, add the following code to your ECL script.

OUTPUT(members_file_raw);

OUTPUT(members_file_raw(lastname='Data' OR id=3));

OUTPUT(members_file_raw[1]);

OUTPUT(members_file_raw[1..3]); 

The first output dumps the entire data set.  The second selects only records meeting the filter condition.  The third outputs only the first record.  Note, ECL indexing starts with 1 and not 0.  Indexing can also be a range like in the last output which returns the first 3 records.  You can also save the output to file:

OUTPUT(members_file_raw, ,'~FOO::BAR::Members', OVERWRITE);

The OUTPUT action will be used frequently in debugging ECL code.  To learn about what each ECL action does, the ECL Language Reference is your best (and only) source of help.

Remember earlier I said that in the real world, you will be getting your data from an existing data source.  Well, now you have a data source which is the file you just created.  To load the file, the command is:

loaded_members_file := DATASET('~FOO::BAR::Members', Record_Member_Raw, THOR);

In this article, I showed how to initialize, load and save data in ECL.  In the next article, I'll pre-process the data by parsing the date of birth into separate month, day and year fields.


Thursday, October 20, 2016

First Encounter with HPCC and ECL

I have been trying out HPCC for a few days now.  HPCC is an open source big data platform developed by LexisNexis which can run on commodity computing clusters like Amazon AWS.  HPCC includes its own declarative programming language called ECL to extract, transform and load large scale data.  Being a declarative programming language, it belongs to the same class of languages as SQL as opposed to imperative languages like C/C++, Java, Python or PHP.  On first sight though, ECL looks a bit like C++.  Perhaps it has to do with the fine granularity of control permissible by ECL such as file pointers and bytes allocation as we shall see.


Setting up

So how does one get started with HPCC quickly and freely?  For me, the path of least resistance was to download and install the HPCC virtual image and run it with VirtualBox.  This will deploy a pre-configured linux based HPCC guest server running in your local machine for demo purposes.

Next, you'll need to install an IDE to interact with the server and expedite ECL programming.
Though an IDE is not required and command line interface tools are available for download, many of the online tutorials assume you're using an IDE, in particular, the ECL IDE for Windows.  I was lucky to have my old copy of Windows 7 lying around which I installed ECL IDE to.  So now I have both HPCC server and Windows 7 with ECL IDE running as guests on my Mac using VirtualBox and everything is working okay (after some hiccups).


Getting Acquainted

While there are various learning resources available on the HPCC website, they are scattered on different pages.  It can be a flustering experience not knowing which ones you should start with and in what order.  Also, some of the resources are seemingly locked away for customers only or require access credentials.  Hopefully, by the time you're reading this, the resources are better organized.

  1. In hindsight, I would start by reading Running HPCC in a Virtual Machine to help with the installation and usage of ECL IDE.

  2. To gain a little bit more insights into ECL, I read HPCC Data Tutorial and followed the short programming examples.

  3. Depending on your preference, you could watch some of the short tutorial videos.

  4. What helped me the most so far is the ECL Programmers Guide.  It's my Rosetta stone to ECL.  I hope they would continue to expand the guide and coverage with more examples.  When reading the guide, you would need to frequently consult the ECL Language Reference.

I haven't read everything there is yet and most likely, there are other useful resources I haven't stumbled upon yet.  Hopefully, these are enough to get you started with HPCC.  In my next article, I'll share what I've learned so far on programming with ECL.