Saturday, October 22, 2016

Parsing and Extracting Data in ECL

As promised last time, I'll show how to pre-process data in ECL by parsing the date of birth from the format of "Month Day, Year" into separate month, day and year fields based on what I've learned so far about ECL.  For demonstration, I'll continue to use the toy data set (members_file_raw) we have created previously.

The date of birth will be processed in 2 steps.  The steps I'm about to show may not be the most efficient, cleverest or fastest way of extracting data but they work.  First, we declare the regular expressions for parsing the date of birth.

PATTERN regex_month := PATTERN('[a-z]+');
PATTERN regex_day := PATTERN('[0-9]{1,2}');
PATTERN regex_year := PATTERN('[0-9]{4}');
PATTERN regex_month_seg := regex_month [' '];
PATTERN regex_day_seg := regex_day [', '];
PATTERN regex_date := regex_month_seg? regex_day_seg? regex_year;
PATTERN regex_birthdate := regex_date?;

Basically, we're declaring that the date of birth consists of an optional month name, an optional day and a mandatory year.  The month name and day are delimited by an empty space while the day and year are delimited by a comma and empty space.

Next comes the data type for storing the processed record:

Record_Member := RECORD
    members_file_raw;
    STRING9 MonthStr := MATCHTEXT(regex_month[1]);
    UNSIGNED1 Month  := 0;
    UNSIGNED1 Day    := (INTEGER) MATCHTEXT(regex_day[1]);
    UNSIGNED2 Year   := (INTEGER) MATCHTEXT(regex_year[1]);
    UNSIGNED2 Age    := 0;
END;

In the previous article, we have declared record type Record_Member_Raw which consists of 4 fields - Id, LastName, FirstName and Birthdate, and members_file_raw is a data set composing of Record_Member_Raw records.  Here, we are declaring that Record_Member extends Record_Member_Raw by including all the fields from the latter.  In addition, Record_Member includes 5 new fields.  The MonthStr, Day and Year fields store the month name, day and year, respectively, to be parsed from a date of birth string such as "January 1, 2017".  The Month field stores the month as a number.  We'll be using the Age field to store the age later.

With the regular expressions and record type defined, we can proceed to parsing the data.

members_file_temp := PARSE(members_file_raw, Birthdate, regex_birthdate, Record_Member, MAX, NOCASE);

Here, we're instructing the PARSE function to parse the Birthdate field of the members_file_raw data set using the regex_birthdate pattern and output the results based on Record_Member record type. The NOCASE flag forces the pattern matching to ignore case.

To see what the parse result looks like, add the action:

OUTPUT(members_file_temp);

This will output something like:

id   lastname   firstname   birthdate       monthstr   month   day   year   age
1    Picard     Jean-Luc    July 13, 2305   July       0       13    2305   0
2    Riker      William     2335                       0       0     2335   0
...

Our next tasks are to convert the month name to number and calculate the age.  To do this, we define a dictionary which maps month name to number.

months_file := DATASET([{'Jan', 1},
                        {'Feb', 2},
                        {'Mar', 3},
                        {'Apr', 4},
                        {'May', 5},
                        {'Jun', 6},
                        {'Jul', 7},
                        {'Aug', 8},
                        {'Sep', 9},
                        {'Oct', 10},
                        {'Nov', 11},
                        {'Dec', 12}], {STRING3 name, UNSIGNED1 number});

MonthNameToNumber := DICTIONARY(months_file, {name => number});

You might have noticed that the full month name is not spelled out in the dictionary.  That's because we'll truncate the month name to the 3 first letters when we process the data:

CurrentYear := 2362;

CalculateAge(UNSIGNED2 birthYear) := CurrentYear - birthYear;

Record_Member NumerateMonth(Record_Member L, Record_Member R) := TRANSFORM
    MonthSubstr  := R.MonthStr[1..3];
    SELF.Month   := MonthNameToNumber[MonthSubstr].number;
    SELF.Age     := CalculateAge(R.Year);
    SELF         := R;
END;

members_file := ITERATE(members_file_temp, NumerateMonth(LEFT, RIGHT), LOCAL);

The ITERATE function traverses the members_file_temp data and invokes the NumerateMonth transform function on each record.  Each time the NumerateMonth is called, the to-be-transformed record is passed in as the RIGHT parameter while the previously transformed record as the LEFT parameter.  You can think of ITERATE function as having a memory of 1 record.  It remembers the result of the previous NumerateMonth call and passes it to the next NumerateMonth call.

The NumerateMonth transform function returns a Record_Member record with its fields initialized from the data of the given R record.  For example, the Month field is set by looking up the month number using the first 3 letters of R.MonthStr from the MonthNameToNumber dictionary.  The Age field is initialized by calling the CalculateAge function.   Any remaining uninitialized field values are copied from the corresponding fields in R.

To check that month number and age are extracted correctly, let's dump the result:

OUTPUT(members_file);

This will output something like:

id   lastname   firstname   birthdate       monthstr   month   day   year   age
1    Picard     Jean-Luc    July 13, 2305   July       7       13    2305   57
2    Riker      William     2335                       0       0     2335   27
...

Now that we have the month number and age calculated, let's do some simple data exploration:

total_members := COUNT(members_file);
OUTPUT(total_members);

older_members := members_file(age >= 50);
OUTPUT(older_members);

feb_or_jul_born_members := members_file(month = 2 OR month = 7);
OUTPUT(feb_or_jul_born_members);

So far, we have only worked with a single data set.  A data set in ECL is somewhat similar to a database table.  In the next article, I'll add a second data set and show how to join multiple data sets together.


No comments:

Post a Comment