Saturday, October 22, 2016

Computing Similarity Matrix in ECL

In the previous introductory articles on ECL, I have created two toy sets.  The members_file_raw data set and the likes_file data set.  The members_file_raw data set contains 12 persons with their ID, name and date of birth.  The record type and data set are:

Record_Member_Raw := RECORD
    UNSIGNED8 Id;
    STRING15 LastName;
    STRING15 FirstName;
    STRING20 Birthdate;
END;

members_file_raw := DATASET([
    {1,'Picard','Jean-Luc','July 13, 2305'},
    {2,'Riker','William','2335'},
    {3,'La Forge','Geordi','February 16, 2335'},
    {4,'Yar','Tasha','2337'},
    {5,'Worf','','2340'},
    {6,'Crusher','Beverly','October 13, 2324'},
    {7,'Troi','Deanna','March 29, 2336'},
    {8,'Data','','February 2, 2338'},
    {9,'Crusher','Wesley','July 29, 2349'},
    {10,'Pulaski','Katherine','2309'},
    {11,'O\'Brien','Miles','September 2328'},
    {12,'Guinan','','1293'}], Record_Member_Raw);

The likes_file data set contains who each person likes based on their ID.  The record type and data set are:

Record_Like := RECORD
    UNSIGNED8 SourceId;
    UNSIGNED8 TargetId;
END;

likes_file := DATASET([
        {1,2},{1,3},{1,4},{1,5},{1,6},{1,7},{1,8},{1,9},{1,10},
        {2,3},{2,4},{2,6},{2,7},{2,8},
        {3,8},{3,9},{3,11},
        {4,5},
        {5,7},
        {6,7},{6,9},
        {7,4},{7,5},{7,6},
        {8,3},{8,9},
        {9,3},{9,6},{9,8},
        {12,1}], Record_Like);

Since we have a set of preferences or likes for each person, I was wondering how I could calculate the similarity between two persons based on their likes using, for example, the Jaccard coefficient.  (The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of two sets.)  I took a crack at this.  While my solution may not be the best or efficient way to go about doing this in ECL, it seems to work.

First, I define the similarity matrix as:

Record_Similarity_Matrix := RECORD
    Id1 := members_file_raw.Id;
    Id2 := members_file_raw.Id;
    STRING30 FullName1;
    STRING30 FullName2;
    SET OF UNSIGNED8 LikeIds1;
    SET OF UNSIGNED8 LikeIds2;
    DECIMAL4_3 Score := 0;
END;

The idea is to collect all the likes for each pair of person (Id1 and Id2) into the sets LikeIds1 and LikeIds2.  The Score field will hold the computed Jaccard coefficient for sets LikeIds1 and LikeIds2.

Then, we initialize the matrix data set:

Record_Similarity_Matrix InitSimilarityMatrix(Record_Member_Raw L, Record_Member_Raw R) := TRANSFORM
    SELF.Id1    := L.Id;
    SELF.Id2    := R.Id;
    SELF.LikeIds1   := [];
    SELF.LikeIds2   := [];
    SELF.FullName1  := TRIM(L.LastName) + IF (L.FirstName != '', ', ' + L.FirstName, '');
    SELF.FullName2  := TRIM(R.LastName) + IF (R.FirstName != '', ', ' + R.FirstName, '');
END;

blank_similarity_matrix_file := JOIN(members_file_raw, members_file_raw, LEFT.Id >= RIGHT.Id, InitSimilarityMatrix(LEFT, RIGHT), ALL);

The blank_similarity_matrix_file recordset contains 12 * 12 = 144 records.  The JOIN condition is limited to LEFT.Id >= RIGHT.Id to save computation time since the Jaccard coefficient J(A, B) is equal to J(B, A).

The next step is to gather up the likes for Id1.

Record_Similarity_Matrix AddLikeIds1(Record_Similarity_Matrix L, Record_Like R) := TRANSFORM
    SELF.LikeIds1 := L.LikeIds1 + [R.TargetId];
    SELF := L;
END;

temp_similarity_matrix_file := DENORMALIZE(blank_similarity_matrix_file, likes_file, LEFT.Id1 = RIGHT.SourceId, AddLikeIds1(LEFT, RIGHT));

For each record (LEFT) in blank_similarity_matrix_file, the DENORMALIZE action finds all the records (RIGHT) from likes_file such that LEFT.Id1 = RIGHT.SourceId.  For example, if Id1 is 6, the matching records from likes_file with SourceId = 6 are {6,7} and {6,9}.  The transform function AddLikeIds is called for each matching record from likes_file.  The matching record is passed in as the RIGHT parameter to the function while the returned result from the previous call is passed in as the LEFT parameter.

We do a similar step to collect the likes for Id2.

Record_Similarity_Matrix AddIds2(Record_Similarity_Matrix L, Record_Like R) := TRANSFORM
    SELF.LikeIds2 := L.LikeIds2 + [R.TargetId];
    SELF := L;
END;

similarity_matrix_file := DENORMALIZE(temp_similarity_matrix_file, likes_file, LEFT.Id2 = RIGHT.SourceId, AddIds2(LEFT, RIGHT));

Now that LikeIds1 and LikeIds2 are populated, we can proceed to computing the Jaccard coefficients:

Record_Similarity_Matrix ComputeSimilarity(Record_Similarity_Matrix L, Record_Similarity_Matrix R) := TRANSFORM
    IdSet1 := DATASET(R.LikeIds1, {UNSIGNED8 Id});
    IdSet2 := DATASET(R.LikeIds2, {UNSIGNED8 Id});
    CombinedIds   := IdSet1 + IdSet2;
    UnionIds      := DEDUP(CombinedIds, LEFT.Id = RIGHT.Id, ALL);
    IntersectIds  := JOIN(IdSet1, IdSet2, LEFT.Id = RIGHT.Id);
    SELF.Score    := COUNT(IntersectIds) / COUNT(UnionIds);
    SELF          := R;
END;

similarity_file := ITERATE(similarity_matrix_file, ComputeSimilarity(LEFT, RIGHT), LOCAL);

The ITERATION action traverses the records in the similarity_matrix_file data set and invokes the ComputeSimilarity function on each record.  I don't know if there is an easier way to compute set union and intersection so I use DEDUP and JOIN, respectively.  They might be an overkill.

To see the result, we simply use the action:

OUTPUT(similarity_file(Id1 > Id2));

Here is a snippet of the output:

Id1  Id2  FullName1         FullName2         LikeIds1      LikeIds2                              Score
11   1    O'Brien, Miles    Picard, Jean-Luc                '2','3','4','5','6','7','8','9','10'  0   
12   1    Guinan            Picard, Jean-Luc  '1'           '2','3','4','5','6','7','8','9','10'  0
3    2    La Forge, Geordi  Riker, William    '8','9','11'  '3','4','6','7','8'                   0.143
...

As I said earlier, this may not be the most efficient way to compute similarity matrix but it works!

Joining Data Sets in ECL

So far, I have demonstrated how to initialize, load and parse data in HPCC ECL.  We have been working with a toy data set consisting of person id, names and dates of birth.  In this article, I'll add a second data set and show how to join multiple data sets.

The new data set describes who likes whom denoted by the person ID.  The record type of the data set is:

Record_Like := RECORD
    UNSIGNED8 SourceId;
    UNSIGNED8 TargetId;
END;

And here are some sample data:

likes_file := DATASET([
        {1,2},{1,3},{1,4},{1,5},{1,6},{1,7},{1,8},{1,9},{1,10},
        {2,3},{2,4},{2,6},{2,7},{2,8},
        {3,8},{3,9},{3,11},
        {4,5},
        {5,7},
        {6,7},{6,9},
        {7,4},{7,5},{7,6},
        {8,3},{8,9},
        {9,3},{9,6},{9,8},
        {12,1}], Record_Like);

It's hard to figure out who likes whom just by looking at a pair of IDs in the likes_file data set.  To make it easier, we'll combine the members_file data set we have seen in previous articles with the new likes_file data set so that the person names are spelled out.  The joined results will have the record type:

Record_Named_Like := RECORD
    SourceLastName   := members_file.LastName;
    SourceFirstName  := members_file.FirstName;
    Record_Like;
    TargetLastName   := members_file.LastName;
    TargetFirstName  := members_file.FirstName;
END;

Essentially, Record_Named_Like extends Record_Like with 4 additional fields.  The data types for the SourceLastName, SourceFirstName, TargetLastName and TargetFirstName fields are the same as LastName or FirstName from the members_file data set.

First, we are going to join the 2 data sets so that Record_Member.Id is Record_Like.SourceId.  We'll copy the last and first name from the corresponding Record_Member record from likes_file to a new Record_Named_Like record.

Record_Named_Like JoinMembersAndLikes(Record_Member L, Record_Like R) := TRANSFORM
    SELF.SourceLastName    := L.LastName;
    SELF.SourceFirstName   := L.FirstName;
    SELF.TargetLastName    := '';
    SELF.TargetFirstName   := '';    
    SELF := R;
END;

member_likes_file := JOIN(members_file, likes_file, LEFT.Id=RIGHT.SourceId, JoinMembersAndLikes(LEFT, RIGHT));

The JOIN action joins the two data sets based on the condition LEFT.Id=RIGHT.SourceId where LEFT refers to members_file and RIGHT refers to likes_file.  During joining, the transform function JoinMemberAndLikes is called to produce the resulting joint record of type Record_Named_Like.

We do another similar join such that Record_Member.Id is equal to Record_Named_Like.TargetId to identify the name of the liked person.

Record_Named_Like JoinLikesAndMembers(Record_Member L, Record_Named_Like R) := TRANSFORM
    SELF.TargetLastName    := L.LastName;
    SELF.TargetFirstName   := L.FirstName;
    SELF := R;
END;

member_likes_member_file := JOIN(members_file, member_likes_file, LEFT.Id = RIGHT.TargetId, JoinLikesAndMembers(LEFT, RIGHT));

After both joins, the final data set member_likes_member_file looks like:

SourceLastName   SourceFirstName   SourceId   Target   TargetLastName   TargetFirstName
Guinan                             12         1        Picard           Jean-Luc
Picard           Jean-Luc          1          2        Riker            William
...

Now, we can easily spot who likes whom by name.  Let's go a bit further by doing some summary statistics and reporting.  We want to generate a report that tells us how many likes there are for each person and of those likes how many are for Data and Worf.  Precisely, the record type for the report is:

Record_Likes_Stats := RECORD
    LastName    := member_likes_member_file.SourceLastName;
    FirstName   := member_likes_member_file.SourceFirstName;
    LikesData   := COUNT(GROUP, member_likes_member_file.TargetLastName = 'Data');
    LikesWorf   := COUNT(GROUP, member_likes_member_file.TargetLastName = 'Worf');
    TotalLikes  := COUNT(GROUP);
END;

In SQL parlance, the first 2 fields (LastName and FirstName) in Record_Likes_Stats are the keys for our upcoming group-by query while the remaining fields (LikesData, LikesWorf and TotalLikes) are the aggregate functions performed on each group.  For example, TotalLikes contains the count for each unique group of last and first name.  LikesData contains the portion of the group count satisfying the filter condition of TargetLastName = 'Data'.

The group-by query is done by the TABLE action:

member_likes_stats := TABLE(member_likes_member_file, Record_Likes_Stats, SourceLastName, SourceFirstName);

The parameters to the TABLE action are the input data set, the returning record type and the group-by keys.  Here is what the output looks like if you output member_likes_stats:

LastName   FirstName   LikesData   LikesWorf   TotalLikes
Crusher    Beverly     0           0           2
Crusher    Wesley      1           0           3
... 

To limit the output to only those who like Data or Worf, you can add a filter to the OUTPUT action:

OUTPUT(member_likes_stats(LikesData > 0 OR LikesWorf > 0));

In this article, I have introduced two new ECL actions which are JOIN and TABLE.  There are other ways of using these actions which I haven't described.  To learn more, you can check out the ECL Language Reference.





Parsing and Extracting Data in ECL

As promised last time, I'll show how to pre-process data in ECL by parsing the date of birth from the format of "Month Day, Year" into separate month, day and year fields based on what I've learned so far about ECL.  For demonstration, I'll continue to use the toy data set (members_file_raw) we have created previously.

The date of birth will be processed in 2 steps.  The steps I'm about to show may not be the most efficient, cleverest or fastest way of extracting data but they work.  First, we declare the regular expressions for parsing the date of birth.

PATTERN regex_month := PATTERN('[a-z]+');
PATTERN regex_day := PATTERN('[0-9]{1,2}');
PATTERN regex_year := PATTERN('[0-9]{4}');
PATTERN regex_month_seg := regex_month [' '];
PATTERN regex_day_seg := regex_day [', '];
PATTERN regex_date := regex_month_seg? regex_day_seg? regex_year;
PATTERN regex_birthdate := regex_date?;

Basically, we're declaring that the date of birth consists of an optional month name, an optional day and a mandatory year.  The month name and day are delimited by an empty space while the day and year are delimited by a comma and empty space.

Next comes the data type for storing the processed record:

Record_Member := RECORD
    members_file_raw;
    STRING9 MonthStr := MATCHTEXT(regex_month[1]);
    UNSIGNED1 Month  := 0;
    UNSIGNED1 Day    := (INTEGER) MATCHTEXT(regex_day[1]);
    UNSIGNED2 Year   := (INTEGER) MATCHTEXT(regex_year[1]);
    UNSIGNED2 Age    := 0;
END;

In the previous article, we have declared record type Record_Member_Raw which consists of 4 fields - Id, LastName, FirstName and Birthdate, and members_file_raw is a data set composing of Record_Member_Raw records.  Here, we are declaring that Record_Member extends Record_Member_Raw by including all the fields from the latter.  In addition, Record_Member includes 5 new fields.  The MonthStr, Day and Year fields store the month name, day and year, respectively, to be parsed from a date of birth string such as "January 1, 2017".  The Month field stores the month as a number.  We'll be using the Age field to store the age later.

With the regular expressions and record type defined, we can proceed to parsing the data.

members_file_temp := PARSE(members_file_raw, Birthdate, regex_birthdate, Record_Member, MAX, NOCASE);

Here, we're instructing the PARSE function to parse the Birthdate field of the members_file_raw data set using the regex_birthdate pattern and output the results based on Record_Member record type. The NOCASE flag forces the pattern matching to ignore case.

To see what the parse result looks like, add the action:

OUTPUT(members_file_temp);

This will output something like:

id   lastname   firstname   birthdate       monthstr   month   day   year   age
1    Picard     Jean-Luc    July 13, 2305   July       0       13    2305   0
2    Riker      William     2335                       0       0     2335   0
...

Our next tasks are to convert the month name to number and calculate the age.  To do this, we define a dictionary which maps month name to number.

months_file := DATASET([{'Jan', 1},
                        {'Feb', 2},
                        {'Mar', 3},
                        {'Apr', 4},
                        {'May', 5},
                        {'Jun', 6},
                        {'Jul', 7},
                        {'Aug', 8},
                        {'Sep', 9},
                        {'Oct', 10},
                        {'Nov', 11},
                        {'Dec', 12}], {STRING3 name, UNSIGNED1 number});

MonthNameToNumber := DICTIONARY(months_file, {name => number});

You might have noticed that the full month name is not spelled out in the dictionary.  That's because we'll truncate the month name to the 3 first letters when we process the data:

CurrentYear := 2362;

CalculateAge(UNSIGNED2 birthYear) := CurrentYear - birthYear;

Record_Member NumerateMonth(Record_Member L, Record_Member R) := TRANSFORM
    MonthSubstr  := R.MonthStr[1..3];
    SELF.Month   := MonthNameToNumber[MonthSubstr].number;
    SELF.Age     := CalculateAge(R.Year);
    SELF         := R;
END;

members_file := ITERATE(members_file_temp, NumerateMonth(LEFT, RIGHT), LOCAL);

The ITERATE function traverses the members_file_temp data and invokes the NumerateMonth transform function on each record.  Each time the NumerateMonth is called, the to-be-transformed record is passed in as the RIGHT parameter while the previously transformed record as the LEFT parameter.  You can think of ITERATE function as having a memory of 1 record.  It remembers the result of the previous NumerateMonth call and passes it to the next NumerateMonth call.

The NumerateMonth transform function returns a Record_Member record with its fields initialized from the data of the given R record.  For example, the Month field is set by looking up the month number using the first 3 letters of R.MonthStr from the MonthNameToNumber dictionary.  The Age field is initialized by calling the CalculateAge function.   Any remaining uninitialized field values are copied from the corresponding fields in R.

To check that month number and age are extracted correctly, let's dump the result:

OUTPUT(members_file);

This will output something like:

id   lastname   firstname   birthdate       monthstr   month   day   year   age
1    Picard     Jean-Luc    July 13, 2305   July       7       13    2305   57
2    Riker      William     2335                       0       0     2335   27
...

Now that we have the month number and age calculated, let's do some simple data exploration:

total_members := COUNT(members_file);
OUTPUT(total_members);

older_members := members_file(age >= 50);
OUTPUT(older_members);

feb_or_jul_born_members := members_file(month = 2 OR month = 7);
OUTPUT(feb_or_jul_born_members);

So far, we have only worked with a single data set.  A data set in ECL is somewhat similar to a database table.  In the next article, I'll add a second data set and show how to join multiple data sets together.


Friday, October 21, 2016

Initialize, Load and Save Data in ECL

In a previous article, I introduced HPCC and ECL, the data-centric declarative programming language for HPCC.  In this first article on ECL, I'll share what I have learned about ECL after experimenting with it for a while.  As ECL and HPCC are about extract, transform and load data, I'll demonstrate using a toy data set consisting of person names and dates of birth. I'll continue to use this data set in the following articles on ECL.

Without further adieu, we first declare the format of the incoming data.

Record_Member_Raw := RECORD
    UNSIGNED8 Id;
    STRING15 LastName;
    STRING15 FirstName;
    STRING20 Birthdate;
END;

The syntax is rather intuitive. We're declaring a data row or record consisting of 4 fields and their data types are 8 bytes unsigned integer, 15 bytes string, 15 bytes string and 20 bytes string.  ECL strings are space padded and not null-terminated.  Also, ECL is not case sensitive so Record_Member_Raw is same as record_member_raw, LastName is same as lastname, and UNSIGNED8 is same as unsigned8.

In the real world, you will be getting your data from an existing data source but in this case, I'm hardcoding the data manually to members_file_raw:

members_file_raw := DATASET([
    {1,'Picard','Jean-Luc','July 13, 2305'},
    {2,'Riker','William','2335'},
    {3,'La Forge','Geordi','February 16, 2335'},
    {4,'Yar','Tasha','2337'},
    {5,'Worf','','2340'},
    {6,'Crusher','Beverly','October 13, 2324'},
    {7,'Troi','Deanna','March 29, 2336'},
    {8,'Data','','February 2, 2338'},
    {9,'Crusher','Wesley','July 29, 2349'},
    {10,'Pulaski','Katherine','2309'},
    {11,'O\'Brien','Miles','September 2328'},
    {12,'Guinan','','1293'}], Record_Member_Raw);

Our data set consists of 12 records.  The records are of type Record_Member_Raw.  To display or output the data set or recordset result, add the following code to your ECL script.

OUTPUT(members_file_raw);

OUTPUT(members_file_raw(lastname='Data' OR id=3));

OUTPUT(members_file_raw[1]);

OUTPUT(members_file_raw[1..3]); 

The first output dumps the entire data set.  The second selects only records meeting the filter condition.  The third outputs only the first record.  Note, ECL indexing starts with 1 and not 0.  Indexing can also be a range like in the last output which returns the first 3 records.  You can also save the output to file:

OUTPUT(members_file_raw, ,'~FOO::BAR::Members', OVERWRITE);

The OUTPUT action will be used frequently in debugging ECL code.  To learn about what each ECL action does, the ECL Language Reference is your best (and only) source of help.

Remember earlier I said that in the real world, you will be getting your data from an existing data source.  Well, now you have a data source which is the file you just created.  To load the file, the command is:

loaded_members_file := DATASET('~FOO::BAR::Members', Record_Member_Raw, THOR);

In this article, I showed how to initialize, load and save data in ECL.  In the next article, I'll pre-process the data by parsing the date of birth into separate month, day and year fields.


Thursday, October 20, 2016

First Encounter with HPCC and ECL

I have been trying out HPCC for a few days now.  HPCC is an open source big data platform developed by LexisNexis which can run on commodity computing clusters like Amazon AWS.  HPCC includes its own declarative programming language called ECL to extract, transform and load large scale data.  Being a declarative programming language, it belongs to the same class of languages as SQL as opposed to imperative languages like C/C++, Java, Python or PHP.  On first sight though, ECL looks a bit like C++.  Perhaps it has to do with the fine granularity of control permissible by ECL such as file pointers and bytes allocation as we shall see.


Setting up

So how does one get started with HPCC quickly and freely?  For me, the path of least resistance was to download and install the HPCC virtual image and run it with VirtualBox.  This will deploy a pre-configured linux based HPCC guest server running in your local machine for demo purposes.

Next, you'll need to install an IDE to interact with the server and expedite ECL programming.
Though an IDE is not required and command line interface tools are available for download, many of the online tutorials assume you're using an IDE, in particular, the ECL IDE for Windows.  I was lucky to have my old copy of Windows 7 lying around which I installed ECL IDE to.  So now I have both HPCC server and Windows 7 with ECL IDE running as guests on my Mac using VirtualBox and everything is working okay (after some hiccups).


Getting Acquainted

While there are various learning resources available on the HPCC website, they are scattered on different pages.  It can be a flustering experience not knowing which ones you should start with and in what order.  Also, some of the resources are seemingly locked away for customers only or require access credentials.  Hopefully, by the time you're reading this, the resources are better organized.

  1. In hindsight, I would start by reading Running HPCC in a Virtual Machine to help with the installation and usage of ECL IDE.

  2. To gain a little bit more insights into ECL, I read HPCC Data Tutorial and followed the short programming examples.

  3. Depending on your preference, you could watch some of the short tutorial videos.

  4. What helped me the most so far is the ECL Programmers Guide.  It's my Rosetta stone to ECL.  I hope they would continue to expand the guide and coverage with more examples.  When reading the guide, you would need to frequently consult the ECL Language Reference.

I haven't read everything there is yet and most likely, there are other useful resources I haven't stumbled upon yet.  Hopefully, these are enough to get you started with HPCC.  In my next article, I'll share what I've learned so far on programming with ECL.

Thursday, September 1, 2016

Storing HTTP Sessions with Amazon Elastic File System (EFS)

Amazon has recently released the Elastic File System (EFS) to the general public after a long beta period.  According to Amazon, EFS is a distributed file system capable of high throughput, low latency and auto scaling in addition to fault tolerance.  In the US region, EFS costs $0.30/GB-month at the time of writing.  You can NFS-mount and share an EFS file system with multiple EC2 instances in Virtual Private Cloud (VPC) or non-VPC via ClassicLink.

There are many applications that can take advantage of a distributed file system.  For example, if you are running a web application on multiple machines in a cluster, you can use EFS for storing user HTTP sessions.  One of the benefits is that EFS costs far less than running ElasticCache, a Memcached compliant service offered by Amazon.

Setting up EFS in a VPC is easy and Amazon provides step-by-step tutorial on how to do this.  Once you have EFS and security group set up, you can use cloud-init to mount the EFS file system automatically at EC2 Instance Launch by following the Amazon instructions at

http://docs.aws.amazon.com/efs/latest/ug/mount-fs-auto-mount-onreboot.html

This involves adding a script during the Launch Instance wizard of the EC2 management console. The script installs the NFS client and writes an entry in the /etc/fstab file to mount the EFS file system on /mnt/efs with the following line:


 echo "$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone).file-system-id.efs.aws-region.amazonaws.com:/    /mnt/efs   nfs4    defaults" >> /etc/fstab


The mount target DNS name is dynamically constructed so that you don't accidentally mount your file system across the wrong Availability Zone.

For my own web application running on Amazon Linux nodes, I did not want to hassle with cloud-init so I opted for a different approach of auto-mounting EFS by appending the following line to /etc/rc.d/rc.local


 mount -t nfs4 -o nfsvers=4.1 $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone).file-system-id.efs.aws-region.amazonaws.com:/ /mnt/efs


Note, you must replace file-system-id and aws-region with your own values.  For example, aws-region could be us-east-1 depending on your EC2 region.

In case you're wondering, you cannot run scripts or use environment variables within /etc/fstab, so it's not possible to run the curl command for mount target DNS name in /etc/fstab directly.  This is the reason why I chose to append to /etc/rc.d/rc.local instead.


Got Too Many Session Files?

If your web application uses PHP and leaves a lot of expired session files behind, you should consider disabling the default session garbage collection and replace it with a scheduled cron job because, by default configuration, there is a 1% chance that PHP will garbage collect session files on each request.  On EFS, file operations like directory listing are slower than local file system, so this can potentially cause your web application to be less responsive every time the garbage collector kicks in.

To turn off session garbage collection, set session.gc_probability to 0 in php.ini and restart your web server.  Next, add the following cron job to garbage collect session files.


 */5 * * * * find /mnt/efs -type f -mmin +60 -delete &>/dev/null


This will run every 5 minutes and delete all files with modified time older than 60 minutes in /mnt/efs.  Now, you're good to go!

Monday, August 1, 2016

The Tortoise and The Hare - Array_splice vs Array_pop in PHP

Sometimes, the most succinct and elegant piece of code can fall short on other factors like speed of execution due to the hidden complexity of the underlying code powering a high level language like PHP.  This happened to me today while I was refactoring a data processing routine.  For a fleeting moment, I experienced a sense of accomplishment having managed to reduce the number of lines of code in the routine to make it more readable and maintainable.  Unfortunately, that gleeful moment only lasted until I ran the routine against a data set.  To my utter surprise, the code ran 900 times slower than before!

Through the process of elimination, I tracked the culprit down to the PHP function array_splice.


 array array_splice ( array &$input , int $offset [, int $length = 0 [, mixed $replacement = array() ]] )


From the PHP manual, the array_splice function

"Removes the elements designated by offset and length from the input array, and replaces them with the elements of the replacement array, if supplied."

For my task, I needed to process the elements in an array starting from the end with a set of N elements at a time so array_splice seemed like the perfect function to use.  The code I had before refactoring was similar to the following stripped down snippet:


 // initialize arrays with 50,000 elements
 $arr = array_fill(0, 50000, 'abc');

 // remove last 5 elements at a time using array_pop
 while (count($arr)) {
    $subArr = [];
    for ($i = 0; $i < 5; $i++) {
        $subArr[] = array_pop($arr);
    }
 }


In the snippet above, N is set to 5 (i.e. remove 5 elements at a time), but in reality, N can change inside the while-loop so the PHP function array_chunk is not applicable.  Also, array_slice isn't appropriate because I needed the input array to be truncated inside the while-loop.

The following refactored version replaces the ugly inner for-loop with a single array_splice statement.


 // initialize arrays with 50,000 elements
 $arr = array_fill(0, 50000, 'abc');

 // remove last 5 elements at a time using array_splice
 while (($len = count($arr)) > 0) {
    $subArr = array_splice($arr, $len - 5);
 }


Don't you agree the refactored version is simpler and easier to read?  At least I thought so.  Unfortunately, while the version before refactoring took < 0.02 second to complete, the refactored version took 18 seconds in Zend PHP 5.6.  Furthermore, as the number of elements in the array scales up, the execution time for array_splice is non-linear while array_pop is more or less linear.  This is illustrated in the following graph by comparing the running time of the two code snippets above over an increasing array size.


Execution Time of Array_splice vs Array_pop


My first thought was that perhaps arrays are implemented as singly linked lists in PHP so every time array_splice() is called, the linked list has to traverse from start until the given offset is reached near the end of the list.  This would explain the non-linear time performance.  If that's the case, if I use array_splice() to remove the first N elements instead of the last N elements as in the following code snippet, the time performance should be linear.


 // initialize arrays with 50,000 elements
 $arr = array_fill(0, 50000, 'abc');

 // remove first 5 elements at a time using array_splice
 while (($len = count($arr)) > 0) {
    $subArr = array_splice($arr, 0, 5);
 }


Strangely, when I ran the test, the run time is almost identical to the graph above so I'm at a loss. I have only tested this with Zend PHP 5.6.  I don't have Zend PHP 7 or HHVM at my disposal but I wonder how they stack up.

Monday, July 11, 2016

Audit ModSecurity Log Quickly and Systematically with Reconity

Having recently installed ModSecurity as my web application firewall, I started to keep an eye on the audit logs generated by ModSecurity regularly.  The audit log records web access events which had set off any of the configured firewall rules.  For example, an event entry in the log may look this (IP addresses have been masked for privacy):

--d9g76d43-A--
[27/Jun/2016:08:18:10 +0000] V3GlK2-sf88lhesfakqlUgAAI X.X.X.X 57596 Y.Y.Y.Y 80
--d9g76d43-B--
GET /../../../../../../../mnt/mtd/OCxW HTTP/1.1
Host: Z.Z.Z.Z
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36
Accept-Encoding: gzip
Connection: close

--d9g76d43-F--
HTTP/1.1 400 Bad Request
X-Content-Type-Options: nosniff
Content-Length: 226
Connection: close
Content-Type: text/html; charset=iso-8859-1

--d9g76d43-H--
Message: Warning. String match "Invalid URI in request" at WEBSERVER_ERROR_LOG. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_20_protocol_violations.conf"] [line "82"] [id "981227"] [rev "1"] [msg "Apache Error: Invalid URI in Request."] [data "GET /../../../../../../../mnt/mtd/OCxW HTTP/1.1"] [severity "WARNING"] [ver "OWASP_CRS/2.2.8"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/PROTOCOL_VIOLATION/INVALID_REQ"] [tag "CAPEC-272"]
Apache-Error: [file "core.c"] [line 4306] [level 3] AH00126: Invalid URI in request %s
Stopwatch: 1467015490548268 657 (- - -)
Stopwatch2: 1467015490548268 657; combined=435, p1=213, p2=0, p3=1, p4=41, p5=131, sr=92, sw=49, l=0, gc=0
Producer: ModSecurity for Apache/2.8.0 (http://www.modsecurity.org/); OWASP_CRS/2.2.8.
Server: Apache
Engine-Mode: "ENABLED"

--d9g76d43-Z--

The above entry tells us that one of the installed rules caught and blocked an illegitimate attempt to access private and protected resource on the server.  According to the entry, the source of the intrusion was X.X.X.X, the offending request was "GET /../../../../../../../mnt/mtd/OCxW", the response was "400 Bad Request" and the rule in violation was 981227.

Often, my audit log also contains false alarms or events pertaining to the internal workings of ModSecurity rather than an external offense such as:

--9fj387gd-A--
[13/Jun/2016:15:48:05 +0000] V11sVtSgC7wd3k4LMd8eSAXAAAg 127.0.0.1 52076 127.0.0.1 80
--2ab87c12-B--
POST /foo HTTP/1.1
Host: localhost

--9fj387gd-F--
HTTP/1.1 200 OK
X-Content-Type-Options: nosniff
Content-Type: application/json; charset=utf-8

--9fj387gd-H--
Message: collections_remove_stale: Failed deleting collection (name "ip", key "X.X.X.X_c7ad53c03q60y9cdrf1e71t5e4k04bx21e589z6e"): Internal error
Apache-Handler: application/x-httpd-php
Stopwatch: 1465832885236582 714358 (- - -)
Stopwatch2: 1465832885236582 714358; combined=4576, p1=259, p2=0, p3=0, p4=0, p5=2162, sr=104, sw=2, l=0, gc=2153
Producer: ModSecurity for Apache/2.8.0 (http://www.modsecurity.org/); OWASP_CRS/2.2.8.
Server: Apache

--9fj387gd-Z--

This particular event is harmless and is caused by a bug in ModSecurity as discussed in a previous article.

A typical log of mine contains a garden variety of events like the ones above.  When the log is big, sifting through it becomes time consuming especially if I want to examine every single recorded event out of paranoia.  What I needed was a simple tool for exploratory analysis of the log.  Being a software engineer, I decided to spend some time building a tool to my specification.

And so born Reconity, an online interactive log auditing tool for ModSecurity.  The tool extracts key event information from part A, B, F and H of the log into an interactive table so I can inspect events quickly.  With a single mouse click, I can hide events like the aforementioned internal errors from view and switch my attention over to other more serious events.  The tool was built in mind to help audit log systematically by narrowing down important events quickly.

Reconity

I wrote the tool mostly in JavaScript so that it can run in a browser.  There is no manual to learn or read as the web user interface is intuitive.  Since processing is done within the browser, there is no need to upload log file to a remote machine for processing or invoke commands at the command line.  This means sensitive log data remains private and local to the browser computer.  While JavaScript may not be the fastest programming language out there, I was able to audit 100,000 events (~200 MB log file) responsively with the tool.  With optimization, I should be able to extend the limit further if my log ever get bigger than 200 MB.

As I have been using the tool for weeks now, I find that it saves time for me from the laborious task of log auditing.  If auditing ModSecurity log is one of your regular routines, you are welcome to simplify the chore with Reconity!

Tuesday, July 5, 2016

Web Application Vulnerability Scanners

In a previous article, I discussed how to set up ModSecurity for Apache 2.4 on Amazon Linux AMI to protect web applications against exploits and abuses.  Having set up ModSecurity myself for my own web application recently, I was curious how my ModSecurity firewall would protect against publicly available hacking tools out there.  I was in for a learning curve as I learned about what's out there.

IMPORTANT:  The scanning tools introduced below may choke your web application with heavy traffic so proceed with caution and permission!


Kali Linux


So what's the easiest way for a layman to start?  I discovered the path with least resistance is to download Kali Linux.  Kali is a freely distributed Debian-based Linux system pre-loaded with many vulnerability scanners and hacking tools which are what I was after.  Since I already use VirtualBox, the quickest way to get it up and running is to download the Kali Linux virtual box image from:

https://www.offensive-security.com/kali-linux-vmware-virtualbox-image-download/

The image is about 2 GB so be patient.  If you prefer the Kali Linux ISO image instead, it's also available at:

https://www.kali.org/downloads/

Once you have downloaded the VirtualBox image, you can create a new virtual machine in VirtualBox by going to File > Import Appliance.  It took a few minutes to create the virtual machine, for example, on a Mac Pro.

Now, launch the new virtual machine and log in as root.  The default root password is "toor".  There are many tools to choose from.  Here's a screenshot of what's under the Application menu.



Nikto


One of the scanning tools under Application > Vulnerability Analysis is Nikto.  From the official website:

"Nikto is an Open Source (GPL) web server scanner which performs comprehensive tests against web servers for multiple items, including over 6700 potentially dangerous files/programs, checks for outdated versions of over 1250 servers, and version specific problems on over 270 servers. It also checks for server configuration items such as the presence of multiple index files, HTTP server options, and will attempt to identify installed web servers and software. Scan items and plugins are frequently updated and can be automatically updated."

For usage information, please refer to the Nikto official website at https://cirt.net/nikto2-docs/

The command to start a Nikto scan is nikto -h www.my_website123.com.  While scanning, Nikto reports potential vulnerabilities to the command console as they are found:

- Nikto v2.1.6
---------------------------------------------------------------------------
+ Target IP:          X.X.X.X
+ Target Hostname:    www.my_website123.com
+ Target Port:        80
+ Start Time:         2016-05-17 12:27:25 (GMT-4)
---------------------------------------------------------------------------
+ Server: Apache
+ Retrieved x-powered-by header: PHP/4.1.1
+ The anti-clickjacking X-Frame-Options header is not present.
+ The X-XSS-Protection header is not defined. This header can hint to the user agent to protect against some forms of XSS
+ The X-Content-Type-Options header is not set. This could allow the user agent to render the content of the site in a different fashion to the MIME type

Different scanners may discover new problems other scanners have not reported so you should try to run a few of them.  It's also very likely that some of the reported problems are false positives or not exploitable by hackers so don't panic.


OpenVAS


There are also other scanners not bundled with Kali Linux   One of them is OpenVAS.  To install OpenVAS on Kali Linux, please refer to:

https://www.kali.org/penetration-testing/openvas-vulnerability-scanning

If your scan finish quickly (<1 minute) and returns without any problems found, chances are your server isn't responding to ICMP requests instead of being absence of vulnerabilities.  You should set "Alive Test" to "Consider Alive" for your scan target.

On the other hand, if your scan finish quickly (<1 minute) and returns with an "Internal Error" and you are using OpenVAS Manager 6.0.5, it's a bug.  Find out which version you have at the terminal by running

openvasmd --version

Then, check if the error log at /var/log/openvas/openvasmd.log shows:

md  main:WARNING:2016-05-17 19h06.25 UTC:26326: sql_prepare_internal: sqlite3_prepare failed: no such table: current_credentials

To fix this, manually add the missing table to the sqlite3 tasks.db file:

CREATE TABLE IF NOT EXISTS current_credentials (id INTEGER PRIMARY KEY, uuid text UNIQUE NOT NULL);

Other Tools


In addition to OpenVAS, there are also other scanners readily available to download and install. Some of the ones I have found are:

Arachni  -  apt-get install arachni
Vega  -  apt-get install vega

At the time of writing, Vega is packaged with a graphical user interface but not Arachni.  If you're trying Arachni, note that a scan may take a long time to finish (as in days or longer) because of its comprehensive nature but it's possible to optimize for faster scans.

Hopefully, this will get you started on securing your web application.  Cheers!

Thursday, June 2, 2016

ModSecurity Failed Deleting Collection

Houston, we have a problem.  Looking at my ModSecurity audit log, I found quite a few log entries similar to the following (I have obfuscated the actual IP address for privacy reason):

Failed deleting collection (name "ip", key “X.X.X.X_"): Internal error

I was concerned that if the collection file was not being cleaned up automatically by Apache or ModSecurity, the file may grow to an astronomical size over time causing other problems.  While I wasn't able to find a definitive cause for the error by searching online, I stumbled upon 3 proposed workaround solutions:
  1. Use memcache for collections
  2. Set SecCollectionTimeout to a small value such as 600 (default is 3600 seconds)
  3. Install and run modsec-sdbm-util in a separate process to clean up the collection file regularly
Option 1 requires ModSecurity 3.0 or git-clone the memcache_collections branch, a choice I didn't want to make hastily.  There are reports that option 2 may not always work for everybody.  Just to be on the safe side, I implemented option 3 in addition to option 2.

For the record, I took the steps below to install modsec-sdbm-util on Amazon Linux AMI running Apache 2.4 with ModSecurity 2.8.

First, install all pre-requisite libraries and tools:

sudo su
yum install libtool autoreconf autoheader automake autoconf apr-util-devel

Then, download and install modsec-sdbm-util to a directory.

git clone https://github.com/SpiderLabs/modsec-sdbm-util.git
cd modsec-sdbm-util
./autogen.sh
./configure
make install

Check that it’s installed successfully by running (assuming ip.pag file is in /tmp):

/usr/local/modsecurity/bin/modsec-sdbm-util -s /tmp/ip.pag

If everything is okay, it should output a status report similar to:

Opening file: /tmp/ip.pag
Database ready to be used.
 [|] 10 records so far.
Total of 17 elements processed.
0 elements removed.
Expired elements: 7, inconsistent items: 0
Fragmentation rate: 41.18% of the database is/was dirty data.

Set up a cron job to run modsec-sdbm-util every half hour or so to remove expired elements from the collection file.

*/30 * * * *  /usr/local/modsecurity/bin/modsec-sdbm-util -k /tmp/ip.pag &> /dev/null

This should do it!  (Cross my fingers.)

Friday, May 27, 2016

ModSecurity Anti-Automation Rule Set for Detecting Denial of Service (DOS) Attacks, Part 2

Continuing my experiment with ModSecurity on a web server running in detection mode, I was discovering that a few of my users were triggering false alarms by the anti-DOS rules even though their number of submitted requests is well below the specified burst threshold of 100 requests per minute.  While searching online for a solution, I found that I was not the only one encountering this issue.  It seems that the false alarm occurs whenever web requests are submitted concurrently to the server by a user.  If the requests are submitted sequentially, the rules work as expected.  Unfortunately, I didn't find anybody posting a fix to the rules and so I took a trip through the tunnel...


Race Condition


The crux of the problem seems to be rule 981048:

#
#
# Check DOS Counter
# If the request count is greater than or equal to user settings, we then set the burst counter
# 
SecRule IP:DOS_COUNTER "@ge %{tx.dos_counter_threshold}" "phase:5,id:'981048',t:none,nolog,pass,setvar:ip.dos_burst_counter=+1,expirevar:ip.dos_burst_counter=%{tx.dos_burst_time_slice},setvar:!ip.dos_counter"


The rule does not operate atomically so rule operations such as unset(ip.dos_counter) are not guaranteed to finish first before the rule is executed by another request.  Consequently, when multiple concurrent requests activate the rule at the same time with the rule condition met, ip.dos_burst_counter is incremented by one for each concurrent request instead of incremented by one for all concurrent requests.  This means ip.dos_burst_counter can jump from 0 to ≥ 2.  Erroneously, the following rule 981049 then blocks the user because ip.dos_burst_counter is ≥ 2.

To illustrate with an example, a user has sent 99 requests slowly and regularly over time, so ip.dos_counter is now at 99 while ip.dos_burst_counter is 0.  All of the sudden, he sends 2 requests simultaneously.  Because the rule is not atomic, ip.dos_burst_counter is incremented twice and is now set to 2.  Had the 2 requests arrived sequentially, ip.dos_burst_counter would be 1 because ip.dos_counter would have been unset properly after the first request.  Now, because ip.dos_burst_counter is 2, the user is blocked even though he had not burst-ed.


The Solution


To fix the problem, I propose replacing rule 981048 with two chained rules to ensure ip.dos_burst_counter doesn't skip from 0 to ≥ 2.

#
# Check DOS Counter (Tick)
# If the request count is greater than or equal to user settings, we then set the burst counter to 1 if it's 0
# 
SecRule IP:DOS_COUNTER "@ge %{tx.dos_counter_threshold}" "chain,phase:5,id:'981048',t:none,nolog,pass"
SecRule &IP:DOS_BURST_COUNTER "@eq 0" "setvar:ip.dos_burst_counter=1,expirevar:ip.dos_burst_counter=%{tx.dos_burst_time_slice},setvar:!ip.dos_counter"

#
# Check DOS Counter (Tock)
# If the request count is greater than or equal to user settings, we then set the burst counter to 2 if it's 1
# 
SecRule IP:DOS_COUNTER "@ge %{tx.dos_counter_threshold}" "chain,phase:5,id:'981050',t:none,nolog,pass"
SecRule &IP:DOS_BURST_COUNTER "@ge 1" "setvar:ip.dos_burst_counter=2,expirevar:ip.dos_burst_counter=%{tx.dos_burst_time_slice},setvar:!ip.dos_counter"


For the previous example, the "tick" rule ensures ip.dos_burst_counter is set to at most 1 even if the rule is executed concurrently.  The "tock" rule dictates the condition that ip.dos_burst_counter can only be set to 2 if it was previously set to 1.  Together, the rules simulate the operation of the original rule while able to handle concurrency without triggering false positives. As for false negatives, the rules may trigger slightly later than the specified threshold crossing (e.g. 100 requests per minute) for concurrent requests due to the inherit lack of atomicity in ModSecurity.

Friday, May 20, 2016

ModSecurity Anti-Automation Rule Set for Detecting Denial of Service (DOS) Attacks, Part 1

In a previous article, I introduced ModSecurity as an effective web application firewall.  One of the things I like about ModSecurity is that it comes with a set of core rules (CRS) for detecting and handling various nefarious web activities like XSS and SQL injection.  I was thrilled to learn that CRS also comes with a set of rules for detecting DOS attacks.  I was curious to learn how the anti-DOS rules work.  I have stared at them long enough now that I'm beginning to see the light at the end of the tunnel.  In this article, I'll give an overview of how the DOS rules work.  In the next article, I'll dive a bit deeper by patching a race condition bug in the rules.

For reference, here are the rules from modsecurity_crs_11_dos_protection.conf:

#
# Anti-Automation rule set for detecting Denial of Service Attacks. 
#
#
# Enforce an existing IP address block and log only 1-time/minute
# We don't want to get flooded by alerts during an attack or scan so
# we are only triggering an alert once/minute.  You can adjust how often
# you want to receive status alerts by changing the expirevar setting below.
#
SecRule IP:DOS_BLOCK "@eq 1" "chain,phase:1,id:'981044',drop,msg:'Denial of Service (DoS) Attack Identified from %{tx.real_ip} (%{tx.dos_block_counter} hits since last alert)',setvar:ip.dos_block_counter=+1"
SecRule &IP:DOS_BLOCK_FLAG "@eq 0" "setvar:ip.dos_block_flag=1,expirevar:ip.dos_block_flag=60,setvar:tx.dos_block_counter=%{ip.dos_block_counter},setvar:ip.dos_block_counter=0"

#
# Block and track # of requests but don't log
SecRule IP:DOS_BLOCK "@eq 1" "phase:1,id:'981045',t:none,drop,nolog,setvar:ip.dos_block_counter=+1"

#
# skipAfter Check
# There are different scenarios where we don't want to do checks -
# 1. If the current IP address has already been blocked due to high requests
# In this case, we skip doing the request counts.
#
SecRule IP:DOS_BLOCK "@eq 1" "phase:5,id:'981046',t:none,nolog,pass,skipAfter:END_DOS_PROTECTION_CHECKS"

#
# DOS Counter
# Count the number of requests to non-static resoures
# 
SecRule REQUEST_BASENAME "!\.(jpe?g|png|gif|js|css|ico)$" "phase:5,id:'981047',t:none,nolog,pass,setvar:ip.dos_counter=+1"

#
# Check DOS Counter
# If the request count is greater than or equal to user settings,
# we then set the burst counter
# 
SecRule IP:DOS_COUNTER "@gt %{tx.dos_counter_threshold}" "phase:5,id:'981048',t:none,nolog,pass,t:none,setvar:ip.dos_burst_counter=+1,expirevar:ip.dos_burst_counter=%{tx.dos_burst_time_slice},setvar:!ip.dos_counter"

#
# Check DOS Burst Counter and set Block
# Check the burst counter - if greater than or equal to 2, then we set the IP
# block variable for 5 mins and issue an alert.
#
SecRule IP:DOS_BURST_COUNTER "@ge 2" "phase:5,id:'981049',t:none,log,pass,msg:'Potential Denial of Service (DoS) Attack from %{tx.real_ip} - # of Request Bursts: %{ip.dos_burst_counter}',setvar:ip.dos_block=1,expirevar:ip.dos_block=%{tx.dos_block_timeout}"

SecMarker END_DOS_PROTECTION_CHECKS


You can tailor the burst detection patterns by editing the file at /etc/httpd/modsecurity.d/modsecurity_localrules.conf:

#
# -- [[ DoS Protection ]] ----------------------------------------------------------------
#
# If you are using the DoS Protection rule set, then uncomment the following
# lines and set the following variables:
# - Burst Time Slice Interval: time interval window to monitor for bursts
# - Request Threshold: request # threshold to trigger a burst
# - Block Period: temporary block timeout
#
SecAction \
  "id:'900015', \
  phase:1, \
  t:none, \
  setvar:'tx.dos_burst_time_slice=60', \
  setvar:'tx.dos_counter_threshold=100', \
  setvar:'tx.dos_block_timeout=600', \
  nolog, \
  pass"

The anti-DOS rules count the number of requests for non-static resources made by an IP address.  By default, if the count exceeds 100 requests / minute, then the IP address is blocked for 10 minutes.  During the block period, any requests made by the IP address are dropped.  The counter starts afresh when the block period expires.  For reporting, the IP address is logged when it has just been blocked.  During the block period, the number of new requests by the IP address is logged every minute.

To implement this logic, the rules use 5 internal variables.  The dos_counter variable counts the number of requests during the time when client is not blocked.  In pseudo-code, the variable with its default value is:

dos_counter = 0

The dos_block_counter variable counts the number of requests during the time when client is blocked:

dos_block_counter = 0

The dos_burst_counter variable keeps track if a burst had occurred (whenever value is 2).  The value resets to 0 upon expiration time.

struct dos_burst_counter
    value = 0
    expiration = +inf

The dos_block variable indicates if an address is currently blocked (whenever value is 1).  The value resets to 0 upon expiration time.

struct dos_block
    value = 0
    expiration = +inf

The dos_block_flag variable controls the frequency of log updates during block period.  The value resets to 0 upon expiration time.

struct dos_block_flag
    value = 0
    expiration = +inf

The burst detection pattern is set with the 3 parameters:

DOS_COUNTER_THRESHOLD = 100
DOS_BURST_TIME_SLICE = 60
DOS_BLOCK_TIMEOUT = 600

In pseudo-code, the 6 anti-DOS rules are encapsulated in a function:

function process(request, response)
    //
    // Rule 981044 - log new requests by the blocked IP address every minute
    //
    if dos_block.value == 1 && dos_block_flag.value == 0
        dos_block_flag.value = 1
        dos_block_flag.expiration = unix_time() + 60
        print("DOS attack from %s %d hits since last alert", request.ip, dos_block_counter)
        dos_block_counter = 0

    //
    // Rule 982045 - drop request if IP address is blocked
    //
    if dos_block.value == 1
        dos_block_counter++
        response.drop()

    //
    // Rule 981046 - skip the following rules
    //
    if dos_block.value == 1
        return

    //
    // Rule 981047 - increment request counter
    //
    if request.non_static_resource
        dos_counter++

    //
    // Rule 981048 - check request counter against threshold
    //
    if dos_counter > DOS_COUNTER_THRESHOLD
        dos_burst_counter.value++
        dos_burst_counter.expiration = unix_time() + DOS_BURST_TIME_SLICE
        dos_counter = 0

    //
    // Rule 981049 - check if a burst had occurred
    //
    if dos_burst_counter.value >= 2
        dos_block.value = 1
        dos_block.expiration = unix_time() + DOS_BLOCK_TIMEOUT
        print("Potential DOS attack from %s - %d bursts, request.ip, dos_burst_counter.value)

Every time a web request comes in, the process() function is invoked.  In addition, every second or less, a background process calls the following function to reset any variables with pending expiration.

function expire()
    int time_now = unix_time()

    if time_now >= dos_block.expiration
        dos_block.value = 0
        dos_block.expiration = +inf

    if time_now >= dos_burst_counter.expiration
        dos_burst_counter.value = 0
        dos_burst_counter.expiration = +inf

    if time_now >= dos_block_flag.expiration
        dos_block_flag.value = 0
        dos_block_flag.expiration = +inf

Barring any issues with concurrency, the above pseudo-code should mimic the behavior of the anti-DOS rules but in a more readable and comprehensible manner.  In the next article, I'll dive a bit deeper by patching a race condition bug in the rules.


Tuesday, May 17, 2016

ModSecurity for Amazon Linux AMI with Apache 2.4 and ELB

UPDATE:  Check out my online interactive tool for auditing ModSecurity log quickly.

I have been searching for a free open source solution to protect my web application against prying hackers, malicious screen scrapers, illegitimate crawlers, rampant bots and abusive API users.  Besides being free and open source, the minimum requirement is that the solution can identify rogue user IP addresses and blacklist them if necessary.  Preferably, the solution can also protect (somewhat) against denial-of-service (DOS) attack and implement API rate limiting.

My system requirements are:
  • Amazon ELB
  • EC2 Nodes
  • Amazon Linux AMI
  • Apache 2.4 (MPM Event)

Blacklist IP Addresses


Since my EC2 nodes are behind a load balancer, simple traditional solution like iptables which operates at the TCP layer will not work because it will pick up the IP address of the load balancer instead of the real user IP address which is being stored at the application layer in X-Forwarded-For of the HTTP header by the load balancer.

To identify and log rogue user IP addresses, you can modify Apache configuration file at /etc/conf/httpd.conf.  Edit the LogFormat lines to include "%{X-Forwarded-For}i".  For example, the edited lines may look like:

LogFormat "%{X-Forwarded-For}i %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogFormat "%{X-Forwarded-For}i %h %l %u %t \"%r\" %>s %b" common

Then, restart Apache with sudo service httpd restart.

From now on, the user IP will be logged in the first field of each line of the access log at /var/log/httpd/access_log:

X.X.X.X, Z.Z.Z.Z 127.0.0.1 - - [16/May/2016:16:39:03 +0000] "POST / HTTP/1.1" 200 930 "-" "-"

With the user IP address available in the log, you can blacklist a user in different ways.  Without installation of additional modules or tools, you can blacklist manually by adding the IP address directly to the .htaccess file.  The following lines blacklist two IP addresses - X.X.X.X and Y.Y.Y.Y:

SetEnvIF X-Forwarded-For "(,| |^)X\.X\.X\.X(,| |$)" IP_BLACKLIST
SetEnvIF X-Forwarded-For "(,| |^)Y\.Y\.Y\.Y(,| |$)" IP_BLACKLIST
deny from env=IP_BLACKLIST

To automate blacklisting, you can install tools like Fail2Ban with sudo yum install fail2ban.  Fail2Ban will monitor the Apache log, extract the user IP and ban based on defined rules.  For more information, check out the numerous online materials available.


ModSecurity


There are a few free solutions available for DOS protection.  For example, there is the Apache module mod_evasive.  To the best of my knowledge, mod_evasive by itself does not work for a web server located behind a load balancer or proxy because it cannot access the user IP from the X-Forwarded-For header field.  The additional installation of mod_rpaf is required to bypass the limitation.  At the time of writing, the only way to install both modules on Amazon Linux for Apache 2.4 is to download and compile the source code.  Furthermore, mod_evasive is only compatible with Apache running in prefork mode so if your Apache is using MPM worker or event, you are out of luck.  To find out which mpm module Apache is using, check the configuration file at /etc/httpd/conf.modules.d/00-mpm.conf

The alternative solution I have been exploring is ModSecurity.  To install, run sudo yum install mod24_security.  ModSecurity is a web application firewall (WAF) designed to protect Apache, NGINX and IIS against common hacking exploits.  It works by examining web requests against a set of rules to identify malicious traffic pattern (e.g. HTTP header missing user-agent) and execute the corresponding actions (e.g. drop connection).  To make life easier, you can download a predefined set of generic attack detection rules called the OWASP ModSecurity Core Rule Set (CRS) via sudo yum install mod_security_crs.  You can take a look at what the rules look like at https://github.com/SpiderLabs/owasp-modsecurity-crs

The CRS rules are installed at /etc/httpd/modsecurity.d/activated_rules.  You may also add your own rules at /etc/httpd/modsecurity.d/local_rules.  Out of the box, the CRS rules will likely generate many false alarms for your particular website.  This means it will inadvertently shut your users off from your site if you are not careful.  For example, it may mistakenly identify a legitimate HTTP POST request with more than 255 parameters as an exploit even if your application expects it.

At the minimum, before you deploy ModSecurity to production use, find the following line from ModSecurity configuration file at /etc/httpd/conf.d/mod_security.conf

SecRuleEngine On

and change it to:

SecRuleEngine DetectionOnly

This sets ModSecurity to detection mode so it only reports potential exploits without enforcement.  Every time you make changes to the configuration or rules, you must restart Apache with sudo service httpd restart.  If everything goes well, your web application should function normally as before while ModSecurity checks every web requests and log potential problems to /var/log/httpd/modsec_audit.log.

You can control what information ModSecurity should log by editing the configuration file at /etc/httpd/conf.d/mod_security.conf

SecAuditLogParts ABHZ

For example, you can set it to log the request headers, request body, response headers, etc.  A log entry for a potentially malicious web request may look like this:


--61080530-B--
POST /index HTTP/1.1
host: my_website123.com
Accept: application/json, text/javascript, */*; q=0.01
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Referer: https://my_website123/index
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
X-Requested-With: XMLHttpRequest
X-Forwarded-For: X.X.X.X
X-Forwarded-Port: 443
X-Forwarded-Proto: https
Content-Length: 12056
Connection: keep-alive

--61080530-H--
Message: Warning. Operator GT matched 255 at ARGS. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_23_request_limits.conf"] [line "31"] [id "960335"] [rev "2"] [msg "Too many arguments in request"] [severity "WARNING"] [ver "OWASP_CRS/2.2.8"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/POLICY/SIZE_LIMIT"]
Apache-Handler: application/x-httpd-php
Stopwatch: 1463421036197726 318695 (- - -)
Stopwatch2: 1463421036197726 318695; combined=64071, p1=199, p2=63680, p3=0, p4=0, p5=120, sr=67, sw=72, l=0, gc=0
Producer: ModSecurity for Apache/2.8.0 (http://www.modsecurity.org/); OWASP_CRS/2.2.8.
Server: Apache
Engine-Mode: "DETECTION_ONLY"


In this example, WebSecurity reports that a web request to https://my_website123/index originating from X.X.X.X violates one of the CRS rule for having more than 255 HTTP POST parameters.  The rule ID is 960335.

You can disable a particular rule by adding the following line to /etc/httpd/modsecurity.d/local_rules/modsecurity_localrules.conf

SecRuleRemoveById 960335

Removing a rule may weaken the firewall you are trying to build.  Instead, you can modify the rule to match your traffic pattern.  For the HTTP POST parameter limit violation, you can also increase the tx.max_num_args parameter value in /etc/httpd/modsecurity.d/modsecurity_crs_10_config.conf.


Denial of Service (DOS) Protection


Getting back to my original purpose of setting up a firewall, I can now use ModSecurity to blacklist IPs as well as protect (somewhat) against denial of service attack by using the anti-automation rule set modsecurity_crs_11_dos_protection.conf:

#
# Anti-Automation rule set for detecting Denial of Service Attacks. 
#
#
# Enforce an existing IP address block and log only 1-time/minute
# We don't want to get flooded by alerts during an attack or scan so
# we are only triggering an alert once/minute.  You can adjust how often
# you want to receive status alerts by changing the expirevar setting below.
#
SecRule IP:DOS_BLOCK "@eq 1" "chain,phase:1,id:'981044',drop,msg:'Denial of Service (DoS) Attack Identified from %{tx.real_ip} (%{tx.dos_block_counter} hits since last alert)',setvar:ip.dos_block_counter=+1"
SecRule &IP:DOS_BLOCK_FLAG "@eq 0" "setvar:ip.dos_block_flag=1,expirevar:ip.dos_block_flag=60,setvar:tx.dos_block_counter=%{ip.dos_block_counter},setvar:ip.dos_block_counter=0"

#
# Block and track # of requests but don't log
SecRule IP:DOS_BLOCK "@eq 1" "phase:1,id:'981045',t:none,drop,nolog,setvar:ip.dos_block_counter=+1"

#
# skipAfter Check
# There are different scenarios where we don't want to do checks -
# 1. If the current IP address has already been blocked due to high requests
# In this case, we skip doing the request counts.
#
SecRule IP:DOS_BLOCK "@eq 1" "phase:5,id:'981046',t:none,nolog,pass,skipAfter:END_DOS_PROTECTION_CHECKS"

#
# DOS Counter
# Count the number of requests to non-static resoures
# 
SecRule REQUEST_BASENAME "!\.(jpe?g|png|gif|js|css|ico)$" "phase:5,id:'981047',t:none,nolog,pass,setvar:ip.dos_counter=+1"

#
# Check DOS Counter
# If the request count is greater than or equal to user settings,
# we then set the burst counter
# 
SecRule IP:DOS_COUNTER "@gt %{tx.dos_counter_threshold}" "phase:5,id:'981048',t:none,nolog,pass,t:none,setvar:ip.dos_burst_counter=+1,expirevar:ip.dos_burst_counter=%{tx.dos_burst_time_slice},setvar:!ip.dos_counter"

#
# Check DOS Burst Counter and set Block
# Check the burst counter - if greater than or equal to 2, then we set the IP
# block variable for 5 mins and issue an alert.
#
SecRule IP:DOS_BURST_COUNTER "@ge 2" "phase:5,id:'981049',t:none,log,pass,msg:'Potential Denial of Service (DoS) Attack from %{tx.real_ip} - # of Request Bursts: %{ip.dos_burst_counter}',setvar:ip.dos_block=1,expirevar:ip.dos_block=%{tx.dos_block_timeout}"

SecMarker END_DOS_PROTECTION_CHECKS


You can tailor the burst detection pattern by editing the file at /etc/httpd/modsecurity.d/modsecurity_localrules.conf:

#
# -- [[ DoS Protection ]] ----------------------------------------------------------------
#
# If you are using the DoS Protection rule set, then uncomment the following
# lines and set the following variables:
# - Burst Time Slice Interval: time interval window to monitor for bursts
# - Request Threshold: request # threshold to trigger a burst
# - Block Period: temporary block timeout
#
SecAction \
  "id:'900015', \
  phase:1, \
  t:none, \
  setvar:'tx.dos_burst_time_slice=60', \
  setvar:'tx.dos_counter_threshold=100', \
  setvar:'tx.dos_block_timeout=600', \
  nolog, \
  pass"


ModSecurity is a powerful tool to protect web applications and as such it comes with a learning curve.   I have only touched on the basics in this blog entry.  Hopefully, I can devote some more blog time to it as I pick up the tool myself.

To develop your own tunnel vision quickly as I have been doing, I recommend taking a look at the official documentation at:

https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual