Saturday, July 4, 2020

Identifying Forename and Surname for Different Ethnicities using Machine Learning

In the previous post, I’ve used logistic regression and tf-idf from Python’s sklearn to predict ethnicity based on a person forename, middle and surname.  In this post, I’ll continue to use the same Wikipedia dataset but this time, the task is to predict if a name is forename or surname.  The motivation is to use such classifier for input form validation and ETL.


To start with, let’s summarize the number of records available:


Ethnicity      Forenames          Surnames           Total 
Nordic         1,691           3,230           4,921
EastAsian         2,331           2,784           5,115
Germanic         1,975           3,329           5,304
Africans         2,969           3,170           6,139
Muslim         3,539           4,557           8,096
Japanese         3,864           4,236           8,100
IndianSubContinent         5,133           3,839           8,972
EastEuropean         2,888           6,627           9,515
Hispanic         3,662           6,254           9,916
Jewish         3,676           6,304           9,980
Italian         4,315           8,781         13,096
French         3,802           9,630         13,432
British         6,933         16,062         22,995



We observe that:

  1.  Some datasets like Nordic have much fewer records than others
  2.  The dataset for French has 2.5 times more surnames than forenames


Our intuition tells us that a general classifier to predict forename and surname for all ethnicities is probably a bad idea as it ignores the subtle differences in name spelling among ethnicities.  The confusion matrix from the previous post on ethnicity prediction speaks to this point.  Therefore, I’ll build a separate forename-surname classifier for each ethnicity.


I will use the same logistic regression approach from the previous post which feature engineers bigrams, trigrams and quadgrams from names.  To address data imbalance in observation #2, I’ll upsample the data of the minority class using the handy Imblearn package.  To address the curse of dimensionality in observation #1, I’ll employ the following approaches to avoid overfitting the training data:

  1. Penalize large coefficients in logistic regression
  2. Reduce the number of dimensions using SVD


Without further ado, here is the classification performance on a 20% holdout data using upsampling and L2 penalization:


EthnicityPrecisionRecallF1
Nordic79.0%78.2%78.4%
EastAsian63.4%62.8%62.8%
Germanic76.9%75.9%76.1%
Africans59.7%59.7%59.7%
Muslim67.2%66.4%66.6%
Japanese60.4%60.3%60.3%
IndianSubContinent60.6%60.1%60.3%
EastEuropean79.6%77.2%77.9%
Hispanic71.1%70.6%70.8%
Jewish74.8%74.0%74.3%
Italian76.4%74.0%74.6%
French76.4%75.0%75.6%
British74.8%73.7%74.2%


The performance is about the same with SVD dimensionality reduction to capture 80 to 90% of variance in the data.  Generally, that reduces the number of features by 80%.


EthnicityPrecisionRecallF1
Nordic78.4%77.6%77.8%
EastAsian64.1%63.4%63.5%
Germanic77.0%76.0%76.2%
Africans60.0%59.9%59.9%
Muslim67.2%66.4%66.5%
Japanese61.3%61.2%61.2%
IndianSubContinent60.6%60.2%60.3%
EastEuropean80.2%77.9%78.5%
Hispanic71.8%70.9%71.2%
Jewish74.8%73.9%74.2%
Italian76.4%73.7%74.4%
French76.8%75.1%75.7%


The results are sorted by the size of dataset from smallest to largest.  The result for British is missing due to my old laptop running out of memory during SVD matrix factorization.


Interestingly, focusing on the F1 score, we see that poor performance is not necessarily correlated with small data.  For example, Nordic being the smallest dataset yields rather high F1 score.  The ethnicities with F1 score below 70% are Africans, IndianSubContinent, Japanese, EastAsian and Muslim.


Some of the Muslim names misclassified are:

  • navi, rabia, bapsi, khar, ravi, fadela, szold


Some of the EastAsian names misclassified are:

  • guanqiu, huang, heshen, dai, eitel, gagnon, ming, kimble, liang, samata


Besides insufficient data being the possible cause of poor classification performance, it's conceivable that some of these misclassified names can be tricky even to humans to decide if it's a forename or surname. For example, "Davis" can be either a forename or surname.


In the meantime, a classifier with 60% accuracy can still be useful.  For example, for input form validation where forename and surname are entered and the goal is to validate if they are entered into the right field, we can use the classifier to predict whether a swap has taken place accidentally by the user.


A swap occurs if the classifier predicts that the input forename is a surname and the input surname is a forename.  For a classifier with 60% accuracy, the chance of erroneously predicting a swap and alerting the user is only 100 x (0.4 x 0.4) = 16%.


Sunday, June 28, 2020

Predicting Ethnicity based on Person Name using Machine Learning

Recently, I stumbled upon a Python app called ethnicolr that uses deep learning to predict ethnicity based on a person forename and surname.  The app uses LSTM on sequences of bigrams generated from surname and forename as features.  Three different models were trained using data collected from US Census, Florida voter registration and Wikipedia.  Of the three, the wikipedia dataset appears the most interesting because of the broader categories of ethnicity available than others.  The categories are East Asian, Japanese, Indian SubContinent, Africans, Muslim, British, East European, Jewish, French, Germanic, Hispanic, Italian, and Nordic.  The reported precision and recall of the LSTM model using the wikipedia dataset are both approximately 73%.


Intrigued by the approach, I was curious how the performance would fare by using more traditional classifiers such as logistic regression which tends to take less time to train than LSTM.  The full Jupyter notebook is available here.


In summary, logistic regression produces slightly better precision and recall than LTSM and also took much less time to train (a few minutes versus about an hour on a 5 year old computer).  At the time of writing, average precision is 75% and recall is 76%.  The full confusion matrix is:



Perhaps as expected, some races such as Africans and Jewish are harder to predict than others due to more mixing and displacements over generations.


The major differences between the LSTM and regression model is that the former uses only bigrams while the latter uses bigrams, trigrams and quadgrams as features.  The regression model also weighs the features by applying tf-idf.  In addition, the regression model includes the middle name whenever it’s available though the performance improvement from adding middle name was observed to be negligible.


As a quick demo, the regression model was run on the following world leaders to predict their ethnicity:


 Name Ethnicity
Abe ShinzoJapanese
Conte GiuseppeItalian 
Johnson BorisBritish
Modi Narendra Indan SubContinent 
 Netanyahu Benjamin Jewish
Obama BarrackAfricans
Putin VladimirEast European
Xi JinpingEast Asian


It's worthy to note that a few other approaches were attempted to improve precision and recall such as applying XGBoost and dataset balancing techniques.  At best, their performance was on par with logistic regression.  It is, of course, conceivable that more tuning might have been required for those methods to excel.