In the previous post, I’ve used logistic regression and tf-idf from Python’s sklearn to predict ethnicity based on a person forename, middle and surname. In this post, I’ll continue to use the same Wikipedia dataset but this time, the task is to predict if a name is forename or surname. The motivation is to use such classifier for input form validation and ETL.
To start with, let’s summarize the number of records available:
Ethnicity Forenames Surnames Total Nordic 1,691 3,230 4,921 EastAsian 2,331 2,784 5,115 Germanic 1,975 3,329 5,304 Africans 2,969 3,170 6,139 Muslim 3,539 4,557 8,096 Japanese 3,864 4,236 8,100 IndianSubContinent 5,133 3,839 8,972 EastEuropean 2,888 6,627 9,515 Hispanic 3,662 6,254 9,916 Jewish 3,676 6,304 9,980 Italian 4,315 8,781 13,096 French 3,802 9,630 13,432 British 6,933 16,062 22,995
We observe that:
- Some datasets like Nordic have much fewer records than others
- The dataset for French has 2.5 times more surnames than forenames
Our intuition tells us that a general classifier to predict forename and surname for all ethnicities is probably a bad idea as it ignores the subtle differences in name spelling among ethnicities. The confusion matrix from the previous post on ethnicity prediction speaks to this point. Therefore, I’ll build a separate forename-surname classifier for each ethnicity.
I will use the same logistic regression approach from the previous post which feature engineers bigrams, trigrams and quadgrams from names. To address data imbalance in observation #2, I’ll upsample the data of the minority class using the handy Imblearn package. To address the curse of dimensionality in observation #1, I’ll employ the following approaches to avoid overfitting the training data:
- Penalize large coefficients in logistic regression
- Reduce the number of dimensions using SVD
Without further ado, here is the classification performance on a 20% holdout data using upsampling and L2 penalization:
Ethnicity Precision Recall F1 Nordic 79.0% 78.2% 78.4% EastAsian 63.4% 62.8% 62.8% Germanic 76.9% 75.9% 76.1% Africans 59.7% 59.7% 59.7% Muslim 67.2% 66.4% 66.6% Japanese 60.4% 60.3% 60.3% IndianSubContinent 60.6% 60.1% 60.3% EastEuropean 79.6% 77.2% 77.9% Hispanic 71.1% 70.6% 70.8% Jewish 74.8% 74.0% 74.3% Italian 76.4% 74.0% 74.6% French 76.4% 75.0% 75.6% British 74.8% 73.7% 74.2%
The performance is about the same with SVD dimensionality reduction to capture 80 to 90% of variance in the data. Generally, that reduces the number of features by 80%.
Ethnicity Precision Recall F1 Nordic 78.4% 77.6% 77.8% EastAsian 64.1% 63.4% 63.5% Germanic 77.0% 76.0% 76.2% Africans 60.0% 59.9% 59.9% Muslim 67.2% 66.4% 66.5% Japanese 61.3% 61.2% 61.2% IndianSubContinent 60.6% 60.2% 60.3% EastEuropean 80.2% 77.9% 78.5% Hispanic 71.8% 70.9% 71.2% Jewish 74.8% 73.9% 74.2% Italian 76.4% 73.7% 74.4% French 76.8% 75.1% 75.7%
The results are sorted by the size of dataset from smallest to largest. The result for British is missing due to my old laptop running out of memory during SVD matrix factorization.
Interestingly, focusing on the F1 score, we see that poor performance is not necessarily correlated with small data. For example, Nordic being the smallest dataset yields rather high F1 score. The ethnicities with F1 score below 70% are Africans, IndianSubContinent, Japanese, EastAsian and Muslim.
Some of the Muslim names misclassified are:
- navi, rabia, bapsi, khar, ravi, fadela, szold
Some of the EastAsian names misclassified are:
- guanqiu, huang, heshen, dai, eitel, gagnon, ming, kimble, liang, samata
Besides insufficient data being the possible cause of poor classification performance, it's conceivable that some of these misclassified names can be tricky even to humans to decide if it's a forename or surname. For example, "Davis" can be either a forename or surname.
In the meantime, a classifier with 60% accuracy can still be useful. For example, for input form validation where forename and surname are entered and the goal is to validate if they are entered into the right field, we can use the classifier to predict whether a swap has taken place accidentally by the user.
A swap occurs if the classifier predicts that the input forename is a surname and the input surname is a forename. For a classifier with 60% accuracy, the chance of erroneously predicting a swap and alerting the user is only 100 x (0.4 x 0.4) = 16%.