The Guide to Tunnel Vision: June 2020

Recently, I stumbled upon a Python app called ethnicolr that uses deep learning to predict ethnicity based on a person forename and surname. The app uses LSTM on sequences of bigrams generated from surname and forename as features. Three different models were trained using data collected from US Census, Florida voter registration and Wikipedia. Of the three, the wikipedia dataset appears the most interesting because of the broader categories of ethnicity available than others. The categories are East Asian, Japanese, Indian SubContinent, Africans, Muslim, British, East European, Jewish, French, Germanic, Hispanic, Italian, and Nordic. The reported precision and recall of the LSTM model using the wikipedia dataset are both approximately 73%.

Intrigued by the approach, I was curious how the performance would fare by using more traditional classifiers such as logistic regression which tends to take less time to train than LSTM. The full Jupyter notebook is available here.

In summary, logistic regression produces slightly better precision and recall than LTSM and also took much less time to train (a few minutes versus about an hour on a 5 year old computer). At the time of writing, average precision is 75% and recall is 76%. The full confusion matrix is:

Perhaps as expected, some races such as Africans and Jewish are harder to predict than others due to more mixing and displacements over generations.

The major differences between the LSTM and regression model is that the former uses only bigrams while the latter uses bigrams, trigrams and quadgrams as features. The regression model also weighs the features by applying tf-idf. In addition, the regression model includes the middle name whenever it’s available though the performance improvement from adding middle name was observed to be negligible.

As a quick demo, the regression model was run on the following world leaders to predict their ethnicity:

Name Ethnicity
Abe Shinzo Japanese
Conte Giuseppe Italian
Johnson Boris British
Modi Narendra Indan SubContinent
Netanyahu Benjamin Jewish
Obama Barrack Africans
Putin Vladimir East European
Xi Jinping East Asian

It's worthy to note that a few other approaches were attempted to improve precision and recall such as applying XGBoost and dataset balancing techniques. At best, their performance was on par with logistic regression. It is, of course, conceivable that more tuning might have been required for those methods to excel.

The Guide to Tunnel Vision

Sunday, June 28, 2020

Predicting Ethnicity based on Person Name using Machine Learning

Name	Ethnicity
Abe Shinzo	Japanese
Conte Giuseppe	Italian
Johnson Boris	British
Modi Narendra	Indan SubContinent
Netanyahu Benjamin	Jewish
Obama Barrack	Africans
Putin Vladimir	East European
Xi Jinping	East Asian