This repository contains minimal code using Python 3.0, to infer religion using South Asian names and race/ethnicity based on names from the North Carolina voter registration data in the United States. For complete details refer to our paper It's All in the Name: A Character Based Approach To Infer Religion
The full replication repository with a detailed readme file is available at this Harvard Dataverse link
The code folder comprises the following files:
-
requirements.txt: this file lists the required dependencies for the code to work
-
religion.ipynb: this file can be used to infer religion from South Asian names using a Support Vector Machine (SVM) based model:
- Use similar instructions as for the inference of race. However, in this case, replace the “sample_data.csv” file with your own file containing an individual’s (name) and/or their parent/spouse’s name (parent).
- Toggle the option concat_model to True or False to obtain predictions based on individual + parent/spouse’s name or only based on individual’s name respectively. Set n_way to “2class” to infer the religion as Muslim or non-Muslim only and “multiclass” to infer the religion as “Hindu”, “Muslim”, “Christian”, “Sikh”, “Jain”, or “Buddhist”.
- Execute all the remaining cells. The final output is saved in “data\predictions\sample_data.csv” file with the predicted religion and the respective decision function scores as returned by the SVM model.
-
race.ipynb: this file infers race/ethnicity by combining population compositions at the zip code level with the first, middle, and last names of individuals using a Convolutional Neural Network (CNN) based model. This can be easily run using Google Colaboratory after making a copy of the entire project folder:
- Use the given code snippet to mount google drive and install requirements.txt and then restart runtime (available in the runtime tab).
- Replace the file “sample_data_race.csv” from the data folder with your own name list with race compositions at the zip-code level corresponding to each individual. Note that the “sample_data_race.csv” file in the data folder has the variables fullname which is concatenated first+middle+last name separated by spaces, proportion of Whites (pz_whi), Blacks (pz_bla), Hispanics (pz_his), Asians (pz_asi), and Others (pz_oth) in the zip code within which the individual resides. Modify these variables based on the file you have.
- Toggle the usegis option to True or False to use the zipcode information or not respectively.
- Execute all the subsequent cells. The final predictions are saved in the file “data\predictions\sample_data_race.csv” file. The columns A, B, H, O, and W report the probability of the name as belonging to Asian, Black, Hispanic, Other, or White individuals respectively.
The religion inference models are trained using Rural Economic and Demographic Survey collected by the National Council of Applied Economic Research. instructions for obtaining this dataset are available at http://adfdell.pstc.brown.edu/arisreds_data/readme.txt. We also annotate the religion of 20,000 randomly selected household heads from rural Uttar Pradesh largely comprising Hindus and Muslims. This dataset can be used to test for improvements in subsequent works.
The race inference models are trained using the North Carolina voter registration data in the United States available at https://www.ncsbe.gov/results-data/voter-registration-data
The models can make predictions using single person's name, however the accuracy improves when we combine a person's name with the name of a relative such as parent/spouse.
The code is licenced under GNU Affero General Public License version 3 (AGPL-3.0, see LICENCE).
If you find the work useful, please cite our paper and the dataverse entry as follows:
@article{chaturvedi_chaturvedi_2023,
title={It’s All in the Name: A Character-Based Approach to Infer Religion},
DOI={10.1017/pan.2023.6},
journal={Political Analysis},
publisher={Cambridge University Press},
author={Chaturvedi, Rochana and Chaturvedi, Sugat},
year={2023},
pages={1–16}
}
@data{DVN/JOEVPN_2023,
author = {Chaturvedi, Rochana and Chaturvedi, Sugat},
publisher = {Harvard Dataverse},
title = {{Replication Data for: It’s All in the Name: A Character Based Approach to Infer Religion}},
year = {2023},
version = {V1},
doi = {10.7910/DVN/JOEVPN},
url = {https://doi.org/10.7910/DVN/JOEVPN}
}
If you are an academic researcher, please feel free to write to us and we will be happy to answer any additional questions.
- Rochana Chaturvedi: rochana [dot] chaturvedi [at] gmail [dot] com
- Sugat Chaturvedi: sugat [dot] chaturvedi [at] gmail [dot] com