I have two files, the first is a list of names, and the second is a directory with over 100 .txt files (the same data on first names in the US that I used in the last post). The problem to solve is how do I check that the names in the first file exist in the second file(s). I could manually do a search or use the search function available in the csv and text files. But if the amount of data in the files and/or the number of files is large then that solution becomes less workable. An alternative is to use Python and Pandas to do the work for me - the code takes less than one second to run so is much faster than any manual search. Note - at this stage I don't care how many times the names appear in the files I am only interested in if they appear or not.
import pandas as pd
cols2 = ['name', 'gen', 'num']
cols1 = ['name']
ib_names = pd.read_csv('irishBoysNames.csv', names=cols1)
ig_names = pd.read_csv('irishGirlsNames.csv', names=cols1)
df = pd.read_csv('yob2010.txt', names=cols2)
boys = df[df.gen == 'M']
girls = df[df.gen == 'F']
bnames = boys[boys.columns]
gnames = girls[girls.columns]
for name in ig_names.name:
#for name in ib_names.name:
exists = name in gnames.values
#exists = name in bnames.values
print(name + ': ' + str(exists))
The output for the yob2010.txt file:
The files have no header rows so I define two lists, one for each file, which contain the column names. I then import the data and create some new dataframes using this data. I then loop through the data and for each name in the list of names in file one I check if it is in file 2 (in the above file 2 = yob2010.txt).
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.