The dataset is available on Kaggle.
The data contains many values = ?. These cannot be passed into a machine learning algorithm so I replaced tham with the numpy NaN.
df_raw = df_raw.replace('?', np.nan)
I also dropped some columns because they contained almost no data:
cols=['STDs: Time since first diagnosis','STDs: Time since last diagnosis']
df_raw = df_raw.drop(cols,axis=1)
I also changed the data type from object to float:
cols = ['Number of sexual partners',
'First sexual intercourse',
'Num of pregnancies',
'Hormonal Contraceptives (years)',
'STDs:pelvic inflammatory disease',
I then checked for correlation between the remaining columns and visualised this using this code:
'''Function plots a graphical correlation matrix for each pair of columns in the dataframe.
df: pandas DataFrame
size: vertical and horizontal size of the plot'''
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
This can often make unusual correlations visible, for example there is a stronger correlation within this dataset between number of sexual partners and smoking than between number of sexual partners and STDs.