Discovering Data
  • Home
  • Blog
  • become_a_data_scientist
  • Code-examples

#100DaysOfDataScience

Day 10 - preparing data

7/15/2018

0 Comments

 
The dataset is available on Kaggle.

The data contains many values = ?. These cannot be passed into a machine learning algorithm so I replaced tham with the numpy NaN.

df_raw = df_raw.replace('?', np.nan)

I also dropped some columns because they contained almost no data:

cols=['STDs: Time since first diagnosis','STDs: Time since last diagnosis']
df_raw = df_raw.drop(cols,axis=1)

I also changed the data type from object to float:

cols = ['Number of sexual partners',            
'First sexual intercourse',              
'Num of pregnancies',                    
'Smokes',                                 
'Smokes (years)',                         
'Smokes (packs/year)',                    
'Hormonal Contraceptives',                
'Hormonal Contraceptives (years)',        
'IUD',                                    
'IUD (years)',                            
'STDs',                                   
'STDs (number)',                          
'STDs:condylomatosis',                    
'STDs:cervical condylomatosis',           
'STDs:vaginal condylomatosis',            
'STDs:vulvo-perineal condylomatosis',     
'STDs:syphilis',                          
'STDs:pelvic inflammatory disease',       
'STDs:genital herpes',                    
'STDs:molluscum contagiosum',            
'STDs:AIDS',                              
'STDs:HIV',                              
'STDs:Hepatitis B',                       
'STDs:HPV'] 
df_raw[cols]= df_raw[cols].apply(pd.to_numeric)

I then checked for correlation between the remaining columns and visualised this using this code:

def plot_corr(df,size=4):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot'''

    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr,cmap=cm.Greys)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
    plt.yticks(range(len(corr.columns)), corr.columns);
    plt.show()
    
    
plot_corr(df_raw,16)
Picture
This can often make unusual correlations visible, for example there is a stronger correlation within this dataset between number of sexual partners and smoking than between ​number of sexual partners and STDs.
0 Comments



Leave a Reply.

Proudly powered by Weebly
  • Home
  • Blog
  • become_a_data_scientist
  • Code-examples