The dataset is available on Kaggle. The data contains many values = ?. These cannot be passed into a machine learning algorithm so I replaced tham with the numpy NaN. df_raw = df_raw.replace('?', np.nan) I also dropped some columns because they contained almost no data: cols=['STDs: Time since first diagnosis','STDs: Time since last diagnosis'] df_raw = df_raw.drop(cols,axis=1) I also changed the data type from object to float: cols = ['Number of sexual partners', 'First sexual intercourse', 'Num of pregnancies', 'Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives', 'Hormonal Contraceptives (years)', 'IUD', 'IUD (years)', 'STDs', 'STDs (number)', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis', 'STDs:vulvoperineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease', 'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B', 'STDs:HPV'] df_raw[cols]= df_raw[cols].apply(pd.to_numeric) I then checked for correlation between the remaining columns and visualised this using this code: def plot_corr(df,size=4): '''Function plots a graphical correlation matrix for each pair of columns in the dataframe. Input: df: pandas DataFrame size: vertical and horizontal size of the plot''' corr = df.corr() fig, ax = plt.subplots(figsize=(size, size)) ax.matshow(corr,cmap=cm.Greys) plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical'); plt.yticks(range(len(corr.columns)), corr.columns); plt.show() plot_corr(df_raw,16) This can often make unusual correlations visible, for example there is a stronger correlation within this dataset between number of sexual partners and smoking than between number of sexual partners and STDs.
0 Comments
Youtube channel: DeepLearning.TV
online book: http://neuralnetworksanddeeplearning.com/ The learning rate is how quickly a network abandons old beliefs for new ones.
If a child sees 10 examples of cats and all of them have orange fur, it will think that cats have orange fur and will look for orange fur when trying to identify a cat. Now it sees a black a cat and her parents tell her it's a cat (supervised learning). With a large “learning rate”, it will quickly realize that “orange fur” is not the most important feature of cats. With a small learning rate, it will think that this black cat is an outlier and cats are still orange. If the learning rate is too high, it might start to think that all cats are black even though it has seen more orange cats than black ones. In general, you want to find a learning rate that is low enough that the network converges to something useful, but high enough that you don't have to spend years training it. one epoch = one forward pass and one backward pass of all the training examples batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need. number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes). Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random. By “ignoring”, I mean these units are not considered during a particular forward or backward pass. More technically, At each training stage, individual nodes are either dropped out of the net with probability 1p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a droppedout node are also removed. Found below notes here.
LDA is a way to reduce 'dimensionality' while at the same time preserving as much of the class discrimination information as possible. Basically, LDA helps you find the 'boundaries' around clusters of classes. It projects your data points on a line so that your clusters ' are as separated as possible', with each cluster having a relative (close) distance to a centroid. What was that stuff about dimensionality? Let's say you have a group of data points in 2 dimensions, and you want to group them into 2 groups. LDA reduces the dimensionality of your set like so: K(Groups) = 2. 21 = 1. Why? Because "The K centroids lie in an at most K1 dimensional affine subspace". What is the affine subspace? Its a geometric concept or *structure* that says "I am going to generalize the affine properties of Euclidean space". What are those affine properties of the Euclidean space? Basically, its the fact that we can represent a point with 3 coordinates in a 3 dimensional space (with a nod toward the fact that there may be more than 3 dimesions that we are ultimately dealing with). So we should be able to represent a point with 2 coordinates in 2 dimensional space, and represent a point with 1 coordinate in a 1 dimensional space. LDA reduced our dimensionality of our 2 dimension problem down to one dimension. So now we can get down to the serious business of listening to the data. We now have 2 groups, and 2 points in any dimension can be joined by a line. How many dimensions does a line have? 1! Now we are cooking with Crisco! So we get a bunch of these data points, represented by their 2d representation (x,y). We are going to use LDA to group these points into either group 1 or group 2. What its actually doing: 1. Calculating mean vectors of the data in all dimensions. 2. Calculates scatter from the whole group (to determine separability) 3. Calculates scatter from representatives of the same class (to determine 'sameness'), using the whole group scatter as a normalizer. 4. Magical grouping around K centroids. According to Wikipedia: LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements.
However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label).[3] Logistic regression and probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method. python example can be used for dimensionality reduction implement in python from scratch other examples of dimensionality reduction It can be important to identify outliers because they can be noise that you want to remove from your analysis or they could be exactly what you are looking for, for example you want to identify suspicious activity on a network or identify a medical problem.
Good example of using KNN to detect outliers here. Also using K means. Algorithms like KNN which measure the distances between data points can use several different approaches: for example the euclidean distance: where d(p,q) is the distance between two data points in a euclidean space. The Manhattan distance according to Wiktionary is: the distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal or "as the crow flies". The Manhattan distance is part of taxicab geometry.
The Scikit learn KNN class allows different distance metrics to be used via the get_metric class method and the metric string identifier. Metrics available: euclidean manhattan chebyshev minkowski wminkowski seuclidean mahalanobis K means clustering is an unsupervised algorithm. Say you have some data but the data is not labelled. This means you can't train it. In this situation an unsupervised algorithm could be useful. Example using K means with Scikit learn: First generate some data using import numpy as np import matplotlib.pyplot as plt #from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn.datasets import make_blobs plt.rcParams['figure.figsize'] = (16, 9) # Creating a sample dataset with 4 clusters X, y = make_blobs(n_samples=800, n_features=2, centers=4) fig = plt.figure() ax = plt.axes() ax.scatter(X[:, 0], X[:, 1]) This produces: Then using: kmeans = KMeans(n_clusters=4) kmeans = kmeans.fit(X) labels = kmeans.predict(X) C = kmeans.cluster_centers_ fig = plt.figure() ax = plt.axes() ax.scatter(X[:, 0], X[:, 1], c=y) ax.scatter(C[:, 0], C[:, 1], marker='+', c='#000000', s=1000) The algorithm has identified the four clusters, the centroids are marked with crosses.
The code from day one can also be used with the Random Forest algorithm:
clf_rf = ske.RandomForestClassifier(n_estimators=50) clf_rf = clf_rf.fit(X_train, y_train) print(clf_rf.score(X_test,y_test)) This produces a score of around 55% Another algorithm is Gradient Boost: clf_gb = ske.GradientBoostingClassifier(n_estimators=50) clf_gb = clf_gb.fit(X_train, y_train) print(clf_gb.score(X_test,y_test)) which generates an accuracy of aroung 59%. The Random Forest is/was popular because it is easy to use and can be used for both regression and classification problems. The problem described on day 1 was a classification problem. Regression problems involve predicting a numeric value based on past patterns. A forest in this case is an ensemble of decision trees; decision trees are notorious for overfitting, generating many trees then averaging over them can reduce the overfitting problem. Hyper parameters are parameters that can be adjusted to improve accuracy. For example with the Random Forest you can increase the n_estimators parameter which controls the number of trees built. Knearest neighbours did not perform well in this problem (average accuracy = 49%). KNN does not work well with categorical data for example gender. It also performs less well as the number of data features increases. Dataset: fata police shootings in the US available on Kaggle Is it possible to use the K nearest neighbours algorithm to predict the race of someone shot by the police based on some other data? What is Knearest neighbour? It is a nonparametric learning algorithm, which means that it doesn't assume anything about the underlying data. It divides the data into groups by measuring the distance between points, you tell it how many neighbours you want to include. There are different ways to measure the distance between points for example Euclidean or Manhattan. The data in the dataset looks like this: The code: import pandas as pd from sklearn import tree, preprocessing import sklearn.ensemble as ske from sklearn.model_selection import train_test_split df = pd.read_csv("../input/PoliceKillingsUS.csv", encoding="windows1252") It is necessary to prepare the data
df = df.drop(['manner_of_death','id','name','date','city'],axis=1) df = df.dropna() le = preprocessing.LabelEncoder() df.armed = le.fit_transform(df.armed) df.gender = le.fit_transform(df.gender) df.state = le.fit_transform(df.state) df.signs_of_mental_illness = le.fit_transform(df.signs_of_mental_illness) df.threat_level = le.fit_transform(df.threat_level) df.flee = le.fit_transform(df.flee) df.body_camera = le.fit_transform(df.body_camera) X = df.drop('race',axis=1) y = df.race from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=7) classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) from sklearn.metrics import classification_report, confusion_matrix print(classification_report(y_test, y_pred)) The output: Precision is the accuracy of the predictions for the diferent races. Results for race W was reasonable but races A,N and O were 0% with f1 score of zero. This might be due to the number of members of these classes, races around 30 to 50 examples each whereas race W had over 1,200 examples. where TP is true positive and FP is false positive recall is: FN = false negative The f1 score is a measure of a test's accuracy. A score of 1 is excellent, zero is as bad as it gets.

Proudly powered by Weebly