Discovering Data
  • Home
  • Blog

Plotting geojson data

10/14/2018

0 Comments

 
The dataset contained data points with numerous labels which I wanted to plot. First I had to manually process the data point labels because the labels were not uniform, for example ​'Bridge (road over rail)','Bridge (rail over road)','Rail Bridge','Railway Bridge' and so on. Once I had identified all the labels I was interested in I collected them in lists and used these to create dataframes which were subsets of the main dataframe:
Code Editor

    
Then I plotted this data onto a blank outline map

    
Picture

0 Comments

Pandas - group by

6/9/2018

0 Comments

 
The dataset contains climate data, the data structure is:
year
month
temp max
temp min
air frost
rain (mm)
hours of sunshine
1948
1
6.6
1.3
8
170.8
40.1
I want to visualise the total hours of sunshine for the summer months (June, July, August) per year. The steps involved in preparing the data were:
  1. keep only months 6, 7 and 8, so create a subset of the full dataframe
  2. drop unnecessary columns - reduces computing power required 
  3. group by year and sum
the code:

    
The data can then be plotted.
Picture

0 Comments

Plotting geographic data on a globe

4/29/2018

0 Comments

 
The following code uses plotly to create a heat map on a globe. The data is country GDP and the Globe generated is interactive (below is only an image so not interactive).
Picture

    

0 Comments

Stemming and Lemmatizing

6/30/2017

0 Comments

 
Sometimes when analysing natural language it is necessary to normalise the text. For example converting everything to lower case ensures that 'Inform' and 'inform' are not treated as different words. Sometimes it might also be necessary to treat 'inform' and 'informed' as the same. In those cases stemming and/or lemmatizing will be necessary. But these processes can give unexpected results.

In natural language programming a stemmer will remove affixes from a word leaving only the stem. NLTK comes with a number of in-built stemmers. Comparison of the standard stemmers:
word
Porter
Lancaster
Snowball
fasten
fasten
fast
fasten
classify
classifi
class
classifi
awaken
awaken
awak
awaken
differentiate
differenti
​differenty
differenti
duplicate
duplic
duply
duplic
specialise
specialis
spec
specialis
lying
lie
lying
lie
The Porter and Snowball stemmers generate the same result the Lancaster stemmer produces very different results.
Lemmatization is simmilar to stemming except it always generates a valid word rather than a stem. For example:
classify gives classify rather than classifi
differentiate gives differentiate rather than  differenti
in the case of words like believes it gives belief whereas the Porter stemmer will give believ
0 Comments

Bag of words - example

6/24/2017

0 Comments

 
We start with four sentences:

there is a dog in the garden
john enjoys watching movies
jane enjoys taking her dog for a walk
there are no dogs in this movie


Computers don't handle natural language well. One way to overcome this is to create a 'bag of words' for each sentence. This can be thought of as a list of integer values. Each value is the number of occurences of a word in the sentence.

The first step is to tokenize the sentences then to create a set from this list of tokens. We can then itterate through each word in the set and count the number of times that word occurs in each sentence. The number is appended to a list. This process generates three lists or vectors.
Bag of words

    
The ouput:
[1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1]

​the word set is:


set(['a', 'dogs', 'no', 'garden', 'her', 'watching', 'this', 'movie', 'is', 
'there', 'dog', 'for', 'walk', 'movies', 'are', 'in', 'jane', 'taking', 'the', 
'john', 'enjoys'])

so 'a' occurs once in the first sentence, 'dogs' does not occur, 'no' does not occur 
and so on.
The bag of words approach can be used in computer vision for image recognition/classification.
There are a number of packages that offer 'bag of words' functionality, the above script is not
intended to replace these packages, it is just to demonstrate what is meant by 'bag of words'
0 Comments

Compare document similarities

6/16/2017

0 Comments

 
The following script will take a document and compare it to a set of documents to find the document similarities.

comparison document
  • The default installation directory for Icecream Ebook Reader is Icecream Ebook Reader with weak folder permissions that grants EVERYONE modify privileges to the contents of the 

the documents
  • directory and it's subfolders. This allows an attacker opportunity for their own code execution under any other user running the application."
  • "An insecure file permissions vulnerability has been discovered in the official Icecream Ebook Reader v4.53 software. The vulnerability allows local attackers with system user accounts to 
  • elevate the access to higher system privileges."
  • "A persistent cross site scripting web vulnerability has been discovered in the official Zenario v7.6 content management system."
  • "While performing network level testing of various Google applications, we discovered that the content for the application did not use SSL."
The Script

    
output of the script:
[ 0.48266575  0.          0.01086096  0.13409612  0.17690402]
So the comparison document most closely matches the first  document. The least similar is the second document with a score of two.

0 Comments

Get word frequency distributions using NLTK

6/5/2017

0 Comments

 
Get word frequency distributions

    

0 Comments

Cleaning data

4/26/2017

0 Comments

 
Get two months free premium Skillshare membership with this affiliate link: Skillshare

Data set used - mental health in IT, available on Kaggle

Data can be messy. The data set above for example had a 'Gender' field which contained many variations on male and female, for example: Male, male, M, m, man, F, f, Female and so on. The first thing I wanted to do was set all values to either male or female. One way to do this is with the following code:

    

0 Comments

Use Python to get historical FOREX data

3/23/2017

0 Comments

 
Data is available from sources such as FRED. 

    
Picture

0 Comments

How to generate a word cloud with python

2/15/2017

0 Comments

 
Word clouds give only limited insight but they are popular. In python the package 'wordcloud' is popular. Below is an example of how to generate a word cloud. Once again I'm using Trump's twitter account as a source of text data.

    
Picture

0 Comments
<<Previous

    This blog includes:

    Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts.  Also includes some explanations of basic data science terminology.

    Archives

    October 2018
    June 2018
    April 2018
    June 2017
    April 2017
    March 2017
    February 2017
    January 2017
    November 2016
    September 2016
    July 2016
    June 2016
    May 2016
    November 2015
    November 2014

    RSS Feed

Proudly powered by Weebly
  • Home
  • Blog