In the last post I wrote a short Python script that gathers url addresses from which I can scrape data. In this post I wanted to explore one way of extracting data from those addresses. I used R's rvest to scrape the data. There is no special reason for using R other than I had never used rvest before and wanted to compare it to Python's BeautifulSoup.
link <- "https://www.propertypal.com/2-woodland-manor-saintfield-road-south-belfast-belfast/353644"
htmlpage <- read_html(link)
rawdata <- html_nodes(htmlpage, ".bx--content, td")
datatxt <- html_text(rawdata)
price <- regmatches(datatxt,regexpr("[0-9]+,[0-9]+",datatxt))
rates <- regmatches(datatxt,regexpr("[0-9]*,?[0-9]+[.]?[0-9]*",datatxt))
style <- regmatches(datatxt,regexpr("[A-Z][a-z]+\\s?[a-z]+",datatxt))
bedrooms <- regmatches(datatxt,regexpr("[0-9]+",datatxt))
bathrooms <- regmatches(datatxt,regexpr("[0-9]+",datatxt))
heating <- regmatches(datatxt,regexpr("[A-Z][a-z]+",datatxt))
The above script is an initial attempt with rvest to get data from one of the addresses. I found it relatively easy to use once I had installed SelectorGadget on my Chrome browser. I used this tool to get the input '.bx--content, td' for the html_nodes() function. You can get this data from the page source code but SelectorGadget makes the process quicker.
The output for the example page was:
 "Detached house"
The extracted data matches the data on the test page. To get all the required data the script would need to be modified to loop through the entire links.txt file and go through the process of scraping data for each address.
This blog includes:
Scripts mainly in Python with a few in R covering NLP, Pandas, Matplotlib and others. See the home page for links to some of the scripts. Also includes some explanations of basic data science terminology.