Happy New Year!
Hope you all had a lovely time over Christmas and New Year period. My holiday started nicely with a visit to the Taj Mahal but ended badly with very unpleasant fever, hence the delay in this post. So we wrapped 2018 with an awesome session by James Dales (I totally recommend watching the recording- https://skillsmatter.com/skillscasts/12865-how-to-create-a-custom-visual-for-power-bi) and numerous fantastic submissions for LondonPugDataViz. This time I asked all submitters to write a bit about their submission and the experience; this what I received:
Submission 1:
Samy
samydoreau@gmail.com
https://www.linkedin.com/in/samydoreau/
My aim for this Data Challenge was to create a choropleth showing statistics (average, median, min, max, count etc.) around Price Paid from the dataset per Postcode – people relate quite well to postcodes and I thought it would be better to show the data as colors areas rather than points / circles on a map.
Data preparation : I initially attempted to upload the entire file into Power BI but encountered memory issues as the file has upwards of 32 million rows. There isn’t in fact a straightforward way to handle such big files so I decided to reduce the data size by loading the data into a local Spark cluster via one of the R interfaces to Spark : Sparklyr. Sparklyr has the immense advantage to be able to easily import huge CSVs, then create aggregations efficiently by using all of the machine’s cores and most of all performs distributed computation and therefore scales very well to large datasets. This allowed me to obtain the mean / min / max / median price paid and count of sales by postcode.
Data Visualisation : I obtained geocoding files for UK postcodes from the OpenDoor Logistics website and converted the data into geoJSON via the mapshaper web utility to use with the Power BI native shapemap visual – however I encountered three issues. First the visualisation was really slow, second the visual was unable to hold all the data points in the dataset, and finally the polygons weren’t overlaid on an actual map, which made it hard to see where each polygon was actually located.
I then decided to use the Mapbox visual and uploaded the shp, dbf, fix, prj and shx files from opendoor logistics into the mapbox website to create a custom tileset (See links below to Mapbox and blog tutorial for detailed instructions) – after a bit of tweaking the visual config, I obtained a cool visualization solving all three above problems : the viz is very responsive and smooth, the polygons are overlaid on an openstreet map and all data points are shown.
The report is very simple and allows to filter the data on the first two letter of the postcode and the price range – the map show the postcodes areas with higher average price as red, and the postcode areas with lower average price as white / yellow, following an automatic gradient. See the gif and picture attached as illustrations.
Submission 2:
https://www.linkedin.com/in/twfoster/
My submission was focused on the top 10 sales in each month as I felt these were the most interesting sales. This meant there was a strong story around these excessive properties. If I had been able to spend more time on it, I wanted to bring in more data around the specific sales. If you want an embed version I have it published to web below
Submission 3:
Jude Moore
jude_moore@outlook.com
https://www.linkedin.com/in/judemcbride/
Goal>
To allow forecasting of property prices down to postcode level based on average price paid changes for the same property type during the last calendar year for a specific region of the UK at a time.
Tools list >
1. PC with Windows
2. Power BI Desktop latest version installed
3. Microsoft Open R installed
4. R studio installed
5. The data itself downloaded in CSV format
Steps
1. Downloaded the annual price paid data from gov.uk (land registry) for year to date (Jan 2018 to September 2018)
2. Used the below guide to install and allow R visuals to be implemented in Power BI Desktop. https://docs.microsoft.com/en-us/power-bi/desktop-r-visuals
3. Started a new Power BI pbix report
4. Set up two parameters to accept the region and the maximum budget for the user’s average property purchase price. Also allowing to accept infinite or all regions.
5. Loaded the data in while filtering price and location based on the parameters.
- Discovery – Tested by ensuring the averages were believable using my experience, and by viewing the data across all regions and introducing maps.
-
Remove outliners to prepare for data science. Outliners are very extreme values like £55m which can damage averages and predictions. One was found.
-
Introduced R forecasting using price and dateoftransfer using this guide’s R script. http://www.datamic.net/blog/dynamic-forecasting-with-power-bi-and-r
-
Tested the forecasting in R
-
Produced final report using R forecasting
Known Limitations :
* Data size limitations – data used by the R visual for plotting is limited to 150,000 rows. If more than 150,000 rows are selected, only the top 150,000 rows are used and a message is displayed on the image.
* Calculation time limitation – if an R visual calculation exceeds five minutes the execution times out, resulting in an error.
* With the forecasting R package, results in an error if the dataset only has one record, ie with streets or postcodes with only one result brought back in the report
Submission 4
Jese Navaranjan
jese.navaranjan@altius.co.uk
https://www.linkedin.com/in/jese-navaranjan-1a836356/
Hoping to see more submissions in the future. Do check out London Power BI Meetup previous events here – https://prathy.com/londonpowerbi/
Till next time,
Prathy 🙂