Home R
Category:

R

Most of the open data source providers supply data in a compressed format, especially when files are large. That is what I noticed with Company House data. I wanted to explore company house data using Power BI. But, I was not at all looking forward to downloading each file then unzip so that I can import into Power BI model. I have many files to work with, so I wanted to automate things. At least, I wanted to unzip all files using Power BI. Quite quickly, I understood Power BI currently does not support all compressed formats. Chris Webb explains here, how to work with compressed files and why currently all compressed formats are not supported.

However, Power BI supports R Script, which means I can easily automate my steps.Until Power BI Desktop supports unzip functionality, we can use new R Script in the Power BI Desktop. Power BI support documented very well regarding how to run R scripts in Power BI. Using R Script, I can download and unzip company house files.

R Script as a source in Power BI Desktop:

Microsoft recommends it is better to write and execute R Scripts in RStudio( or any other R tools) before you use it in Power BI. Power BI does not have a rich R interface, but I found it does give reasonable error information. Currently, R Script imports only data frames to Power BI.

Here is a simple R data frame in RStudio and Power BI Desktop

2016-04-01_21-33-14

RScript in R Studio

2016-04-01_21-33-31

RScript data frame in R Studio

2016-04-01_21-21-41

R Script In Power BI

Power BI Desktop uses R.Execute to execute R Script.

Download and Unzip from web using Power BI and R Script:

I wanted to download more than one file which has different URL with an incremental number. Hence, I used for loop to get the file number I wanted to download. With few lines of R script, I can easily download and unzip all these files.

R script I used in R Studio:

for(i in 0:4)

{

URLFull="http://download.companieshouse.gov.uk/BasicCompanyData-2016-03-01-part"

i=i+1

SourceURL=paste(URLFull,i,"_5.zip",sep="")

Destination="\\\\Mac\\Home\\Desktop\\Working\\csv\\"

DestinationURL=paste(Destination,i,".zip",sep="")

download.file(url=SourceURL, destfile=DestinationURL, method = "auto", quiet=FALSE)

unzipfolder="\\\\Mac\\Home\\Desktop\\Working\\csv\\CSVFiles"

unzip (zipfile=DestinationURL,exdir=unzipfolder)

}

This script does not return any data frame. If I use the same query in PowerBI Desktop, it would still download and unzip files, but I would not be able to import the data to Power BI. Also, I would not be able to use it as a step in the Power BI query.

We can either return an empty R Data Frame and create a separate query in Power BI to read the unzipped files or use a blank query as a source instead of R Script as the source, and do all operations in the single query.

Same script as above in Power BI Desktop returning empty data frame:

for(i in 0:1)

{

URLFull="http://download.companieshouse.gov.uk/BasicCompanyData-2016-03-01-part"

i=i+1

SourceURL=paste(URLFull,i,"_5.zip",sep="")

Destination="\\\\Mac\\Home\\Desktop\\Working\\csv\\"

DestinationURL=paste(Destination,i,".zip",sep="")

download.file(url=SourceURL, destfile=DestinationURL, method = "auto", quiet=FALSE)

unzipfolder="\\\\Mac\\Home\\Desktop\\Working\\csv\\CSVFiles"

unzip (zipfile=DestinationURL,exdir=unzipfolder)

}

df= data.frame()

Alternatively, same R Script without returning any data frame, this executes the R Script and does other Power Query operations:

let

Source = R.Execute

("

for(i in 0:4)#(lf)#(lf)

{#(lf)#(lf)

URLFull=""http://download.companieshouse.gov.uk/BasicCompanyData-2016-03-01-part""#(lf)#(lf)

i=i+1#(lf)#(lf)

SourceURL=paste(URLFull,i,""_5.zip"",sep="""")#(lf)#(lf)

Destination=""\\\\Mac\\Home\\Desktop\\Working\\csv\\""#(lf)#(lf)

DestinationURL=paste(Destination,i,"".zip"",sep="""")#(lf)#(lf)

download.file(url=SourceURL, destfile=DestinationURL, method = ""auto"", quiet=FALSE)#(lf)#(lf)

unzipfolder=""\\\\Mac\\Home\\Desktop\\Working\\csv\\CSVFiles""#(lf)#(lf)

unzip (zipfile=DestinationURL,exdir=unzipfolder)#(lf)



}

"),

df1 = Folder.Files("\\Mac\Home\Desktop\Working\csv\CSVFiles")

in

df1
2016-04-03_20-56-33

Downloaded R Files in Power BI

That is all; as usually now we can do additional operations in Power BI. However, there are few things to consider:

  • There is a 30 min timeout limit on all R Scripts, if you are downloading many large files, Power BI may raise a timeout error.
  • Every time you refresh PowerBI it would re-run the R Script; work around for this to run R Script using one query and do other Power BI operations in another query so you can disable the query which using R Script.

Among these are few other things I noticed:

  • I always had to import a data frame. Otherwise, it would still execute the R script, but I would have nothing in Power BI. In fact, I had to cancel the query, which did not save R Script either.
  • If I cancel Power query refresh, it will lock the files or folders which were in the process – need to close and reopen the Power BI to unlock!
  • If I want to change the R Script, before it finishes execution or Power BI refresh, again Power BI caches the script or values I given in the script, which mean I need to close and reopen Power BI. E.g.After I canceled the query and If I changed the for loop condition, it will still use the condition I given before I cancelled the query.

Overall it is exciting to see R Script in Power BI. It enables many features those Power BI does not support currently. Remember you never know what incredible Power BI team can deliver; an element which is missing today may appear in the next minute 🙂

Till next time,

Prathy 🙂

5 comments
0 FacebookTwitterPinterestEmail

After years of hesitation finally, I think I am ready to write my first blog post, or I should say I am inclined to publish a public blog post. I chose SSIS & R as the subject because I like SSIS and also since Microsoft released Power Query, I was always curious to extract Twitter data, but it was never too easy for me. Now SQL Server R Services made it quite easy.

One of my favorite feature among all new SQL Server 2016 features is integration with R. I don’t want to mislead saying I understand it very well, but I can proudly say I am trying to learn and trust me it is a good language to learn 🙂 . So note, you may find poorly written R script in this post. Also, Microsoft has decided to integrate R wherever feasible. Hence, there are different methods of using R, but I chose SSIS so I can demonstrate how we can use R in SSIS.

To achieve our purpose, first thing first, we need to enable R Services in our SQL Server 2016 instance. An excellent blog post by Henk Vadrevalk explains all about, how to install and configure SQL Server R Services http://henkvandervalk.com/introducing-microsoft-sql-server-2016-r-services . After SQL Server R Services enabled and ready run, you can execute the following script to find out all pre-installed packages and find R Services works or not.

EXECUTE  sp_execute_external_script
@language=N'R'
,@script = N'str(OutputDataSet);
packagematrix <- installed.packages();
NameOnly <- packagematrix[,1];
OutputDataSet <- as.data.frame(NameOnly);'
,@input_data_1 = N'SELECT 1 as col'
WITH RESULT SETS ((PackageName nvarchar(250) ))]

imageOnce we convinced that R services were working, then we are ready to proceed! The only challenge I faced was to figure out how to install new packages.I used RStudio to install packages and moved them across to SQL 2016 folder. Obviously, that was not the exact process, the blog post by Microsoft explains how to install additional packages: http://henkvandervalk.com/introducing-microsoft-sql-server-2016-r-services. As we saw in the above code, we are doing nothing but executing a stored procedure which has R Script in it. So, I’ll use Execute SQL Task in SSIS to run SQL Server R script to extract data by using one of the available open source R libraries, simple 🙂 . I will be using TwitterR Package/library, there are many other free libraries available, but I think it’s a good way to start; just because it is simple. To extract data using Twitter API first we need to create a Twitter App.

Twitter API:

As per Twitter documentation, there are four different objects that you’ll encounter in the API: Tweets, Users, Entities and Places, each object has several properties.For the sake of simplicity, I am looking at Tweets only. I just want to search Twitter based on a hashtag and extract a certain number of tweets just to show how it works.I created a twitter app called SQLServerRServices. Once the app created we can get: api_key, api_secret, access_token and access_token_secret which are required to access Twitter API. Next step is writing R Script.

R Script:

I have to admit writing R in the SQL Server Management Studio is not fun.I felt like it doesn’t have debugging capabilities which eventually made my troubleshooting a tough task. To overcome this, I used R-Studio to write R script until I was sure that my script was working. It was a lot easier compared to SSMS! Here is the script I used! (I replaced my API key’s with xxxx)

R Script using TwitterR:

/*Stored procedure to execute R Script in SQL Server 2016*/
execute sp_execute_external_script
@language = N'R'

/*R Script*/
, @script = N'

#Load the necessary packages
library(base64enc);
library(twitteR);
library(ROAuth);
library(bitops);
library(RCurl);
library(httr);

#OAUTH
oauth_endpoints("twitter");

#API keys from Twitter APP
api_key <- "xxxxxxxxxxxxxxx";
api_secret <- "xxxxxxxxxxxxxxxxxxx";
access_token <- "xxxxxxxxxxxxxxxxxxxxxxxx";
access_token_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxx";
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret);

#searchTwitter - is TwitterR function to search Twitter
PowerBITweets = searchTwitter("PowerBI", n=10);

#Create a data frame with extracted data
df <- do.call("rbind", lapply(PowerBITweets , as.data.frame))

#Choose the selected columns
frame<-as.data.frame(df[,c(11,5,3,12,15,16)])
', @input_data_1 = N''

#Send created data frame as output result set
, @output_data_1_name = N'frame'
with result sets (([Handle] NVARCHAR(max),[DateTime] DATETIME,[FavCount] int,[RetweetCount] int,[Longitude] NVARCHAR(100),[Latitude] NVARCHAR(100) ));

Script in Detail:

We can execute R in SQL Server 2016 by using stored procedure called sp_execute_external_script. @Language represents the language we want to use in the script (It looks like we could probably able to run Python or any other languages shortly 🙂 ). Next step is to load all required packages.

After loading all the packages then it’s OAUTH handshake. With supplied API access tokens and key’s, we can authenticate Twitter using following bit of code.

oauth_endpoints("twitter");

#API keys from Twitter APP
api_key <- "xxxxxxxxxxxxxx";
api_secret <- "xxxxxxxxxxxxxx";
access_token <- "xxxxxxxxxxxxxx-xxxxxxxxxxxxxx";
access_token_secret <- "xxxxxxxxxxxxxx";
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret);

Next step is to search Twitter by using SearchTwitter function. This feature works based on supplied string.There are several arguments we can supply; here I am providing two arguments: #PowerBI -Search String  and n – maximum number of tweets

SQLBitsTweets = searchTwitter(“#PowerBI”, n=10);

SearchTwitter function returns the list of status objects. In the next step, I am using three functions to convert returned list into a data frame.

#Create a data frame with extracted data df <- do.call(“rbind”, lapply(SQLBitsTweets , as.data.frame))

This data frame contains different data column, but I am interested in only a few columns, so I chose the columns I am interested in the next step.

#Choose the selected columns frame<-as.data.frame(df[,c(11,5,3,12,15,16)])

I am not supplying any input data to the query. Hence; my @Input_data_1 value is blank

‘, @input_data_1 = N”

For output, I want to return the selected data, so I gave the data frame name and in the result sets I given the column names and datatypes.

, @output_data_1_name = N’frame’ with result sets (([Handle] NVARCHAR(max),[DateTime] DATETIME,[FavCount] int,[RetweetCount] int,[Longitude] NVARCHAR(100),[Latitude] NVARCHAR(100) ));

It’s impossible to know all the column names and data types beforehand; that’s when RStudio comes handy. We can do this with SSMS too, but RStudio is easy

Identifying data types:

One of the beauties of RStudio is we can see the Environment values and data. In the following image on the right side, you can Data and Values. Under data df is the data frame created when executed the code: df <- do.call(“rbind”, lapply(PowerBITweets , as.data.frame))  image

When we click on the little table icon next to df, it opens the dataframe in a new window, where we can see the data and column names.

image

To identify the data types we can use class or type of to get the list of column names and respective data types. This worked for me for this example, but if you want to know more about it, there is a debate in the stack overflow. Beware it’s a quite old post but I found interesting: http://tinyurl.com/jj9c3e9

image

Finally, we are sure that RScript is working with Twitter API. Next step is to execute and schedule this extract using SSIS.

Creating SSIS Package:

I created an SSIS control flow template to retrieve Twitter data. This template has one Execute SQL Task with 4 task scoped variables:

TwitterHashTag – Search String

NoOfTweets – Number of tweets I want to obtain based on my search

Script – I want to make R script dynamic using  SSIS variables. Hence, I am using another variable called the script with Expression. I can use SQL statement as an expression in the SSIS Execute SQL Task but using variables makes it easy to document, and I can use BIDS Helper to create and modify variables.

imageimage

I am going to use this template in my SSIS package. Creating SSIS Control Flow Template is another blog post. There some excellent blog posts about control flow templates by Koen Verbeeck. Editing control flow template is easy, here I wanted to use the extracted data/ result set in the Data Flow Task as a source for that reason I changed the scope of the TwitterExtract Control flow template variable to Package level.

image

image

Next step is to prepare Data Flow Task. In the data flow task, I used Script Component as a Source.

image

There is no point of me repeating, how to use Sys Object variable as the source for the data flow task when Tim Mitchell explained it all.I created all columns in the result set as output columns in my source component. Again I used same data types; I used for R Script output parameter.

image

 

 

 

 

 

 

 

 

 

And the C# script I used in the Script Component Source

public override void CreateNewOutputRows()
{
OleDbDataAdapter oleDA = new OleDbDataAdapter();
DataTable dt = new DataTable();
oleDA.Fill(dt, Variables.RResultSet);

foreach (DataRow dr in dt.Rows)

{

ScriptOutputBuffer.AddRow();

ScriptOutputBuffer.Handle = (dr["Handle"].ToString());

ScriptOutputBuffer.DateTime = DateTime.Parse(dr["DateTime"].ToString());

ScriptOutputBuffer.FavCount = int.Parse(dr["FavCount"].ToString());

ScriptOutputBuffer.RetweetCount = int.Parse(dr["RetweetCount"].ToString());

ScriptOutputBuffer.Latitude = (dr["Latitude"].ToString());

ScriptOutputBuffer.Longitude = (dr["Longitude"].ToString());

}

}

Script component generates the output buffer which can be used to connect to OLEDB destination with SQL Server connection.That’s all; now we can schedule and see this data. We don’t have to use SSIS to create one simple SQL Server table, but I am using this as an example.

Here we go now we can see who is tweeting about PowerBI and who’s tweets got more retweeted and favorited. I have not done testing by going to each profile and see how many tweets were POWERBI tweets. My only testing is, I can see familiar Twitter handles which I regularly follow ;). For this example, I extracted only 1000 tweets, but we can extract tweets based on tweet time and schedule to get all tweets between the given time. Using some other TwitterR function, we can also get user profile and their profile images which can lead to an excellent interactive PowerBI report. I tried to publish to the web feature with my Wordpress blog, for some reason, Wordpress decided not to insert an iframe; until I figure out the workaround here is the URL where you can see my quick and dirty report – https://app.powerbi.com/view?r=eyJrIjoiYzg5YTUwN2EtYmMzZi00YzRlLTlkZTgtNzQzZjJiOTNiMjUyIiwidCI6ImExMjczNjNlLTcxNzAtNDg0Yi1iZDY4LTAyYjgwZjIzZDI4MiIsImMiOjh9

PowerBI File

Have fun,

Prathy

 

2 comments
5 FacebookTwitterPinterestEmail