After years of hesitation finally, I think I am ready to write my first blog post, or I should say I am inclined to publish a public blog post. I chose SSIS & R as the subject because I like SSIS and also since Microsoft released Power Query, I was always curious to extract Twitter data, but it was never too easy for me. Now SQL Server R Services made it quite easy.
One of my favorite feature among all new SQL Server 2016 features is integration with R. I don’t want to mislead saying I understand it very well, but I can proudly say I am trying to learn and trust me it is a good language to learn 🙂 . So note, you may find poorly written R script in this post. Also, Microsoft has decided to integrate R wherever feasible. Hence, there are different methods of using R, but I chose SSIS so I can demonstrate how we can use R in SSIS.
To achieve our purpose, first thing first, we need to enable R Services in our SQL Server 2016 instance. An excellent blog post by Henk Vadrevalk explains all about, how to install and configure SQL Server R Services http://henkvandervalk.com/introducing-microsoft-sql-server-2016-r-services . After SQL Server R Services enabled and ready run, you can execute the following script to find out all pre-installed packages and find R Services works or not.
EXECUTE sp_execute_external_script @language=N'R' ,@script = N'str(OutputDataSet); packagematrix <- installed.packages(); NameOnly <- packagematrix[,1]; OutputDataSet <- as.data.frame(NameOnly);' ,@input_data_1 = N'SELECT 1 as col' WITH RESULT SETS ((PackageName nvarchar(250) ))]
Once we convinced that R services were working, then we are ready to proceed! The only challenge I faced was to figure out how to install new packages.I used RStudio to install packages and moved them across to SQL 2016 folder. Obviously, that was not the exact process, the blog post by Microsoft explains how to install additional packages: http://henkvandervalk.com/introducing-microsoft-sql-server-2016-r-services. As we saw in the above code, we are doing nothing but executing a stored procedure which has R Script in it. So, I’ll use Execute SQL Task in SSIS to run SQL Server R script to extract data by using one of the available open source R libraries, simple 🙂 . I will be using TwitterR Package/library, there are many other free libraries available, but I think it’s a good way to start; just because it is simple. To extract data using Twitter API first we need to create a Twitter App.
Twitter API:
As per Twitter documentation, there are four different objects that you’ll encounter in the API: Tweets, Users, Entities and Places, each object has several properties.For the sake of simplicity, I am looking at Tweets only. I just want to search Twitter based on a hashtag and extract a certain number of tweets just to show how it works.I created a twitter app called SQLServerRServices. Once the app created we can get: api_key, api_secret, access_token and access_token_secret which are required to access Twitter API. Next step is writing R Script.
R Script:
I have to admit writing R in the SQL Server Management Studio is not fun.I felt like it doesn’t have debugging capabilities which eventually made my troubleshooting a tough task. To overcome this, I used R-Studio to write R script until I was sure that my script was working. It was a lot easier compared to SSMS! Here is the script I used! (I replaced my API key’s with xxxx)
R Script using TwitterR:
/*Stored procedure to execute R Script in SQL Server 2016*/ execute sp_execute_external_script @language = N'R' /*R Script*/ , @script = N' #Load the necessary packages library(base64enc); library(twitteR); library(ROAuth); library(bitops); library(RCurl); library(httr); #OAUTH oauth_endpoints("twitter"); #API keys from Twitter APP api_key <- "xxxxxxxxxxxxxxx"; api_secret <- "xxxxxxxxxxxxxxxxxxx"; access_token <- "xxxxxxxxxxxxxxxxxxxxxxxx"; access_token_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxx"; setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret); #searchTwitter - is TwitterR function to search Twitter PowerBITweets = searchTwitter("PowerBI", n=10); #Create a data frame with extracted data df <- do.call("rbind", lapply(PowerBITweets , as.data.frame)) #Choose the selected columns frame<-as.data.frame(df[,c(11,5,3,12,15,16)]) ', @input_data_1 = N'' #Send created data frame as output result set , @output_data_1_name = N'frame' with result sets (([Handle] NVARCHAR(max),[DateTime] DATETIME,[FavCount] int,[RetweetCount] int,[Longitude] NVARCHAR(100),[Latitude] NVARCHAR(100) ));
Script in Detail:
We can execute R in SQL Server 2016 by using stored procedure called sp_execute_external_script. @Language represents the language we want to use in the script (It looks like we could probably able to run Python or any other languages shortly 🙂 ). Next step is to load all required packages.
After loading all the packages then it’s OAUTH handshake. With supplied API access tokens and key’s, we can authenticate Twitter using following bit of code.
oauth_endpoints("twitter"); #API keys from Twitter APP api_key <- "xxxxxxxxxxxxxx"; api_secret <- "xxxxxxxxxxxxxx"; access_token <- "xxxxxxxxxxxxxx-xxxxxxxxxxxxxx"; access_token_secret <- "xxxxxxxxxxxxxx"; setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret);
Next step is to search Twitter by using SearchTwitter function. This feature works based on supplied string.There are several arguments we can supply; here I am providing two arguments: #PowerBI -Search String and n – maximum number of tweets
SQLBitsTweets = searchTwitter(“#PowerBI”, n=10);
SearchTwitter function returns the list of status objects. In the next step, I am using three functions to convert returned list into a data frame.
#Create a data frame with extracted data df <- do.call(“rbind”, lapply(SQLBitsTweets , as.data.frame))
This data frame contains different data column, but I am interested in only a few columns, so I chose the columns I am interested in the next step.
#Choose the selected columns frame<-as.data.frame(df[,c(11,5,3,12,15,16)])
I am not supplying any input data to the query. Hence; my @Input_data_1 value is blank
‘, @input_data_1 = N”
For output, I want to return the selected data, so I gave the data frame name and in the result sets I given the column names and datatypes.
, @output_data_1_name = N’frame’ with result sets (([Handle] NVARCHAR(max),[DateTime] DATETIME,[FavCount] int,[RetweetCount] int,[Longitude] NVARCHAR(100),[Latitude] NVARCHAR(100) ));
It’s impossible to know all the column names and data types beforehand; that’s when RStudio comes handy. We can do this with SSMS too, but RStudio is easy
Identifying data types:
One of the beauties of RStudio is we can see the Environment values and data. In the following image on the right side, you can Data and Values. Under data df is the data frame created when executed the code: df <- do.call(“rbind”, lapply(PowerBITweets , as.data.frame))
When we click on the little table icon next to df, it opens the dataframe in a new window, where we can see the data and column names.
To identify the data types we can use class or type of to get the list of column names and respective data types. This worked for me for this example, but if you want to know more about it, there is a debate in the stack overflow. Beware it’s a quite old post but I found interesting: http://tinyurl.com/jj9c3e9
Finally, we are sure that RScript is working with Twitter API. Next step is to execute and schedule this extract using SSIS.
Creating SSIS Package:
I created an SSIS control flow template to retrieve Twitter data. This template has one Execute SQL Task with 4 task scoped variables:
TwitterHashTag – Search String
NoOfTweets – Number of tweets I want to obtain based on my search
Script – I want to make R script dynamic using SSIS variables. Hence, I am using another variable called the script with Expression. I can use SQL statement as an expression in the SSIS Execute SQL Task but using variables makes it easy to document, and I can use BIDS Helper to create and modify variables.
I am going to use this template in my SSIS package. Creating SSIS Control Flow Template is another blog post. There some excellent blog posts about control flow templates by Koen Verbeeck. Editing control flow template is easy, here I wanted to use the extracted data/ result set in the Data Flow Task as a source for that reason I changed the scope of the TwitterExtract Control flow template variable to Package level.
Next step is to prepare Data Flow Task. In the data flow task, I used Script Component as a Source.
There is no point of me repeating, how to use Sys Object variable as the source for the data flow task when Tim Mitchell explained it all.I created all columns in the result set as output columns in my source component. Again I used same data types; I used for R Script output parameter.
And the C# script I used in the Script Component Source
public override void CreateNewOutputRows() { OleDbDataAdapter oleDA = new OleDbDataAdapter(); DataTable dt = new DataTable(); oleDA.Fill(dt, Variables.RResultSet); foreach (DataRow dr in dt.Rows) { ScriptOutputBuffer.AddRow(); ScriptOutputBuffer.Handle = (dr["Handle"].ToString()); ScriptOutputBuffer.DateTime = DateTime.Parse(dr["DateTime"].ToString()); ScriptOutputBuffer.FavCount = int.Parse(dr["FavCount"].ToString()); ScriptOutputBuffer.RetweetCount = int.Parse(dr["RetweetCount"].ToString()); ScriptOutputBuffer.Latitude = (dr["Latitude"].ToString()); ScriptOutputBuffer.Longitude = (dr["Longitude"].ToString()); } }
Script component generates the output buffer which can be used to connect to OLEDB destination with SQL Server connection.That’s all; now we can schedule and see this data. We don’t have to use SSIS to create one simple SQL Server table, but I am using this as an example.
Here we go now we can see who is tweeting about PowerBI and who’s tweets got more retweeted and favorited. I have not done testing by going to each profile and see how many tweets were POWERBI tweets. My only testing is, I can see familiar Twitter handles which I regularly follow ;). For this example, I extracted only 1000 tweets, but we can extract tweets based on tweet time and schedule to get all tweets between the given time. Using some other TwitterR function, we can also get user profile and their profile images which can lead to an excellent interactive PowerBI report. I tried to publish to the web feature with my Wordpress blog, for some reason, Wordpress decided not to insert an iframe; until I figure out the workaround here is the URL where you can see my quick and dirty report – https://app.powerbi.com/view?r=eyJrIjoiYzg5YTUwN2EtYmMzZi00YzRlLTlkZTgtNzQzZjJiOTNiMjUyIiwidCI6ImExMjczNjNlLTcxNzAtNDg0Yi1iZDY4LTAyYjgwZjIzZDI4MiIsImMiOjh9
Have fun,
Prathy
2 comments
Good Post, keep Sharing
Good Post, keep Sharing