MSFabric

#MicrosoftFabric Road Map and Web Scraping with Fabric

by Prathy Kamasani 27 November 2023

written by Prathy Kamasani

Like many I am also playing with Fabric, many of my clients are also excited about Fabric and want to know more about it. Being a solution architect in the consulting world one of the most common questions I get asked is: “When certain features will be available, Where are they in the roadmap?”. That’s what sparked the idea of scraping the Microsoft Fabric Roadmap and creating this Power BI report. It is based on a Direct Lake connection, so it has been a bit temperamental.

So, what did I do it? If you are not interested in the whole story. Here is Python code you can run to get a road map. If you are interested in my process carry on reading 🙂

Data Integration

I am not proficient in Python but I have done web scraping with Power BI many times. So I know my way around web scraping a bit. However, I started with ChatGapt, doing a rough skeleton using Python.

url = "https://learn.microsoft.com/en-us/fabric/release-plan"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='main')
# print(results.prettify())

The above script gives me the entire page. I am obviously not interested in the entire page. I went to developer tools and figured out snippets I am interested in.

It looked like, I need all Hrefs and then I can filter out only those I am interested in.

# get all hrefs
hrefs = results.find_all('a', href=True)

# keep all href name and link in a list
href_list = []
for href in hrefs:
    href_list.append([href.text, href['href']])

# add a column name to the list with full url
href_list = pd.DataFrame(href_list, columns=['name', 'url'])

# remove rows containing https  
href_list = href_list[~href_list['url'].str.contains("https")]

# add a new column with full url    
href_list['full_url'] = 'https://learn.microsoft.com/en-us/fabric/release-plan/' + href_list['url']
href_list.loc[href_list['name'] == 'Power BI', 'full_url'] = 'https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/planned-features'

I was able to extract the hrefs from the web page using the above code snippet. It was a bit messy, but I managed to clean it up and get the list I wanted. Now, I have to iterate over the links and scrape the data from each page. The pages have different formats and layouts, so I need to adjust my code accordingly. I decided to try chat GPT for some help, and AI suggested using Pd_readhtml to read the tables from the HTML pages.

table_list = []
# loop through href_list, check if it has tables and if it does, append to table_list
for index, row in href_list.iterrows():
    table = pd.read_html(row['full_url'], extract_links="all")
    table[0]['product'] = row['name']
    table[0]['FullURL'] = row['full_url']
    table_list.append(table[0])

Pandas_df = pd.concat(table_list, ignore_index=True)
Pandas_df.columns = ['Feature', 'Estimated_release_timeline', 'Product','FullURL', 'Enabled_for', 'Public_Preview','GA']
Pandas_df = Pandas_df.astype({'Feature': 'string', 'Estimated_release_timeline': 'string', 'Product': 'string', 'Enabled_for': 'string', 'Public_Preview': 'string', 'GA': 'string'})

spark_df = spark.createDataFrame(Pandas_df)

# display(spark_df)

I decided to make some improvements to the data. I added some extra columns to identify the products more easily, I converted the data types to string for consistency, and I gave meaningful names to the columns, avoiding any spaces that could cause problems later.

Next, I applied some more transformations with Spark. I learned that Spark Data Frame is much faster than Pandas, so I switched to Spark DataFrame.

Script for all transformations

from pyspark.sql.functions import split, col
spark_df = spark_df.withColumn('Feature_Name', split(col('Feature'), '\',')[0])
spark_df = spark_df.withColumn('Feature_URL', split(col('Feature'), '\',')[1])
# display(spark_df)

spark_df = spark_df.withColumn('Feature_Name', regexp_replace('Feature_Name', '\(\'', ''))
spark_df = spark_df.withColumn('Feature_Name', regexp_replace('Feature_Name', '\'', ''))
spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', '\'\)', ''))
spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', '\'', ''))
spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', ' ', ''))
# display(spark_df)

# spark_df = spark_df.withColumn('Feature_URL', concat(col('FullURL'), col('Feature_URL')))
spark_df = spark_df.withColumn('Feature_URL', when(col('Product') == 'Power BI'
    , concat(lit('https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/'), col('Feature_URL')))
    .otherwise(concat(col('FullURL'), col('Feature_URL'))))

spark_df = spark_df.withColumn('Estimated_release_timeline', split(col('Estimated_release_timeline'), ',')[0])
spark_df = spark_df.withColumn('Estimated_release_timeline', regexp_replace('Estimated_release_timeline', '\(\'', ''))
spark_df = spark_df.withColumn('Estimated_release_timeline', regexp_replace('Estimated_release_timeline', '\'', ''))
spark_df = spark_df.drop('Feature')
# display(spark_df)
spark_df = spark_df.withColumn('Enabled_for', split(col('Enabled_for'), '\',')[0])
spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\(\'', ''))
spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\'', ''))
spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\)', ''))

#split GA column into two columns
spark_df = spark_df.withColumn('GA', split(col('GA'), ',')[0])
spark_df = spark_df.withColumn('GA', regexp_replace('GA', '\(\'', ''))
spark_df = spark_df.withColumn('GA', regexp_replace('GA', '\'', ''))

#split Public_Preview column into two columns
spark_df = spark_df.withColumn('Public_Preview', split(col('Public_Preview'), ',')[0])
spark_df = spark_df.withColumn('Public_Preview', regexp_replace('Public_Preview', '\(\'', ''))
spark_df = spark_df.withColumn('Public_Preview', regexp_replace('Public_Preview', '\'', ''))

# spark_df.show()  
# display(spark_df)

Table_Name = 'Fabric_Release_Plan'
# print(Table_Name)
spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)
print('Table ' + Table_Name + ' has been created successfully')

Finally, inserted all data into the Spark Delta Table

Table_Name = 'Fabric_Release_Plan'
# print(Table_Name)
spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)
print('Table ' + Table_Name + ' has been created successfully')

Then, I went to get descriptions of each item as well and displayed them in the tooltips of my report. This was similar to the rest of the code, but more fun and challenging.

Script to get Descriptions

url = "https://learn.microsoft.com/en-us/fabric/release-plan"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='main')

# get all hrefs
hrefs = results.find_all('a', href=True)

# keep all href name and link in a list
href_list = []
for href in hrefs:
    href_list.append([href.text, href['href']])

# add a column name to the list with full url
href_list = pd.DataFrame(href_list, columns=['name', 'url'])

# remove rows containing https  
href_list = href_list[~href_list['url'].str.contains("https")]

# add a new column with full url    
href_list['full_url'] = 'https://learn.microsoft.com/en-us/fabric/release-plan/' + href_list['url']
href_list.loc[href_list['name'] == 'Power BI', 'full_url'] = 'https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/planned-features'

Desc_List=[]
table_list = []
# loop through href_list, check if it has tables and if it does, append to table_list
for index, row in href_list.iterrows():
    
    # also get all descriptions
    url = row['full_url']
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.find(id='main')

    h3 = results.find_all('h3')
    h3_ids = []
    for h in h3:
        h3_ids.append(h.get('id'))
        h3_ids.append(h.get_text())
        h3_ids.append(h.find_next('p').get_text())
        h3_ids.append(h.find_next('p').find_next('p').get_text())
        # print(h3_ids)
        # transpose list
    h3_ids = [h3_ids[i:i+4] for i in range(0, len(h3_ids), 4)]
    h3_ids = pd.DataFrame(h3_ids, columns=['id', 'name', 'timeline', 'full_description'])
    h3_ids['url'] = url
    # add product name
    h3_ids['product'] = row['name']
    # add full url with concat of url and id
    h3_ids['full_url'] = url + '#' + h3_ids['id']
    Desc_List.append(h3_ids)

# concat all desc_list dataframes
Desc_List = pd.concat(Desc_List)
# display(Desc_List)

spark_df = spark.createDataFrame(Desc_List)
# spark_df.show()

Table_Name = 'Fabric_Release_Plan_with_desc'
# print(Table_Name)
spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)
print('Table ' + Table_Name + ' has been created successfully')

Next, I ran a simple script to get the time my notebook refreshed to display the last refresh date and time on the report.

# get notebook run time and save in a table
import time
from datetime import datetime
from pyspark.sql.functions import lit
import pandas as pd

now = datetime.now()
current_time = now.strftime("%Y-%m-%d %H:%M:%S")

df = pd.DataFrame([current_time], columns=['Time'])
spark_df=spark.createDataFrame(df)
spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable("Notebook_runtime")

Modelling

Modelling was pretty easy. I created relationships using the Sql Endpoint relationships view.

Then I chose which tables to show in my default semantic model.

This lineage view explains the flow.

Reporting

I created a simple Power BI report on the default semantic model using the Power BI service. I published to the web and you can view the report here – https://app.powerbi.com/view?r=eyJrIjoiOTYxMmM3NjYtMGIxZS00MDFmLTljOWMtYTYyMzBmYjA3ZGJjIiwidCI6IjM4YmU3MjU5LTMyMGQtNDZkYy04YzNiLWNkZDA5OTM1NGExMyIsImMiOjh9

Until next time,
Prathy 🙂

27 November 2023 0 comment

Microsoft Fabric

What is #MicrosoftFabric and why am I excited about it?

by Prathy Kamasani 24 May 2023

written by Prathy Kamasani

I had been following the hints and clues on Twitter, where some Power BI team members were changing their job titles to Synapse. I was curious and hopeful about what they were working on, but I also didn’t want Power BI to lose momentum. When I saw the leadership changes, I felt reassured that whatever they did would be amazing.

Fast forward, at Build, Microsoft announced the long-awaited Microsoft Fabric. What is it? It’s a SaaS Data Fabric platform with a common data lake called OneLake. It’s a one-stop shop for all the analytics needs of every enterprise. With fabric, there is no need to stitch together services from multiple vendors and all data is saved as delta parquet files. No more vendor lock-in or proprietary data formats!

The data Fabric concept is not new; as per datanami, Noel Yuhanna has been at the forefront of Data Fabric. In his words, “A data fabric is essentially an abstraction layer that links a disparate collection of data tools that address key pain points in big data projects. A data fabric solution should deliver capabilities in the areas of data access, discovery, transformation, integration, security, governance, lineage, and orchestration. It should also provide self-service capabilities, as well as some graph capabilities to identify connected data.

By providing a way to bring these data management capabilities to bear on data spanning all these silos, a data fabric can help alleviate core data management challenges holding companies back from higher-level data use cases, including advanced analytics and AI in the cloud.”

Microsoft Fabric takes this concept to the next level by integrating it with Power BI, Microsoft Office and Fluent UI frameworks, Integration with Office and Power BI makes the time and steps to data insights very flexible. Fluent UI provides a consistent and beautiful user experience across all artefacts. I think Microsoft Fabric is the ultimate solution for anyone who wants to leverage the power of data in the cloud. Whether you prefer a centralized or distributed approach, Microsoft Fabric can handle it all. You don’t have to change your existing skills, and you can use the tool or language you are comfortable with. There is plenty of information on MS Docs, and I am sure many people will be talking about it. I want to take my take on Microsoft Fabric from an ex-Microsoft SQL Server BI developer.

You see, I have a long history with Microsoft SQL Server and BI. I used to work with SSIS, SSRS and SSAS to create amazing data solutions. I loved how I could integrate, transform and analyze data using these tools. I felt like I had everything I needed in one place: SQL Server. Ever since, no matter what tool I worked with, I still somehow try to connect the dots back to MSBI. This might be because it helps me to understand the topic better or not to get intimidated by changing world.

Fabric sounds game changer, but as an ex-MS SQL BI developer, I find it familiar. In my older days, I accumulated data in SQL Server databases. I made ETL or ELT using SQL-stored procedures and my first love SSIS. All my data integration tasks were mainly done using SSIS but sometimes with a few bits of C# and hardcore SQL Procedures. I used SSAS to create dimensional models; all calculations were traditionally created by MDX and then DAX.

I created beautiful SSRS reports with cool features like data-driven parameters, conditional formatting, look-up sets. I kept my end users on top of their KPIs using SSRS subscriptions. Not only end users, hardcore IT guys knew when the server was hitting limits when jobs were failing, even some occasional happy birthday emails using SQL Server Task Manager and sometimes with a bit of help from the Windows Server task manager. Whatever I did, I kind of moved, transformed and analysed all the data on my SQL Server. There were occasional intruders like Oracle, My Sql etc but I was faithful to SQL Server.

But times have changed, and so has the data landscape. Now we have more data sources, more data types, and more data challenges. We need to be able to handle big data, streaming data, unstructured data and more. We need to be able to scale, secure and optimize our data pipelines. We need to be able to deliver insights faster and easier.

That’s where Microsoft Fabric comes in. It’s a platform that lets me connect all these different data sources and destinations. If I have an existing data platform, nicely formatted and all; I used to use linked servers, now I would use Shortcuts on Fabric. Back then, used to access data files with a mapped network drive on Server. Now I would probably mount my storage on Fabric. Microsoft Fabric lets me connect all kinds of data sources and destinations. Whether I have a well-structured data platform or a bunch of files on a server, I can easily access them with Shortcuts and Mounts on Fabric.

For Data Integration, I used a combination of SSIS and SQL Server functionalities like stored procedures and functions. Now on Fabric, I can use Data Pipelines and Data Flows Gen-II to create visual and intuitive workflows for my data transformations. Or I can use Notebooks to write code in any language I want. I think I will be mostly using Pipelines and Data flows Gen-II. I am a more visual person.

I haven’t played enough, but it looks like, I can create multiple data flows and call those data flows from Data Factory pipelines. That would be really useful for scenarios like For Each Loop.

Next comes methodologies or frameworks of data storage. The proven approach then was ETL – Extract Raw data from operational data sources, Transform and store in a Transactional layer, then Load into a Data warehouse; in a presentable format for further data analysis. I think the old ETL is now Medallion Lakehouse architecture. For larger data warehouses we used to create Databases schemas or data marts for each department or domain, and now that’s domain-based data mesh architecture.

Microsoft Fabric supports any data storage framework or architecture I need. I can use Workspaces and Domains to organize my data into different stages and domains. I can follow the Medallion Lakehouse approach or create custom data marts for each department or domain.

Next data analytics, in the past, we had three options for creating an analytics layer: SQL Views, SQL DW and SSAS Model. The choice depended on many factors, such as the urgency, the use case, the budget and the developer’s skills. But now, with Fabric, which gives us more flexibility and power. We can use gold-level lakehouse tables for fast and reliable data access, or we can build a Warehouse with custom measures and calculations, or we can create a Power BI Dataset for interactive visualization.

In my opinion, the priority would be time to insights and performance. Using the benefits of DeltaLake and one copy would be my highest priority. But again, there are best practices and client scenarios. It’s not always easy to follow a recommended framework.

But that’s not all. Analytics is not just about making nice reports with actionable insights. We are in the age of AI. We want those insights to reach the end users in the most effective way possible. In the past, I used SQL Task Manager and Windows Task Manager to automate things. I used SSRS and Excel for further analysis. I relied on support teams for ad hoc report requests. In the modern era, different export options, and self-service BI is not enough. We need scalable platforms with no propriety locks to use the final output for further analysis. Gold-level data is not an end. We may want to analyse the final dataset created by the self-service end user. We may want to add pixie dust of Open AI on that final data frame created by the citizen developer. That’s where Fabric is perfect. I am no longer limited to the final stage of Load in my ETL framework or Gold layer of Medallion Lakehouse architecture. We are emerging into the world of creating Platinum and diamond layers.

Microsoft Fabric, there are so many elements I don’t know yet. There are these whole data science bits, I don’t understand yet. But as an ex-MS SQL BI developer, I find it familiar. It reminds me of how I used to work with SQL Server and BI tools. It’s like a modern version of them. It’s like a new way of doing old things. It’s time to bring back Data Platforms and analytics platforms together. Together in Cloud 🙂

What do you think? Are you excited about Microsoft Fabric? Do you have any questions or comments? Let me know in the comments section below.

Until next time, Prathy 🙂

24 May 2023 0 comment