Home Microsoft Fabric #MicrosoftFabric Road Map and Web Scraping with Fabric

#MicrosoftFabric Road Map and Web Scraping with Fabric

by Prathy Kamasani

Like many I am also playing with Fabric, many of my clients are also excited about Fabric and want to know more about it. Being a solution architect in the consulting world one of the most common questions I get asked is: “When certain features will be available, Where are they in the roadmap?”. That’s what sparked the idea of scraping the Microsoft Fabric Roadmap and creating this Power BI report. It is based on a Direct Lake connection, so it has been a bit temperamental.

So, what did I do it? If you are not interested in the whole story. Here is Python code you can run to get a road map. If you are interested in my process carry on reading 🙂

Data Integration

I am not proficient in Python but I have done web scraping with Power BI many times. So I know my way around web scraping a bit. However, I started with ChatGapt, doing a rough skeleton using Python.

url = "https://learn.microsoft.com/en-us/fabric/release-plan"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='main')
# print(results.prettify())

The above script gives me the entire page. I am obviously not interested in the entire page. I went to developer tools and figured out snippets I am interested in.

It looked like, I need all Hrefs and then I can filter out only those I am interested in.

# get all hrefs
hrefs = results.find_all('a', href=True)

# keep all href name and link in a list
href_list = []
for href in hrefs:
    href_list.append([href.text, href['href']])

# add a column name to the list with full url
href_list = pd.DataFrame(href_list, columns=['name', 'url'])

# remove rows containing https  
href_list = href_list[~href_list['url'].str.contains("https")]

# add a new column with full url    
href_list['full_url'] = 'https://learn.microsoft.com/en-us/fabric/release-plan/' + href_list['url']
href_list.loc[href_list['name'] == 'Power BI', 'full_url'] = 'https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/planned-features'

I was able to extract the hrefs from the web page using the above code snippet. It was a bit messy, but I managed to clean it up and get the list I wanted. Now, I have to iterate over the links and scrape the data from each page. The pages have different formats and layouts, so I need to adjust my code accordingly. I decided to try chat GPT for some help, and AI suggested using Pd_readhtml to read the tables from the HTML pages.

table_list = []
# loop through href_list, check if it has tables and if it does, append to table_list
for index, row in href_list.iterrows():
    table = pd.read_html(row['full_url'], extract_links="all")
    table[0]['product'] = row['name']
    table[0]['FullURL'] = row['full_url']
    table_list.append(table[0])

Pandas_df = pd.concat(table_list, ignore_index=True)
Pandas_df.columns = ['Feature', 'Estimated_release_timeline', 'Product','FullURL', 'Enabled_for', 'Public_Preview','GA']
Pandas_df = Pandas_df.astype({'Feature': 'string', 'Estimated_release_timeline': 'string', 'Product': 'string', 'Enabled_for': 'string', 'Public_Preview': 'string', 'GA': 'string'})

spark_df = spark.createDataFrame(Pandas_df)

# display(spark_df)

I decided to make some improvements to the data. I added some extra columns to identify the products more easily, I converted the data types to string for consistency, and I gave meaningful names to the columns, avoiding any spaces that could cause problems later.

Next, I applied some more transformations with Spark. I learned that Spark Data Frame is much faster than Pandas, so I switched to Spark DataFrame.

Script for all transformations
from pyspark.sql.functions import split, col
spark_df = spark_df.withColumn('Feature_Name', split(col('Feature'), '\',')[0])
spark_df = spark_df.withColumn('Feature_URL', split(col('Feature'), '\',')[1])
# display(spark_df)

spark_df = spark_df.withColumn('Feature_Name', regexp_replace('Feature_Name', '\(\'', ''))
spark_df = spark_df.withColumn('Feature_Name', regexp_replace('Feature_Name', '\'', ''))
spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', '\'\)', ''))
spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', '\'', ''))
spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', ' ', ''))
# display(spark_df)

# spark_df = spark_df.withColumn('Feature_URL', concat(col('FullURL'), col('Feature_URL')))
spark_df = spark_df.withColumn('Feature_URL', when(col('Product') == 'Power BI'
    , concat(lit('https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/'), col('Feature_URL')))
    .otherwise(concat(col('FullURL'), col('Feature_URL'))))

spark_df = spark_df.withColumn('Estimated_release_timeline', split(col('Estimated_release_timeline'), ',')[0])
spark_df = spark_df.withColumn('Estimated_release_timeline', regexp_replace('Estimated_release_timeline', '\(\'', ''))
spark_df = spark_df.withColumn('Estimated_release_timeline', regexp_replace('Estimated_release_timeline', '\'', ''))
spark_df = spark_df.drop('Feature')
# display(spark_df)
spark_df = spark_df.withColumn('Enabled_for', split(col('Enabled_for'), '\',')[0])
spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\(\'', ''))
spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\'', ''))
spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\)', ''))

#split GA column into two columns
spark_df = spark_df.withColumn('GA', split(col('GA'), ',')[0])
spark_df = spark_df.withColumn('GA', regexp_replace('GA', '\(\'', ''))
spark_df = spark_df.withColumn('GA', regexp_replace('GA', '\'', ''))

#split Public_Preview column into two columns
spark_df = spark_df.withColumn('Public_Preview', split(col('Public_Preview'), ',')[0])
spark_df = spark_df.withColumn('Public_Preview', regexp_replace('Public_Preview', '\(\'', ''))
spark_df = spark_df.withColumn('Public_Preview', regexp_replace('Public_Preview', '\'', ''))

# spark_df.show()  
# display(spark_df)

Table_Name = 'Fabric_Release_Plan'
# print(Table_Name)
spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)
print('Table ' + Table_Name + ' has been created successfully')

Finally, inserted all data into the Spark Delta Table

Table_Name = 'Fabric_Release_Plan'
# print(Table_Name)
spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)
print('Table ' + Table_Name + ' has been created successfully')

Then, I went to get descriptions of each item as well and displayed them in the tooltips of my report. This was similar to the rest of the code, but more fun and challenging.

Script to get Descriptions

url = "https://learn.microsoft.com/en-us/fabric/release-plan"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='main')

# get all hrefs
hrefs = results.find_all('a', href=True)

# keep all href name and link in a list
href_list = []
for href in hrefs:
    href_list.append([href.text, href['href']])

# add a column name to the list with full url
href_list = pd.DataFrame(href_list, columns=['name', 'url'])

# remove rows containing https  
href_list = href_list[~href_list['url'].str.contains("https")]

# add a new column with full url    
href_list['full_url'] = 'https://learn.microsoft.com/en-us/fabric/release-plan/' + href_list['url']
href_list.loc[href_list['name'] == 'Power BI', 'full_url'] = 'https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/planned-features'

Desc_List=[]
table_list = []
# loop through href_list, check if it has tables and if it does, append to table_list
for index, row in href_list.iterrows():
    
    # also get all descriptions
    url = row['full_url']
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.find(id='main')

    h3 = results.find_all('h3')
    h3_ids = []
    for h in h3:
        h3_ids.append(h.get('id'))
        h3_ids.append(h.get_text())
        h3_ids.append(h.find_next('p').get_text())
        h3_ids.append(h.find_next('p').find_next('p').get_text())
        # print(h3_ids)
        # transpose list
    h3_ids = [h3_ids[i:i+4] for i in range(0, len(h3_ids), 4)]
    h3_ids = pd.DataFrame(h3_ids, columns=['id', 'name', 'timeline', 'full_description'])
    h3_ids['url'] = url
    # add product name
    h3_ids['product'] = row['name']
    # add full url with concat of url and id
    h3_ids['full_url'] = url + '#' + h3_ids['id']
    Desc_List.append(h3_ids)

# concat all desc_list dataframes
Desc_List = pd.concat(Desc_List)
# display(Desc_List)

spark_df = spark.createDataFrame(Desc_List)
# spark_df.show()

Table_Name = 'Fabric_Release_Plan_with_desc'
# print(Table_Name)
spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)
print('Table ' + Table_Name + ' has been created successfully')

Next, I ran a simple script to get the time my notebook refreshed to display the last refresh date and time on the report.

# get notebook run time and save in a table
import time
from datetime import datetime
from pyspark.sql.functions import lit
import pandas as pd

now = datetime.now()
current_time = now.strftime("%Y-%m-%d %H:%M:%S")

df = pd.DataFrame([current_time], columns=['Time'])
spark_df=spark.createDataFrame(df)
spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable("Notebook_runtime")

Modelling

Modelling was pretty easy. I created relationships using the Sql Endpoint relationships view.

Then I chose which tables to show in my default semantic model.

This lineage view explains the flow.

Reporting

I created a simple Power BI report on the default semantic model using the Power BI service. I published to the web and you can view the report here – https://app.powerbi.com/view?r=eyJrIjoiOTYxMmM3NjYtMGIxZS00MDFmLTljOWMtYTYyMzBmYjA3ZGJjIiwidCI6IjM4YmU3MjU5LTMyMGQtNDZkYy04YzNiLWNkZDA5OTM1NGExMyIsImMiOjh9

Until next time,
Prathy 🙂

You may also like

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.