Like many I am also playing with Fabric, many of my clients are also excited about Fabric and want to know more about it. Being a solution architect in the consulting world one of the most common questions I get asked is: “When certain features will be available, Where are they in the roadmap?”. That’s what sparked the idea of scraping the Microsoft Fabric Roadmap and creating this Power BI report. It is based on a Direct Lake connection, so it has been a bit temperamental.
So, what did I do it? If you are not interested in the whole story. Here is Python code you can run to get a road map. If you are interested in my process carry on reading 🙂
Data Integration
I am not proficient in Python but I have done web scraping with Power BI many times. So I know my way around web scraping a bit. However, I started with ChatGapt, doing a rough skeleton using Python.
url = "https://learn.microsoft.com/en-us/fabric/release-plan" page = requests.get(url) soup = BeautifulSoup(page.content, 'html.parser') results = soup.find(id='main') # print(results.prettify())
The above script gives me the entire page. I am obviously not interested in the entire page. I went to developer tools and figured out snippets I am interested in.
It looked like, I need all Hrefs and then I can filter out only those I am interested in.
# get all hrefs hrefs = results.find_all('a', href=True) # keep all href name and link in a list href_list = [] for href in hrefs: href_list.append([href.text, href['href']]) # add a column name to the list with full url href_list = pd.DataFrame(href_list, columns=['name', 'url']) # remove rows containing https href_list = href_list[~href_list['url'].str.contains("https")] # add a new column with full url href_list['full_url'] = 'https://learn.microsoft.com/en-us/fabric/release-plan/' + href_list['url'] href_list.loc[href_list['name'] == 'Power BI', 'full_url'] = 'https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/planned-features'
I was able to extract the hrefs from the web page using the above code snippet. It was a bit messy, but I managed to clean it up and get the list I wanted. Now, I have to iterate over the links and scrape the data from each page. The pages have different formats and layouts, so I need to adjust my code accordingly. I decided to try chat GPT for some help, and AI suggested using Pd_readhtml to read the tables from the HTML pages.
table_list = [] # loop through href_list, check if it has tables and if it does, append to table_list for index, row in href_list.iterrows(): table = pd.read_html(row['full_url'], extract_links="all") table[0]['product'] = row['name'] table[0]['FullURL'] = row['full_url'] table_list.append(table[0]) Pandas_df = pd.concat(table_list, ignore_index=True) Pandas_df.columns = ['Feature', 'Estimated_release_timeline', 'Product','FullURL', 'Enabled_for', 'Public_Preview','GA'] Pandas_df = Pandas_df.astype({'Feature': 'string', 'Estimated_release_timeline': 'string', 'Product': 'string', 'Enabled_for': 'string', 'Public_Preview': 'string', 'GA': 'string'}) spark_df = spark.createDataFrame(Pandas_df) # display(spark_df)
I decided to make some improvements to the data. I added some extra columns to identify the products more easily, I converted the data types to string for consistency, and I gave meaningful names to the columns, avoiding any spaces that could cause problems later.
Next, I applied some more transformations with Spark. I learned that Spark Data Frame is much faster than Pandas, so I switched to Spark DataFrame.
Script for all transformations
from pyspark.sql.functions import split, col spark_df = spark_df.withColumn('Feature_Name', split(col('Feature'), '\',')[0]) spark_df = spark_df.withColumn('Feature_URL', split(col('Feature'), '\',')[1]) # display(spark_df) spark_df = spark_df.withColumn('Feature_Name', regexp_replace('Feature_Name', '\(\'', '')) spark_df = spark_df.withColumn('Feature_Name', regexp_replace('Feature_Name', '\'', '')) spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', '\'\)', '')) spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', '\'', '')) spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', ' ', '')) # display(spark_df) # spark_df = spark_df.withColumn('Feature_URL', concat(col('FullURL'), col('Feature_URL'))) spark_df = spark_df.withColumn('Feature_URL', when(col('Product') == 'Power BI' , concat(lit('https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/'), col('Feature_URL'))) .otherwise(concat(col('FullURL'), col('Feature_URL')))) spark_df = spark_df.withColumn('Estimated_release_timeline', split(col('Estimated_release_timeline'), ',')[0]) spark_df = spark_df.withColumn('Estimated_release_timeline', regexp_replace('Estimated_release_timeline', '\(\'', '')) spark_df = spark_df.withColumn('Estimated_release_timeline', regexp_replace('Estimated_release_timeline', '\'', '')) spark_df = spark_df.drop('Feature') # display(spark_df) spark_df = spark_df.withColumn('Enabled_for', split(col('Enabled_for'), '\',')[0]) spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\(\'', '')) spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\'', '')) spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\)', '')) #split GA column into two columns spark_df = spark_df.withColumn('GA', split(col('GA'), ',')[0]) spark_df = spark_df.withColumn('GA', regexp_replace('GA', '\(\'', '')) spark_df = spark_df.withColumn('GA', regexp_replace('GA', '\'', '')) #split Public_Preview column into two columns spark_df = spark_df.withColumn('Public_Preview', split(col('Public_Preview'), ',')[0]) spark_df = spark_df.withColumn('Public_Preview', regexp_replace('Public_Preview', '\(\'', '')) spark_df = spark_df.withColumn('Public_Preview', regexp_replace('Public_Preview', '\'', '')) # spark_df.show() # display(spark_df) Table_Name = 'Fabric_Release_Plan' # print(Table_Name) spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name) print('Table ' + Table_Name + ' has been created successfully')
Finally, inserted all data into the Spark Delta Table
Table_Name = 'Fabric_Release_Plan' # print(Table_Name) spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name) print('Table ' + Table_Name + ' has been created successfully')
Then, I went to get descriptions of each item as well and displayed them in the tooltips of my report. This was similar to the rest of the code, but more fun and challenging.
Script to get Descriptions
url = "https://learn.microsoft.com/en-us/fabric/release-plan" page = requests.get(url) soup = BeautifulSoup(page.content, 'html.parser') results = soup.find(id='main') # get all hrefs hrefs = results.find_all('a', href=True) # keep all href name and link in a list href_list = [] for href in hrefs: href_list.append([href.text, href['href']]) # add a column name to the list with full url href_list = pd.DataFrame(href_list, columns=['name', 'url']) # remove rows containing https href_list = href_list[~href_list['url'].str.contains("https")] # add a new column with full url href_list['full_url'] = 'https://learn.microsoft.com/en-us/fabric/release-plan/' + href_list['url'] href_list.loc[href_list['name'] == 'Power BI', 'full_url'] = 'https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/planned-features' Desc_List=[] table_list = [] # loop through href_list, check if it has tables and if it does, append to table_list for index, row in href_list.iterrows(): # also get all descriptions url = row['full_url'] page = requests.get(url) soup = BeautifulSoup(page.content, 'html.parser') results = soup.find(id='main') h3 = results.find_all('h3') h3_ids = [] for h in h3: h3_ids.append(h.get('id')) h3_ids.append(h.get_text()) h3_ids.append(h.find_next('p').get_text()) h3_ids.append(h.find_next('p').find_next('p').get_text()) # print(h3_ids) # transpose list h3_ids = [h3_ids[i:i+4] for i in range(0, len(h3_ids), 4)] h3_ids = pd.DataFrame(h3_ids, columns=['id', 'name', 'timeline', 'full_description']) h3_ids['url'] = url # add product name h3_ids['product'] = row['name'] # add full url with concat of url and id h3_ids['full_url'] = url + '#' + h3_ids['id'] Desc_List.append(h3_ids) # concat all desc_list dataframes Desc_List = pd.concat(Desc_List) # display(Desc_List) spark_df = spark.createDataFrame(Desc_List) # spark_df.show() Table_Name = 'Fabric_Release_Plan_with_desc' # print(Table_Name) spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name) print('Table ' + Table_Name + ' has been created successfully')
Next, I ran a simple script to get the time my notebook refreshed to display the last refresh date and time on the report.
# get notebook run time and save in a table import time from datetime import datetime from pyspark.sql.functions import lit import pandas as pd now = datetime.now() current_time = now.strftime("%Y-%m-%d %H:%M:%S") df = pd.DataFrame([current_time], columns=['Time']) spark_df=spark.createDataFrame(df) spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable("Notebook_runtime")
Modelling
Modelling was pretty easy. I created relationships using the Sql Endpoint relationships view.
Then I chose which tables to show in my default semantic model.
This lineage view explains the flow.
Reporting
I created a simple Power BI report on the default semantic model using the Power BI service. I published to the web and you can view the report here – https://app.powerbi.com/view?r=eyJrIjoiOTYxMmM3NjYtMGIxZS00MDFmLTljOWMtYTYyMzBmYjA3ZGJjIiwidCI6IjM4YmU3MjU5LTMyMGQtNDZkYy04YzNiLWNkZDA5OTM1NGExMyIsImMiOjh9
Until next time,
Prathy 🙂