Iterating over an unknown number of pages in Power Query – avoid endless loops

The topic of Iterating over an unknown number of pages in Power Query has been covered many times. The first one I know was Matt Masson, who used the technique of a pre-generated list of page numbers (PageRange = {1..10000}).

This technique works well but may result in a long query execution time or could even run in a time-out by hitting the throttling limit on certain pages.

Omri Avissar came up with a new solution. Its benefit is shorter execution time and a reduced chance to run in a time-out. A great solution!

…Infinite loops

But yesterday I had to learn that this solution doesn’t work on every page. I wrote a query to pull data from the Power BI Kudos Leaderboard (1 Day). The leaderboard spans over only 2 pages and 5’000 rows of HTML Code. The query was incredible slow!

After some testing and reverse engineering, I found out, that the page redirects to the last page, if the page number is greater than the number of the last page.
Calling
http://community.powerbi.com/t5/kudos/leaderboardpage/board-id/power-bi-designer/timerange/one_day/tab/authors/page/3
was redirected to
http://community.powerbi.com/t5/kudos/leaderboardpage/board-id/power-bi-designer/timerange/one_day/tab/authors/page/2).

The next call was page number 4, then 5 etc. Every time the result was the data of page number 2. An infinite loop! Thanks for that Microsoft 😉

Solution

So, I had to figure out a technique to compare the sent url with the received url. Fortunately, the site provides the url in the HTML Code in Row 13:

<meta content="Data shaping, modeling, and report building in the Power BI Desktop app." name="description"/><meta content="noindex" name="robots"/><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><link href="http://community.powerbi.com/t5/kudos/leaderboardpage/board-id/power-bi-designer/timerange/one_day/tab/authors/page/2" rel="canonical"/>

By using the brand-new Power Query command “Text.BetweenDelimiters” I could get the desired information out of the HTML Code:

Function_fnGetHTML

The function receives the url and the page number as input and calls the web page. As the page doesn’t provide any usable tables (………….), I had to import the HTML code, look for the desired line of code and compare the result with the original url. If it’s a match, the function gives back the whole HTML Code for further cleaning or otherwise the result is “false”.

All I had to do is to write a Query which uses the function.

Query_Desktop_24Hours

I adjusted the technique of Omri Avissar and boom, there you are. With a little function, I was able to retrieve the data in short time.

Happy Querying!