Matt Briggs | Retrieve Semantic Data from Frequently Asked Questions Published to the Web

Retrieve Semantic Data from Frequently Asked Questions Published to the Web

July 15, 2025

Learn.com published more than 700 structured Frequently Asked Questions articles that contained both a JSON-LD payload and HTML. This post looks at how the FAQ content both supports and does not support interoperable standards for accessible content.

Content is accessible via the public endpoint and produces content that can be programmatically queried or ingested and read by a script without any additional work. In addition, the elements in each article are related to a publicly available Schema, the FAQPage schema on Schema.org, which can be used to disambiguate content parts, place them into context, and create a shared representation with anyone using Schema.org.

The semantic power with the FAQPage is is that it maps each question to the right answer.

JSON-LD annotates elements on a page, structuring the data, which can then be used by search engines to disambiguate elements and establish facts surrounding entities, which is then associated with creating a more organized, better web overall.
– A Guide to JSONLD for Beginners (Moz.com)

A first or third party using these end points could:

Integrate the content into their content experience.
Perform granular content reporting.
Create content chunks with the following elements:
a. Grab deliberate chunks as intended by the content author.
b. Associated semantic metadata with each element of the document.
c. Associate the chunks with the article to other parts of the article.
This content is ingested by Google and Bing’s Knowledge Graph contributing to the Search Engine AI features and knowledge graph returns.

Use SEO Meta 1 Copy to review the semantic data

An extension allows easy checking and copying of SEO-related meta information. It can be used to compare semantic data availability across different topics on Learn.Microsoft.com.

Install use SEO Meta 1 Copy. This is an extension that allows you to easily check SEO-related meta information and copy it with one click. You can get it from as Microsoft Edge Add-on or Google Chrome Extension.
Navigate to any of the HTML files that contains JSON-LD. For example: https://learn.microsoft.com/en-us/azure/aks/faq
Click the Extension in your Browser Toolbar. Select structured and then select read more under FAQPage. Note there are 47 questions on this page. You will see the semantic structure of the article.
You can compare the availability of this semantic data to other topics. Nearly all of the topics on Learn.Microsoft.com lack semantic data.
Click the Extension in your Browser Toolbar. Select structured. Note that while the breadcrumb is present, the semantic data payload is not present.

Use Schema.org to review the semantic data

Paste a URL into validator.schema.org, select Fetch URL, and Run Test. Review the semantic structure of the document.

Open https://validator.schema.org/
Paste one of the URL with semantic data into the Test your structured data box. Select Fetch URL, and Run Test. For example: https://learn.microsoft.com/en-us/azure/aks/faq
Expand the FAQPage item and review the semantic structure of the document.

Implementation

You can use a short Python script to download the files as json. This script scrapes a list of URLs, detects JSON-LD FAQPage schema, and extracts relevant data. It also saves FAQ-related HTML and JSON-LD data when detected.

You can find the script in the retrieval-json-vs-md GitHub repository.

You can find a list of public URLs on Learn with FAQPage at in the repository: public-faqpage-urls.txt

Conclusion

Content is accessible via the public ends point and produces content that can be programmatically queries or ingested and read by a script without any additional work.

You were able to:

Access the semantic data at the public end points for Learn.com.
Validate the semantic data and its associate with the public schema at Schema.org
Use a program to download the data and use a query to access the content as data.