How AI is unlocking possibilities for products that depend on public information

While we are already benefiting directly from various LLM chat applications. There are also immense possibilities for products using the API of LLMs. They can use LLM capabilities in imaginative ways that would have been cost prohibitive in the past. Let us start with examples of three broad use cases that can benefit from LLM APIs.

Curation

Let's assume we want to create a product that makes learning material available to students of class 8-12 in a somewhat unique delivery form (e.g. video, interactive, animation, etc). I have to perform following steps:

Get access to all the publicly available CBSE/ICSE materials (PDF, HTML etc) from the Internet.
Note that the learning material have several standard forms - lessons, exercises, multiple choice questions, etc. So, manually distill the learning intent from them, select what we want, and create our own content in desired form. In other words we manually curate our content.
Develop our product that delivers this content to the users.

1 & 2 put together is quite an effort intensive project.

Web scraping

Let us assume, someone is an large public equity investor. He/she wants to be to able to gather all the intelligence possible about the stocks one is interested in. This implies:

Make a list of all the websites, blogs, research notes, etc that provide relevant and helpful public information.
Scrape, structure and store this information.
Develop your product that allows users to query this stored semi-structured information.

Scraping, structuring and querying - all three are somewhat non-deterministic and again effort intensive in the outcomes they would produce.

Public Data API

Finally, let's say we want to develop a product that allows access to and insights from Indian village level (gram panchayat's) planning data. One option is for the makers of eGramSwaraj to develop an API for this data. Village planning data is semi-structured data, hence the API maybe able to only provide these planning documents rather than a clean structured JSON.

LLM services can help in all three of these

Curation example from above

We can model the learning material in a hierarchical form like class, subjects, chapters, topics, quiz, and assignments. Have a look at these three prompts and their output as sample for what is possible.

Entire course material can be created by programmatically, requesting for a structured output to the API and storing in the response content. There are two interesting things here.

The sources that have been used to create these. Each prompt has used several sources to create the output. This illustrates the complexity involved in manual curation if one has to get similar level of quality. This isn't to say that no human beings can create better material - but it is unlikely that we can do this at the same cost even with qualified individuals.
Agent orchestration patterns is an emergent space. While we see hierarchical organisation of between step 1 & 2; the step 3 & 4 are quite interesting and is powerful in improving such content. Just passing the output via a review process allows us to automatically find improvement areas in the content produced. Other patterns can be deployed here e.g. one can have an "quantitative evaluator agent" that strictly scores the output and one can use evaluator agent's output to decide when to stop improving.

Web scraping

We have already seen in the previous example that Perplexity has used several web sources to generate the output. Perplexity (and similar) have already scraped a lot of Internet to improve its responses. This raises a question for anyone wanting to scrape oneself - can one produce better output that this? Quite likely no. We also have an option of helping Perplexity further by providing sources or increasing weight of preferred sources.

What about licensed content/data?

Let me first this get out of the way - robots.txt and terms of usage are not fully respected including by LLM providers in their data ingestion, search engines, and others (as the data becomes more and more valuable scraping will get more and more difficult).

But it is quite remarkable what Perplexity is doing. It is licensing the data from major sources and making them accessible via their chat and API. Recently they have made NSE and BSE (Indian stock exchanges) data available via their Perplexity Finance program. Perplexity provides Statista, Pitchbook and Wiley's data. They plan to do more.

Why is this significant? Consider BSE and NSE data. There are data distributors that charge close to 10-15 lakhs per year to access this data. The price is so high because only bulk data is accessible. There is no pay as you go option depending on how much data you access. Since I am developing a product is this domain, this is quite an entry barrier and high cost for anyone wanting to develop a new product. Their model now has been disrupted by Perplexity Finance.

Public Data API

As of now existing LLMs service do not provide any a complete solution in this case, but it offers a way for likes of eGramSwaraj to make their huge volume of semi-structured data lot more accessible.

They can implement an MCP server that exposes the hierarchical information leading LLMs to the main content, which is GP Plan excels. LLMs can provide a much better access to such data including structuring it - as per the needs of each user. There are several government's public data sources that can be lot more impactful than they are today - e.g. imagine the accessibility if this data can be provided on mobile apps up to the village and citizen level.

Via large language models (LLMs) we can access almost the entire Internet, hook any data for ease of access, but also turn this information into structured forms so that they can be consumed by products for their users.

Blog post

How AI is unlocking possibilities for products that depend on public information

Vivek Singh

Curation

Web scraping

Public Data API

LLM services can help in all three of these

Curation example from above

Web scraping

Public Data API

LLMs are "Get anything API" of the Internet!

Related Blog Posts

Blog post

How AI is unlocking possibilities for products that depend on public information

Vivek Singh

Curation

Web scraping

Public Data API

LLM services can help in all three of these

Curation example from above

Web scraping

Public Data API

LLMs are "Get anything API" of the Internet!

Related Blog Posts

Field notes on AI powered coding

NLP — Natural Language Processing

Minimum Viable Product

A beginner’s guide to using SpringData JPA with Hibernate in Spring Boot.

Field notes on AI powered coding