As I see complex information problems around me, my immediate response now is to take a theoretical stab at it - to see if a powerful model can solve it. The mental exercise involves - what do I need to do so that I leave the most difficult & core part of the problem to the model. Once the difficult to deterministically solve complexity is narrowed down and handed over to the model - rest I need to do is plain old software engineering.
I start from the humble acceptance that model is far more capable than most programs I can even imagine to write myself. So I won't even try.
If the model can solve it then I would like to be lazy as there are many such problems to solve.I have several of them but there is one set that I worked on - which I will get into here. Lets start with problem statements.
- I want to write complete functional testing suite for an application, but that seems like a tall order in terms of time to be spent. It involves writing page objects and workflows before I can bring in coding agents to write the tests (discussed more here). If my application has 50 pages, it is a lot of resource commitment. There are degrees of deterministic solution options - but most don't align with a robust solution. And they also involve decent amount of work.
- I want to download my personal data related to loyalty, health, pets, travel, PF, gas, electricity, water, very very long list. There are mostly no APIs since these are consumer/citizen apps. For the same reason, unlikely there will be agents and CLI tools as well - any decade soon.
- I want to document the functionality of a complete software application - it is boring, difficult to keep up-to-date. In fact, for something like a low code platform, where each application is created in a few days-weeks, no one even attempts to document it. App is the documentation.
- Similar to above, I want to generate video demos of my application with transcripts.
Now imagine solving these using AI. We have something like Opus. The most important thing to do next is to break down the problem into its smallest components. The components, which I can give to the model and ask it to solve for me. Not just ask it, expect it to do so. So what are the components for our scenario (not an exhaustive list, but close).
- get accessibility tree
- get page html DOM
- ability to enumerate interactive elements in a page
- click
- navigate
- type
- create a page's visual fingerprint
- compare visual fingerprint of page (so that it can tell whether two pages are same)
That is mostly it. Now my job as an engineer is to only make the above available to the model. With the following tools it is pretty straightforward.
- Playwright
- Playwright chrome extension & Web Sockets (if you want to solve for your own tabs)
- dHash, crypto (visual fingerprint, comparison)
- CDP (Chrome)
Now what we can do with it. You can find details here, where everything has been explained. The code is here and output is here. We can do the following.
- Autonomously explore any website and capture every possible detail of each unique page. No user instructions. Just explore on its own. Limit to max number of pages. (output)
- We can also do guided exploration (i.e. with some guiding instruction from the user)
- Use 1,2 to document the entire website in detail, with screenshots (since we captured it). Also document page transitions and on which action. (output)
- Use 1,2 to generate all page object models, with their interconnections. If I don't like page objects, then I can generate procedural functions as well. (output)
- Use 1,2 to run any custom instruction on demand. For example - download all of my pets' medical history - with perhaps saying just that much in English.
I believe this is a quite repeatable process across the board. At the heart of solving many complex problems is something like the following:
- Take a complex problem
- Break it down to smallest composable components of solution
- Provide these working components to the model
- Design for an acceptable error rate, risk assessment & resolution in case of error, & where to put human in the loop.
A few other interesting problems like this are:
- Closing a month in a services business using - timesheet, contracts (unstructured docs), project assignments (structured + slack messages) lead to the right invoices being sent with minimal and supervisory human effort.
- Top of the funnel sales (there are many products in this space).
- On a personal front - closing a quarter with advance tax declaration. This involves resolving invoices, DEMAT account transactions, bank interest statement, TDS amounts, dividends. This is two part problem both requires agent loop - getting all this data and secondly calculating the taxes with proof.

%20cover%20image.png)








