OneSevenTwoNine

Profiling Myself, But Locally

2026-06-27T00:00:00+00:00

I am building a personal data profiler: a system that collects parts of my digital footprint and helps me make sense of them. That sounds creepy, and it should. Most profiling systems are creepy because they are built by someone else, for someone else’s incentives. They turn our behaviours into predictions for ads, feeds, and retention loops. I’m interested in a different version - profiling as a way to understand myself, helping me gauge whether my intentions and my attention are aligned.

Because we use different cloud services, our own digital footprint is highly scattered - bank portals, Chrome browser, Garmin, YouTube, Goodreads etc. So, the first challenge is simply getting the data into one place and normalizing it. To enrich and categorize the free-form text in the data, traditionally would have involved supervised Natural Language Processing (NLP) models and doing that on your own data would have been very difficult. However, we are at a point in our LLM technology, where these NLP tasks can often be handled with few-shot prompts. Especially, if we can build the profiler using local LLM models, we will have a system that is confined to a boundary that we control and understand what it is doing. In the past I have built expense reporter based on my credit card and bank statements. Now, I have worked on building insights from my browsing history.

The architecture for the browsing history profiler at a high level is same as what I talked about in my post on Anatomy of a Data Product. The interesting part in this pipeline is that high-level topics are not assigned one page at a time. They are synthesized globally from the full browsing corpus, so the taxonomy reflects my actual library rather than a generic topic list. The main four stages of the data pipeline are:

Capture: A Chrome extension sends bookmarks and browsing history to the service.
Enrich: The service extracts article text, detailed topics, and vector embeddings.
Store: SQLite keeps captures, topics, high-level topics, and UMAP coordinates. ChromaDB stores embeddings and metadata.
Synthesize: A batch job builds a global taxonomy and computes the graph layout.

To explore my browsing and bookmarks data, I have a simple React frontend where each node in the graph represents one page, and nodes are clustered based on their text semantics, and high-level and low-level topics make it easy for me to zoom in. Looking at the graph, I can see that software engineering and AI dominate my reading. Entrepreneurship shows up as a meaningful secondary cluster. These top topics don’t surprise me as much. But, some areas I generally care about, like travelling, parenting, tennis are much thinner or non-existent. That gap is useful: the system is not just reflecting my interests, it is showing me where my attention and my intentions diverge.

To protect my privacy, the screenshot below uses a deliberately biased sample.

I would like to continue building these profilers that make sense of our own digital footprint in different domains, and aim to gain meaningful insights and recommendations that balance likeness and serendipity. I also think a lot more interesting insights would come out if we can cross-correlate the data from different verticals - Finance, Health, Reading, Media etc.

Personalization does not have to belong to platforms. With local models and local data, we can build profilers that serve self-understanding and can act as an assistant that keeps the user’s best interests in mind.

Snakes and Ladders - Simulation, Markov Chains, and Board Design

2026-02-28T00:00:00+00:00

Snakes and Ladders: Simulation, Markov Chains, and Board Design

Lately, my son has been obsessed with the game of Snakes and Ladders. So much so that he made it a mission to win the game first before he does any thing else. Watching him play so many games, I got curious about the game mechanics and design. In this post I explored questions like - what is the expected number of turns in a game, what a boring Snakes and Ladders board with extremely low variance would look like, and conversely, how to mathematically design a chaotic board where the variance is very high.

Simulating a Game

Let’s start by simulating a single game with two players on the board that we got from the store.

Understanding Game Statistics via Monte Carlo

Now, lets run hundred thousand simulations (Monte Carlo method) on this board to get an estimate of the expected number of turns it takes to win and see the distribution of turns. First we will run this simulation when only one player is playing!

This distribution makes it look like a well balanced board overall, there are few unlucky ones who would have gotten eaten by big snakes multiple times (100+ turns). By eye-balling the right-skewed probability distribution, we can see most games finish in 15 to 50 turns, with a long tail of unlucky outliers..

Mean: 43.82096308326311, Variance: 752.33833145564, Standard Deviation: 27.42878654726891

Would the game be quicker if there are more players ?

You can see below the average number of turns each player takes before someone wins. It does make intuitive sense that each player takes lesser turns, as there is higher chance of someone lucking out (falling in left side portions of probability distribution function) when there are more players. Though that doesn’t mean the game finshes quickly in the sense of time! As the total game turns itself among all players is a multiple of expected turns for each player, plus the amount of drama increases exponentially with each player :smiley:

Markov Chain Analysis

Monte Carlo simulations are great to calculate the expected number of turns, thereby helping us to understand the chances in a board. But they are also computationally expensive, especially if one wants to try out different board variations to optimize for certain factors. To our rescue are Markov Chains! Snakes and Ladders is a classic example of a Markov Chain.

The game has a finite number of states (squares on the board).
The probability of moving from one square to another depends only on the current square, not the history of the game.
Say you are on square 10, and there is a ladder at square 12 -> 24, there is a snake at square 15 -> 6, the 10th row of transition matrix looks like this (0…1/6 0 0 0 0 1/6 0 1/6 1/6 0 1/6 …. 1/6 0 … ). Where the probability of each of 10 -> {6, 11, 24, 13, 14, 16} transitions have 1/6th probability and rest of the transitions have 0 probability.

By representing the board as a transition matrix and treating 100 as absorbing square, we can calculate the exact expected number of turns and the variance without running any simulations! The math behind calculation of expected number of transitions and variation in number of transitions can be found here. Let’s define a board and compare the exact Markov Chain statistics with our Monte Carlo simulation results.

Stats calculated with Monte Carlo Simulation: Mean: 44.90337, Variance: 757.3198526431, Standard Deviation: 27.519444991552792
Stats calculated with Markov Chain Process: Mean: 43.82096308326311, Variance: 752.33833145564, Standard Deviation: 27.42878654726891

More power to the law of large numbers!! The mean, variance, standard deviation calculated with Markov Chain Statistical Model are almost same as the ones we calculated from 100000 Monte Carlo Simulations.

Designing Snakes And Ladder Board Design

Now comes the interesting part.

The board we have been analyzing so far is based on what we got from the store. I’d say its a well designed board, balancing the excitement with boredom. Now, how can we design boards optimized for special cases. Say we want a board that is very chaotic, where its hard to predict how many turns it is going to take to complete the game. How about a boring board, where no matter who plays the number of turns thereby the time taken is pretty much the same.

Here we design the boards for these extreme cases, demonstrating our optimization algorithm. Before we do though, a bit on the theory:

Optimization Algorithm

We want the algorithm to figure out where to place the snakes and ladders so as to achieve the objective we are choosing for. This is a combinatorial problem, and considering the number of overall possibilities, we can’t brute force our way out of it. And here, we aren’t trying to solve for a theoretically bounded solution but a solution that we can arguable verify that it is closer to our objective.

Such optimization algorithm typically has three ingredients:

1) Loss function

The loss function we choose for is, how much the mean and standard deviation of number of turns differs from the target. So closer the mean and std to our targets, smaller the loss. weight_mean and weight_std are hyperparameters to tune for, indicating which of the component of loss function is more important.

loss = weight_mean * abs(mean - target_mean) + weight_std * abs(std - target_std)

2) Neighbour selection function

Given certain locations of snakes and the ladders, this function outputs an altered positions of these snakes and ladders, resulting in a new board. The way we alter the positions could be very wild - flipping the direction of snake/ladder, adding a new snake/ladder, swapping the positions of snake to ladder or vice versa etc. Or the alterations could be very minimal - making the snake/ladder bigger/shorter by few squares, sliding the snake/ladder but keeping length the same etc. In re-inforcement learning terminology, these moves can be categorized as exploration and exploitation respectively. Exploration moves especially at the beginnning of the game are useful to literally explore possibilities trying to get in the neighbourhood of low “energy” regions. Whereas, exploration moves are important to tune the board to get closer and closer to the objective.

My Neighbour function picks exploration move with high probability at the earlier iterations, and chooses exploitation move with high probability in final stages.

In below, _T refers to Temperature, which will talk about as part of next ingredient.

cooling_ratio = current_T / max_T

explore_pct = 0.05 + (0.55 * cooling_ratio) 

explore_moves = ['add_delete', 'flip', 'swap']
exploit_moves = ['slide', 'shift', 'stretch']

if random.random() < explore_pct:
    move_type = random.choices(explore_moves, weights=[0.5, 0.3, 0.2])[0]
else:
    move_type = random.choices(exploit_moves, weights=[0.4, 0.4, 0.2])[0]

I have also put some constraints on how the board should look like, such as it should contain at least 5 snakes/ladders and at most 10, the snake/ladder shouldn’t end in same row. This is to avoid weird looking boards.

3) Iterative Loop

This is where we orchestrate the overall algorithm. We start with a random initialization of the board, and ask our Neighbour function to make a new board out of it, then we measure the loss value of the new board using the above Markov Evaluator. If the new board has lower loss, we always accept it. If the new board has higher loss than the previous version, we might still accept it with a probability that is proportional to the current temperature (math.exp(-delta_E / current_T)). The reason to accept such bad boards is to potentially escape local minima, and explore other possibilities especially at the beginning of the algorithm.

This is the same temperature that we use to derive exploitation probability in Neighbour function. Temperature can be thought of as a “risk tolerance” parameter that controls the “wildness” or randomness of the algorithm. Typically, we start with a very high temperature (max_T) and decay it in every step with a scheduled cooling ratio. So basically, at the earlier iterations of the algorithm we are very open to wild board and after some iterations we try to minimize the risk.

One note on cooling ratio is, higher this ratio and faster is the algorithm at the risk of not exploring lot of solutions.

Boring board

To design a boaring board, we set target mean as something resonable like 45 and set target standard deviation as 0

The algorithm after running ~2000 iterations produces very good, which I mean very boring board with an average of ~28 turns and a standard deviation of only ~8 turns. That means most of the the games should have 17 to 29 turns, compared to 15 to 50 turns of the original board.

Store-bought board stats: Mean: 43.82096308326311, Variance: 752.33833145564, Standard Deviation: 27.42878654726891
Boring board stats: Mean: 29.769746404989455, Variance: 64.5211715766194, Standard Deviation: 8.032507178746831

By looking at the boring board, there are definitely bare minimum snakes and ladders. Especially, the snakes are very small and they seem to be placed not to hurt you but to satisfy the constraints!

Chaotic board

Maths says that to design a chaotic or very uncertain board where we can’t really tell how long the game is going to last, we’d need to target for very high standard deviation. So we choose a reasonable mean, and a standard deviation that is higher than the mean itself.

Now, let’s try the opposite: a highly “chaotic” board. We’ll set the target mean to 45 turns again, but increase the target standard deviation to 50. This creates a highly unpredictable board where games could end extremely quickly or take forever!

Store-bought board stats: Mean: 43.82096308326311, Variance: 752.33833145564, Standard Deviation: 27.42878654726891
Chaotic board stats: Mean: 53.554156747278846, Variance: 4891.157442564327, Standard Deviation: 69.93681035452165

I really love this chaotic board! Its like either you get those very first 15, 17, 18, 19 ladders or you pretty much don’t get any help at all afterwords. Plus the snakes at 67, 68, 69, 70, 71 are like “almost” traps to increase number of turns in the game, thereby increasing the std deviation. Not sure how the algorithm came up with these placements but it is such a beautiful idea to satisfy the given loss function.

Conclusion

We have used two extreme examples to illustrate the power of stochastic optimization, and similar strategy could be use to design things for whatever value function. While this is a powerful framework that can be applied across domains and problems, designing the Loss and Neighbour functions need deep understanding of the domain. Plus in my opinion its more of an art to come up with these functions. That makes this framework even more beautiful.

Source code for simulation: https://github.com/psrikanthm/snakes-and-ladder

Anatomy of a Data Product

2026-01-25T00:00:00+00:00

I define a Data Product as an application where the data itself is the primary feature, not just a byproduct. In these products, the UI serves mainly as a wrapper to filter, search, and visualize the underlying dataset. A few examples that fit in this definition are Business Intelligence Dashboards, Web Search interfaces, Credit scores in banking app.

The key components I see in building a Data Product are:

Data Pipelines - Usually offline jobs, that pull raw data from various sources and transform to domain objects
Database - Stores the data
Relevance Engine- Algorithms that help the user discover relevant data. For example, Netflix’s recommendation algorithm would surface a handful of titles among say million other options.
Service layer + UI - APIs that serves the data to match with user’s intent. And UI to capture user’s intent and present relevant data.

I would like to talk about the above components, by walking through the Categorized Expense Report that I wrote for my own use.

Data Pipelines

The offline jobs that I wrote for ingesting and processing my bank and credit card statements are:

1) CSV parsers to parse different bank and credit card statements to put in standardized data model

2) Keyword based categorizer and categorizer using Local LLM, figures out a pre-defined category for any financial transaction.

3) Simple aggregator condenses the spending of each category per month.

Database

In my toy project, I simply used CSV files for data storage. I followed medallion architecture to organize my data:

Layer	Schema
Bronze	As is from CIBC, Amex, Scotia bank statement downloads
Silver (Normalized)	Parsed, Normalized, Deduped into standard transactions schema
Silver (Enriched)	Standard transactions schema with an additional column of business logic based expense category. Think of it as enriched data
Gold	Aggregation of expense by month + category

The Gold layer is readily available to be presented to the user and offer MoM expense insights I was looking for. At the same time, this sort of organizing data allows to expand to more use cases (anomaly detection, merchant level analysis etc) starting from the same Silver layer. My understanding of Medallion architecture is - Bronze is very wide and diverse, a reflection of wide variety of data input sources. Gold is also wide and diverse, a reflection of data use cases. Whereas Silver is the source of truth and expected to be very dense, the schema here closely resembles the business objects.

Interestingly, medallion architecture closely resembles three tiered architecture of standard backend service.

View Model (Gold) <-> Domain (Silver) <-> DAO (Bronze)

Service layer + UI

The way the expense report is consumed is through monthly pdf attachment sent in the email to me and my partner. So the service layer in this project constructs pdf and sends emails once a month. But one can imagine more complex use cases that demand a typical web server, and maybe even a transactional database, say we want to support user corrections of expense category.

You may have noticed that I skipped out Relevance Engine component in my mapping, since the data and use case I was building for is very simple it doesn’t warrant any fancy algorithms. However, I think it is a whole interesting domain on its own and I hope to build few projects that illustrate the wide space of these algorithms!!

Finally, here is how my expense report looks.

Analysis of text conversations with my wife

2020-03-29T00:00:00+00:00

I was recently going through some of the earliest text conversations between me and my wife. Apart from exposing our naiveness and excitement in a cute way, I found those texts insightful of our personalities and relationship. So I tried to apply what I do for my day job in analyzing the texts data and quantify the conversations.

I exported the WhatsApp chat for the first two months of our relationship into text file. Since we primarily texted each other on WhatsApp this captures most of our text conversations, but doesn’t include phone calls or other media.

After parsing the text file and preprocessing, I looked at the quantity of the messages and the words. It seems like she sent me ~400 more messages than I did, and we overall texted ~4000 messages in two months! However, I sent those “lengthy” texts to makeup for my lesser number of messages as I used 14k words in total comared to 12k of her’s.

Looks like she does one word texts lot more than me okay, okies, ohk, vokay, ok, k, and compensates for the words with higher usage of emojis. An Emoji is worth 10 words !

I consider myself a curious person and ask lot of questions, taking into account that these messages are from early phase of our relationship its not surprising we ask each other lots of questions. I categorized messages as questions using simple rule based filter, if the message ends with “?” or starts with “why”, “what”, “where”, “who”, “when”, “how”.

Apart from the core office working hours or sleeping times, we were pretty reponsive to each other. The response times shown here are in seconds and on an average my wife has faster reponse time compared to me.

A message is considered as conversation starter if it sent after a pre defined time window which in this case I set as 6 hours. Finally a criteria where we have almost equal number !=!.

Next I explored the words used in conversations, I have hidden some words and emojis for privacy reasons. Below you can see our Wordclouds, after removing standard stopwords like ‘and’, ‘or’, ‘the’ etc.

As expected there are plenty of words that characterize the millenial texting habits like “haha”, “hahaha”, “yup”, “soo”. It is interesting to see words from three different languages sprinkled all over, as we both are trilingual. My wife had the habit of using these weird short forms for words such as “bcz”, “ryt”, “ppl”, the habit which she shredded off lately ;)

One thing I like in these commmon words are some hidden inside jokes and ways we used to refer each other #awww

This text analysis would be incomplete without looking at the emojis, especially when about 2000 emojis were exchanged in 4300 messages. Clearly each of us have our favourite emojis to respond, though I was bit conservative in throwing emojis. Hey honey, stop winking so much 😉!!

Finally, I looked at the number of messages per day and plotted below using Plotly’s Time Series plot. After loading the messages in Pandas’ Series indexed with timestamp, I resampled the data with a time interval of 24 hours for converting un-even spaced time series data to evenly spaced.

Since we had communications outside WhatsApp texts, its hard to make lot of conclusions from this plot. Though I can identify when we had our “sparks”, its also striking to see we texted each other everyday !!

If you would like to run this analysis on your text conversations, feel free to clone the repository https://github.com/psrikanthm/WhatChat and follow the instructions. There are few more stats that are available in the code which I haven’t made use of in this analysis.