layonsan | layonsan
My capstone while pursuing my masters in data science was centered on finetuning large language models (LLMs) using Federated Learning (FL). I explored the potential and usage of flower framework to finetune LLMs on finance dataset via FL, a privacy-preserving training paradigm where multiple parties can collaboratively train a model under the coordination of a central server. A pre-trained LLM ready for usage on HuggingFace is used as the base for training, with instruction-tuning applied as the representative training procedure. The process of training the model using FL is carried out through 4 iterative steps – (1) global model updating (server), (2) local model training (client), (3) local model updating (client) and (4) global model aggregating.
Here’s the development process of an end-to-end machine learning (ML) platform designed to accommodate a batch-serving architecture. This initiative is part of my 2023 goal plan which aims to expand my engineering capabilities into the realm of AI/ML deployments. It draws inspiration and insights from Paul Iusztin’s comprehensive Full Stack MLOps Guide. Rather than merely duplicating his project, I elevated the endeavor by incorporating a distinct dataset. Capitalizing on the geographical context of Singapore, I utilized the Open Government Application Programming Interface (API) to extract PM2.5 data. Consequently, although the infrastructure stack and logic align closely with the reference guide, notable distinctions arise in the components responsible for preprocessing, prediction, and inference. The source code can located in this GitHub repository.
I have been working with Azure cloud services for the past 1-2 years, complemented by the acquisition of two Microsoft Certificates: Azure Fundamentals and Azure Data Engineering. In this post, I will highlight a few pivotal projects where I played a central role. This exposition is less of a guide but rather a comprehensive display illustrating the integration of these services to accomplish each project’s specific objectives.
This reflection and review for 2023 will incorporate a more personal element instead of focusing on purely work topics.
Instead of working on analytical insights projects in 2022, I decided to spin up something different. There are 2 notable projects I have been working on this year: Medium Articles and Churn Models on Streamlit.
What does your LinkedIn network really look like? I visualized my own connections using NetworkX and Plotly, turning a list of names into a living, breathing graph. Along the way, I explored concepts from network and graph theory—like centrality, clusters, and bridges—that reveal hidden patterns in how people are connected. Dive in to see how data visualization can turn something familiar into something surprisingly insightful.
Are all app reviews created equal? I put investment platforms — Syfe, StashAway, and Endowus — under the microscope using three lexicon-based sentiment tools: TextBlob, VADER, and SentiWordNet. Each one tells a slightly different story about what users love (or don’t), and together they reveal the hidden tone behind the feedback.
Behind every app lies thousands of user voices. I used Python to scrape reviews for Syfe, Endowus, and StashAway from both the Apple App Store and Google Play. In this three-part series, I walk through collecting reviews from each platform and then bringing them together into one dataset ready for analysis. This work draws reference from Apple Store Scraper and Google Play Store Scraper.
Predicting housing prices in Melbourne through regression analysis. This notebook walks through the full workflow—data cleaning, exploration, and modelling with linear and multiple regression. I also apply feature selection techniques (correlation and mutual information) and evaluate model performance using MAE, MSE, RMSE, and R². This notebook is adapted from Price Analysis and Linear Regression on Kaggle