layonsan

Finetuning LLMs using Federated Learning

My capstone while pursuing my masters in data science was centered on finetuning large language models (LLMs) using Federated Learning (FL). I explored the potential and usage of flower framework to finetune LLMs on finance dataset via FL, a privacy-preserving training paradigm where multiple parties can collaboratively train a model under the coordination of a central server. A pre-trained LLM ready for usage on HuggingFace is used as the base for training, with instruction-tuning applied as the representative training procedure. The process of training the model using FL is carried out through 4 iterative steps – (1) global model updating (server), (2) local model training (client), (3) local model updating (client) and (4) global model aggregating.

Implementing an end-to-end ML system using batch-serving architecture

Here’s the development process of an end-to-end machine learning (ML) platform designed to accommodate a batch-serving architecture. This initiative is part of my 2023 goal plan which aims to expand my engineering capabilities into the realm of AI/ML deployments. It draws inspiration and insights from Paul Iusztin’s comprehensive Full Stack MLOps Guide. Rather than merely duplicating his project, I elevated the endeavor by incorporating a distinct dataset. Capitalizing on the geographical context of Singapore, I utilized the Open Government Application Programming Interface (API) to extract PM2.5 data. Consequently, although the infrastructure stack and logic align closely with the reference guide, notable distinctions arise in the components responsible for preprocessing, prediction, and inference. The source code can located in this GitHub repository.

Data Systems using Azure

I have been working with Azure cloud services for the past 1-2 years, complemented by the acquisition of two Microsoft Certificates: Azure Fundamentals and Azure Data Engineering. In this post, I will highlight a few pivotal projects where I played a central role. This exposition is less of a guide but rather a comprehensive display illustrating the integration of these services to accomplish each project’s specific objectives.

2023 Review

This reflection and review for 2023 will incorporate a more personal element instead of focusing on purely work topics.

2022 Project Summary

Instead of working on analytical insights projects in 2022, I decided to spin up something different. There are 2 notable projects I have been working on this year: Medium Articles and Churn Models on Streamlit.

LinkedIn Network Analysis

What does your LinkedIn network really look like? I visualized my own connections using NetworkX and Plotly, turning a list of names into a living, breathing graph. Along the way, I explored concepts from network and graph theory—like centrality, clusters, and bridges—that reveal hidden patterns in how people are connected. Dive in to see how data visualization can turn something familiar into something surprisingly insightful.

Rule-based Sentiment Analysis on Syfe, Stashaway and Endowus

Are all app reviews created equal? I put investment platforms — Syfe, StashAway, and Endowus — under the microscope using three lexicon-based sentiment tools: TextBlob, VADER, and SentiWordNet. Each one tells a slightly different story about what users love (or don’t), and together they reveal the hidden tone behind the feedback.

Scrapping App Reviews for popular roboadvisors in Singapore using Python

Behind every app lies thousands of user voices. I used Python to scrape reviews for Syfe, Endowus, and StashAway from both the Apple App Store and Google Play. In this three-part series, I walk through collecting reviews from each platform and then bringing them together into one dataset ready for analysis. This work draws reference from Apple Store Scraper and Google Play Store Scraper.

Predicting Housing Prices in Melbourne

Predicting housing prices in Melbourne through regression analysis. This notebook walks through the full workflow—data cleaning, exploration, and modelling with linear and multiple regression. I also apply feature selection techniques (correlation and mutual information) and evaluate model performance using MAE, MSE, RMSE, and R². This notebook is adapted from Price Analysis and Linear Regression on Kaggle