An Overview of the projects in my Master's Course

I have graduated from my master’s course back in February 2025. It had been an incredibly busy and intense 2 years. I would like to share about my experience from the past 2 years, from the beginning to the end.

The Very Beginning

In the very first semester, I decided to take things slow. At that time, I had only been in Korea for 6 months. If I had the choice, I would have delayed my entry into the master’s. Korea was still quite new to me, and my command of the Korean language was still not particularly good. I had only finished what was the equivalent of TOPIK level 2 in the university’s Korean course. I was at the level at which I could survive, but it was not optimal.

On the very first day of entering the lab, I recall the Professor being a little disappointed that I could not speak fluently. Also, I was essentially a Stats guy finding myself in a CS lab. I was a little worried if it might just be a little too much for me to chew.

Quantum Computing?

The very first project that I was tasked with was to study quantum computing. It was quite jarring initially as no one in the lab had any experience with it. I had only 1 job: to study quantum computing and report on what I learned. I studied from a book titled “Quantum Computation and Quantum Information” ¹ and slowly worked through the chapters. It was quite an odd feeling as I found out quickly that understanding about quantum states from the perspective of a continuous probability state was much better than trying to think about the weirdness of quantum properties. If I treat it as a purely mathematical construct with clearly defined properties (like all states summing to 1), I could start to reason with it and slowly work out simple quantum computing operations. But this book was entirely theoretical, and my Professor wanted me to start flying when I had just learned how to crawl.

To put things into practice, I started to learn the Qiskit ² python quantum computing library. Everything was simulated on my “classical” computer. The library allowed me to play with a handful of qubits (7?) and to construct quantum circuits with simple gates to encode information and put qubits into superposition. Before long, the Professor wanted my team to come up with a research proposal. It was crazy, but we tried to produce something, anything. The idea was to use study the use of quantum kernels as a layer within a neural network. We actually wrote a differentiable parameterized quantum circuit! But it was clear that with < 10 qubits this network is not going to do anything useful. And so the project came to an end.

Graph Drawing

Right after ending the quantum machine learning project, we received a new project: produce a program to output any graph as an “aesthetic” graph. For the first 2 weeks, we could not even define the actual requirements. We didn’t even know what “aesthetic” meant. The only thing that the team lead said was it needed to be straight lines that bend at 90 degrees. Everything else was up to us. We tried to google and use the then brand new GPT3 to better define this problem but it always came up short.

The first breakthrough came when I found the GraphViz ³ library. Using its python version PyGraphViz ⁴, I was able to produce hierarchical drawings of directed graphs with 90 degree bends. It looked great and it seemed like it solved the problem. But a few discussions later, they wanted orthogonal graphs, not hierarchical graphs.

The second breakthrough came when I found the Open Graph algorithms and Data structures Framework (OGDF) ⁵. It had exactly what we needed: orthogonal graphs. And it delivered the exact results we needed. However, the higher-ups will not be satisfied with what amounts to an off-the-shelf solution from an open-source project. They wanted to know how it works, and so I had to study deeply the process of how an orthogonal graph is produced. Frankly, it still seems like black magic to me as the field of graph visualization seems to be locked behind an endless sea of theorems. But I could give a rough explanation for what the code is doing. It was really painful as the resources were mainly past conference papers that assumed a high-degree of assumed knowledge. But we managed to deliver a product that could draw orthogonal graphs when given graph definition.

Battery Anomaly Detection

By this point there is a trend: upon ending a project, another project will begin. This new project’s problem statement is as follows: given the time series of an industrial battery unit, identify anomalies, any anomalies. This project started in a very painful manner as I was only given a zip file with limited documentation (a single .txt file). It turns out that it was an Influxdb data store, and I had to load the time series data from that store. But the power of Arch Linux came in clutch, as I installed Influxdb from the AUR and ran a local instance of the Influxdb server with some help from the Influxdb documentation and extracted the data into CSV’s for my team to access the data.

This is the first time I finally encountered a dataset that was nearly unusable because of how messy it was. There were NA’s everywhere and values that exceeded the sensor ranges according to the documentation (because the logging machine would log erroneous values when the battery state is undefined). We spent a lot of time trying to cleanup the data and to extract any signal from the data. The data was simply too big to eyeball for features, and so I relied on a technique that used LSTM’s to reconstruct time series. Any time the reconstruction error goes above a threshold, the code would flag an anomaly. And… it was still inconclusive. The data was practically useless.

And one day I decided to read through the manufacturer’s battery guide carefully and saw that the battery already had a dedicated anomaly logging functionality which they just didn’t log into the Influxdb store. After discussing with the higher-ups, we were simply not needed anymore after that. And just like that the project ended.

Battery Lifetime Simulation

But since our project contract lasted longer, we were tasked to contribute anything to this battery project. I then proposed that we can perform predictions on the lifetime of the battery unit based on known usage patterns of the battery. This is when it became frankly a bit crazy. Firstly, we had to reverse engineer the specifications of the unit, from the arrangements of the cells, down to the exact make and model of the cell. We then tried to study the usage patterns of the battery, and created a poisson process model to run the battery with variable patterns that over a period of time average out to a mean that we set in the model. We used a physics-based battery simulation library, BLAST ⁶. And… the results were again inconclusive. Simulating things is hard! We could not predict the knee-point (the point of rapid deterioration) of the battery. The simulation model was too sensitive to certain parameters, and tweaking the parameters would either cause the cell to last forever or last only a few days. We had to shelve the project.

Graph Planarization

By the start of 2024, I felt like I had not rested and was just slinging myself from one project to the next. But in order to graduate, I had to publish a paper in either a journal or a conference. I decided to focus on writing a conference paper as it seemed like the requirements were a little easier, and the turn-around time was faster than a journal. The project that intrigued me was oddly from the graph drawing project. I decided to focus on a sub-problem, graph planarization. And so I got started with testing some ideas that I had. I tried to benchmark my methods against a reference implementation and my methods would always lose to the reference implementation. But as time went by, I kept writing in new optimizations to make my implementation faster and faster. By the end of March, I found that across a test dataset, my method would start to win in many of the cases. I was confident that this method would be a good paper, and so I wrote a paper to submit for the 2024 Graph Drawing Conference. By the end of May, I submitted by conference paper and eagerly waited for the result on 20 July.

After I submitted by paper, I finally had some time to rest in June of 2024. I took plenty of rest and tried not to open my laptop. I was quite anxious waiting for the result of my conference paper, but I just tried to focus on the other aspects of my life. Finally, when 20 July came, I did not receive any news. I started to have a bad feeling, and on the morning of the 21st, I received the email “We regret to inform you that…” I was utterly distraught. Not only have I wasted the past half a year, but with only 1 semester left, I urgently needed to produce a publication.

Silicon Etching Arcing Anomaly Detection

Let’s wind back time to March 2024. I started working on a time series anomaly detection problem from a Korean semiconductor company. They had a very specific problem of wanting us to create a method to detect anomalies within their machine data. I applied the same code from the earlier battery anomaly detection project on this problem and found again that the model doesn’t work well on the given data. It would predict too many false positives, making it almost useless. Scaling the threshold to be less sensitive would make it miss all the actual anomalies.

We did not make any progress for months, but sometime in May, the company contact suggested that we try to convert the time series data into images via Gramian Angular Fields (GAF). I didn’t think that it would help much as I was too fixated on using the method of LSTM time series reconstruction. But my lab partner Daniel urged me to think about it carefully. Almost by chance, I was working on unsupervised image learning in December of 2023 for a class. I decided to apply the unsupervised objective to train an image embedding model on the image-transforms of the time series data. The resultant embeddings were a great success as the images containing the anomalies produced embeddings that could be flagged as anomalies when combined with K-means clustering.

Coming back to the July of 2024, I was faced with the choice of either 1. continuing with my just rejected paper, 2. write a new paper using my previous project. I opted to write a new short paper for a local conference using the unsupervised anomaly detection method on GAF images. I removed the company data and replaced it with the UCR time series classification archive ⁷ and retrained all the models. The results were good enough for me to write a paper, and by September, my paper was accepted. I could finally graduate.

Hyundai Data Standardization

Again let’s wind back to the start of 2024. (This makes it project no. 3 in the same time frame). I was introduced to a project from Hyundai which required me to map variable item names to standardized item names. Because of the ML hype, one of the requirements of the project was that it must use a language model to solve the data standardization problem. Before this, I had only dabbled in LSTMs and ResNets. It would be the first time I would use Transformers, though I had learned about it conceptually in class. The HuggingFace NLP/LLM course ⁸ was incredibly helpful in helping me be able to go from 0 to 100 in a matter of weeks. I could quickly give a demo of a fine-tuned T5 seq-to-seq model that could give ~90% accuracy for the data standardization task.

However, the real challenge then is trying to identify why the remaining 10% could not be solved by the model. Every week we would try different techniques to augment the pre-processing to allow the model to learn better e.g. using special tokens to mark input boundaries.

This project allowed me to see the trouble of trying to apply a pretrained model on a new problem domain: ensuring the reliability of predictions for the use case. We eventually had to use embedding comparisons to try to “correct” bad predictions. The idea is that for a given test data, we would find using embedding vector similarity the top few closest training data inputs. And if we compare their predictions, there should be an agreement. This seems like a kind of K-nearest neighbors method that uses embedding cosine similarity to verify the predictions from the seq-to-seq model. This was able to raise the model performance from ~90% to ~95%.

I was able to co-author a paper on this method, and it was published to IEEE access under the title “Enhancing Maritime Data Integration for Platform Services with Sequence-to-Sequence Models and Statistical Refinement” ⁹. I’m glad to be able to publish a paper from the time in my master’s.

Reflection

Looking back, I really poured my life into the lab projects. There were many moments when I had maybe 2-3 concurrent major projects and I would continue to deliver results each week. I guess a good part of being able to do so was simply becaues I enjoyed the process. It was fun to come up with new ideas for real problems, and though most would not work, it was great in the times when it does.

But now that I’ve graduated, I’m consolidating my experience and planning what to do next. I hope that I can continue exploring and building things.

Nielsen, M. A., & Chuang, I. L. (2010). Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge: Cambridge University Press. ↩︎
https://docs.quantum.ibm.com/guides ↩︎
https://graphviz.org ↩︎
https://pygraphviz.github.io/ ↩︎
https://www.ogdf.net/ ↩︎
https://www2.nrel.gov/transportation/blast ↩︎
https://www.cs.ucr.edu/%7Eeamonn/time_series_data_2018/ ↩︎
https://huggingface.co/learn/llm-course/en/chapter1/1 ↩︎
H. Hwang, R. Wong, D. Lim, J. Kang and I. Joe, “Enhancing Maritime Data Integration for Platform Services With Sequence-to-Sequence Models and Statistical Refinement,” in IEEE Access, vol. 13, pp. 58636-58648, 2025, https://doi.org/10.1109/ACCESS.2025.3555272. ↩︎