karnwong.me Open in urlscan Pro
188.114.96.3  Public Scan

Submitted URL: http://karnwong.me/
Effective URL: https://karnwong.me/
Submission: On April 30 via api from US — Scanned from NL

Form analysis 0 forms found in the DOM

Text Content

Karn Wong

 * About
 * Consulting
 * Archives
 * Projects
 * Talks
 * Tags
 * Search


KARN WONG

Platform engineer | HashiCorp Ambassador | AWS SA-Associate & GCP PCA Certified



FASTER SPARK WORKLOADS WITH COMET

For big data processing, spark is still king. Over the years, many improvements
have been made to improve spark performance. Databricks themselves created
photon, a spark engine that can accelerate spark queries, but this is
proprietary to Databricks. Other alternatives do exist (see here for more
details), but they are not trivial to setup. But if you use Apache Arrow
DataFusion Comet, surprisingly it does not take much time at all to setup....

April 7, 2024 · 2 min · Karn Wong


SLIM DOWN PYTHON DOCKER IMAGE SIZE WITH POETRY AND PIP

Python package management is not straightforward, seeing default package manager
(pip) does not behave like node’s npm, in a sense that it doesn’t track
dependencies versions. This is why you should use poetry to manage python
packages, since it creates a lock file, so you can be sure that on every
re-install, the versions would be the same. However, this poses a challenge when
you want to create a docker image with poetry, because you need to do an extra
pip install poetry (unless you bake this into your base python image)....

April 7, 2024 · 2 min · Karn Wong


DATAFRAME WRITE PERFORMANCE TO POSTGRES

Previously, I talked about dataframe performance, but this doesn’t include
writing data to destination part. At a large scale, big data means you need to
use spark for data processing (unless you prefer SQL, in which this post is
irrelevant). But not many orgs need big data, so small data frameworks should
work, since they are easier to setup and use compared to spark. Initially I
wanted to include pandas as well, but sadly it performs significantly worse than
polars, so only spark and polars remain on the benchmark....

March 17, 2024 · 2 min · Karn Wong


HOW TO CONNECT TO CLOUD SQL FROM CLOUD RUN (NO, YOU DON'T NEED A VPC)

A minimal application architecture would compose of a database, and an
application backend. Serverless database is still in its infancy, but thankfully
container-based runtime is very much alive and doing well. On GCP, a serverless
container-based runtime do exist, known as Cloud Run. Standard database access
pattern Per standard security practices, you should not expose your database to
public, this means you should use a proxy/tunnel or private network to reach
your database....

February 10, 2024 · 3 min · Karn Wong


WHAT IS PLATFORM ENGINEERING?

Back in 2017-2018, everyone wanted to be a data scientist. Then reality hits,
that they need a data engineer for a successful machine learning project. Things
didn’t end there, since they also need a machine learning engineer to create
production-ready code. Some people think you only need an MLE and suddenly your
ML project would become a reality, sadly the reality begs to differ, because you
also need to find someone to deploy and scale it, enter DevOps engineer (who
understands ML, this is very important)....

January 21, 2024 · 2 min · Karn Wong
Next  »
© 2024 Karn Wong · Powered by Hugo & PaperMod