W03 - Insights on Troubleshooting Advanced Problems
We had a Webpack ghost issue that plagued us for nearly three years, and there was a major breakthrough last week. I want to use this opportunity to summarize the characteristics of strange, deep frontend problems and share lessons learned handling them.
Strange, deep problems can be divided into two categories depending on whether the code itself is faulty
The first category is code with very hidden design flaws. These issues stem from latent logical defects in the code design and often only surface under specific conditions. For example, incorrect assumptions about asynchrony, concurrency, or state can reveal themselves under low performance or specific execution orders. The most common real-world cases are timing race conditions, which I'll illustrate later.
The second category often has root causes outside the code itself and requires searching for the truth in a system full of uncertainty. This is the higher-level skill of frontend troubleshooting: not only writing correct code, but also understanding the world in which that code will run. That world can be called an “uncertain runtime,” and its degree of uncertainty often makes observation and reproduction challenging.
Uncertain runtime
This can be understood as a technical assumption: assume everything except the code is unreliable. The goal is to prepare defenses.
You can treat the start of the uncertain runtime as beginning when code is built.
Build (human choices) → Container (heavy customization) → OS & Hardware (high fragmentation) → User & Physical Context (uncontrollable).
1. Build: differences in infrastructure and build artifacts
Inconsistent polyfills, runtime libraries, or build configuration versions can cause new syntax to error in older environments or cause duplicate/incorrect injections that trigger exceptions.
A recent issue encountered by the monthly payments team occurred after switching from Webpack to Rspack: on low-end devices the Promise/thenable execution order sometimes behaved abnormally. The bundler itself might not have a bug, but differences in runtime behavior of the build output can trigger problems. Webpack, for example, has a complex runtime that uses Promise/thenable to coordinate module loading order.
On the merchant checkout side, a set-cookie failure was found on the KuaiLv client due to execution ordering between different threads in the client.
2. Container: differences in host apps / WebView engines
Different WebViews are not fully consistent in JS execution, CSS rendering, security, and API behavior; the shell may also inject or restrict APIs, causing the same H5 page to behave differently across containers.
A typical example is iOS interceptors: containers may intercept network requests or route navigations, and incorrect or duplicated interception logic causes issues. For example, at Huixiao we encountered set-cookie failures that invalidated tokens. We also hit problems related to offline capabilities on the merchant takeaway app.
On the merchant takeaway app, we once found an injected hook that would autonomously include knb.js version 1.9.1 if knb.js was missing, causing intermittent subscribe appear success on Android without executing callbacks.
3. OS & Hardware: OS and device fragmentation
Differences in vendor ROMs, device models, resolutions, and system versions can produce device-specific crashes, layout anomalies, or performance issues.
On the merchant takeaway iOS side we once saw a single user repeatedly invoke a yoda call many times; when addressing another issue the following year we realized they could be investigated together and ultimately found the cause was insufficient available memory on low-end devices. After a memory warning, due to keep-alive mechanisms, the WebView repeatedly called reload, causing a single user to enter the page frequently in a short period.
4. User & Physical Context: differences in users’ real running environments
Users vary widely in how and where they use apps—weak networks, low battery, insufficient storage, etc.—creating many uncontrollable variables that produce edge-case anomalies. These issues lack repeatability and are challenging to diagnose, often requiring correlating instrumentation logs with user action records to find clues.
In third-party payment scenarios we encountered payment failures caused by running two instances of WeChat. Some users use system or third-party dual-app features to run two instances of the same app on one phone. This mechanism breaks assumptions about application singletons or IPC links and can cause unexpected issues.
I've fallen into many traps; here are some lessons and two case studies
1) Let go of perfection: not every deep problem needs an immediate “perfect explanation.” If an incident has already occurred, the first priority is to relieve pressure around you. Find a controllable workaround to contain the risk and shift attention away quickly. Most people don't care about the “ultimate root cause” in the moment—stabilize the situation first, then you can slowly pursue the truth.
2) Be persistent: face challenges bravely to build technical depth. Many problems are solved after long waiting. When you hit a dead end, leave the problem there, wait for opportunity or inspiration, fill in what’s missing, but don't give up.
Case 1: unstable Webpack build artifacts
The Webpack build artifact instability mentioned at the start was a process that took nearly three years, gradually approaching the truth from shallow to deep.
Symptom: The symptom was that Webpack build artifacts were sometimes unusable, causing sporadic white screens in production. When I first encountered this, we couldn't find the root cause, but we did improve the pipeline's release mechanism, ensuring that whitelist verification and full release used the same package.
Configuration analysis: Later, Hongze performed a dedicated Webpack configuration analysis and located an issue where ESM and CJS were mixed in third-party dependencies, causing build anomalies. We optimized the configuration accordingly, but the abnormal artifacts still occurred.
Social experiment and trade-off: Last year, Jianzuan ran a session on build anomalies as part of skills mentoring. We reproduced the unstable build artifact issue on-site. That social experiment deepened the team's understanding of builds and uncovered an effective config toggle: disabling async chunking eliminated the abnormal artifacts at the cost of build speed.
Final resolution: Last week Jianzuan provided a new solution. He identified the n.n() wrapper as a key signature in the build output and verified whether the CJS compression patterns in the artifacts matched expectations to ensure the correctness of production bundles. This can completely close the quality-risk loophole.
At this point I consider the issue closed. If there's extra bandwidth, the next step would be to help Webpack fully resolve CJS module issues in async builds and submit a PR.
Case 2: Alpha project marketing refactor
Problems can be solved, but not every problem can receive a complete technical explanation. How you handle the problem matters more.
The Alpha project once saw a drop in account-opening conversion after a marketing refactor. It took more than two months to resolve and exposed a series of issues. In the end we never found a technical root cause, nor did we have the conditions to reproduce it again, so a perfect technical explanation will never exist.
But through the process we gained deep recognition of quality. During that period we improved real-time monitoring for business metrics, AB testing capabilities, normalized offline quality operations, and deeply reconsidered rollback strategies and decision-making. This event also directly prompted the Alpha project’s full refactor the following year, providing the team with a solid maintainability benchmark.
Last updated