W45 - Impressions from Participating in a Data Compression Competition
I participated in a data compression competition hosted by a small large-model community and finished my submission in half a day. The quality remains to be judged by others' submissions, but the speed of completing it with AI Coding was impressively fast.
This time I used Windsurf's Codemap and quickly understood the repository. Most AI Coding products focus on the output side and overlook that humans read more slowly and less systematically. As generation speed approaches real time, input-side cognitive latency becomes the new performance bottleneck. Tools like Codemap and Deepwiki enhance the input side and help engineers build mental models quickly.
The theoretical foundation of data compression is information theory. Shannon's definition of information measures the reduction of uncertainty: the higher the uncertainty, the greater the information entropy. Only the parts that effectively remove uncertainty constitute true information payload. Predictable, repetitive parts are redundancy, and the goal of data compression is to reduce redundancy and improve coding efficiency.
After seeing the problem, I came up with three approaches corresponding to three levels of compression logic.
The first approach is semantic compression: structured summarization.Perform conventional analysis on the data and eliminate redundancy. My submitted version used this approach; the main strategies included:
List aggregation: aggregate long lists such as consumption, behavior, and location by time window or type, retaining only core metrics (order count, average order value, days of stay, etc.);
Semantic summarization: for text data (dialogs, searches, etc.), extract patterns with regex and templates to generate high-value signals like top_issues and top_keywords;
Unified cleaning: use general-purpose tools (safe_json_list, date parsing, field normalization) so that dirty data from different sources can be compressed and interpreted under a unified framework;
The second approach is coding compression: symbol-level optimization.Generate a more efficient vocabulary mapping based on statistical probabilities, using shorter symbols to represent high-frequency content, similar to Huffman coding. The problem is the compressed data must be usable by large language models, and the model won't understand an unfamiliar symbol system. Even if you provide the encoding rules, the context overhead likely won't shrink, effectively shifting the translation burden to the LLM. Implementing this approach would require retraining a dedicated model, which is extremely costly.
The third approach is engineering compression: the direction of Context Engineering.In production, saving context space purely through data compression is one strategy; framed under Context Engineering, more effective strategies can be found..Have the large model generate query statements and invoke tools like RAG to obtain statistics as input, then apply a high-compression-rate algorithm—this combined workflow should amplify the final effect by multiples.
Last updated