My Learning Reflections about Optimizing Neural Networks (Part 1)

Nov 28, 2025

Credit: Video generated by Google Veo 3.1 and Google AI Pro

Editor’s Note: This is the first of multiple focused articles covering my learning reflections and industry validations about neural network optimization from Cornell University’s eCornell “Designing and Building AI Solutions” program. Each article examines a specific optimization aspect alongside market observations, building toward comprehensive understanding.
What I have been up to: I have been building the semantic layer and a LLM component for my overall personal News Analyst application. This is a locally-deployed system that transforms the high-volume, fast-moving AI news into structured metadata for deeper pattern analysis (Truly exciting times!). This involved my v2 llama-tinker-lora-news-enhancer project: hand-annotating 460+ AI news examples, building three-model comparisons (Baseline, Tinker, and Unsloth), and running comprehensive evaluations. This hands-on work validates the optimization principles in these articles. My personal solution? A locally-deployed personal news analyst system for deeper pattern-finding across enhanced news content.
This involved my Version 2.0 AI Industry News Enhancer (llama-tinker-lora-news-enhancer) project, which enhances formatted AI industry items with semantic metadata in my preferred structure. The work included:

Data Engineering: Extracting and parsing data from various APIs and MCP servers, building ETL functions to format data for training
Dataset Development & Annotation: Hand-annotating 460+ AI news examples (yes, inspecting each one individually)
Experimental Design for Cross-Model Comparison: Building Python configurations for three-model hyperparameter comparisons (Meta Llama 3.2 1B baseline model, Tinker fine-tuned model, and Unsloth fine-tuned model)
Training & Debugging: Conducting controlled experiments, debugging training pipelines, and running comprehensive evaluations

This hands-on work directly validates the neural network optimization principles covered in these articles.
Project repositories:

v2 AI Industry News Enhancer (Production): GitHub | Hugging Face | LinkedIn Post
v1 AI-Related News Enhancer (Proof of Concept): GitHub | Hugging Face

With v2 now live and delivering production-ready results, we’re back to regular programming!

The Art and Science of Neural Network Architecture

Following my exploration of neural network structures and complexity in my previous article, I now turn to the critical challenge of neural network optimization: Finding the right neural architecture and configuration that is aligned with specific personal, business, or societal objective, and balances performance with computational efficiency. As I continue reflection upon my 5-month journey through Cornell University‘s eCornell “Designing and Building AI Solutions” certificate program instructed by Lutz Finger, the next part of the fourth program module “Expanding AI Power and Value Through Neural Networks” revealed that building and optimizing neural networks is as much an art as it is a science.

The Iterative Journey of Neural Network Design

Neural network optimization follows familiar frameworks yet requires nuanced decision-making at each step. The process remains consistent: Data representation with labels, splitting into train/test sets, model training, and performance evaluation through confusion matrices, ROC curves, and AUC metrics. What distinguishes neural network development is the expanded set of design choices we must navigate.

The complexity of finding optimal configurations stems from four primary design levers that we can adjust. These include the number of layers in the network, the number of neurons per layer, the choice of activation functions, and the overall network architecture. Each decision impacts not only model performance but also computational cost and the risk of overfitting.

As a coffee aficionado, Lutz Finger emphasized that data science, like making espresso, is fundamentally about finding balance. This analogy resonated throughout the program as we explored how more layers and neurons generally increase a network’s capacity to learn complex patterns, Yet, too much complexity leads to usual challenge of overfitting, where models perform well on training data but fail to generalize to new patterns or situations.

Balancing Complexity with Performance

Through our eCornell program exercises, I learned a crucial insight: Adding network architectural complexity does not always translate to proportional performance improvements. In one demonstration, Lutz Finger showed that a two-layer network with 128 and 64 neurons performed marginally better than a simpler 65-neuron network. Yet, the minimal gains relative to the increased computational needs showed we might be approaching the point of diminishing returns or even overfitting.

Computational cost emerges as a critical consideration when scaling neural networks. Every training iteration involves forward propagation, backpropagation, and weight updates through potentially millions or billions of parameters, making deeper and wider networks computationally expensive to train and deploy. The eCornell program emphasized that an efficient approach is to start with simpler architectures and gradually increase complexity while monitoring performance.

The relationship between architectural complexity and performance proved nonlinear. Single-layer networks frequently achieved performance within a few percentage points of far more complex multi-layer architectures, yet required substantially fewer computational resources for both training and inference. This observation reinforces that bigger may not always be better in neural network development.

Recent Industry Developments: Specialized Hardware for Neural Network Optimization

The year 2025 sees massive investments and development of specialized AI hardware designed specifically for neural network optimization. This validates the principle that computational infrastructure must evolve alongside model complexity. The growing focus on specialized hardware and massive infrastructure is driven by the strategic necessity to overcome physical constraints inherent in deep learning at scale.

In December 2024, Google announced general availability of Trillium TPUs for cloud customers, with Gemini 2.0 trained on these chips. This demonstrates the evolution of specialized hardware for efficient neural network training.

By April 2025, Google introduced the Ironwood TPU, its seventh generation TPU with improvements specifically optimized for inference rather than training. This specialization reflects the practical reality that deployment efficiency matters as much as training performance.

Apple and Broadcom began developing server chips code-named “Baltra” for AI applications, expected for mass production by 2026. This reflects the principle that computational infrastructure must evolve alongside model complexity. Nvidia announced the Rubin CPX GPU (available end of 2026) designed for context windows larger than 1 million tokens, showing how hardware continues to evolve to support specific optimization challenges.

By October 2025, the scale of hardware commitments reached unprecedented levels. OpenAI secured a $300B compute deal with Oracle over five years starting in 2027, one of the largest cloud contracts ever. The company also partnered with Broadcom for 10 gigawatts of custom AI chips over four years and with AMD for 6 gigawatts of Instinct GPUs, with total computing capacity agreements reaching 26 gigawatts and targeting 250 gigawatts by 2033.

By November 2025, the infrastructure demands reached critical scaling thresholds. Anthropic committed $30B Azure compute from Microsoft. Google‘s AI infrastructure team announced that the company must double its AI serving capacity every 6 months to keep up with growing demand for AI services. With over 100,000 AI accelerators already deployed, Google CEO Sundar Pichai emphasized that the company’s cloud business serves as a key driver of this exponential capacity requirement, validating the principle that computational infrastructure must evolve rapidly alongside model complexity.

Market validation of the specialized hardware thesis came through strong financial performance data. With a focus on fundamentals, specialized AI hardware & software provider NVIDIA reported fiscal year 2026 third quarter (Q3 FY2026) revenue of $57.0B up 22% from previous quarter (Q2 FY2026) and up 62% from prior year (Q3 FY2025). This figure exceeded analyst expectations of $54.92B.

Data center sales (Blackwell and infrastructure for cloud GPUs) are the key driver of these results. NVIDIA generated a record $51.2 billion from its data center segment. This represented a 62% increase year-over-year and surpassed analyst expectations of $49.09B. On November 19, 2025, CNBC reported that compute products, primarily GPUs (e.g., Blackwell Ultra GB300 chips), contributed $43B sales. Networking components that connect and operate GPU clusters accounted for $8.2B sales.
NVIDIA provided strong forward guidance. Nvidia expects approximately $65B sales for current quarter. In contrast, analysts had projected $61.66B. CEO Jensen Huang stated in October 2025 that the company holds $500B in backlog orders for 2025 and 2026 combined (which will be burned down as recognized revenue). For the general audience, backlog order data is commonly reported for manufacturing or physical goods businesses.
Other NVIDIA business segments also showed growth. Gaming revenue reached $4.3B (Q3 FY2026), up 30% from prior year (Q3 FY2025). The professional visualization hardware and software business, which serves professional visual computing, reported $760M sales (Q3 FY2026), up 56% from prior year (Q3 FY2025) This segment included sales from DGX Spark, the company’s AI desktop computer announced earlier this year. Automotive and robotics sales totaled $592M (Q3 FY2026), representing 32% from prior year (Q3 FY2025)
Nvidia revealed that Blackwell achieved the highest performance and best overall efficiency in the SemiAnalysis InferenceMAX benchmarks. The architecture delivers 10x throughput per megawatt compared with the previous generation.
Major partnership deals reinforced demand. OpenAI committed to deploying at least 10 gigawatts of Nvidia systems for its next generation AI infrastructure. Anthropic announced it will run on Nvidia infrastructure for the first time, initially adopting 1 gigawatt (GW) of compute capacity with Grace Blackwell and Vera Rubin systems.
Google Cloud, Microsoft, Oracle, and xAI partnered to build AI infrastructure with hundreds of thousands of NVIDIA GPUs. Nvidia also celebrated a manufacturing milestone. The first Blackwell wafer produced on United States soil sources from TSMC‘s Arizona facility, marking Blackwell’s transition to volume production.

Architectural Breakthroughs: Efficiency at Scale

In the early onset of 2025, the Chinese AI lab DeepSeek demonstrated training efficiency or competitive parity at lower development cost through its open-source DeepSeek models. This came amidst reports about Chinese AI lab’s use of the widely-known distillation technique to distill from larger models and access to Nvidia H100 chips. DeepSeek claimed that its reasoning model DeepSeek R1 model only costs an estimated $5.6M to train. This is old news from January 2025 but it resulted in a significant (but brief) uproar and speculation about the unit economics of AI.

DeepSeek’s direct innovation on the Transformer architecture introduced key efficiency improvements. Multi-head Latent Attention uses low-rank compression to reduce the KV cache size without compromising model quality, outperforming grouped-query attention methods. The Mixture-of-Experts (MoE) language model DeepSeekMoE separates experts into shared experts that process every token and routed experts that activate selectively, avoiding routing collapse while enabling specialization. Multi-Token Prediction allows the model to predict multiple future tokens simultaneously, improving training signal and enabling speculative decoding that nearly doubles inference speed. DeepSeek-V3 combined these architectures with FP8 training and Multi-Plane Network Topology to maximize GPU efficiency and minimize GPU communication overhead. The R-series models inherited this architecture but added reinforcement learning to enhance reasoning capabilities. Both open-source and frontier models have begun pursuing long-run inference efficiency through smaller model variations and task-specific architectures. Examples include: Meta unveiled Llama 4 Scout and Maverick, two natively-multimodal models using Mixture-of-Experts (MoE) architecture. Both activate 17 billion parameters per token, with Scout containing 109 billion total parameters across 16 experts and Maverick containing 400B total parameters across 128 experts. OpenAI introduced its open-weight gpt-oss models, which use alternating full-context and sliding-window attention layers alongside MXFP4 quantization. This 4-bit quantization format significantly reduces memory requirements, enabling the 117 billion parameter gpt-oss-120b to run on a single 80GB GPU while the 21B parameter gpt-oss-20b operates within 16GB of memory.

Practical Application: My Neural Network Optimization Journey

As I learned these architectural principles, I applied them to my own project: building a LoRA fine-tuned news analyzer. My challenge was to process AI industry news on hardware with significant constraints: a Surface Pro laptop with only 16GB RAM, 128MB Intel integrated graphics, and an Intel Core i5-1135G7 processor running at 2.40GHz. These limitations became design drivers that shaped every architectural decision.

My project goal was to automate the enhancement of AI news with structured semantic metadata. The system needed to transform raw news articles into JSON outputs containing relevance scores (1-10 rating), key insights (critical takeaways), company names (mentioned organizations), and concise summaries. This structured data would enable deeper pattern analysis through Google NotebookLM and Gemini models. It also supports my preferred strategic frameworks, converting unstructured news into queryable intelligence.

I chose Meta‘s Llama 3.2-1B as the base model because it offered the optimal balance between capability and efficiency for my hardware constraints. Rather than using massive frontier models that would require cloud deployment, I applied LoRA (Low-Rank Adaptation) fine-tuning with rank 32. This architectural choice allowed me to adapt the model for my specific task while training only a portion of the total parameters, making the entire workflow feasible on my local hardware without cloud costs. Furthermore, the base model is already pretrained and I want to retain a majority of weights to preserve learned patterns or representations that would be costly and impractical for me recreate. Pretraining with supervised learning truly unlocks enormous initial value.

My v2 project demonstrated the practical impact of these architectural decisions. Building on my v1 proof of concept with 101 examples, I expanded the dataset 4.5 times to 460 hand-annotated training examples. This larger dataset enabled robust statistical comparison between two fine-tuning frameworks (Thinking Machines Lab‘s Tinker API and Unsloth AI‘s Unsloth frameworks) while maintaining the same architectural principles. The results validated the eCornell lesson that starting simpler and adding complexity incrementally produces better outcomes: I achieved 100% JSON validity and F1 scores of 0.612-0.603 across both frameworks.

The baseline model without fine-tuning achieved only 5.2% JSON validity, proving that architectural optimization through LoRA was essential, not optional. This stark contrast demonstrated that the right architectural choices, even with modest compute and memory resources, can unlock capabilities that brute force scaling cannot match. The project repositories document the complete optimization journey:

v2 Project: GitHub | Hugging Face
v1 Project (context): GitHub | Hugging Face

Looking Ahead

The art of neural network architecture lies in finding the sweet spot where model capacity aligns with data characteristics, business requirements, and hardware realities. Just as Lutz Finger compared data science to making the perfect espresso, success comes from finding the right balance: enough complexity to capture important patterns, but not so much that we lose sight of our objectives or create systems we cannot deploy practically.

The past year of industry developments demonstrates that optimization is not a one-time achievement but an ongoing process. From hardware innovations like Google’s Ironwood TPUs to architectural breakthroughs like DeepSeek-V3, the field continues to find new ways to achieve more with less. My own journey with the news enhancer project validated these principles at a practical level: thoughtful architectural choices, combined with efficient training methods and realistic hardware constraints, can produce production-ready systems without requiring massive computational resources. In the next article, I will explore the mathematics behind neural network learning through backpropagation and gradient optimization, building on the architectural foundations established here.

This article is Part 1 of a multi-part series on my learning reflections about neural network optimization from “Expanding AI Power and Value Through Neural Networks” module of Cornell University’s eCornell “Designing and Building AI Solutions” program.

Core AI

Discussion about this post

Ready for more?