RE evolution road show

2025-08-29 08:19:21

**Evolution of the Road**

Handmade Age

In the early days, our frontend architecture was built on a four-layer load balancing system. Static resources were cached using Varnish and Squid, while dynamic requests were handled by the LAMP stack. At that time, there were very few machines, minimal processes, and no clear separation between application operations and system maintenance. The team was small, and each person was responsible for network, servers, and services. Most of the work was done manually, and there was no formal operation and maintenance system in place. Many startups today still follow this kind of architecture.

Cloud Infrastructure

As the business evolved, so did our architecture. Especially with the rise of mobile traffic, the access layer expanded beyond just web resources to include numerous API services. Backend languages diversified, with Java, Python, and C++ being introduced based on service requirements. The entire system gradually shifted toward microservices. As the architecture changed, the underlying infrastructure also transformed. A major shift occurred around mid-2014 when all business operations moved to the cloud, as shown below.

Cloud SRE development and practice

One of the main benefits of moving to the cloud was the abstraction of the underlying host and network. This allowed the cloud platform to encapsulate tasks like host creation and network policy changes into a unified interface. Maintenance became more streamlined. At this stage, the SRE team was formed, and responsibilities were divided. The cloud computing team focused on hosts, networks, and systems, while SRE teams worked closely with service teams to optimize the environment and handle business-related issues.

Problem & Solution

Next, we will discuss the challenges we faced during the construction of our cloud infrastructure and how we addressed them.

Cloud SRE development and practice

As shown in the figure, one of the key issues was resource isolation. Because of shared CPU and network cards, high traffic from testing caused bandwidth saturation, leading to performance degradation for other VMs. This resulted in service outages for critical applications. To solve this, we implemented network quotas per VM and separated clusters based on business characteristics. Offline services were placed in dedicated clusters, while online services were split into smaller groups according to their importance.

Another challenge was VM fragmentation. Initially, large clusters made it difficult to isolate failures. However, after service decomposition, hundreds of services required smaller, more isolated clusters. We optimized the distribution of VMs across hosts, ensuring no more than two VMs of the same service per host. This significantly reduced the risk of service outages during peak times.

We also improved scheduling success rates through collaboration between SRE and cloud teams, achieving over 90% success rates.

Cloud Infrastructure

Cloud SRE development and practice

The diagram above shows our cloud infrastructure network. Traffic enters via BGP links and is routed through high-speed lines connecting multiple data centers. These connections have been tested and proven stable under heavy loads, such as during food delivery or group buying events. Our network is highly redundant, with each node having backup devices to ensure uninterrupted traffic flow. Custom components like MGW and NAT provide flexible traffic control.

The US Group is one of the largest users of Meituan Cloud. The benefits they receive include strong API support, customized resource isolation, multi-data center fiber connections, and higher resource utilization.

Operation and Maintenance Automation

With the rapid growth of orders and machines, automation became essential for efficient operations.

During the automation journey, we developed our own methodology. We simplified complex tasks by leveraging the cloud platform, which abstracted most of the management. We also standardized simple processes, creating naming conventions, system environments, and operational procedures. Once standards were in place, we automated workflows, reducing manual effort and improving efficiency.

Cloud SRE development and practice

This is the service tree, which maps cloud hosts, services, and service leaders. It allows us to visualize relationships and integrate peripheral systems like configuration management and monitoring platforms. Weâ€™ve also added cost tracking, enabling users to see costs per business unit easily.

Cloud SRE development and practice

The process of creating a machine begins with a technician initiating a request. The process center retrieves service information from the service tree, sends it to the operation platform, which then creates the machine on the cloud. Once created, the machine is added to the service node, and the configuration management system initializes the environment. Monitoring is automatically set up, followed by service deployment and registration with the service governance platform. This entire process is now fully automated.

Data Operations

Cloud SRE development and practice

Today, the company has grown significantly. The cloud platform handles everything from the access layer to the infrastructure. SRE teams focus on stability and monitoring. Our new goal is data-driven operations. Fault management provides a centralized view of incidents, including timing, causes, and responsible parties. We classify faults by severity and ensure every issue is resolved.

Through the fault platform, we analyze trends and identify recurring issues. We also perform data mining to understand service behavior, such as traffic patterns and response times.

Responsibility & Mission

Cloud SRE development and practice

Our mission has evolved from reactive firefighting to proactive stability. Through data operations, we drive improvements and maintain service reliability. Operation and maintenance involves both enhancing service quality through experience and meeting business needs with technical solutions.

Stability Protection Practices

Business Stability Guarantee Practice

Failure Causes & Examples

First, let's look at failure causes and examples. One common cause is change. The US Group makes over 300 daily releases, including network and service component updates. For example, a small Nginx configuration change led to a critical outage because the order of directives was incorrect. Without a proper gray release, the issue went unnoticed until it affected the live service.

Another cause is capacity. Large events or traffic spikes can overwhelm services. In one incident, the backend couldnâ€™t handle the increased load, causing a complete service failure. We now focus on accurate capacity planning and monitoring to avoid such situations.

Hidden dangers, such as design flaws or unmonitored components, are harder to detect. We conduct full-link exercises and use SLA metrics to improve stability. Each failure is analyzed to prevent recurrence and share lessons across teams.

Experience Summary

Prevention is keyâ€”standard SOPs, capacity assessments, and pressure testing help avoid problems. When incidents occur, fast containment is crucial. Clear communication and structured feedback reduce confusion and speed up resolution. Post-incident analysis ensures that mistakes arenâ€™t repeated.

User Experience Optimization

From the userâ€™s perspective, traffic flows from the public internet to the private cloud and finally to the server. Public network issues like hijacking and multi-operator routing are mitigated through BGP and HTTP DNS. Weâ€™re also adopting newer protocols like SPDY and HTTP/2 to improve performance and reduce latency.

Future Prospects

Technically, weâ€™re focusing on automation and moving toward intelligent systems. AI is being explored for automatic fault detection and decision-making. Productization of our tools is underway, aiming to offer services to external users. Finally, we aim to create mature technical frameworks that can be used by others, helping advance the industry as a whole.

The cloud is the future. It abstracts low-level complexity, allowing us to focus on innovation and value creation.

9 M Light Tower

9 M Light Tower

A 9M light tower refers to a light tower that has a height of 9 meters. Light towers are typically used in construction sites, outdoor events, and other temporary lighting needs. They are equipped with powerful lights that can illuminate a large area, providing visibility and safety during nighttime or low-light conditions. The height of the light tower allows the lights to be positioned at a higher elevation, maximizing their coverage and reach.

The 9 M light tower is designed to provide bright and efficient lighting for large areas, making it ideal for nighttime construction work, outdoor events, or emergency situations. It is often used in conjunction with other equipment, such as cranes or generators, to provide a complete lighting solution.

9 meter light tower mast can be extended to 9 meters, can be equipped with manual mast and hydraulic mast, adjustable Angle, Solar Light Tower and Diesel Light Tower can achieve 9 meters of expansion, in addition to the lighthouse optional LED lamps or Metal Halide lamps, for you to create the most suitable products.

9 M Light Tower,Mine Solar Light Tower,Solar Mobile Led Lighting Tower,Led High Mast Light Tower

Grandwatt Electric Corp. , https://www.grandwattelectric.com