Figure 1 – In a situation with many subscribers to the same event, often these subscribers have similar data enriching requirements. As such each event subscriber will effectively run the same queries on underlying data stores. As a consequence, the load on the underlying infrastructure will increase tremendously as more and more subscribers run for the same information. This can effectively cripple (parts of) a service inventory.
Figure 2 – Event handling is delegated to dedicated consumers which each have different data enrichment requirements.
Figure 3 – Instead of all consumers subscribing to the same event (1), service consumers with similar data enrichment requirements use that event indirectly. Every consumer with specific data enrichment functionality is delegated this specific task and has a requirement to republish the enriched event (1a, 1b, 1ac). Any consumer which needs similar data must use the republished event instead of the initial event.
Figure 4 – Overall Service Autonomy is increased as less underlying resource access is necessary, effectively preventing the underlying resources from becoming a bottleneck. Discoverability is impacted as more and more similarly looking events are being generated. Reliability of the solution can be improved by introducing asynchronous queuing to overcome any subscriber availability issues. The application of this pattern is a form of logic centralization.

TelCo is a telecommunications company which started 15 years ago in a very competitive market. At the time of starting, TelCo was the second operator to start mobile communications in that country. The company thrived initially because they had chosen for a strategic SOA approach with for that time a state-of-the-art application of an event driven architecture.
The big benefit of this approach was that new functionality could be plugged in pretty easily and non-intrusive as events could be used for many new subscribers without affecting the code and functionality of existing subscribers.
As a basis of their efforts they had build a proprietary event management system with lots of specialized functionality. Telco focuses on mobile communications and because this market is very competitive, the lead time of implementing and releasing changes was crucial to TelCo. Whoever had the best proposition would have the biggest customer growth and every operator in the country would try to have an even better proposition by offering new "key" features quickly after each other. TelCo found itself in a race to show more and more innovations quickly to follow present market developments and to try and stay in the lead.
After a short time (within the first two years of existence) there were so many new advancements with releases sometimes more than twice per week that the system began to slow down. More and more events would bring the system in a state where handling a single event would occupy the system significantly, and sometimes some of the underlying resources would be significantly overloaded.
As a result of an internal analysis, the system architects concluded the entire system was overloaded and there was a problem in the event processing. As such the architects decided to make three significant changes:
- split the system into partitions to spread the load
- enhance the capacity of certain underlying resources and back-ends
- replace their bespoke event management system with a COTS product of a commercial vendor, who had indicated that their system would be able to process more than 50x the amount of events TelCo presently has with ease
After a couple of months their system was partitioned, their own bespoke event management system was replaced by the vendor product and the situation overall had improved but not significantly. Upgrading the back end systems had helped a bit as well but in the end the architects could foresee that similar problems would occur less than 8 months down the road if no drastic changes would be made.
The system architects had asked the designers and developers to build in extra logging which slowed down the system even more but it allowed the architects to dig deeper into the problematic system areas. What the system architects found is that a single event would cause an avalanche of back-end calls to the same critical back-ends supporting the organization.
As the company was still growing they realized that the next problematic system load would be reached a lot sooner than initially estimated, perhaps even within the next 4-6 months. Either radical changes would be made immediately or TelCo would be forced to slow down on their marketing and sales, something the company could not afford.
The system architects made the following fundamental change: event consumers were split into three logical categories:
- event consumers that require a specific event sequence
- event consumers that can live with out-of-order execution of received events
- consumers that need guaranteed event delivery
- consumers that do not need guaranteed event delivery (i.e. events can be skipped and nothing serious would break)
An analysis was done and the amount of event consumers that were classified in the first category fortunately was less than 15%. This means that for the majority of event consumers, no special infrastructure would be necessary which is necessary for maintaining the order of events. For the other two categories (85%) the following segmentation could be found: 3% did not have any problem with delivery assurances, but the remaining 82% needs delivery assurance.
Figure 5 – Classification of events after the analysis by architects.
This was actually good news to the architects as this meant that the really problematic areas (15% of the events) could be isolated in a relatively small area.
The first category (1) was split in the infrastructure and a new segment of infrastructure was introduced for the remaining event processing. Because of this split, in case a problematic event load would ever occur, the effects of this would remain somewhat isolated from the rest of the infrastructure.
Because only 15% of the events required significant extra hardware, a lot of the hardware purchased recently could be used to build the new segment of infrastructure.
Of the remaining 85%, only a very small subset would not need guaranteed event delivery and the amount of event processing in that area (2b) was so small that it made no sense to further split up the infrastructure for 2a/2b.
The second significant change the architects made is that they analyzed all the existing and planned event consumers to see whether they could find similar processing requirements in multiple event consumers. Once the analysis was complete, they found out that in fact many of the event consumers of the same event would use the same services and resources to enrich the data.
It was decided that in the events classified as (2a/2b) it made sense to appoint dedicated event consumers which would be solely responsible for enriching the event data. This meant a thorough redesign of the event architecture but the benefit would definitely outweigh the cost.
Due to the naturally increased amount of reuse, the build part of the redesigned event architecture was delivered 3 weeks ahead of schedule. Because naturally the amount of software assets had decreased, the amount of testing effort was also drastically decreased and also the test results came in 1 week early.
During load testing it was observed that despite all the structural changes, still a few events would cause problematic load. Fortunately the architects found that a selective caching strategy would help overcome most of the remaining areas. Three event consumers were given caching abilities to cache retrieved event data in a central cache for 20 minutes after the data would be retrieved. Due to the fact that many events revolve around the same subscribers and accounts in a relatively short amount of time this approach solved most of the problems.
Because the enterprise service bus TelCo had purchased shortly after start-up supports message delivery assurances, no further development was required for guaranteed message delivery and message in-order delivery.
The caching would reuse the bespoke caching framework built by TelCo several years ago, but because of the benefits they had of the delivery assurance framework that came out-of-the-box, the architects decided that the caching framework would be re-evaluated in the near future.
Telco architects were happy with the new approach and changed the reference architecture to accommodate for the new decisions so consecutive projects would be aware of the new event handling infrastructure.
Additionally, in the early phases of the software lifecycle, criteria were defined and introduced into the governance documentation to require classifying events and event consumers up-front into one of the defined categories.