Edge Cases, new developments
This post provides some input as to Monitoring approaches/modifications suggested by various specific cases, driven by advances in technology increasingly adopted within target applications.
The following are briefly addressed:
- Single Page Applications
- Applications incorporating Server push and/or service worker
- Internet of Things (IoT)
- Microservices based applications
- Managing non-design performance adoptions
- Performance APIs
- Single Page Applications [SPAs]:
The issue that SPAs present from a monitoring perspective is that they minimise the network interactions between user device and origin infrastructure. The historic ‘page based’ download paradigm (and dependency) is broken. This presents a particular problem for traditional synthetic monitoring, given that they are based on capturing and analysing just that ‘over the wire’ interaction.
User:site interactions (termed ‘soft’ navigations) and data delivery are independent of the standard W3C navigation API flags (DOM ready, onload, etc). Many interactions occur entirely within the client device.
Although some nuances can exist depending upon the detailed design of particular applications, unless a particular user interaction (eg button click) is reliably associated with a network request, the primary (but important) value of synthetic monitoring in this use case becomes the monitoring of availability. This key metric is unavailable to ‘passive’ (site visitor) based tools, for obvious reasons.
Any interactions that are directly linked to a network call can (in most synthetic monitoring scripting utilities) can be specifically ‘bracketed’ and their response patterns examined. Otherwise, monitoring best practice requires the use of RUM (Real User Monitoring) instrumentation.
Unfortunately, not all RUM tools are created equal, so if it is likely that you will be squaring up to SPAs, it will be important to check (and validate) that your RUM tool (as an extension of your APM tooling or otherwise) offers the granularity of recording required. If not, an alternative (and assuming that the APM Vendor cannot provide realistic comfort re their roadmap) may be to integrate a ‘standalone’ RUM such as SOASTA mPulse. This product has been specifically modified to meet the SPA use case. Details are given in this blog post http://www.soasta.com/blog/angularjs-real-user-monitoring-single-page-applications/. This is a currently evolving situation of direct business relevance. Others will undoubtedly follow.
- HTTP/2 based applications
The evolutionary specification of HTTP/2, a formalisation of Google SPDY, has been available for some time. Adoption is now reported to be rapid, and this rate is expected to further increase with progressive server adoption.
HTTP/2 provides a number of transport efficiencies relative to HTTP/1.x. These include request multiplexing (ie effective handling of multiple requests over the same connection), compression of head components, and other design interventions to avoid multiple retransmission of head metadata.
These changes deliver considerable advantages, particularly in site with large numbers of element requests and those involving delivery to users in high latency conditions.
These changes make it necessary to be aware of changes in interventions formerly regarded as ‘best practice’ for optimised performance.
Domain sharding, which was formerly adopted to increase the effective number of parallel connections, becomes an anti-pattern. Domain sharding involves the risk of request failure and consequent retransmission, particular in conditions of limited connectivity (mobile delivery in rural locations, countries with poor internet infrastructure). It impacts the inherent HTTP/2 efficiencies of header compression and transmission optimisation and resource prioritisation possible over connection to a single domain. Does not present monitoring or analysis challenges per se. Can form part of optimisation recommendations.
Content concatenation, the most prominent usage of which is in image spriting, but which may also be applied to other content, has the objective of reducing the number of roundtrip requests. This has, however the disadvantage of forcing refresh if any part of the grouped content changes. Revised best practice, driven by the transmission efficiencies inherent in HTTP/2, directs reduced individual object payloads, and essentially a more granular management of content at individual element level. This, for example, this supports more appropriate cache setting, having regard to the specifics of particular objects.
It should be noted that, with the exception of increased adoption of ‘server push’ interactions (see following section), these changes involve modification of FEO interpretation and recommendation, rather than impacting monitoring practice.
- Server Push content, Service Worker interactions:
Persistent duration server:client interactions are a core facet of modern applications. In certain cases this is driven by the nature of the application itself (eg delivery of live update betting odds). Other drivers are the leverage of HTTP/2 efficiencies (see section above) and the development of ‘network independent’ ‘mobile ‘WebApps’.
WebApps effectively co-exist with native mobile applications. They incorporate local device caching and store and forward capabilities that enable usage in unstable or ‘network off’ conditions. WebApps utilise Service Workers. These replace the limitations of former AppCache based approaches. They are event driven, and permit access to server push interactions.
Service Worker capability offers many attractive advantages in the creation of more business-centric mobile device based interactions.
The challenge to historic monitoring practice is that long duration connections distort the recorded page load endpoint in traditional synthetic monitoring tools. This must be identified and corrected for, otherwise incorrect performance inferences may be drawn, particularly in terms of recorded response variation.
Fortunately, identification of server push interactions is usually obvious from inspection of standard ‘waterfall’ charts. Correcting for it in an elegant manner is more difficult. Ignoring the validation approaches incorporated within certain synthetic monitoring product scripting (as they are not widely adopted), arguably the best approach to synthetic testing is simply to identify and then filter out the server push calls. Although somewhat of a blunt instrument, it does get around the problem.
A more elegant approach, based on RUM analysis, emerges with the availability of the new sendBeacon API, the syntax of which is as follows:
Use of this call enables granular instrumentation of application code to specifically record response to particular events. It should be noted that this is newly released (at the time of writing), so that it is likely that reliable cross browser support in unlikely to be complete. However, I understand that the leading-edge performance team at Financial Times in London report effective use of this API in production conditions (P Hamann, personal communication).
Example code instrumentation using the sendBeacon API
- Internet of Things:
A brief note on the ‘Internet of Things’
Sensor – based applications, collectively known as the ‘Internet of Things (IoT)’ have been slowly evolving since Coca Cola introduced the self-reordering dispensing machine many decades ago. It is now in danger of becoming one of the most hyped areas of new technology. Certainly, actual companies are now trading (in the UK Hive and Nest to name but two). Regardless of whether the app is controlling your heating thermostats, reordering the contents of your fridge, or (in the future) ordering up a driverless car for your commute to work, it is important to be able to understand and validate performance in objective terms.
Although companies offering the wet string (Cisco etc) are ready and waiting, full evolution will be accelerated by the mass adoption of intermediation platform technology such as Jasper, Apple Homekit, etc.
IoT application control panel & code (HIVE Home)
It may be asked “why do I want to check the performance of my smoke alarm remotely anyway?”. Well, clearly, the value of performance monitoring lies in the relevance of whatever is being tested. As such, monitoring may be more appropriate to Vendors of such services rather than individual domestic customers, but, again, it depends – and IoT system performance at individual level may become relevant to the smooth running of all our lives in the future.
Monitoring will probably be based around assurance of the successful completion of core control transactions, based on a (predominantly) mobile application interface. The core use-case is therefore more akin to availability monitoring. Depending upon how such IoT systems are architected, the effect of high traffic load on performance may become relevant.
IoT networks are fairly closed systems, but core mobile app monitoring principles apply. As closed systems, they are not accessible to scheduled synthetic external testing.
Two approaches are possible, either:
- Instrument the mobile application used to control the system [standard SDK-based techniques], timing specific response end-points (eg ‘temperature set’ flag or whatever).
- If available, monitor via the API – APM tooling can often provide webservices based gateways. These can either be custom developed today, although they will undoubtedly become available for the major providers off-the-shelf as the market develops.
The former obviously only monitors the performance of the control application, not the IoT devices themselves (which are assumed to operate correctly if appropriate control application responses are received).
- Microservices based applications
The primary development and/or extension of applications based on microservices – ie discrete functionality containerised elements – is becoming very popular. Arguably this is being driven by the popularity and availability of open source platforms, particularly Docker, though alternatives exist.
The pros and cons of microservices adoption are outside both my core experience and the scope of this material. Suffice it to say that despite the ownership advantages of highly granular functional elements from an agile development perspective, microservices-based applications deliver an additional layer of integration and management complexity from an operations perspective.
Performance understanding should be considered from both a back end and external approach.
From the point of view of the containers themselves, the major APM Vendors are increasingly including specific support for these technologies. Currently, given the market dynamics, specific support starts with Docker, although other platforms are/will be explicitly supported moving forward. The extent of visibility offered by the various APM tools does vary, although it is likely that your choice will be made by other considerations (and therefore you will ‘get what you get’ with respect to container performance visibility).
Microservices container monitoring (RUXIT APM)
In terms of external monitoring practice, the core change is not the high level approach/tooling mix, but rather the importance of ensuring that poor performance of core services and/or module interactions are targeted, such that interventions can be made rapidly. This is particularly apposite given that the nature of testing and pre-production environments is such that it is likely that issues will arise that only emerge post release-to-production when the application comes under real world load and interaction complexity conditions.
The take home message should therefore be to monitor with underlying services in mind. This implies a ‘subpage’ monitoring approach. Greater granularity of monitoring can be achieved, by, for example, (with synthetic tooling) scripting transactions to bracket key in-page interactions (reported as step timings), and (with RUM) using additional timing markers / beacons to achieve the same effect.
Issues not specifically detected by these techniques should reveal themselves by changes to traffic flows/user behaviours. These are best detected by cultivating an approach to Web Analytics reports that is both absolute and intuitive.
Although not strictly associated with FEO, a few words on Bots are relevant to the consideration of third party related performance constraints. Bots (or web robots) are automated site interactions. Although the majority (ranging from SEO to synthetic testing and price aggregation) are not malicious in intent, they represent a huge proportion of total site traffic – over two thirds for typical retail sites for example.
Global car rental site – UK traffic by unique IP per hour – total vs customer traffic
This represents a significant economic cost, both in maintaining otherwise unnecessary infrastructure and in reducing the effective capacity overhead of the site (and therefore its ability to benefit from peaks in ‘real customer’ traffic. These benefits can be extremely significant. One of our retail clients was able to reduce its IBM licence requirement for WebSphere Commerce Suite from 5 to 3 cores, thus generating a substantial ongoing annual cost saving.
Unfortunately, Bot effects are not simply confined to generating excess traffic. So called “bad” bots have a range of negative effects, from inadvertent inefficiencies due to poorly written code, to spam, malicious hacks, and high volume Denial of Service (DDoS) attacks. According to the Anti Phishing Working group (report, Q1-Q3 2015), over one third of all computers worldwide are infected with Malware.
Various approaches to mitigation are possible. These include:
- ID blocking
- CAPTCHA (largely regarded as compromised)
- Multi parameter traffic ‘fingerprinting’ and
- Bot ‘honeytraps’
From the point of view of performance practice / FEO, Bots are an indirect consideration, however one that should be considered when making recommendations regarding overall performance enhancement. Seek to quantify the extent of the problem and identify potential interventions. These are likely to depend upon the economics of the threat and existing relationships. They can range from specialist target solutions (eg Distil Networks) security extensions to firewalls (eg Barracuda), added options from CDN or other performance vendors (eg Akamai, Radware), to focussed integrated traffic management solutions (eg Intechnica Alipta).
- Performance APIs
A few words on the use of performance-centric APIs. These include the ‘traditional’ navigation flags – DOM Ready, page unload etc.- that have been around for a few years now, together with more leading edge developments such as sendBeacon (already referenced wrt monitoring service worker / push content) Event.timestamp property & others.
The only negative to introducing timing APIs to this series of posts is that it moves us across the ‘dev’ spectrum and away from an introduction for day-to-day operations. Failure to exploit them, will, however, prove a serious limitation to effective performance practice going forward, so awareness and if possible, adoption is increasingly important.
Network timing attributes are collected for each page resource. Navigation and resource timers are delivered as standard in most modern browsers for components of ‘traditional’ page downloads. User interaction and more client centric design (eg SPAs), however, require event based timers.
Basic custom timers introduce a timing mark() at points within the page/code. Your RUM tooling should, ideally, be able to support these, as they enable read-across between different tooling eg using visual end points for ‘user experience endpoints’ / browser fill times in synthetic measurements. Not all RUM products do support these, however, so this is an important aspect of understanding when making a product purchase decision.
Other APIs have been developed to support, for example, image rendering, and frame timing – important if seeking to ensure smooth jank-free user experiences.
Browser support cannot be taken for granted, particularly with the newer APIs. It is important to be aware of which browsers support a particular method, as you will be ‘blind’ with respect to the performance of users with non-supported technologies. In certain cases (eg Opera in Russia, or Safari for media-centric user bases), this can introduce serious distortions to results interpretation.
A useful primer for Web Performance Timing APIs, which also contains links to further specialist information in this evolving area can be found here> bit.ly/perf-timing-primer.
Browser support for resource timing API – May 2016 [caniuse.com]