Service Provider Blogs

One night in Beijing

CloudFlare Blog -

As the old saying goes, good things come in pairs, 好事成双! The month of May marks a double celebration in China for our customers, partners and Cloudflare.First and ForemostA Beijing Customer Appreciation Cocktail was held in the heart of Beijing at Yintai Centre Xiu Rooftop Garden Bar on the 10 May 2019, an RSVP event graced by our supportive group of partners and customers. We have been blessed with almost 10 years of strong growth at Cloudflare - sharing our belief in providing access to internet security and performance to customers of all sizes and industries. This success has been the result of collaboration between our developers, our product team as represented today by our special guest, Jen Taylor, our Global Head of Product, Business Leaders Xavier Cai, Head of China business, and Aliza Knox Head of our APAC Business, James Ball our Head of Solutions Engineers for APAC, most importantly, by the trust and faith that our partners, such as Baidu, and customers have placed in us.Double Happiness, 双喜On the same week, we embarked on another exciting journey in China with our grand office opening at WeWork. Beijing team consists of functions from Customer Development to Solutions Engineering and Customer Success lead by Xavier, Head of China business. The team has grown rapidly in size by double since it started last year.We continue to invest in China and to grow our customer base, and importantly our methods for supporting our customers, here are well. Those of us who came from different parts of the world, are also looking to learn from the wisdom and experience of our customers in this market. And to that end, we look forward to many more years of openness, trust, and mutual success.感谢所有花时间来参加我们这次北京鸡尾酒会的客户和合作伙伴,谢谢各位对此活动的大力支持与热烈交流!

One more thing... new Speed Page

CloudFlare Blog -

Congratulations on making it through Speed Week. In the last week, Cloudflare has: described how our global network speeds up the Internet, launched a HTTP/2 prioritisation model that will improve web experiences on all browsers, launched an image resizing service which will deliver the optimal image to every device, optimized live video delivery, detailed how to stream progressive images so that they render twice as fast - using the flexibility of our new HTTP/2 prioritisation model and finally, prototyped a new over-the-wire format for JavaScript that could improve application start-up performance especially on mobile devices. As a bonus, we’re also rolling out one more new feature: “TCP Turbo” automatically chooses the TCP settings to further accelerate your website.As a company, we want to help every one of our customers improve web experiences. The growth of Cloudflare, along with the increase in features, has often made simple questions difficult to answer:How fast is my website?How should I be thinking about performance features?How much faster would the site be if I were to enable a particular feature?This post will describe the exciting changes we have made to the Speed Page on the Cloudflare dashboard to give our customers a much clearer understanding of how their websites are performing and how they can be made even faster. The new Speed Page consists of :A visual comparison of your website loading on Cloudflare, with caching enabled, compared to connecting directly to the origin.The measured improvement expected if any performance feature is enabled.A report describing how fast your website is on desktop and mobile.We want to simplify the complexity of making web experiences fast and give our customers control.  Take a look - We hope you like it.Why do fast web experiences matter?Customer experience : No one likes slow service. Imagine if you go to a restaurant and the service is slow, especially when you arrive; you are not likely to go back or recommend it to your friends. It turns out the web works in the same way and Internet customers are even more demanding. As many as 79% of customers who are “dissatisfied” with a website’s performance are less likely to buy from that site again.Engagement and Revenue : There are many studies explaining how speed affects customer engagement, bounce rates and revenue. Reputation : There is also brand reputation to consider as customers associate an online experience to the brand. A study found that for 66% of the sample website performance influences their impression of the company.Diversity : Mobile traffic has grown to be larger than its desktop counterpart over the last few years. Mobile customers' expectations have becoming increasingly demanding and expect seamless Internet access regardless of location. Mobile provides a new set of challenges that includes the diversity of device specifications. When testing, be aware that the average mobile device is significantly less capable than the top-of-the-range models. For example, there can be orders-of-magnitude disparity in the time different mobile devices take to run JavaScript. Another challenge is the variance in mobile performance, as customers move from a strong, high quality office network to mobile networks of different speeds (3G/5G), and quality within the same browsing session.New Speed PageThere is compelling evidence that a faster web experience is important for anyone online. Most of the major studies involve the largest tech companies, who have whole teams dedicated to measuring and improving web experiences for their own services. At Cloudflare we are on a mission to help build a better and faster Internet for everyone - not just the selected few. Delivering fast web experiences is not a simple matter. That much is clear.To know what to send and when requires a deep understanding of every layer of the stack, from TCP tuning, protocol level prioritisation, content delivery formats through to the intricate mechanics of browser rendering.  You will also need a global network that strives to be within 10 ms of every Internet user. The intrinsic value of such a network, should be clear to everyone. Cloudflare has this network, but it also offers many additional performance features.With the Speed Page redesign, we are emphasizing the performance benefits of using Cloudflare and the additional improvements possible from our features.The de facto standard for measuring website performance has been WebPageTest. Having its creator in-house at Cloudflare encouraged us to use it as the basis for website performance measurement. So, what is the easiest way to understand how a web page loads? A list of statistics do not paint a full picture of actual user experience. One of the cool features of WebPageTest is that it can generate a filmstrip of screen snapshots taken during a web page load, enabling us to quantify how a page loads, visually. This view makes it significantly easier to determine how long the page is blank for, and how long it takes for the most important content to render. Being able to look at the results in this way, provides the ability to empathise with the user.How fast on Cloudflare ?After moving your website to Cloudflare, you may have asked: How fast did this decision make my website? Well, now we provide the answer:Comparison of website performance using Cloudflare. As well as the increase in speed, we provide filmstrips of before and after, so that it is easy to compare and understand how differently a user will experience the website. If our tests are unable to reach your origin and you are already setup on Cloudflare, we will test with development mode enabled, which disables caching and minification.Site performance statisticsHow can we measure the user experience of a website?Traditionally, page load was the important metric. Page load is a technical measurement used by browser vendors that has no bearing on the presentation or usability of a page. The metric reports on how long it takes not only to load the important content but also all of the 3rd party content (social network widgets, advertising, tracking scripts etc.). A user may very well not see anything until after all the page content has loaded, or they may be able to interact with a page immediately, while content continues to load.A user will not decide whether a page is fast by a single measure or moment. A user will perceive how fast a website is from a combination of factors:when they see any responsewhen they see the content they expectwhen they can interact with the pagewhen they can perform the task they intendedExperience has shown that if you focus on one measure, it will likely be to the detriment of the others.Importance of Visual responseIf an impatient user navigates to your site and sees no content for several seconds or no valuable content, they are likely to get frustrated and leave. The paint timing spec defines a set of paint metrics, when content appears on a page, to measure the key moments in how a user perceives performance. First Contentful Paint (FCP) is the time when the browser first renders any DOM content. First Meaningful Paint (FMP) is the point in time when the page’s “primary” content appears on the screen. This metric should relate to what the user has come to the site to see and is designed as the point in time when the largest visible layout change happens. Speed Index attempts to quantify the value of the filmstrip rather than using a single paint timing. The speed index measures the rate at which content is displayed - essentially the area above the curve. In the chart below from our progressive image feature you can see reaching 80% happens much earlier for the parallelized (red) load rather than the regular (blue). Importance of interactivityThe same impatient user is now happy that the content they want to see has appeared. They will still become frustrated if they are unable to interact with the site. Time to Interactive is the time it takes for content to be rendered and the page is ready to receive input from the user. Technically this is defined as when the browser’s main processing thread has been idle for several seconds after first meaningful paint.The Speed Tab displays these key metrics for mobile and desktop.How much faster on Cloudflare ?The Cloudflare Dashboard provides a list of performance features which can, admittedly, be both confusing and daunting. What would be the benefit of turning on Rocket Loader and on which performance metrics will it have the most impact ? If you upgrade to Pro what will be the value of the enhanced HTTP/2 prioritisation ? The optimization section answers these questions. Tests are run with each performance feature turned on and off. The values for the tests for the appropriate performance metrics are displayed, along with the improvement. You can enable or upgrade the feature from this view. Here are a few examples :If Rocket Loader were enabled for this website, the render-blocking JavaScript would be deferred causing first paint time to drop from 1.25s to 0.81s - an improvement of 32% on desktop.Image heavy sites do not perform well on slow mobile connections. If you enable Mirage, your customers on 3G connections would see meaningful content 1s sooner - an improvement of 29.4%.So how about our new features?We tested the enhanced HTTP/2 prioritisation feature on an Edge browser on desktop and saw meaningful content display 2s sooner - an improvement of 64%.This is a more interesting result taken from the blog example used to illustrate the progressive image streaming. At first glance the improvement of 29% in speed index is good. The filmstrip comparison shows a more significant difference. In this case the page with no images shown is already 43% visually complete for both scenarios after 1.5s. At 2.5s the difference is 77% compared to 50%.This is a great example of how metrics do not tell the full story. They cannot completely replace viewing the page loading flow and understanding what is important for your site.How to tryThis is our first iteration of the new Speed Page and we are eager to get your feedback. We will be rolling this out to beta customers who are interested in seeing how their sites perform. To be added to the queue for activation of the new Speed Page please click on the banner on the overview page, or click on the banner on the existing Speed Page.

EU election season and securing online democracy

CloudFlare Blog -

It’s election season in Europe, as European Parliament seats are contested across the European Union by national political parties. With approximately 400 million people eligible to vote, this is one of the biggest democratic exercises in the world - second only to India - and it takes place once every five years. Over the course of four days, 23-26 May 2019, each of the 28 EU countries will elect a different number of Members of the European Parliament (“MEPs”) roughly mapped to population size and based on a proportional system. The 751 newly elected MEPs (a number which includes the UK’s allocation for the time being) will take their seats in July. These elections are not only important because the European Parliament plays a large role in the EU democratic system, being a co-legislator alongside the European Council, but as the French President Emmanuel Macron has described, these European elections will be decisive for the future of the continent.Election security: an EU political priorityPolitical focus on the potential cybersecurity threat to the EU elections has been extremely high, and various EU institutions and agencies have been engaged in a long campaign to drive awareness among EU Member States and to help political parties prepare. Last month for example, more than 80 representatives from the European Parliament, EU Member States, the European Commission and the European Agency for Network and Information Security (ENISA) gathered for a table-top exercise to test the EU's response to potential incidents. The objective of the exercise was to test the efficacy of EU Member States’ practices and crisis plans, to acquire an overview of the level of resilience across the EU, and to identify potential gaps and adequate mitigation measures. Earlier this year, ENISA published a paper on EU-wide election security which described how as a result of the large attack surface that is inherent to elections, the risks do not only concern government election systems but also extend to individual candidates and individual political campaigns. Examples of attack vectors that affect election processes can include spear phishing, data theft, online disinformation, malware, and DDoS attacks. ENISA went on to propose that election systems, processes and infrastructures be classified as critical infrastructure, and that a legal obligation be put in place requiring political organisations to deploy a high level of cybersecurity.Last September, in his State of the Union address, European Commission President Juncker announced a package of initiatives aimed at ensuring that the EU elections are organised in a free, fair and secure manner. EU Member States subsequently set up a national cooperation network of relevant authorities – such as electoral, cybersecurity, data protection and law enforcement authorities – and appointed contact points to take part in a European cooperation network for elections. In July 2018, the Cooperation Group set up under the EU NIS Directive (composed of Member States, the European Commission and ENISA) issued a detailed report, "Compendium on Cyber Security of Election Technology". The report outlined how election processes typically extend over a long life cycle, consisting of several phases, and the presentation layer is as important as the correct vote count and protection of the interface where citizens learn of the election results. Estonia - a country that is known to be a digital leader when it comes to eGovernment services - is currently the only EU country that offers its citizens the option to cast their ballot online. However, even electoral systems that rely exclusively on paper voting typically take advantage of digital tools and services in compiling voter rolls, candidate registration or result tabulation and communication. The report described various election/cyber incidents witnessed at EU Member State level and the methods used. As the electoral systems vary greatly across the EU, the NIS Cooperation Group ultimately recommended that tools, procedures, technologies and protection measures should follow a “pick and mix” approach which can include DDoS protection, network flow analysis and monitoring, and use of a CDN. Cloudflare provides all these services and more, helping to prevent the defacement of public-facing websites and Denial of Service attacks, and ensuring the high availability and performance of web pages which need to be capable of withstanding a significant traffic load at peak times. Cloudflare’s election security experienceCloudflare’s CTO John Graham-Cumming recently spoke at a session in Brussels which explored Europe’s cyber-readiness for the EU elections. He outlined that while sophisticated cyber attacks are on the rise, humans can often be the weakest link. Strong password protection, two factor authentication and a keen eye for phishing scams can go a long way in thwarting attackers’ attempts to penetrate campaign and voting web properties. John also described Cloudflare’s experience in running the Athenian Project, which provides free enterprise-level services to government election and voter registration websites. Source: PoliticoCloudflare has protected most of the major U.S Presidential campaign websites from cyberattacks, including the Trump/Pence campaign website, the website for the campaign of Senator Bernie Sanders, and websites for 14 of the 15 leading candidates from the two  political parties. We have also protected election websites in countries like Peru, Ecuador and, most recently, North Macedonia. Is Europe cyber-ready?Thanks to the high profile awareness campaign across the EU, Europeans have had time to prepare and to look for solutions according to their needs. Election interference is certainly not a new phenomenon, however, the scale of the current threat is unprecedented and clever disinformation campaigns are also now in play. Experts have recently identified techniques such as spear phishing and DDoS attacks as particular threats to watch for, and the European Commission has been monitoring industry progress under the Code of Practice on Disinformation which has encouraged platforms such as Google, Twitter and Facebook to take action to fight against malicious bots and fake accounts.What is clear is that this can only ever be a coordinated effort, with both governments and industry working together to ensure a robust response to any threats to the democratic process. For its part, Cloudflare is protecting a number of political group websites across the EU and we have been seeing Layer 4 and Layer 7 DDoS attacks, as well as pen testing and firewall probing attempts. Incidents this month have included attacks against Swedish, French, Spanish and UK web properties, with particularly high activity across the board around 8th May. As the elections approach, we can expect the volume/spread of attacks to increase.Further information about the European elections can be found here - and if you are based in Europe, don’t forget to vote!

Cloudflare architecture and how BPF eats the world

CloudFlare Blog -

Recently at Netdev 0x13, the Conference on Linux Networking in Prague, I gave a short talk titled "Linux at Cloudflare". The talk ended up being mostly about BPF. It seems, no matter the question - BPF is the answer.Here is a transcript of a slightly adjusted version of that talk.At Cloudflare we run Linux on our servers. We operate two categories of data centers: large "Core" data centers, processing logs, analyzing attacks, computing analytics, and the "Edge" server fleet, delivering customer content from 180 locations across the world.In this talk, we will focus on the "Edge" servers. It's here where we use the newest Linux features, optimize for performance and care deeply about DoS resilience.Our edge service is special due to our network configuration - we are extensively using anycast routing. Anycast means that the same set of IP addresses are announced by all our data centers.This design has great advantages. First, it guarantees the optimal speed for end users. No matter where you are located, you will always reach the closest data center. Then, anycast helps us to spread out DoS traffic. During attacks each of the locations receives a small fraction of the total traffic, making it easier to ingest and filter out unwanted traffic.Anycast allows us to keep the networking setup uniform across all edge data centers. We applied the same design inside our data centers - our software stack is uniform across the edge servers. All software pieces are running on all the servers.In principle, every machine can handle every task - and we run many diverse and demanding tasks. We have a full HTTP stack, the magical Cloudflare Workers, two sets of DNS servers - authoritative and resolver, and many other publicly facing applications like Spectrum and Warp.Even though every server has all the software running, requests typically cross many machines on their journey through the stack. For example, an HTTP request might be handled by a different machine during each of the 5 stages of the processing.Let me walk you through the early stages of inbound packet processing:(1) First, the packets hit our router. The router does ECMP, and forwards packets onto our Linux servers. We use ECMP to spread each target IP across many, at least 16, machines. This is used as a rudimentary load balancing technique.(2) On the servers we ingest packets with XDP eBPF. In XDP we perform two stages. First, we run volumetric DoS mitigations, dropping packets belonging to very large layer 3 attacks.(3) Then, still in XDP, we perform layer 4 load balancing. All the non-attack packets are redirected across the machines. This is used to work around the ECMP problems, gives us fine-granularity load balancing and allows us to gracefully take servers out of service.(4) Following the redirection the packets reach a designated machine. At this point they are ingested by the normal Linux networking stack, go through the usual iptables firewall, and are dispatched to an appropriate network socket.(5) Finally packets are received by an application. For example HTTP connections are handled by a "protocol" server, responsible for performing TLS encryption and processing HTTP, HTTP/2 and QUIC protocols.It's in these early phases of request processing where we use the coolest new Linux features. We can group useful modern functionalities into three categories:DoS handlingLoad balancingSocket dispatchLet's discuss DoS handling in more detail. As mentioned earlier, the first step after ECMP routing is Linux's XDP stack where, among other things, we run DoS mitigations.Historically our mitigations for volumetric attacks were expressed in classic BPF and iptables-style grammar. Recently we adapted them to execute in the XDP eBPF context, which turned out to be surprisingly hard. Read on about our adventures:L4Drop: XDP DDoS Mitigationsxdpcap: XDP Packet CaptureXDP based DoS mitigation talk by Arthur FabreXDP in practice: integrating XDP into our DDoS mitigation pipeline (PDF)During this project we encountered a number of eBPF/XDP limitations. One of them was the lack of concurrency primitives. It was very hard to implement things like race-free token buckets. Later we found that Facebook engineer Julia Kartseva had the same issues. In February this problem has been addressed with the introduction of bpf_spin_lock helper.While our modern volumetric DoS defenses are done in XDP layer, we still rely on iptables for application layer 7 mitigations. Here, a higher level firewall’s features are useful: connlimit, hashlimits and ipsets. We also use the xt_bpf iptables module to run cBPF in iptables to match on packet payloads. We talked about this in the past:Lessons from defending the indefensible (PPT)Introducing the BPF toolsAfter XDP and iptables, we have one final kernel side DoS defense layer.Consider a situation when our UDP mitigations fail. In such case we might be left with a flood of packets hitting our application UDP socket. This might overflow the socket causing packet loss. This is problematic - both good and bad packets will be dropped indiscriminately. For applications like DNS it's catastrophic. In the past to reduce the harm, we ran one UDP socket per IP address. An unmitigated flood was bad, but at least it didn't affect the traffic to other server IP addresses.Nowadays that architecture is no longer suitable. We are running more than 30,000 DNS IP's and running that number of UDP sockets is not optimal. Our modern solution is to run a single UDP socket with a complex eBPF socket filter on it - using the SO_ATTACH_BPF socket option. We talked about running eBPF on network sockets in past blog posts:eBPF, Sockets, Hop Distance and manually writing eBPF assemblySOCKMAP - TCP splicing of the futureThe mentioned eBPF rate limits the packets. It keeps the state - packet counts - in an eBPF map. We can be sure that a single flooded IP won't affect other traffic. This works well, though during work on this project we found a rather worrying bug in the eBPF verifier:eBPF can't count?!I guess running eBPF on a UDP socket is not a common thing to do.Apart from the DoS, in XDP we also run a layer 4 load balancer layer. This is a new project, and we haven't talked much about it yet. Without getting into many details: in certain situations we need to perform a socket lookup from XDP.The problem is relatively simple - our code needs to look up the "socket" kernel structure for a 5-tuple extracted from a packet. This is generally easy - there is a bpf_sk_lookup helper available for this. Unsurprisingly, there were some complications. One problem was the inability to verify if a received ACK packet was a valid part of a three-way handshake when SYN-cookies are enabled. My colleague Lorenz Bauer is working on adding support for this corner case.After DoS and the load balancing layers, the packets are passed onto the usual Linux TCP / UDP stack. Here we do a socket dispatch - for example packets going to port 53 are passed onto a socket belonging to our DNS server.We do our best to use vanilla Linux features, but things get complex when you use thousands of IP addresses on the servers.Convincing Linux to route packets correctly is relatively easy with the "AnyIP" trick. Ensuring packets are dispatched to the right application is another matter. Unfortunately, standard Linux socket dispatch logic is not flexible enough for our needs. For popular ports like TCP/80 we want to share the port between multiple applications, each handling it on a different IP range. Linux doesn't support this out of the box. You can call bind() either on a specific IP address or all IP's (with 0.0.0.0).In order to fix this, we developed a custom kernel patch which adds a SO_BINDTOPREFIX socket option. As the name suggests - it allows us to call bind() on a selected IP prefix. This solves the problem of multiple applications sharing popular ports like 53 or 80.Then we run into another problem. For our Spectrum product we need to listen on all 65535 ports. Running so many listen sockets is not a good idea (see our old war story blog), so we had to find another way. After some experiments we learned to utilize an obscure iptables module - TPROXY - for this purpose. Read about it here:Abusing Linux's firewall: the hack that allowed us to build SpectrumThis setup is working, but we don't like the extra firewall rules. We are working on solving this problem correctly - actually extending the socket dispatch logic. You guessed it - we want to extend socket dispatch logic by utilizing eBPF. Expect some patches from us.Then there is a way to use eBPF to improve applications. Recently we got excited about doing TCP splicing with SOCKMAP:SOCKMAP - TCP splicing of the futureThis technique has a great potential for improving tail latency across many pieces of our software stack. The current SOCKMAP implementation is not quite ready for prime time yet, but the potential is vast.Similarly, the new TCP-BPF aka BPF_SOCK_OPS hooks provide a great way of inspecting performance parameters of TCP flows. This functionality is super useful for our performance team.Some Linux features didn't age well and we need to work around them. For example, we are hitting limitations of networking metrics. Don't get me wrong - the networking metrics are awesome, but sadly they are not granular enough. Things like TcpExtListenDrops and TcpExtListenOverflows are reported as global counters, while we need to know it on a per-application basis.Our solution is to use eBPF probes to extract the numbers directly from the kernel. My colleague Ivan Babrou wrote a Prometheus metrics exporter called "ebpf_exporter" to facilitate this. Read on:Introducing ebpf_exporterhttps://github.com/cloudflare/ebpf_exporterWith "ebpf_exporter" we can generate all manner of detailed metrics. It is very powerful and saved us on many occasions.In this talk we discussed 6 layers of BPFs running on our edge servers:Volumetric DoS mitigations are running on XDP eBPFIptables xt_bpf cBPF for application-layer attacksSO_ATTACH_BPF for rate limits on UDP socketsLoad balancer, running on XDPeBPFs running application helpers like SOCKMAP for TCP socket splicing, and TCP-BPF for TCP measurements"ebpf_exporter" for granular metricsAnd we're just getting started! Soon we will be doing more with eBPF based socket dispatch, eBPF running on Linux TC (Traffic Control) layer and more integration with cgroup eBPF hooks. Then, our SRE team is maintaining ever-growing list of BCC scripts useful for debugging.It feels like Linux stopped developing new API's and all the new features are implemented as eBPF hooks and helpers. This is fine and it has strong advantages. It's easier and safer to upgrade eBPF program than having to recompile a kernel module. Some things like TCP-BPF, exposing high-volume performance tracing data, would probably be impossible without eBPF.Some say "software is eating the world", I would say that: "BPF is eating the software".

Join Cloudflare & Yandex at our Moscow meetup! Присоединяйтесь к митапу в Москве!

CloudFlare Blog -

Photo by Serge Kutuzov / UnsplashAre you based in Moscow? Cloudflare is partnering with Yandex to produce a meetup this month in Yandex's Moscow headquarters.  We would love to invite you to join us to learn about the newest in the Internet industry. You'll join Cloudflare's users, stakeholders from the tech community, and Engineers and Product Managers from both Cloudflare and Yandex.Cloudflare Moscow MeetupTuesday, May 30, 2019: 18:00 - 22:00 Location: Yandex - Ulitsa L'va Tolstogo, 16, Moskva, Russia, 119021Talks will include "Performance and scalability at Cloudflare”, "Security at Yandex Cloud", and "Edge computing".Speakers will include Evgeny Sidorov, Information Security Engineer at Yandex, Ivan Babrou, Performance Engineer at Cloudflare, Alex Cruz Farmer, Product Manager for Firewall at Cloudflare, and Olga Skobeleva, Solutions Engineer at Cloudflare.Agenda:18:00 - 19:00 - Registration and welcome cocktail19:00 - 19:10 - Cloudflare overview19:10 - 19:40 - Performance and scalability at Cloudflare19:40 - 20:10 - Security at Yandex Cloud20:10 - 20:40 - Cloudflare security solutions and industry security trends20:40 - 21:10 - Edge computingQ&AThe talks will be followed by food, drinks, and networking.View Event Details & Register Here »We'll hope to meet you soon.Разработчики, присоединяйтесь к Cloudflare и Яндексу на нашей предстоящей встрече в Москве!Cloudflare сотрудничает с Яндексом, чтобы организовать мероприятие в этом месяце в штаб-квартире Яндекса. Мы приглашаем вас присоединиться к встрече посвященной новейшим достижениям в интернет-индустрии. На мероприятии соберутся клиенты Cloudflare, профессионалы из технического сообщества, инженеры из Cloudflare и Яндекса.Вторник, 30 мая: 18:00 - 22:00Место встречи: Яндекс, улица Льва Толстого, 16, Москва, Россия, 119021Доклады будут включать себя такие темы как «Решения безопасности Cloudflare и тренды в области безопасности», «Безопасность в Yandex Cloud», “Производительность и масштабируемость в Cloudflare и «Edge computing» от докладчиков из Cloudflare и Яндекса.Среди докладчиков будут Евгений Сидоров, Заместитель руководителя группы безопасности сервисов в Яндексе, Иван Бобров, Инженер по производительности в Cloudflare, Алекс Круз Фармер, Менеджер продукта Firewall в Cloudflare, и Ольга Скобелева, Инженер по внедрению в Cloudflare.Программа:18:00 - 19:00 - Регистрация, напитки и общение19:00 - 19:10 - Обзор Cloudflare19:10 - 19:40 - Производительность и масштабируемость в Cloudflare19:40 - 20:10 - Решения для обеспечения безопасности в Яндексе20:10 - 20:40 - Решения безопасности Cloudflare и тренды в области безопасности20:40 - 21:10 - Примеры Serverless-решений по безопасностиQ&AВслед за презентациям последует общение, еда и напитки.Посмотреть детали события и зарегистрироваться можно здесь »Ждем встречи с вами!

Faster script loading with BinaryAST?

CloudFlare Blog -

JavaScript Cold startsThe performance of applications on the web platform is becoming increasingly bottlenecked by the startup (load) time. Large amounts of JavaScript code are required to create rich web experiences that we’ve become used to. When we look at the total size of JavaScript requested on mobile devices from HTTPArchive, we see that an average page loads 350KB of JavaScript, while 10% of pages go over the 1MB threshold. The rise of more complex applications can push these numbers even higher.While caching helps, popular websites regularly release new code, which makes cold start (first load) times particularly important. With browsers moving to separate caches for different domains to prevent cross-site leaks, the importance of cold starts is growing even for popular subresources served from CDNs, as they can no longer be safely shared.Usually, when talking about the cold start performance, the primary factor considered is a raw download speed. However, on modern interactive pages one of the other big contributors to cold starts is JavaScript parsing time. This might seem surprising at first, but makes sense - before starting to execute the code, the engine has to first parse the fetched JavaScript, make sure it doesn’t contain any syntax errors and then compile it to the initial bytecode. As networks become faster, parsing and compilation of JavaScript could become the dominant factor.The device capability (CPU or memory performance) is the most important factor in the variance of JavaScript parsing times and correspondingly the time to application start. A 1MB JavaScript file will take an order of a 100 ms to parse on a modern desktop or high-end mobile device but can take over a second on an average phone  (Moto G4).A more detailed post on the overall cost of parsing, compiling and execution of JavaScript shows how the JavaScript boot time can vary on different mobile devices. For example, in the case of news.google.com, it can range from 4s on a Pixel 2 to 28s on a low-end device.While engines continuously improve raw parsing performance, with V8 in particular doubling it over the past year, as well as moving more things off the main thread, parsers still have to do lots of potentially unnecessary work that consumes memory, battery and might delay the processing of the useful resources.The “BinaryAST” ProposalThis is where BinaryAST comes in. BinaryAST is a new over-the-wire format for JavaScript proposed and actively developed by Mozilla that aims to speed up parsing while keeping the semantics of the original JavaScript intact. It does so by using an efficient binary representation for code and data structures, as well as by storing and providing extra information to guide the parser ahead of time.The name comes from the fact that the format stores the JavaScript source as an AST encoded into a binary file. The specification lives at tc39.github.io/proposal-binary-ast and is being worked on by engineers from Mozilla, Facebook, Bloomberg and Cloudflare.“Making sure that web applications start quickly is one of the most important, but also one of the most challenging parts of web development. We know that BinaryAST can radically reduce startup time, but we need to collect real-world data to demonstrate its impact. Cloudflare's work on enabling use of BinaryAST with Cloudflare Workers is an important step towards gathering this data at scale.” Till Schneidereit, Senior Engineering Manager, Developer TechnologiesMozillaParsing JavaScriptFor regular JavaScript code to execute in a browser the source is parsed into an intermediate representation known as an AST that describes the syntactic structure of the code. This representation can then be compiled into a byte code or a native machine code for execution.A simple example of adding two numbers can be represented in an AST as:Parsing JavaScript is not an easy task; no matter which optimisations you apply, it still requires reading the entire text file char by char, while tracking extra context for syntactic analysis.The goal of the BinaryAST is to reduce the complexity and the amount of work the browser parser has to do overall by providing an additional information and context by the time and place where the parser needs it.To execute JavaScript delivered as BinaryAST the only steps required are:Another benefit of BinaryAST is that it makes possible to only parse the critical code necessary for start-up, completely skipping over the unused bits. This can dramatically improve the initial loading time.This post will now describe some of the challenges of parsing JavaScript in more detail, explain how the proposed format addressed them, and how we made it possible to run its encoder in Workers.HoistingJavaScript relies on hoisting for all declarations - variables, functions, classes. Hoisting is a property of the language that allows you to declare items after the point they’re syntactically used.Let's take the following example:function f() { return g(); } function g() { return 42; }Here, when the parser is looking at the body of f, it doesn’t know yet what g is referring to - it could be an already existing global function or something declared further in the same file - so it can’t finalise parsing of the original function and start the actual compilation.BinaryAST fixes this by storing all the scope information and making it available upfront before the actual expressions.As shown by the difference between the initial AST and the enhanced AST in a JSON representation:Lazy parsingOne common technique used by modern engines to improve parsing times is lazy parsing. It utilises the fact that lots of websites include more JavaScript than they actually need, especially for the start-up.Working around this involves a set of heuristics that try to guess when any given function body in the code can be safely skipped by the parser initially and delayed for later. A common example of such heuristic is immediately running the full parser for any function that is wrapped into parentheses:(function(...Such prefix usually indicates that a following function is going to be an IIFE (immediately-invoked function expression), and so the parser can assume that it will be compiled and executed ASAP, and wouldn’t benefit from being skipped over and delayed for later.(function() { … })();These heuristics significantly improve the performance of the initial parsing and cold starts, but they’re not completely reliable or trivial to implement.One of the reasons is the same as in the previous section - even with lazy parsing, you still need to read the contents, analyse them and store an additional scope information for the declarations.Another reason is that the JavaScript specification requires reporting any syntax errors immediately during load time, and not when the code is actually executed. A class of these errors, called early errors, is checking for mistakes like usage of the reserved words in invalid contexts, strict mode violations, variable name clashes and more. All of these checks require not only lexing JavaScript source, but also tracking extra state even during the lazy parsing.Having to do such extra work means you need to be careful about marking functions as lazy too eagerly, especially if they actually end up being executed during the page load. Otherwise you’re making cold start costs even worse, as now every function that is erroneously marked as lazy, needs to be parsed twice - once by the lazy parser and then again by the full one.Because BinaryAST is meant to be an output format of other tools such as Babel, TypeScript and bundlers such as Webpack, the browser parser can rely on the JavaScript being already analysed and verified by the initial parser. This allows it to skip function bodies completely, making lazy parsing essentially free.It reduces the cost of a completely unused code - while including it is still a problem in terms of the network bandwidth (don’t do this!), at least it’s not affecting parsing times anymore. These benefits apply equally to the code that is used later in the page lifecycle (for example, invoked in response to user actions), but is not required during the startup.Last but not least important benefit of such approach is that BinaryAST encodes lazy annotations as part of the format, giving tools and developers direct and full control over the heuristics. For example, a tool targeting the Web platform or a framework CLI can use its domain-specific knowledge to mark some event handlers as lazy or eager depending on the context and the event type.Avoiding ambiguity in parsingUsing a text format for a programming language is great for readability and debugging, but it's not the most efficient representation for parsing and execution.For example, parsing low-level types like numbers, booleans and even strings from text requires extra analysis and computation, which is unnecessary when you can just store and read them as native binary-encoded values in the first place and read directly on the other side.Another problem is an ambiguity in the grammar itself. It was already an issue in the ES5 world, but could usually be resolved with some extra bookkeeping based on the previously seen tokens. However, in ES6+ there are productions that can be ambiguous all the way through until they’re parsed completely.For example, a token sequence like:(a, {b: c, d}, [e = 1])...can start either a parenthesized comma expression with nested object and array literals and an assignment:(a, {b: c, d}, [e = 1]); // it was an expressionor a parameter list of an arrow expression function with nested object and array patterns and a default value:(a, {b: c, d}, [e = 1]) => … // it was a parameter listBoth representations are perfectly valid, but have completely different semantics, and you can’t know which one you’re dealing with until you see the final token.To work around this, parsers usually have to either backtrack, which can easily get exponentially slow, or to parse contents into intermediate node types that are capable of holding both expressions and patterns, with following conversion. The latter approach preserves linear performance, but makes the implementation more complicated and requires preserving more state.In the BinaryAST format this issue doesn't exist in the first place because the parser sees the type of each node before it even starts parsing its contents.Cloudflare ImplementationCurrently, the format is still in flux, but the very first version of the client-side implementation was released under a flag in Firefox Nightly several months ago. Keep in mind this is only an initial unoptimised prototype, and there are already several experiments changing the format to provide improvements to both size and parsing performance.On the producer side, the reference implementation lives at github.com/binast/binjs-ref. Our goal was to take this reference implementation and consider how we would deploy it at Cloudflare scale.If you dig into the codebase, you will notice that it currently consists of two parts.One is the encoder itself, which is responsible for taking a parsed AST, annotating it with scope and other relevant information, and writing out the result in one of the currently supported formats. This part is written in Rust and is fully native.Another part is what produces that initial AST - the parser. Interestingly, unlike the encoder, it's implemented in JavaScript.Unfortunately, there is currently no battle-tested native JavaScript parser with an open API, let alone implemented in Rust. There have been a few attempts, but, given the complexity of JavaScript grammar, it’s better to wait a bit and make sure they’re well-tested before incorporating it into the production encoder.On the other hand, over the last few years the JavaScript ecosystem grew to extensively rely on developer tools implemented in JavaScript itself. In particular, this gave a push to rigorous parser development and testing. There are several JavaScript parser implementations that have been proven to work on thousands of real-world projects.With that in mind, it makes sense that the BinaryAST implementation chose to use one of them - in particular, Shift - and integrated it with the Rust encoder, instead of attempting to use a native parser.Connecting Rust and JavaScriptIntegration is where things get interesting.Rust is a native language that can compile to an executable binary, but JavaScript requires a separate engine to be executed. To connect them, we need some way to transfer data between the two without sharing the memory.Initially, the reference implementation generated JavaScript code with an embedded input on the fly, passed it to Node.js and then read the output when the process had finished. That code contained a call to the Shift parser with an inlined input string and produced the AST back in a JSON format.This doesn’t scale well when parsing lots of JavaScript files, so the first thing we did is transformed the Node.js side into a long-living daemon. Now Rust could spawn a required Node.js process just once and keep passing inputs into it and getting responses back as individual messages.Running in the cloudWhile the Node.js solution worked fairly well after these optimisations, shipping both a Node.js instance and a native bundle to production requires some effort. It's also potentially risky and requires manual sandboxing of both processes to make sure we don’t accidentally start executing malicious code.On the other hand, the only thing we needed from Node.js is the ability to run the JavaScript parser code. And we already have an isolated JavaScript engine running in the cloud - Cloudflare Workers! By additionally compiling the native Rust encoder to Wasm (which is quite easy with the native toolchain and wasm-bindgen), we can even run both parts of the code in the same process, making cold starts and communication much faster than in a previous model.Optimising data transferThe next logical step is to reduce the overhead of data transfer. JSON worked fine for communication between separate processes, but with a single process we should be able to retrieve the required bits directly from the JavaScript-based AST.To attempt this, first of all, we needed to move away from the direct JSON usage to something more generic that would allow us to support various import formats. The Rust ecosystem already has an amazing serialisation framework for that - Serde.Aside from allowing us to be more flexible in regard to the inputs, rewriting to Serde helped an existing native use case too. Now, instead of parsing JSON into an intermediate representation and then walking through it, all the native typed AST structures can be deserialized directly from the stdout pipe of the Node.js process in a streaming manner. This significantly improved both the CPU usage and memory pressure.But there is one more thing we can do: instead of serializing and deserializing from an intermediate format (let alone, a text format like JSON), we should be able to operate [almost] directly on JavaScript values, saving memory and repetitive work.How is this possible? wasm-bindgen provides a type called JsValue that stores a handle to an arbitrary value on the JavaScript side. This handle internally contains an index into a predefined array.Each time a JavaScript value is passed to the Rust side as a result of a function call or a property access, it’s stored in this array and an index is sent to Rust. The next time Rust wants to do something with that value, it passes the index back and the JavaScript side retrieves the original value from the array and performs the required operation.By reusing this mechanism, we could implement a Serde deserializer that requests only the required values from the JS side and immediately converts them to their native representation. It’s now open-sourced under https://github.com/cloudflare/serde-wasm-bindgen.At first, we got a much worse performance out of this due to the overhead of more frequent calls between 1) Wasm and JavaScript - SpiderMonkey has improved these recently, but other engines still lag behind and 2) JavaScript and C++, which also can’t be optimised well in most engines.The JavaScript <-> C++ overhead comes from the usage of TextEncoder to pass strings between JavaScript and Wasm in wasm-bindgen, and, indeed, it showed up as the highest in the benchmark profiles. This wasn’t surprising - after all, strings can appear not only in the value payloads, but also in property names, which have to be serialized and sent between JavaScript and Wasm over and over when using a generic JSON-like structure.Luckily, because our deserializer doesn’t have to be compatible with JSON anymore, we can use our knowledge of Rust types and cache all the serialized property names as JavaScript value handles just once, and then keep reusing them for further property accesses.This, combined with some changes to wasm-bindgen which we have upstreamed, allows our deserializer to be up to 3.5x faster in benchmarks than the original Serde support in wasm-bindgen, while saving ~33% off the resulting code size. Note that for string-heavy data structures it might still be slower than the current JSON-based integration, but situation is expected to improve over time when reference types proposal lands natively in Wasm.After implementing and integrating this deserializer, we used the wasm-pack plugin for Webpack to build a Worker with both Rust and JavaScript parts combined and shipped it to some test zones.Show me the numbersKeep in mind that this proposal is in very early stages, and current benchmarks and demos are not representative of the final outcome (which should improve numbers much further).As mentioned earlier, BinaryAST can mark functions that should be parsed lazily ahead of time. By using different levels of lazification in the encoder (https://github.com/binast/binjs-ref/blob/b72aff7dac7c692a604e91f166028af957cdcda5/crates/binjs_es6/src/lazy.rs#L43) and running tests against some popular JavaScript libraries, we found following speed-ups.Level 0 (no functions are lazified)With lazy parsing disabled in both parsers we got a raw parsing speed improvement of between 3 and 10%. Name Source size (kb) JavaScript Parse time (average ms) BinaryAST parse time (average ms) Diff (%) React 20 0.403 0.385 -4.56 D3 (v5) 240 11.178 10.525 -6.018 Angular 180 6.985 6.331 -9.822 Babel 780 21.255 20.599 -3.135 Backbone 32 0.775 0.699 -10.312 wabtjs 1720 64.836 59.556 -8.489 Fuzzball (1.2) 72 3.165 2.768 -13.383 Level 3 (functions up to 3 levels deep are lazified)But with the lazification set to skip nested functions of up to 3 levels we see much more dramatic improvements in parsing time between 90 and 97%. As mentioned earlier in the post, BinaryAST makes lazy parsing essentially free by completely skipping over the marked functions. Name Source size (kb) Parse time (average ms) BinaryAST parse time (average ms) Diff (%) React 20 0.407 0.032 -92.138 D3 (v5) 240 11.623 0.224 -98.073 Angular 180 7.093 0.680 -90.413 Babel 780 21.100 0.895 -95.758 Backbone 32 0.898 0.045 -94.989 wabtjs 1720 59.802 1.601 -97.323 Fuzzball (1.2) 72 2.937 0.089 -96.970 All the numbers are from manual tests on a Linux x64 Intel i7 with 16Gb of ram.While these synthetic benchmarks are impressive, they are not representative of real-world scenarios. Normally you will use at least some of the loaded JavaScript during the startup. To check this scenario, we decided to test some realistic pages and demos on desktop and mobile Firefox and found speed-ups in page loads too.For a sample application (https://github.com/cloudflare/binjs-demo, https://serve-binjs.that-test.site/) which weighed in at around 1.2 MB of JavaScript we got the following numbers for initial script execution: Device JavaScript BinaryAST Desktop 338ms 314ms Mobile (HTC One M8) 2019ms 1455ms Here is a video that will give you an idea of the improvement as seen by a user on mobile Firefox (in this case showing the entire page startup time):Next step is to start gathering data on real-world websites, while improving the underlying format.How do I test BinaryAST on my website?We’ve open-sourced our Worker so that it could be installed on any Cloudflare zone: https://github.com/binast/binjs-ref/tree/cf-wasm.One thing to be currently wary of is that, even though the result gets stored in the cache, the initial encoding is still an expensive process, and might easily hit CPU limits on any non-trivial JavaScript files and fall back to the unencoded variant. We are working to improve this situation by releasing BinaryAST encoder as a separate feature with more relaxed limits in the following few days.Meanwhile, if you want to play with BinaryAST on larger real-world scripts, an alternative option is to use a static binjs_encode tool from https://github.com/binast/binjs-ref to pre-encode JavaScript files ahead of time. Then, you can use a Worker from https://github.com/cloudflare/binast-cf-worker to serve the resulting BinaryAST assets when supported and requested by the browser.On the client side, you’ll currently need to download Firefox Nightly, go to about:config and enable unrestricted BinaryAST support via the following options:Now, when opening a website with either of the Workers installed, Firefox will get BinaryAST instead of JavaScript automatically.SummaryThe amount of JavaScript in modern apps is presenting performance challenges for all consumers. Engine vendors are experimenting with different ways to improve the situation - some are focusing on raw decoding performance, some on parallelizing operations to reduce overall latency, some are researching new optimised formats for data representation, and some are inventing and improving protocols for the network delivery.No matter which one it is, we all have a shared goal of making the Web better and faster. On Cloudflare's side, we're always excited about collaborating with all the vendors and combining various approaches to make that goal closer with every step.

Live video just got more live: Introducing Concurrent Streaming Acceleration

CloudFlare Blog -

Today we’re excited to introduce Concurrent Streaming Acceleration, a new technique for reducing the end-to-end latency of live video on the web when using Stream Delivery.Let’s dig into live-streaming latency, why it’s important, and what folks have done to improve it.How “live” is “live” video?Live streaming makes up an increasing share of video on the web. Whether it’s a TV broadcast, a live game show, or an online classroom, users expect video to arrive quickly and smoothly. And the promise of “live” is that the user is seeing events as they happen. But just how close to “real-time” is “live” Internet video? Delivering live video on the Internet is still hard and adds lots of latency:The content source records video and sends it to an encoding server;The origin server transforms this video into a format like DASH, HLS or CMAF that can be delivered to millions of devices efficiently;A CDN is typically used to deliver encoded video across the globeClient players decode the video and render it on the screenAnd all of this is under a time constraint — the whole process need to happen in a few seconds, or video experiences will suffer. We call the total delay between when the video was shot, and when it can be viewed on an end-user’s device, as “end-to-end latency” (think of it as the time from the camera lens to your phone’s screen).Traditional segmented deliveryVideo formats like DASH, HLS, and CMAF work by splitting video into small files, called “segments”. A typical segment duration is 6 seconds.If a client player needs to wait for a whole 6s segment to be encoded, sent through a CDN, and then decoded, it can be a long wait! It takes even longer if you want the client to build up a buffer of segments to protect against any interruptions in delivery. A typical player buffer for HLS is 3 segments:Clients may have to buffer three 6-second chunks, introducing at least 18s of latency‌‌When you consider encoding delays, it’s easy to see why live streaming latency on the Internet has typically been about 20-30 seconds. We can do better.Reduced latency with chunked transfer encodingA natural way to solve this problem is to enable client players to start playing the chunks while they’re downloading, or even while they’re still being created. Making this possible requires a clever bit of cooperation to encode and deliver the files in a particular way, known as “chunked encoding.” This involves splitting up segments into smaller, bite-sized pieces, or “chunks”. Chunked encoding can typically bring live latency down to 5 or 10 seconds.Confusingly, the word “chunk” is overloaded to mean two different things:CMAF or HLS chunks, which are small pieces of a segment (typically 1s) that are aligned on key framesHTTP chunks, which are just a way of delivering any file over the webChunked Encoding splits segments into shorter chunksHTTP chunks are important because web clients have limited ability to process streams of data. Most clients can only work with data once they’ve received the full HTTP response, or at least a complete HTTP chunk. By using HTTP chunked transfer encoding, we enable video players to start parsing and decoding video sooner.CMAF chunks are important so that decoders can actually play the bits that are in the HTTP chunks. Without encoding video in a careful way, decoders would have random bits of a video file that can’t be played.CDNs can introduce additional bufferingChunked encoding with HLS and CMAF is growing in use across the web today. Part of what makes this technique great is that HTTP chunked encoding is widely supported by CDNs – it’s been part of the HTTP spec for 20 years.CDN support is critical because it allows low-latency live video to scale up and reach audiences of thousands or millions of concurrent viewers – something that’s currently very difficult to do with other, non-HTTP based protocols.Unfortunately, even if you enable chunking to optimise delivery, your CDN may be working against you by buffering the entire segment. To understand why consider what happens when many people request a live segment at the same time:If the file is already in cache, great! CDNs do a great job at delivering cached files to huge audiences. But what happens when the segment isn’t in cache yet? Remember – this is the typical request pattern for live video!Typically, CDNs are able to “stream on cache miss” from the origin. That looks something like this:But again – what happens when multiple people request the file at once? CDNs typically need to pull the entire file into cache before serving additional viewers:Only one viewer can stream video, while other clients wait for the segment to buffer at the CDNThis behavior is understandable. CDN data centers consist of many servers. To avoid overloading origins, these servers typically coordinate amongst themselves using a “cache lock” (mutex) that allows only one server to request a particular file from origin at a given time. A side effect of this is that while a file is being pulled into cache, it can’t be served to any user other than the first one that requested it. Unfortunately, this cache lock also defeats the purpose of using chunked encoding!To recap thus far:Chunked encoding splits up video segments into smaller piecesThis can reduce end-to-end latency by allowing chunks to be fetched and decoded by players, even while segments are being produced at the origin serverSome CDNs neutralize the benefits of chunked encoding by buffering entire files inside the CDN before they can be delivered to clientsCloudflare’s solution: Concurrent Streaming AccelerationAs you may have guessed, we think we can do better. Put simply, we now have the ability to deliver un-cached files to multiple clients simultaneously while we pull the file once from the origin server.This sounds like a simple change, but there’s a lot of subtlety to do this safely. Under the hood, we’ve made deep changes to our caching infrastructure to remove the cache lock and enable multiple clients to be able to safely read from a single file while it’s still being written.The best part is – all of Cloudflare now works this way! There’s no need to opt-in, or even make a config change to get the benefit.We rolled this feature out a couple months ago and have been really pleased with the results so far. We measure success by the “cache lock wait time,” i.e. how long a request must wait for other requests – a direct component of Time To First Byte.  One OTT customer saw this metric drop from 1.5s at P99 to nearly 0, as expected:This directly translates into a 1.5-second improvement in end-to-end latency. Live video just got more live!ConclusionNew techniques like chunked encoding have revolutionized live delivery, enabling publishers to deliver low-latency live video at scale. Concurrent Streaming Acceleration helps you unlock the power of this technique at your CDN, potentially shaving precious seconds of end-to-end latency.If you’re interested in using Cloudflare for live video delivery, contact our enterprise sales team.And if you’re interested in working on projects like this and helping us improve live video delivery for the entire Internet, join our engineering team!

Announcing Cloudflare Image Resizing: Simplifying Optimal Image Delivery

CloudFlare Blog -

In the past three years, the amount of image data on the median mobile webpage has doubled. Growing images translate directly to users hitting data transfer caps, experiencing slower websites, and even leaving if a website doesn’t load in a reasonable amount of time. The crime is many of these images are so slow because they are larger than they need to be, sending data over the wire which has absolutely no (positive) impact on the user’s experience.To provide a concrete example, let’s consider this photo of Cloudflare’s Lava Lamp Wall: On the left you see the photo, scaled to 300 pixels wide. On the right you see the same image delivered in its original high resolution, scaled in a desktop web browser. On a regular-DPI screen, they both look the same, yet the image on the right takes more than twenty times more data to load. Even for the best and most conscientious developers resizing every image to handle every possible device geometry consumes valuable time, and it’s exceptionally easy to forget to do this resizing altogether.Today we are launching a new product, Image Resizing, to fix this problem once and for all.Announcing Image ResizingWith Image Resizing, Cloudflare adds another important product to its suite of available image optimizations.  This product allows customers to perform a rich set of the key actions on images.Resize - The source image will be resized to the specified height and width.  This action allows multiple different sized variants to be created for each specific use.Crop - The source image will be resized to a new size that does not maintain the original aspect ratio and a portion of the image will be removed.  This can be especially helpful for headshots and product images where different formats must be achieved by keeping only a portion of the image.Compress - The source image will have its file size reduced by applying lossy compression.  This should be used when slight quality reduction is an acceptable trade for file size reduction.Convert to WebP - When the users browser supports it, the source image will be converted to WebP.  Delivering a WebP image takes advantage of the modern, highly optimized image format.By using a combination of these actions, customers store a single high quality image on their server, and Image Resizing can be leveraged to create specialized variants for each specific use case.  Without any additional effort, each variant will also automatically benefit from Cloudflare’s global caching.ExamplesEcommerce ThumbnailsEcommerce sites typically store a high-quality image of each product.  From that image, they need to create different variants depending on how that product will be displayed.  One example is creating thumbnails for a catalog view.  Using Image Resizing, if the high quality image is located here:https://example.com/images/shoe123.jpgThis is how to display a 75x75 pixel thumbnail using Image Resizing:<img src="/cdn-cgi/image/width=75,height=75/images/shoe123.jpg">Responsive ImagesWhen tailoring a site to work on various device types and sizes, it’s important to always use correctly sized images.  This can be difficult when images are intended to fill a particular percentage of the screen.  To solve this problem, <img srcset sizes> can be used.Without Image Resizing, multiple versions of the same image would need to be created and stored.  In this example, a single high quality copy of hero.jpg is stored, and Image Resizing is used to resize for each particular size as needed.<img width="100%" srcset=" /cdn-cgi/image/fit=contain,width=320/assets/hero.jpg 320w, /cdn-cgi/image/fit=contain,width=640/assets/hero.jpg 640w, /cdn-cgi/image/fit=contain,width=960/assets/hero.jpg 960w, /cdn-cgi/image/fit=contain,width=1280/assets/hero.jpg 1280w, /cdn-cgi/image/fit=contain,width=2560/assets/hero.jpg 2560w, " src="/cdn-cgi/image/width=960/assets/hero.jpg"> Enforce Maximum Size Without Changing URLsImage Resizing is also available from within a Cloudflare Worker. Workers allow you to write code which runs close to your users all around the world. For example, you might wish to add Image Resizing to your images while keeping the same URLs. Your users and client would be able to use the same image URLs as always, but the images will be transparently modified in whatever way you need.You can install a Worker on a route which matches your image URLs, and resize any images larger than a limit:addEventListener('fetch', event => { event.respondWith(handleRequest(event.request)) }) async function handleRequest(request) { return fetch(request, { cf: { image: { width: 800, height: 800, fit: 'scale-down' } }); } As a Worker is just code, it is also easy to run this worker only on URLs with image extensions, or even to only resize images being delivered to mobile clients.Cloudflare and ImagesCloudflare has a long history building tools to accelerate images. Our caching has always helped reduce latency by storing a copy of images closer to the user.  Polish automates options for both lossless and lossy image compression to remove unnecessary bytes from images.  Mirage accelerates image delivery based on device type. We are continuing to invest in all of these tools, as they all serve a unique role in improving the image experience on the web.Image Resizing is different because it is the first image product at Cloudflare to give developers full control over how their images would be served. You should choose Image Resizing if you are comfortable defining the sizes you wish your images to be served at in advance or within a Cloudflare Worker.Next Steps and Simple PricingImage Resizing is available today for Business and Enterprise Customers.  To enable it, login to the Cloudflare Dashboard and navigate to the Speed Tab.  There you’ll find the section for Image Resizing which you can enable with one click.This product is included in the Business and Enterprise plans at no additional cost with generous usage limits.  Business Customers have 100k requests per month limit and will be charged $10 for each additional 100k requests per month.  Enterprise Customers have a 10M request per month limit with discounted tiers for higher usage.  Requests are defined as a hit on a URI that contains Image Resizing or a call to Image Resizing from a Worker.Now that you’ve enabled Image Resizing, it’s time to resize your first image.Using your existing site, store an image here: https://yoursite.com/images/yourimage.jpgUse this URL to resize that image:https://yoursite.com/cdn-cgi/image/width=100,height=100,quality=75/images/yourimage.jpgExperiment with changing width=, height=, and quality=.The instructions above use the Default URL Format for Image Resizing.  For details on options, uses cases, and compatibility, refer to our Developer Documentation.

Parallel streaming of progressive images

CloudFlare Blog -

Progressive image rendering and HTTP/2 multiplexing technologies have existed for a while, but now we've combined them in a new way that makes them much more powerful. With Cloudflare progressive streaming images appear to load in half of the time, and browsers can start rendering pages sooner. document.getElementsByTagName('video')[0].playbackRate=0.4In HTTP/1.1 connections, servers didn't have any choice about the order in which resources were sent to the client; they had to send responses, as a whole, in the exact order they were requested by the web browser. HTTP/2 improved this by adding multiplexing and prioritization, which allows servers to decide exactly what data is sent and when. We’ve taken advantage of these new HTTP/2 capabilities to improve perceived speed of loading of progressive images by sending the most important fragments of image data sooner.This feature is compatible with all major browsers, and doesn’t require any changes to page markup, so it’s very easy to adopt. Sign up for the Beta to enable it on your site!What is progressive image rendering?Basic images load strictly from top to bottom. If a browser has received only half of an image file, it can show only the top half of the image. Progressive images have their content arranged not from top to bottom, but from a low level of detail to a high level of detail. Receiving a fraction of image data allows browsers to show the entire image, only with a lower fidelity. As more data arrives, the image becomes clearer and sharper.This works great in the JPEG format, where only about 10-15% of the data is needed to display a preview of the image, and at 50% of the data the image looks almost as good as when the whole file is delivered. Progressive JPEG images contain exactly the same data as baseline images, merely reshuffled in a more useful order, so progressive rendering doesn’t add any cost to the file size. This is possible, because JPEG doesn't store the image as pixels. Instead, it represents the image as frequency coefficients, which are like a set of predefined patterns that can be blended together, in any order, to reconstruct the original image. The inner workings of JPEG are really fascinating, and you can learn more about them from my recent performance.now() conference talk.The end result is that the images can look almost fully loaded in half of the time, for free! The page appears to be visually complete and can be used much sooner. The rest of the image data arrives shortly after, upgrading images to their full quality, before visitors have time to notice anything is missing.HTTP/2 progressive streamingBut there's a catch. Websites have more than one image (sometimes even hundreds of images). When the server sends image files naïvely, one after another, the progressive rendering doesn’t help that much, because overall the images still load sequentially:Having complete data for half of the images (and no data for the other half) doesn't look as good as having half of the data for all images.And there's another problem: when the browser doesn't know image sizes yet, it lays the page out with placeholders instead, and relays out the page when each image loads. This can make pages jump during loading, which is inelegant, distracting and annoying for the user.Our new progressive streaming feature greatly improves the situation: we can send all of the images at once, in parallel. This way the browser gets size information for all of the images as soon as possible, can paint a preview of all images without having to wait for a lot of data, and large images don’t delay loading of styles, scripts and other more important resources.This idea of streaming of progressive images in parallel is as old as HTTP/2 itself, but it needs special handling in low-level parts of web servers, and so far this hasn't been implemented at a large scale. When we were improving our HTTP/2 prioritization, we realized it can be also used to implement this feature. Image files as a whole are neither high nor low priority. The priority changes within each file, and dynamic re-prioritization gives us the behavior we want: The image header that contains the image size is very high priority, because the browser needs to know the size as soon as possible to do page layout. The image header is small, so it doesn't hurt to send it ahead of other data. The minimum amount of data in the image required to show a preview of the image has a medium priority (we'd like to plug "holes" left for unloaded images as soon as possible, but also leave some bandwidth available for scripts, fonts and other resources) The remainder of the image data is low priority. Browsers can stream it last to refine image quality once there's no rush, since the page is already fully usable. Knowing the exact amount of data to send in each phase requires understanding the structure of image files, but it seemed weird to us to make our web server parse image responses and have a format-specific behavior hardcoded at a protocol level. By framing the problem as a dynamic change of priorities, were able to elegantly separate low-level networking code from knowledge of image formats. We can use Workers or offline image processing tools to analyze the images, and instruct our server to change HTTP/2 priorities accordingly.The great thing about parallel streaming of images is that it doesn’t add any overhead. We’re still sending the same data, the same amount of data, we’re just sending it in a smarter order. This technique takes advantage of existing web standards, so it’s compatible with all browsers.The waterfallHere are waterfall charts from WebPageTest showing comparison of regular HTTP/2 responses and progressive streaming. In both cases the files were exactly the same, the amount of data transferred was the same, and the overall page loading time was the same (within measurement noise). In the charts, blue segments show when data was transferred, and green shows when each request was idle.The first chart shows a typical server behavior that makes images load mostly sequentially. The chart itself looks neat, but the actual experience of loading that page was not great — the last image didn't start loading until almost the end.The second chart shows images loaded in parallel. The blue vertical streaks throughout the chart are image headers sent early followed by a couple of stages of progressive rendering. You can see that useful data arrived sooner for all of the images. You may notice that one of the images has been sent in one chunk, rather than split like all the others. That’s because at the very beginning of a TCP/IP connection we don't know the true speed of the connection yet, and we have to sacrifice some opportunity to do prioritization in order to maximize the connection speed.The metrics compared to other solutionsThere are other techniques intended to provide image previews quickly, such as low-quality image placeholder (LQIP), but they have several drawbacks. They add unnecessary data for the placeholders, and usually interfere with browsers' preload scanner, and delay loading of full-quality images due to dependence on JavaScript needed to upgrade the previews to full images.Our solution doesn't cause any additional requests, and doesn't add any extra data. Overall page load time is not delayed.Our solution doesn't require any JavaScript. It takes advantage of functionality supported natively in the browsers.Our solution doesn't require any changes to page's markup, so it's very safe and easy to deploy site-wide.The improvement in user experience is reflected in performance metrics such as SpeedIndex metric and and time to visually complete. Notice that with regular image loading the visual progress is linear, but with the progressive streaming it quickly jumps to mostly complete:Getting the most out of progressive renderingAvoid ruining the effect with JavaScript. Scripts that hide images and wait until the onload event to reveal them (with a fade in, etc.) will defeat progressive rendering. Progressive rendering works best with the good old <img> element. Is it JPEG-only?Our implementation is format-independent, but progressive streaming is useful only for certain file types. For example, it wouldn't make sense to apply it to scripts or stylesheets: these resources are rendered as all-or-nothing.Prioritizing of image headers (containing image size) works for all file formats.The benefits of progressive rendering are unique to JPEG (supported in all browsers) and JPEG 2000 (supported in Safari). GIF and PNG have interlaced modes, but these modes come at a cost of worse compression. WebP doesn't even support progressive rendering at all. This creates a dilemma: WebP is usually 20%-30% smaller than a JPEG of equivalent quality, but progressive JPEG appears to load 50% faster. There are next-generation image formats that support progressive rendering better than JPEG, and compress better than WebP, but they're not supported in web browsers yet. In the meantime you can choose between the bandwidth savings of WebP or the better perceived performance of progressive JPEG by changing Polish settings in your Cloudflare dashboard.Custom header for experimentationWe also support a custom HTTP header that allows you to experiment with, and optimize streaming of other resources on your site. For example, you could make our servers send the first frame of animated GIFs with high priority and deprioritize the rest. Or you could prioritize loading of resources mentioned in <head> of HTML documents before <body> is loaded. The custom header can be set only from a Worker. The syntax is a comma-separated list of file positions with priority and concurrency. The priority and concurrency is the same as in the whole-file cf-priority header described in the previous blog post.cf-priority-change: <offset in bytes>:<priority>/<concurrency>, ... For example, for a progressive JPEG we use something like (this is a fragment of JS to use in a Worker):let headers = new Headers(response.headers); headers.set("cf-priority", "30/0"); headers.set("cf-priority-change", "512:20/1, 15000:10/n"); return new Response(response.body, {headers}); Which instructs the server to use priority 30 initially, while it sends the first 512 bytes. Then switch to priority 20 with some concurrency (/1), and finally after sending 15000 bytes of the file, switch to low priority and high concurrency (/n) to deliver the rest of the file. We’ll try to split HTTP/2 frames to match the offsets specified in the header to change the sending priority as soon as possible. However, priorities don’t guarantee that data of different streams will be multiplexed exactly as instructed, since the server can prioritize only when it has data of multiple streams waiting to be sent at the same time. If some of the responses arrive much sooner from the upstream server or the cache, the server may send them right away, without waiting for other responses.Try it!You can use our Polish tool to convert your images to progressive JPEG. Sign up for the beta to have them elegantly streamed in parallel.

Better HTTP/2 Prioritization for a Faster Web

CloudFlare Blog -

HTTP/2 promised a much faster web and Cloudflare rolled out HTTP/2 access for all our customers long, long ago. But one feature of HTTP/2, Prioritization, didn’t live up to the hype. Not because it was fundamentally broken but because of the way browsers implemented it. Today Cloudflare is pushing out a change to HTTP/2 Prioritization that gives our servers control of prioritization decisions that truly make the web much faster. Historically the browser has been in control of deciding how and when web content is loaded. Today we are introducing a radical change to that model for all paid plans that puts control into the hands of the site owner directly. Customers can enable “Enhanced HTTP/2 Prioritization” in the Speed tab of the Cloudflare dashboard: this overrides the browser defaults with an improved scheduling scheme that results in a significantly faster visitor experience (we have seen 50% faster on multiple occasions). With Cloudflare Workers, site owners can take this a step further and fully customize the experience to their specific needs. Background Web pages are made up of dozens (sometimes hundreds) of separate resources that are loaded and assembled by the browser into the final displayed content. This includes the visible content the user interacts with (HTML, CSS, images) as well as the application logic (JavaScript) for the site itself, ads, analytics for tracking site usage and marketing tracking beacons. The sequencing of how those resources are loaded can have a significant impact on how long it takes for the user to see the content and interact with the page. A browser is basically an HTML processing engine that goes through the HTML document and follows the instructions in order from the start of the HTML to the end, building the page as it goes along. References to stylesheets (CSS) tell the browser how to style the page content and the browser will delay displaying content until it has loaded the stylesheet (so it knows how to style the content it is going to display). Scripts referenced in the document can have several different behaviors. If the script is tagged as “async” or “defer” the browser can keep processing the document and just run the script code whenever the scripts are available. If the scripts are not tagged as async or defer then the browser MUST stop processing the document until the script has downloaded and executed before continuing. These are referred to as “blocking” scripts because they block the browser from continuing to process the document until they have been loaded and executed. The HTML document is split into two parts. The <head> of the document is at the beginning and contains stylesheets, scripts and other instructions for the browser that are needed to display the content. The <body> of the document comes after the head and contains the actual page content that is displayed in the browser window (though scripts and stylesheets are allowed to be in the body as well). Until the browser gets to the body of the document there is nothing to display to the user and the page will remain blank so getting through the head of the document as quickly as possible is important. “HTML5 rocks” has a great tutorial on how browsers work if you want to dive deeper into the details. The browser is generally in charge of determining the order of loading the different resources it needs to build the page and to continue processing the document. In the case of HTTP/1.x, the browser is limited in how many things it can request from any one server at a time (generally 6 connections and only one resource at a time per connection) so the ordering is strictly controlled by the browser by how things are requested. With HTTP/2 things change pretty significantly. The browser can request all of the resources at once (at least as soon as it knows about them) and it provides detailed instructions to the server for how the resources should be delivered. Optimal Resource Ordering For most parts of the page loading cycle there is an optimal ordering of the resources that will result in the fastest user experience (and the difference between optimal and not can be significant - as much as a 50% improvement or more). As described above, early in the page load cycle before the browser can render any content it is blocked on the CSS and blocking JavaScript in the <head> section of the HTML. During that part of the loading cycle it is best for 100% of the connection bandwidth to be used to download the blocking resources and for them to be downloaded one at a time in the order they are defined in the HTML. That lets the browser parse and execute each item while it is downloading the next blocking resource, allowing the download and execution to be pipelined. The scripts take the same amount of time to download when downloaded in parallel or one after the other but by downloading them sequentially the first script can be processed and execute while the second script is downloading. Once the render-blocking content has loaded things get a little more interesting and the optimal loading may depend on the specific site or even business priorities (user content vs ads vs analytics, etc). Fonts in particular can be difficult as the browser only discovers what fonts it needs after the stylesheets have been applied to the content that is about to be displayed so by the time the browser knows about a font, it is needed to display text that is already ready to be drawn to the screen. Any delays in getting the font loaded end up as periods with blank text on the screen (or text displayed using the wrong font). Generally there are some tradeoffs that need to be considered: Custom fonts and visible images in the visible part of the page (viewport) should be loaded as quickly as possible. They directly impact the user’s visual experience of the page loading. Non-blocking JavaScript should be downloaded serially relative to other JavaScript resources so the execution of each can be pipelined with the downloads. The JavaScript may include user-facing application logic as well as analytics tracking and marketing beacons and delaying them can cause a drop in the metrics that the business tracks. Images benefit from downloading in parallel. The first few bytes of an image file contain the image dimensions which may be necessary for browser layout, and progressive images downloading in parallel can look visually complete with around 50% of the bytes transferred. Weighing the tradeoffs, one strategy that works well in most cases is: Custom fonts download sequentially and split the available bandwidth with visible images. Visible images download in parallel, splitting the “images” share of the bandwidth among them. When there are no more fonts or visible images pending: Non-blocking scripts download sequentially and split the available bandwidth with non-visible images Non-visible images download in parallel, splitting the “images” share of the bandwidth among them. That way the content visible to the user is loaded as quickly as possible, the application logic is delayed as little as possible and the non-visible images are loaded in such a way that layout can be completed as quickly as possible. Example For illustrative purposes, we will use a simplified product category page from a typical e-commerce site. In this example the page has: The HTML file for the page itself, represented by a blue box. 1 external stylesheet (CSS file), represented by a green box. 4 external scripts (JavaScript), represented by orange boxes. 2 of the scripts are blocking at the beginning of the page and 2 are asynchronous. The blocking script boxes use a darker shade of orange. 1 custom web font, represented by a red box. 13 images, represented by purple boxes. The page logo and 4 of the product images are visible in the viewport and 8 of the product images require scrolling to see. The 5 visible images use a darker shade of purple. For simplicity, we will assume that all of the resources are the same size and each takes 1 second to download on the visitor’s connection. Loading everything takes a total of 20 seconds, but HOW it is loaded can have a huge impact to the experience. This is what the described optimal loading would look like in the browser as the resources load: The page is blank for the first 4 seconds while the HTML, CSS and blocking scripts load, all using 100% of the connection. At the 4-second mark the background and structure of the page is displayed with no text or images. One second later, at 5 seconds, the text for the page is displayed. From 5-10 seconds the images load, starting out as blurry but sharpening very quickly. By around the 7-second mark it is almost indistinguishable from the final version. At the 10 second mark all of the visual content in the viewport has completed loading. Over the next 2 seconds the asynchronous JavaScript is loaded and executed, running any non-critical logic (analytics, marketing tags, etc). For the final 8 seconds the rest of the product images load so they are ready for when the user scrolls. Current Browser Prioritization All of the current browser engines implement different prioritization strategies, none of which are optimal. Microsoft Edge and Internet Explorer do not support prioritization so everything falls back to the HTTP/2 default which is to load everything in parallel, splitting the bandwidth evenly among everything. Microsoft Edge is moving to use the Chromium browser engine in future Windows releases which will help improve the situation. In our example page this means that the browser is stuck in the head for the majority of the loading time since the images are slowing down the transfer of the blocking scripts and stylesheets. Visually that results in a pretty painful experience of staring at a blank screen for 19 seconds before most of the content displays, followed by a 1-second delay for the text to display. Be patient when watching the animated progress because for the 19 seconds of blank screen it may feel like nothing is happening (even though it is): Safari loads all resources in parallel, splitting the bandwidth between them based on how important Safari believes they are (with render-blocking resources like scripts and stylesheets being more important than images). Images load in parallel but also load at the same time as the render-blocking content. While similar to Edge in that everything downloads at the same time, by allocating more bandwidth to the render-blocking resources Safari can display the content much sooner: At around 8 seconds the stylesheet and scripts have finished loading so the page can start to be displayed. Since the images were loading in parallel, they can also be rendered in their partial state (blurry for progressive images). This is still twice as slow as the optimal case but much better than what we saw with Edge. At around 11 seconds the font has loaded so the text can be displayed and more image data has been downloaded so the images will be a little sharper. This is comparable to the experience around the 7-second mark for the optimal loading case. For the remaining 9 seconds of the load the images get sharper as more data for them downloads until it is finally complete at 20 seconds. Firefox builds a dependency tree that groups resources and then schedules the groups to either load one after another or to share bandwidth between the groups. Within a given group the resources share bandwidth and download concurrently. The images are scheduled to load after the render-blocking stylesheets and to load in parallel but the render-blocking scripts and stylesheets also load in parallel and do not get the benefits of pipelining. In our example case this ends up being a slightly faster experience than with Safari since the images are delayed until after the stylesheets complete: At the 6 second mark the initial page content is rendered with the background and blurry versions of the product images (compared to 8 seconds for Safari and 4 seconds for the optimal case). At 8 seconds the font has loaded and the text can be displayed along with slightly sharper versions of the product images (compared to 11 seconds for Safari and 7 seconds in the Optimal case). For the remaining 12 seconds of the loading the product images get sharper as the remaining content loads. Chrome (and all Chromium-based browsers) prioritizes resources into a list. This works really well for the render-blocking content that benefits from loading in order but works less well for images. Each image loads to 100% completion before starting the next image. In practice this is almost as good as the optimal loading case with the only difference being that the images load one at a time instead of in parallel: Up until the 5 second mark the Chrome experience is identical to the optimal case, displaying the background at 4 seconds and the text content at 5. For the next 5 seconds the visible images load one at a time until they are all complete at the 10 second mark (compared to the optimal case where they are just slightly blurry at 7 seconds and sharpen up for the remaining 3 seconds). After the visual part of the page is complete at 10 seconds (identical to the optimal case), the remaining 10 seconds are spent running the async scripts and loading the hidden images (just like with the optimal loading case). Visual Comparison Visually, the impact can be quite dramatic, even though they all take the same amount of time to technically load all of the content: Server-Side Prioritization HTTP/2 prioritization is requested by the client (browser) and it is up to the server to decide what to do based on the request. A good number of servers don’t support doing anything at all with the prioritization but for those that do, they all honor the client’s request. Another option would be to decide on the best prioritization to use on the server-side, taking into account the client’s request. Per the specification, HTTP/2 prioritization is a dependency tree that requires full knowledge of all of the in-flight requests to be able to prioritize resources against each other. That allows for incredibly complex strategies but is difficult to implement well on either the browser or server side (as evidenced by the different browser strategies and varying levels of server support). To make prioritization easier to manage we have developed a simpler prioritization scheme that still has all of the flexibility needed for optimal scheduling. The Cloudflare prioritization scheme consists of 64 priority “levels” and within each priority level there are groups of resources that determine how the connection is shared between them: All of the resources at a higher priority level are transferred before moving on to the next lower priority level. Within a given priority level, there are 3 different “concurrency” groups: 0 : All of the resources in the concurrency “0” group are sent sequentially in the order they were requested, using 100% of the bandwidth. Only after all of the concurrency “0” group resources have been downloaded are other groups at the same level considered. 1 : All of the resources in the concurrency “1” group are sent sequentially in the order they were requested. The available bandwidth is split evenly between the concurrency “1” group and the concurrency “n” group. n : The resources in the concurrency “n” group are sent in parallel, splitting the bandwidth available to the group between them. Practically speaking, the concurrency “0” group is useful for critical content that needs to be processed sequentially (scripts, CSS, etc). The concurrency “1” group is useful for less-important content that can share bandwidth with other resources but where the resources themselves still benefit from processing sequentially (async scripts, non-progressive images, etc). The concurrency “n” group is useful for resources that benefit from processing in parallel (progressive images, video, audio, etc). Cloudflare Default Prioritization When enabled, the enhanced prioritization implements the “optimal” scheduling of resources described above. The specific prioritizations applied look like this: This prioritization scheme allows sending the render-blocking content serially, followed by the visible images in parallel and then the rest of the page content with some level of sharing to balance application and content loading. The “* If Detectable” caveat is that not all browsers differentiate between the different types of stylesheets and scripts but it will still be significantly faster in all cases. 50% faster by default, particularly for Edge and Safari visitors is not unusual: Customizing Prioritization with Workers Faster-by-default is great but where things get really interesting is that the ability to configure the prioritization is also exposed to Cloudflare Workers so sites can override the default prioritization for resources or implement their own complete prioritization schemes. If a Worker adds a “cf-priority” header to the response, Cloudflare edge servers will use the specified priority and concurrency for that response. The format of the header is <priority>/<concurrency> so something like response.headers.set('cf-priority', “30/0”); would set the priority to 30 with a concurrency of 0 for the given response. Similarly, “30/1” would set concurrency to 1 and “30/n” would set concurrency to n. With this level of flexibility a site can tweak resource prioritization to meet their needs. Boosting the priority of some critical async scripts for example or increasing the priority of hero images before the browser has identified that they are in the viewport. To help inform any prioritization decisions, the Workers runtime also exposes the browser-requested prioritization information in the request object passed in to the Worker’s fetch event listener (request.cf.requestPriority). The incoming requested priority is a semicolon-delimited list of attributes that looks something like this: “weight=192;exclusive=0;group=3;group-weight=127”. weight: The browser-requested weight for the HTTP/2 prioritization. exclusive: The browser-requested HTTP/2 exclusive flag (1 for Chromium-based browsers, 0 for others). group: HTTP/2 stream ID for the request group (only non-zero for Firefox). group-weight: HTTP/2 weight for the request group (only non-zero for Firefox). This is Just the Beginning The ability to tune and control the prioritization of responses is the basic building block that a lot of future work will benefit from. We will be implementing our own advanced optimizations on top of it but by exposing it in Workers we have also opened it up to sites and researchers to experiment with different prioritization strategies. With the Apps Marketplace it is also possible for companies to build new optimization services on top of the Workers platform and make it available to other sites to use. If you are on a Pro plan or above, head over to the speed tab in the Cloudflare dashboard and turn on “Enhanced HTTP/2 Prioritization” to accelerate your site.

Argo and the Cloudflare Global Private Backbone

CloudFlare Blog -

Welcome to Speed Week! Each day this week, we’re going to talk about something Cloudflare is doing to make the Internet meaningfully faster for everyone. Cloudflare has built a massive network of data centers in 180 cities in 75 countries. One way to think of Cloudflare is a global system to transport bits securely, quickly, and reliably from any point A to any other point B on the planet.To make that a reality, we built Argo. Argo uses real-time global network information to route around brownouts, cable cuts, packet loss, and other problems on the Internet. Argo makes the network that Cloudflare relies on—the Internet—faster, more reliable, and more secure on every hop around the world.We launched Argo two years ago, and it now carries over 22% of Cloudflare’s traffic. On an average day, Argo cuts the amount of time Internet users spend waiting for content by 112 years!As Cloudflare and our traffic volumes have grown, it now makes sense to build our own private backbone to add further security, reliability, and speed to key connections between Cloudflare locations.Today, we’re introducing the Cloudflare Global Private Backbone. It’s been in operation for a while now and links Cloudflare locations with private fiber connections.This private backbone benefits all Cloudflare customers, and it shines in combination with Argo. Argo can select the best available link across the Internet on a per data center-basis, and takes full advantage of the Cloudflare Global Private Backbone automatically.Let’s open the hood on Argo and explain how our backbone network further improves performance for our customers.What’s Argo?Argo is like Waze for the Internet. Every day, Cloudflare carries hundreds of billions of requests across our network and the Internet. Because our network, our customers, and their end-users are well distributed globally, all of these requests flowing across our infrastructure paint a great picture of how different parts of the Internet are performing at any given time.Just like Waze examines real data from real drivers to give you accurate, uncongested (and sometimes unorthodox) routes across town, Argo Smart Routing uses the timing data Cloudflare collects from each request to pick faster, more efficient routes across the Internet.In practical terms, Cloudflare’s network is expansive in its reach. Some of the Internet links in a given region may be congested and cause poor performance (a literal traffic jam). By understanding this is happening and using alternative network locations and providers, Argo can put traffic on a less direct, but faster, route from its origin to its destination.These benefits are not theoretical: enabling Argo Smart Routing shaves an average of 33% off HTTP time to first byte (TTFB).One other thing we’re proud of: we’ve stayed super focused on making it easy to use. One click in the dashboard enables better, smarter routing, bringing the full weight of Cloudflare’s network, data, and engineering expertise to bear on making your traffic faster. Advanced analytics allow you to understand exactly how Argo is performing for you around the world. You can read a lot more about how Argo works in our original launch blog post. So far, we’ve been talking about Argo at a functional level: you turn it on and it makes requests that traverse the Internet to your origin faster. How does it actually work? Argo is dependent on a few things to make its magic happen: Cloudflare’s network, up-to-the-second performance data on how traffic is moving on the Internet, and machine learning routing algorithms.Cloudflare’s Global NetworkCloudflare maintains a network of data centers around the world, and our network continues to grow significantly. Today, we have more than 180 data centers in 75 countries. That’s an additional 69 data centers since we launched Argo in May 2017.In addition to adding new locations, Cloudflare is constantly working with network partners to add connectivity options to our network locations. A single Cloudflare data center may be peered with a dozen networks, connected to multiple Internet eXchanges (IXs), connected to multiple transit providers (e.g. Telia, GTT, etc), and now, connected to our own physical backbone. A given destination may be reachable over multiple different links from the same location; each of these links will have different performance and reliability characteristics. This increased network footprint is important in making Argo faster. Additional network locations and providers mean Argo has more options at its disposal to route around network disruptions and congestion. Every time we add a new network location, we exponentially grow the number of routing options available to any given request.Better routing for improved performanceArgo requires the huge global network we’ve built to do its thing. But it wouldn’t do much of anything if it didn’t have the smarts to actually take advantage of all our data centers and cables between them to move traffic faster.Argo combines multiple machine learning techniques to build routes, test them, and disqualify routes that are not performing as we expect.The generation of routes is performed on data using “offline” optimization techniques: Argo’s route construction algorithms take an input data set (timing data) and fixed optimization target (“minimize TTFB”), outputting routes that it believes satisfy this constraint.Route disqualification is performed by a separate pipeline that has no knowledge of the route construction algorithms. These two systems are intentionally designed to be adversarial, allowing Argo to be both aggressive in finding better routes across the Internet but also adaptive to rapidly changing network conditions.One specific example of Argo’s smarts is its ability to distinguish between multiple potential connectivity options as it leaves a given data center. We call this “transit selection”.As we discussed above, some of our data centers may have a dozen different, viable options for reaching a given destination IP address. It’s as if you subscribed to every available ISP at your house, and you could choose to use any one of them for each website you tried to access. Transit selection enables Cloudflare to pick the fastest available path in real-time at every hop to reach the destination. With transit selection, Argo is able to specify both:1) Network location waypoints on the way to the origin.2) The specific transit provider or link at each waypoint in the journey of the packet all the way from the source to the destination.To analogize this to Waze, Argo giving directions without transit selection is like telling someone to drive to a waypoint (go to New York from San Francisco, passing through Salt Lake City), without specifying the roads to actually take to Salt Lake City or New York. With transit selection, we’re able to give full turn-by-turn directions — take I-80 out of San Francisco, take a left here, enter the Salt Lake City area using SR-201 (because I-80 is congested around SLC), etc. This allows us to route around issues on the Internet with much greater precision.Transit selection requires logic in our inter-data center data plane (the components that actually move data across our network) to allow for differentiation between different providers and links available in each location. Some interesting network automation and advertisement techniques allow us to be much more discerning about which link actually gets picked to move traffic. Without modifications to the Argo data plane, those options would be abstracted away by our edge routers, with the choice of transit left to BGP. We plan to talk more publicly about the routing techniques used in the future.We are able to directly measure the impact transit selection has on Argo customer traffic. Looking at global average improvement, transit selection gets customers an additional 16% TTFB latency benefit over taking standard BGP-derived routes. That’s huge!One thing we think about: Argo can itself change network conditions when moving traffic from one location or provider to another by inducing demand (adding additional data volume because of improved performance) and changing traffic profiles. With great power comes great intricacy.Adding the Cloudflare Global Private BackboneGiven our diversity of transit and connectivity options in each of our data centers, and the smarts that allow us to pick between them, why did we go through the time and trouble of building a backbone for ourselves? The short answer: operating our own private backbone allows us much more control over end-to-end performance and capacity management.When we buy transit or use a partner for connectivity, we’re relying on that provider to manage the link’s health and ensure that it stays uncongested and available. Some networks are better than others, and conditions change all the time.As an example, here’s a measurement of jitter (variance in round trip time) between two of our data centers, Chicago and Newark, over a transit provider’s network:Average jitter over the pictured 6 hours is 4ms, with average round trip latency of 27ms. Some amount of latency is something we just need to learn to live with; the speed of light is a tough physical constant to do battle with, and network protocols are built to function over links with high or low latency.Jitter, on the other hand, is “bad” because it is unpredictable and network protocols and applications built on them often degrade quickly when jitter rises. Jitter on a link is usually caused by more buffering, queuing, and general competition for resources in the routing hardware on either side of a connection. As an illustration, having a VoIP conversation over a network with high latency is annoying but manageable. Each party on a call will notice “lag”, but voice quality will not suffer. Jitter causes the conversation to garble, with packets arriving on top of each other and unpredictable glitches making the conversation unintelligible.Here’s the same jitter chart between Chicago and Newark, except this time, transiting the Cloudflare Global Private Backbone:Much better! Here we see a jitter measurement of 536μs (microseconds), almost eight times better than the measurement over a transit provider between the same two sites.The combination of fiber we control end-to-end and Argo Smart Routing allows us to unlock the full potential of Cloudflare’s backbone network. Argo’s routing system knows exactly how much capacity the backbone has available, and can manage how much additional data it tries to push through it. By controlling both ends of the pipe, and the pipe itself, we can guarantee certain performance characteristics and build those expectations into our routing models. The same principles do not apply to transit providers and networks we don’t control.Latency, be gone!Our private backbone is another tool available to us to improve performance on the Internet. Combining Argo’s cutting-edge machine learning and direct fiber connectivity between points on our large network allows us to route customer traffic with predictable, excellent performance.We’re excited to see the backbone and its impact continue to expand.Speaking personally as a product manager, Argo is really fun to work on. We make customers happier by making their websites, APIs, and networks faster. Enabling Argo allows customers to do that with one click, and see immediate benefit. Under the covers, huge investments in physical and virtual infrastructure begin working to accelerate traffic as it flows from its source to destination.  From an engineering perspective, our weekly goals and objectives are directly measurable — did we make our customers faster by doing additional engineering work? When we ship a new optimization to Argo and immediately see our charts move up and to the right, we know we’ve done our job.Building our physical private backbone is the latest thing we’ve done in our need for speed.Welcome to Speed Week!Activate Argo now, or contact sales to learn more!

Welcome to Speed Week!

CloudFlare Blog -

Every year, we celebrate Cloudflare’s birthday in September when we announce the products we’re releasing to help make the Internet better for everyone. We’re always building new and innovative products throughout the year, and having to pick five announcements for just one week of the year is always challenging. Last year we brought back Crypto Week where we shared new cryptography technologies we’re supporting and helping advance to help build a more secure Internet. Today I’m thrilled to announce we are launching our first-ever Speed Week and we want to showcase some of the things that we’re obsessed with to make the Internet faster for everyone.How much faster is faster?When we built the software stack that runs our network, we knew that both security and speed are important to our customers, and they should never have to compromise one for the other. All of the products we’re announcing this week will help our customers have a better experience on the Internet with as much as a 50% improvement in page load times for websites, getting the  most out of HTTP/2’s features (while only lifting a finger to click the button that enables them), finding the optimal route across the Internet, and providing the best live streaming video experience. I am constantly amazed by the talented engineers that work on the products that we launch all year round. I wish we could have weeks like this all year round to celebrate the wins they’ve accumulated as they tackle the difficult performance challenges of the web. We’re never content to settle for the status quo, and as our network continues to grow, so does our ability to improve our flagship products like Argo, or how we support rich media sites that rely heavily on images and video. The sheer scale of our network provides rich data that we can use to make better decisions on how we support our customers’ web properties. We also recognize that the Internet is evolving. New standards and protocols such as HTTP/2, QUIC, TLS 1.3 are great advances to improve web performance and security, but they can also be challenging for many developers to easily deploy. HTTP/2 was introduced in 2015 by the IETF, and was the first major revision of the HTTP protocol. While our customers have always been able to benefit from HTTP/2, we’re exploring how we can make that experience even faster.All things SpeedWant a sneak peek at what we’re announcing this week? I’m really excited to see this week’s announcements unfold. Each day we’ll post a new blog where we’ll share product announcements and customer stories that demonstrate how we’re making life better for our customers.Monday: An inside view of how we’re making faster, smarter routing decisionsTuesday: HTTP/2 can be faster, we’ll show you howWednesday: Simplify image management and speed up load times on any deviceThursday: How we’re improving our network for faster video streamingFriday: How we’re helping make JavaScript fasterFor bonus points, sign up for a live stream webinar where Kornel Lesinksi and I will be hosted by Dennis Publishing to discuss the many challenges of the modern web “Stronger, Better, Faster: Solving the performance challenges of the modern web.” The event will be held on Monday, May 13th at 11:00 am BST and you can either register for the live event or sign up for one of the on-demand sessions later in the week.I hope you’re just as excited about our upcoming Speed Week as much as I am, be sure to subscribe to the blog to get daily updates sent to your inbox, cause who knows… there may even be “one last thing”

eBPF can't count?!

CloudFlare Blog -

Grant mechanical calculating machine, public domain imageIt is unlikely we can tell you anything new about the extended Berkeley Packet Filter, eBPF for short, if you've read all the great man pages, docs, guides, and some of our blogs out there. But we can tell you a war story, and who doesn't like those? This one is about how eBPF lost its ability to count for a while1. They say in our Austin, Texas office that all good stories start with "y'all ain't gonna believe this… tale." This one though, starts with a post to Linux netdev mailing list from Marek Majkowski after what I heard was a long night: Marek's findings were quite shocking - if you subtract two 64-bit timestamps in eBPF, the result is garbage. But only when running as an unprivileged user. From root all works fine. Huh. If you've seen Marek's presentation from the Netdev 0x13 conference, you know that we are using BPF socket filters as one of the defenses against simple, volumetric DoS attacks. So potentially getting your packet count wrong could be a Bad Thing™, and affect legitimate traffic. Let's try to reproduce this bug with a simplified eBPF socket filter that subtracts two 64-bit unsigned integers passed to it from user-space though a BPF map. The input for our BPF program comes from a BPF array map, so that the values we operate on are not known at build time. This allows for easy experimentation and prevents the compiler from optimizing out the operations. Starting small, eBPF, what is 2 - 1? View the code on our GitHub. $ ./run-bpf 2 1 arg0 2 0x0000000000000002 arg1 1 0x0000000000000001 diff 1 0x0000000000000001 OK, eBPF, what is 2^32 - 1? $ ./run-bpf $[2**32] 1 arg0 4294967296 0x0000000100000000 arg1 1 0x0000000000000001 diff 18446744073709551615 0xffffffffffffffff Wrong! But if we ask nicely with sudo: $ sudo ./run-bpf $[2**32] 1 [sudo] password for jkbs: arg0 4294967296 0x0000000100000000 arg1 1 0x0000000000000001 diff 4294967295 0x00000000ffffffff Who is messing with my eBPF? When computers stop subtracting, you know something big is up. We called for reinforcements. Our colleague Arthur Fabre quickly noticed something is off when you examine the eBPF code loaded into the kernel. It turns out kernel doesn't actually run the eBPF it's supplied - it sometimes rewrites it first. Any sane programmer would expect 64-bit subtraction to be expressed as a single eBPF instruction $ llvm-objdump -S -no-show-raw-insn -section=socket1 bpf/filter.o … 20: 1f 76 00 00 00 00 00 00 r6 -= r7 … However, that's not what the kernel actually runs. Apparently after the rewrite the subtraction becomes a complex, multi-step operation. To see what the kernel is actually running we can use little known bpftool utility. First, we need to load our BPF $ ./run-bpf --stop-after-load 2 1 [2]+ Stopped ./run-bpf 2 1 Then list all BPF programs loaded into the kernel with bpftool prog list $ sudo bpftool prog list … 5951: socket_filter name filter_alu64 tag 11186be60c0d0c0f gpl loaded_at 2019-04-05T13:01:24+0200 uid 1000 xlated 424B jited 262B memlock 4096B map_ids 28786 The most recently loaded socket_filter must be our program (filter_alu64). Now we now know its id is 5951 and we can list its bytecode with $ sudo bpftool prog dump xlated id 5951 … 33: (79) r7 = *(u64 *)(r0 +0) 34: (b4) (u32) r11 = (u32) -1 35: (1f) r11 -= r6 36: (4f) r11 |= r6 37: (87) r11 = -r11 38: (c7) r11 s>>= 63 39: (5f) r6 &= r11 40: (1f) r6 -= r7 41: (7b) *(u64 *)(r10 -16) = r6 … bpftool can also display the JITed code with: bpftool prog dump jited id 5951. As you see, subtraction is replaced with a series of opcodes. That is unless you are root. When running from root all is good $ sudo ./run-bpf --stop-after-load 0 0 [1]+ Stopped sudo ./run-bpf --stop-after-load 0 0 $ sudo bpftool prog list | grep socket_filter 659: socket_filter name filter_alu64 tag 9e7ffb08218476f3 gpl $ sudo bpftool prog dump xlated id 659 … 31: (79) r7 = *(u64 *)(r0 +0) 32: (1f) r6 -= r7 33: (7b) *(u64 *)(r10 -16) = r6 … If you've spent any time using eBPF, you must have experienced first hand the dreaded eBPF verifier. It's a merciless judge of all eBPF code that will reject any programs that it deems not worthy of running in kernel-space. What perhaps nobody has told you before, and what might come as a surprise, is that the very same verifier will actually also rewrite and patch up your eBPF code as needed to make it safe. The problems with subtraction were introduced by an inconspicuous security fix to the verifier. The patch in question first landed in Linux 5.0 and was backported to 4.20.6 stable and 4.19.19 LTS kernel. The over 2000 words long commit message doesn't spare you any details on the attack vector it targets. The mitigation stems from CVE-2019-7308 vulnerability discovered by Jann Horn at Project Zero, which exploits pointer arithmetic, i.e. adding a scalar value to a pointer, to trigger speculative memory loads from out-of-bounds addresses. Such speculative loads change the CPU cache state and can be used to mount a Spectre variant 1 attack. To mitigate it the eBPF verifier rewrites any arithmetic operations on pointer values in such a way the result is always a memory location within bounds. The patch demonstrates how arithmetic operations on pointers get rewritten and we can spot a familiar pattern there Wait a minute… What pointer arithmetic? We are just trying to subtract two scalar values. How come the mitigation kicks in? It shouldn't. It's a bug. The eBPF verifier keeps track of what kind of values the ALU is operating on, and in this corner case the state was ignored. Why running BPF as root is fine, you ask? If your program has CAP_SYS_ADMIN privileges, side-channel mitigations don't apply. As root you already have access to kernel address space, so nothing new can leak through BPF. After our report, the fix has quickly landed in v5.0 kernel and got backported to stable kernels 4.20.15 and 4.19.28. Kudos to Daniel Borkmann for getting the fix out fast. However, kernel upgrades are hard and in the meantime we were left with code running in production that was not doing what it was supposed to. 32-bit ALU to the rescue As one of the eBPF maintainers has pointed out, 32-bit arithmetic operations are not affected by the verifier bug. This opens a door for a work-around. eBPF registers, r0..r10, are 64-bits wide, but you can also access just the lower 32 bits, which are exposed as subregisters w0..w10. You can operate on the 32-bit subregisters using BPF ALU32 instruction subset. LLVM 7+ can generate eBPF code that uses this instruction subset. Of course, you need to you ask it nicely with trivial -Xclang -target-feature -Xclang +alu32 toggle: $ cat sub32.c #include "common.h" u32 sub32(u32 x, u32 y) { return x - y; } $ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub32.c $ llvm-objdump -S -no-show-raw-insn sub32.o … sub32: 0: bc 10 00 00 00 00 00 00 w0 = w1 1: 1c 20 00 00 00 00 00 00 w0 -= w2 2: 95 00 00 00 00 00 00 00 exit The 0x1c opcode of the instruction #1, which can be broken down as BPF_ALU | BPF_X | BPF_SUB (read more in the kernel docs), is the 32-bit subtraction between registers we are looking for, as opposed to regular 64-bit subtract operation 0x1f = BPF_ALU64 | BPF_X | BPF_SUB, which will get rewritten. Armed with this knowledge we can borrow a page from bignum arithmetic and subtract 64-bit numbers using just 32-bit ops: u64 sub64(u64 x, u64 y) { u32 xh, xl, yh, yl; u32 hi, lo; xl = x; yl = y; lo = xl - yl; xh = x >> 32; yh = y >> 32; hi = xh - yh - (lo > xl); /* underflow? */ return ((u64)hi << 32) | (u64)lo; } This code compiles as expected on normal architectures, like x86-64 or ARM64, but BPF Clang target plays by its own rules: $ clang -O2 -target bpf -Xclang -target-feature -Xclang +alu32 -c sub64.c -o - \ | llvm-objdump -S - … 13: 1f 40 00 00 00 00 00 00 r0 -= r4 14: 1f 30 00 00 00 00 00 00 r0 -= r3 15: 1f 21 00 00 00 00 00 00 r1 -= r2 16: 67 00 00 00 20 00 00 00 r0 <<= 32 17: 67 01 00 00 20 00 00 00 r1 <<= 32 18: 77 01 00 00 20 00 00 00 r1 >>= 32 19: 4f 10 00 00 00 00 00 00 r0 |= r1 20: 95 00 00 00 00 00 00 00 exit Apparently the compiler decided it was better to operate on 64-bit registers and discard the upper 32 bits. Thus we weren't able to get rid of the problematic 0x1f opcode. Annoying, back to square one. Surely a bit of IR will do? The problem was in Clang frontend - compiling C to IR. We know that BPF "assembly" backend for LLVM can generate bytecode that uses ALU32 instructions. Maybe if we tweak the Clang compiler's output just a little we can achieve what we want. This means we have to get our hands dirty with the LLVM Intermediate Representation (IR). If you haven't heard of LLVM IR before, now is a good time to do some reading2. In short the LLVM IR is what Clang produces and LLVM BPF backend consumes. Time to write IR by hand! Here's a hand-tweaked IR variant of our sub64() function: define dso_local i64 @sub64_ir(i64, i64) local_unnamed_addr #0 { %3 = trunc i64 %0 to i32 ; xl = (u32) x; %4 = trunc i64 %1 to i32 ; yl = (u32) y; %5 = sub i32 %3, %4 ; lo = xl - yl; %6 = zext i32 %5 to i64 %7 = lshr i64 %0, 32 ; tmp1 = x >> 32; %8 = lshr i64 %1, 32 ; tmp2 = y >> 32; %9 = trunc i64 %7 to i32 ; xh = (u32) tmp1; %10 = trunc i64 %8 to i32 ; yh = (u32) tmp2; %11 = sub i32 %9, %10 ; hi = xh - yh %12 = icmp ult i32 %3, %5 ; tmp3 = xl < lo %13 = zext i1 %12 to i32 %14 = sub i32 %11, %13 ; hi -= tmp3 %15 = zext i32 %14 to i64 %16 = shl i64 %15, 32 ; tmp2 = hi << 32 %17 = or i64 %16, %6 ; res = tmp2 | (u64)lo ret i64 %17 } It may not be pretty but it does produce desired BPF code when compiled3. You will likely find the LLVM IR reference helpful when deciphering it. And voila! First working solution that produces correct results: $ ./run-bpf -filter ir $[2**32] 1 arg0 4294967296 0x0000000100000000 arg1 1 0x0000000000000001 diff 4294967295 0x00000000ffffffff Actually using this hand-written IR function from C is tricky. See our code on GitHub. public domain image by Sergei FrolovThe final trick Hand-written IR does the job. The downside is that linking IR modules to your C modules is hard. Fortunately there is a better way. You can persuade Clang to stick to 32-bit ALU ops in generated IR. We've already seen the problem. To recap, if we ask Clang to subtract 32-bit integers, it will operate on 64-bit values and throw away the top 32-bits. Putting C, IR, and eBPF side-by-side helps visualize this: The trick to get around it is to declare the 32-bit variable that holds the result as volatile. You might already know the volatile keyword if you've written Unix signal handlers. It basically tells the compiler that the value of the variable may change under its feet so it should refrain from reorganizing loads (reads) from it, as well as that stores (writes) to it might have side-effects so changing the order or eliminating them, by skipping writing it to the memory, is not allowed either. Using volatile makes Clang emit special loads and/or stores at the IR level, which then on eBPF level translates to writing/reading the value from memory (stack) on every access. While this sounds not related to the problem at hand, there is a surprising side-effect to it: With volatile access compiler doesn't promote the subtraction to 64 bits! Don't ask me why, although I would love to hear an explanation. For now, consider this a hack. One that does not come for free - there is the overhead of going through the stack on each read/write. However, if we play our cards right we just might reduce it a little. We don't actually need the volatile load or store to happen, we just want the side effect. So instead of declaring the value as volatile, which implies that both reads and writes are volatile, let's try to make only the writes volatile with a help of a macro: /* Emits a "store volatile" in LLVM IR */ #define ST_V(rhs, lhs) (*(volatile typeof(rhs) *) &(rhs) = (lhs)) If this macro looks strangely familiar, it's because it does the same thing as WRITE_ONCE() macro in the Linux kernel. Applying it to our example: That's another hacky but working solution. Pick your poison. CC BY-SA 3.0 image by ANKAWÜSo there you have it - from C, to IR, and back to C to hack around a bug in eBPF verifier and be able to subtract 64-bit integers again. Usually you won't have to dive into LLVM IR or assembly to make use of eBPF. But it does help to know a little about it when things don't work as expected. Did I mention that 64-bit addition is also broken? Have fun fixing it! 1 Okay, it was more like 3 months time until the bug was discovered and fixed. 2 Some even think that it is better than assembly. 3 How do we know? The litmus test is to look for statements matching r[0-9] [-+]= r[0-9] in BPF assembly.

Unit Testing Workers, in Cloudflare Workers

CloudFlare Blog -

We recently wrote about unit testing Cloudflare Workers within a mock environment using CloudWorker (a Node.js based mock Cloudflare Worker environment created by Dollar Shave Club's engineering team). See Unit Testing Worker Functions.Even though Cloudflare Workers deploy globally within seconds, software developers often choose to use local mock environments to have the fastest possible feedback loop while developing on their local machines. CloudWorker is perfect for this use case but as it is still a mock environment it does not guarantee an identical runtime or environment with all Cloudflare Worker APIs and features. This gap can make developers uneasy as they do not have 100% certainty that their tests will succeed in the production environment.In this post, we're going to demonstrate how to generate a Cloudflare Worker compatible test harness which can execute mocha unit tests directly in the production Cloudflare environment.Directory SetupCreate a new folder for your project, change it to your working directory and run npm init to initialise the package.json file.Run mkdir -p src && mkdir -p test/lib && mkdir dist to create folders used by the next steps. Your folder should look like this:. ./dist ./src/worker.js ./test ./test/lib ./package.jsonnpm install --save-dev mocha exports-loader webpack webpack-cliThis will install Mocha (the unit testing framework), Webpack (a tool used to package the code into a single Worker script) and Exports Loader (a tool used by Webpack to import the Worker script into the Worker based Mocha environment.npm install --save-dev git+https://github.com/obezuk/mocha-loader.gitThis will install a modified version of Webpack's mocha loader. It has been modified to support the Web Worker environment type. We are excited to see Web Worker support merged into Mocha Loader so please vote for our pull request here: https://github.com/webpack-contrib/mocha-loader/pull/77Example ScriptCreate your Worker script in ./src/worker.js:addEventListener('fetch', event => { event.respondWith(handleRequest(event.request)) }) async function addition(a, b) { return a + b } async function handleRequest(request) { const added = await addition(1,3) return new Response(`The Sum is ${added}!`) } Add TestsCreate your unit tests in ./test/test.test.js:const assert = require('assert') describe('Worker Test', function() { it('returns a body that says The Sum is 4', async function () { let url = new URL('https://worker.example.com') let req = new Request(url) let res = await handleRequest(req) let body = await res.text() assert.equal(body, 'The Sum is 4!') }) it('does addition properly', async function() { let res = await addition(1, 1) assert.equal(res, 2) }) }) Mocha in Worker Test HarnessIn order to execute mocha and unit tests within Cloudflare Workers we are going to build a Test Harness. The Test Harness script looks a lot like a normal Worker script but integrates your ./src/worker.js and ./test/test.test.js into a script which is capable of executing the Mocha unit tests within the Cloudflare Worker runtime.Create the below script in ./test/lib/serviceworker-mocha-harness.js.import 'mocha'; import 'mocha-loader!../test.test.js'; var testResults; async function mochaRun() { return new Promise(function (accept, reject) { var runner = mocha.run(function () { testResults = runner.testResults; accept(); }); }); } addEventListener('fetch', event => { event.respondWith(handleMochaRequest(event.request)) }); async function handleMochaRequest(request) { if (!testResults) { await mochaRun(); } var headers = new Headers({ "content-type": "application/json" }) var statusCode = 200; if (testResults.failures != 0) { statusCode = 500; } return new Response(JSON.stringify(testResults), { "status": statusCode, "headers": headers }); } Object.assign(global, require('exports-loader?handleRequest,addition!../../src/worker.js')); Mocha Webpack ConfigurationCreate a new file in the project root directory called: ./webpack.mocha.config.js. This file is used by Webpack to bundle the test harness, worker script and unit tests into a single script that can be deployed to Cloudflare.module.exports = { target: 'webworker', entry: "./test/lib/serviceworker-mocha-harness.js", mode: "development", optimization: { minimize: false }, performance: { hints: false }, node: { fs: 'empty' }, module: { exprContextCritical: false }, output: { path: __dirname + "/dist", publicPath: "dist", filename: "worker-mocha-harness.js" } }; Your file structure should look like (excluding node_modules):. ./dist ./src/worker.js ./test/test.test.js ./test/lib/serviceworker-mocha-harness.js ./package.json ./package-lock.json ./webpack.mocha.config.js Customising the test harness.If you wish to extend the test harness to support your own test files you will need to add additional test imports to the top of the script:import 'mocha-loader!/* TEST FILE NAME HERE */' If you wish to import additional functions from your Worker script into the test harness environment you will need to add them comma separated into the last line:Object.assign(global, require('exports-loader?/* COMMA SEPARATED FUNCTION NAMES HERE */!../../src/worker.js')); Running the test harnessDeploying and running the test harness is identical to deploying any other Worker script with Webpack.Modify the scripts section of package.json to include the build-harness command."scripts": { "build-harness": "webpack --config webpack.mocha.config.js -p --progress --colors" } In the project root directory run the command npm run build-harness to generate and bundle your Worker script, Mocha and your unit tests into ./dist/worker-mocha-harness.js.Upload this script to a test Cloudflare workers route and run curl --fail https://test.example.org. If the unit tests are successful it will return a 200 response, and if the unit tests fail a 500 response.Integrating into an existing CI/CD pipelineYou can integrate Cloudflare Workers and the test harness into your existing CI/CD pipeline by using our API: https://developers.cloudflare.com/workers/api/.The test harness returns detailed test reports in JSON format:Example Success Response{ "stats": { "suites": 1, "tests": 2, "passes": 2, "pending": 0, "failures": 0, "start": "2019-04-23T06:24:33.492Z", "end": "2019-04-23T06:24:33.590Z", "duration": 98 }, "tests": [ { "title": "returns a body that says The Sum is 4", "fullTitle": "Worker Test returns a body that says The Sum is 4", "duration": 0, "currentRetry": 0, "err": {} }, { "title": "does addition properly", "fullTitle": "Worker Test does addition properly", "duration": 0, "currentRetry": 0, "err": {} } ], "pending": [], "failures": [], "passes": [ { "title": "returns a body that says The Sum is 4", "fullTitle": "Worker Test returns a body that says The Sum is 4", "duration": 0, "currentRetry": 0, "err": {} }, { "title": "does addition properly", "fullTitle": "Worker Test does addition properly", "duration": 0, "currentRetry": 0, "err": {} } ] } Example Failure Response{ "stats": { "suites": 1, "tests": 2, "passes": 0, "pending": 0, "failures": 2, "start": "2019-04-23T06:25:52.100Z", "end": "2019-04-23T06:25:52.170Z", "duration": 70 }, "tests": [ { "title": "returns a body that says The Sum is 4", "fullTitle": "Worker Test returns a body that says The Sum is 4", "duration": 0, "currentRetry": 0, "err": { "name": "AssertionError", "actual": "The Sum is 5!", "expected": "The Sum is 4!", "operator": "==", "message": "'The Sum is 5!' == 'The Sum is 4!'", "generatedMessage": true, "stack": "AssertionError: 'The Sum is 5!' == 'The Sum is 4!'\n at Context.<anonymous> (worker.js:19152:16)" } }, { "title": "does addition properly", "fullTitle": "Worker Test does addition properly", "duration": 0, "currentRetry": 0, "err": { "name": "AssertionError", "actual": "3", "expected": "2", "operator": "==", "message": "3 == 2", "generatedMessage": true, "stack": "AssertionError: 3 == 2\n at Context.<anonymous> (worker.js:19157:16)" } } ], "pending": [], "failures": [ { "title": "returns a body that says The Sum is 4", "fullTitle": "Worker Test returns a body that says The Sum is 4", "duration": 0, "currentRetry": 0, "err": { "name": "AssertionError", "actual": "The Sum is 5!", "expected": "The Sum is 4!", "operator": "==", "message": "'The Sum is 5!' == 'The Sum is 4!'", "generatedMessage": true, "stack": "AssertionError: 'The Sum is 5!' == 'The Sum is 4!'\n at Context.<anonymous> (worker.js:19152:16)" } }, { "title": "does addition properly", "fullTitle": "Worker Test does addition properly", "duration": 0, "currentRetry": 0, "err": { "name": "AssertionError", "actual": "3", "expected": "2", "operator": "==", "message": "3 == 2", "generatedMessage": true, "stack": "AssertionError: 3 == 2\n at Context.<anonymous> (worker.js:19157:16)" } } ], "passes": [] } This is really powerful and can allow you to execute your unit tests directly in the Cloudflare runtime, giving you more confidence before releasing your code into production. We hope this was useful and welcome any feedback.

The Serverlist Newsletter: A big week of serverless announcements, serverless Rust with WASM, cloud cost hacking, and more

CloudFlare Blog -

Check out our fourth edition of The Serverlist below. Get the latest scoop on the serverless space, get your hands dirty with new developer tutorials, engage in conversations with other serverless developers, and find upcoming meetups and conferences to attend.Sign up below to have The Serverlist sent directly to your mailbox. .newsletter .visually-hidden { position: absolute; white-space: nowrap; width: 1px; height: 1px; overflow: hidden; border: 0; padding: 0; clip: rect(0 0 0 0); clip-path: inset(50%); } .newsletter form { display: flex; flex-direction: row; margin-bottom: 1em; } .newsletter input[type="email"], .newsletter button[type="submit"] { font: inherit; line-height: 1.5; padding-top: .5em; padding-bottom: .5em; border-radius: 3px; } .newsletter input[type="email"] { padding-left: .8em; padding-right: .8em; margin: 0; margin-right: .5em; box-shadow: none; border: 1px solid #ccc; } .newsletter input[type="email"]:focus { border: 1px solid #3279b3; } .newsletter button[type="submit"] { padding-left: 1.25em; padding-right: 1.25em; background-color: #f18030; color: #fff; } .newsletter .privacy-link { font-size: .9em; } Email Submit Your privacy is important to us newsletterForm.addEventListener('submit', async function(e) { e.preventDefault() fetch('https://streamblog.website', { method: 'POST', body: newsletterForm.elements[0].value }).then(async res => { const thing = await res.text() newsletterForm.innerHTML = thing const homeURL = 'https://developers.cloudflare.com/' if (window.location.href !== homeURL) { window.setTimeout(_ => { window.location = homeURL }, 5000) } }) }) iframe[seamless]{ background-color: transparent; border: 0 none transparent; padding: 0; overflow: hidden; } const magic = document.getElementById('magic') function resizeIframe() { const iframeDoc = magic.contentDocument const iframeWindow = magic.contentWindow magic.height = iframeDoc.body.clientHeight const injectedStyle = iframeDoc.createElement('style') injectedStyle.innerHTML = ` body { background: white !important; } ` magic.contentDocument.head.appendChild(injectedStyle) function onFinish() { setTimeout(() => { magic.style.visibility = '' }, 80) } if (iframeDoc.readyState === 'loading') { iframeWindow.addEventListener('load', onFinish) } else { onFinish() } } async function fetchURL(url) { magic.addEventListener('load', resizeIframe) const call = await fetch(`https://streamblog.website/proxy?domain=${url}`) const text = await call.text() const divie = document.createElement("div") divie.innerHTML = text const listie = divie.getElementsByTagName("a") for (var i = 0; i < listie.length; i++) { listie[i].setAttribute("target", "_blank") } magic.scrolling = "no" magic.srcdoc = divie.innerHTML } fetchURL("https://mailchi.mp/cloudflare/theserverlistnewsletter-e04")

Rapid Development of Serverless Chatbots with Cloudflare Workers and Workers KV

CloudFlare Blog -

I'm the Product Manager for the Application Services team here at Cloudflare. We recently identified a need for a new tool around service ownership. As a fast growing engineering organization, ownership of services changes fairly frequently. Many cycles get burned in chat with questions like "Who owns service x now? Whilst it's easy to see how a tool like this saves a few seconds per day for the asker and askee, and saves on some mental context switches, the time saved is unlikely to add up to the cost of development and maintenance. = 5 minutes per day x 260 work days = 1300 mins / 60 mins = 20 person hours per year So a 20 hour investment in that tool would pay itself back in a year valuing everyone's time the same. While we've made great strides in improving the efficiency of building tools at Cloudflare, 20 hours is a stretch for an end-to-end build, deploy and operation of a new tool. Enter Cloudflare Workers + Workers KV The more I use Serverless and Workers, the more I'm struck with the benefits of: 1. Reduced operational overhead When I upload a Worker, it's automatically distributed to 175+ data centers. I don't have to be worried about uptime - it will be up, and it will be fast. 2. Reduced dev time With operational overhead largely removed, I'm able to focus purely on code. A constrained problem space like this lends itself really well to Workers. I reckon we can knock this out in well under 20 hours. Requirements At Cloudflare, people ask these questions in Chat, so that's a natural interface to service ownership. Here's the spec: Use Case Input Output Add @ownerbot add Jira IT http://chat.google.com/room/ABC123 Service added Delete @ownerbot delete Jira Service deleted Question @ownerbot Kibana SRE Core owns Kibana. The room is: http://chat.google.com/ABC123 Export @ownerbot export [{name: "Kibana", owner: "SRE Core"...}] Hello @ownerbot Following the Hangouts Chat API Guide, let's start with a hello world bot. To configure the bot, go to the Publish page and scroll down to the Enable The API button: Enter the bot name Download the private key json file Go to the API Console Search for the Hangouts Chat API (Note: not the Google+ Hangouts API) Click Configuration onthe left menu Fill out the form as per below [1] Use a hard to guess URL. I generate a guid and use that in the url. The URL will be the route you associate with your Worker in the Dashboard Click Save So Google Chat should know about our bot now. Back in Google Chat, click in the "Find people, rooms, bots" textbox and choose "Message a Bot". Your bot should show up in the search: It won't be too useful just yet, as we need to create our Worker to receive the messages and respond! The Worker In the Workers dashboard, create a script and associate with the route you defined in step #7 (the one with the guid). It should look something like below. [2] The Google Chatbot interface is pretty simple, but weirdly obfuscated in the Hangouts API guide IMHO. You have to reverse engineer the python example. Basically, if we message our bot like @ownerbot-blog Kibana, we'll get a message like this: { "type": "MESSAGE", "message": { "argumentText": "Kibana" } } To respond, we need to respond with 200 OK and JSON body like this: content-length: 27 content-type: application/json {"text":"Hello chat world"} So, the minimum Chatbot Worker looks something like this: addEventListener('fetch', event => { event.respondWith(process(event.request)) }); function process(request) { let body = { text: "Hello chat world" } return new Response(JSON.stringify(body), { status: 200, headers: { "Content-Type": "application/json", "Cache-Control": "no-cache" } }); } Save and deploy that and we should be able message our bot: Success! Implementation OK, on to the meat of the code. Based on the requirements, I see a need for an AddCommand, QueryCommand, DeleteCommand and HelpCommand. I also see some sort of ServiceDirectory that knows how to add, delete and retrieve services. I created a CommandFactory which accepts a ServiceDirectory, as well as an implementation of a KV store, which will be Workers KV in production, but I'll mock out in tests. class CommandFactory { constructor(serviceDirectory, kv) { this.serviceDirectory = serviceDirectory; this.kv = kv; } create(argumentText) { let parts = argumentText.split(' '); let primary = parts[0]; switch (primary) { case "add": return new AddCommand(argumentText, this.serviceDirectory, this.kv); case "delete": return new DeleteCommand(argumentText, this.serviceDirectory, this.kv); case "help": return new HelpCommand(argumentText, this.serviceDirectory, this.kv); default: return new QueryCommand(argumentText, this.serviceDirectory, this.kv); } } } OK, so if we receive a message like @ownerbot add, we'll interpret it as an AddCommand, but if it's not something we recognize, we'll assume it's a QueryCommand like @ownerbot Kibana which makes it easy to parse commands. OK, our commands need a service directory, which will look something like this: class ServiceDirectory { get(serviceName) {...} async add(service) {...} async delete(serviceName) {...} find(serviceName) {...} getNames() {...} } Let's build some commands. Oh, and my chatbot is going to be Ultima IV themed, because... reasons. class AddCommand extends Command { async respond() { let cmdParts = this.commandParts; if (cmdParts.length !== 6) { return new OwnerbotResponse("Adding a service requireth Name, Owner, Room Name and Google Chat Room Url.", false); } let name = this.commandParts[1]; let owner = this.commandParts[2]; let room = this.commandParts[3]; let url = this.commandParts[4]; let aliasesPart = this.commandParts[5]; let aliases = aliasesPart.split(' '); let service = { name: name, owner: owner, room: room, url: url, aliases: aliases } await this.serviceDirectory.add(service); return new OwnerbotResponse(`My codex of knowledge has expanded to contain knowledge of ${name}. Congratulations virtuous Paladin.`); } } The nice thing about the Command pattern for chatbots, is you can encapsulate the logic of each command for testing, as well as compose series of commands together to test out conversations. Later, we could extend it to support undo. Let's test the AddCommand it('requires all args', async function() { let addCmd = new AddCommand("add AdminPanel 'Internal Tools' 'Internal Tools'", dir, kv); //missing url let res = await addCmd.respond(); console.log(res.text); assert.equal(res.success, false, "Adding with missing args should fail"); }); it('returns success for all args', async function() { let addCmd = new AddCommand("add AdminPanel 'Internal Tools' 'Internal Tools Room' 'http://chat.google.com/roomXYZ'", dir, kv); let res = await addCmd.respond(); console.debug(res.text); assert.equal(res.success, true, "Should have succeeded with all args"); }); $ mocha -g "AddCommand" AddCommand add ✓ requires all args ✓ returns success for all args 2 passing (19ms) So far so good. But adding commands to our ownerbot isn't going to be so useful unless we can query them. class QueryCommand extends Command { async respond() { let service = this.serviceDirectory.get(this.argumentText); if (service) { return new OwnerbotResponse(`${service.owner} owns ${service.name}. Seeketh thee room ${service.room} - ${service.url})`); } let serviceNames = this.serviceDirectory.getNames().join(", "); return new OwnerbotResponse(`I knoweth not of that service. Thou mightst asketh me of: ${serviceNames}`); } } Let's write a test that runs an AddCommand followed by a QueryCommand describe ('QueryCommand', function() { let kv = new MockKeyValueStore(); let dir = new ServiceDirectory(kv); await dir.init(); it('Returns added services', async function() { let addCmd = new AddCommand("add AdminPanel 'Internal Tools' 'Internal Tools Room' url 'alias' abc123", dir, kv); await addCmd.respond(); let queryCmd = new QueryCommand("AdminPanel", dir, kv); let res = await queryCmd.respond(); assert.equal(res.success, true, "Should have succeeded"); assert(res.text.indexOf('Internal Tools') > -1, "Should have returned the team name in the query response"); }) }) Demo A lot of the code as been elided for brevity, but you can view the full source on Github. Let's take it for a spin! Learnings Some of the things I learned during the development of @ownerbot were: Chatbots are an awesome use case for Serverless. You can deploy and not worry again about the infrastructure Workers KV means extends the range of useful chat bots to include stateful bots like @ownerbot The Command pattern provides a useful way to encapsulate the parsing and responding to commands in a chat bot. In Part 2 we'll add authentication to ensure we're only responding to requests from our instance of Google Chat For simplicity, I'm going to use a static shared key, but Google have recently rolled out a more secure method for verifying the caller's authenticity, which we'll expand on in Part 2. ↩︎ This UI is the multiscript version available to Enterprise customers. You can still implement the bot with a single Worker, you'll just need to recognize and route requests to your chatbot code. ↩︎

We want to host your technical meetup at Cloudflare London

CloudFlare Blog -

Cloudflare recently moved to County Hall, the building just behind the London Eye. We have a very large event space which we would love to open up to the developer community. If you organize a technical meetup, we'd love to host you. If you attend technical meetups, please share this post with the meetup organizers. We're on the upper floor of County HallAbout the spaceOur event space is large enough to hold up to 280 attendees, but can also be used for a small group as well. There is a large entry way for people coming into our 6th floor lobby where check-in may be managed. Once inside the event space, you will see a large, open kitchen area which can be used to set up event food and beverages. Beyond that is Cloudflare's all-hands space, which may be used for your events. We have several gender-neutral toilets for your guests' use as well.LobbyYou may welcome your guests here. The event space is just to the left of this spot.Event spaceThis space may be used for talks, workshops, or large panels. We can rearrange seating, based on the format of your meetup.Food & beveragesCloudflare will gladly provide the light snacks and beverages including beer, wine or cider, and sodas or juices we have in our kitchen area. If your attendees would like to have additional food, you are welcome to order additional food. If your meetup is eligible, we may even be able to sponsor your additional food orders. Check out our pizza reimbursement rules for more details. Our kitchen area is attached to the event spaceHow to book the spaceIf this all sounds good to you and you're interested in hosting your technical meetup at Cloudflare London, please fill out this form with all the details of your event. If you'd like a tour of the space before booking it, I will gladly show you around and go through date options with you. Host at Cloudflare »You may also email me directly with any questions you have. I'll hope to meet and host you soon!Want to host an event at Cloudflare's San Francisco office?We also warmly welcome meetups in our San Francisco all-hands space. Please read and submit this form if your meetup is Bay Area-based.

xdpcap: XDP Packet Capture

CloudFlare Blog -

Our servers process a lot of network packets, be it legitimate traffic or large denial of service attacks. To do so efficiently, we’ve embraced eXpress Data Path (XDP), a Linux kernel technology that provides a high performance mechanism for low level packet processing. We’re using it to drop DoS attack packets with L4Drop, and also in our new layer 4 load balancer. But there’s a downside to XDP: because it processes packets before the normal Linux network stack sees them, packets redirected or dropped are invisible to regular debugging tools such as tcpdump.To address this, we built a tcpdump replacement for XDP, xdpcap. We are open sourcing this tool: the code and documentation are available on GitHub.xdpcap uses our classic BPF (cBPF) to eBPF or C compiler, cbpfc, which we are also open sourcing: the code and documentation are available on GitHub.CC BY 4.0 image by Christoph MüllerTcpdump provides an easy way to dump specific packets of interest. For example, to capture all IPv4 DNS packets, one could: $ tcpdump ip and udp port 53 xdpcap reuses the same syntax! xdpcap can write packets to a pcap file: $ xdpcap /path/to/hook capture.pcap "ip and udp port 53" XDPAborted: 0/0 XDPDrop: 0/0 XDPPass: 254/0 XDPTx: 0/0 (received/matched packets) XDPAborted: 0/0 XDPDrop: 0/0 XDPPass: 995/1 XDPTx: 0/0 (received/matched packets) Or write the pcap to stdout, and decode the packets with tcpdump: $ xdpcap /path/to/hook - "ip and udp port 53" | sudo tcpdump -r - reading from file -, link-type EN10MB (Ethernet) 16:18:37.911670 IP 1.1.1.1 > 1.2.3.4.21563: 26445$ 1/0/1 A 93.184.216.34 (56) The remainder of this post explains how we built xdpcap, including how /path/to/hook/ is used to attach to XDP programs. tcpdump To replicate tcpdump, we first need to understand its inner workings. Marek Majkowski has previously written a detailed post on the subject. Tcpdump exposes a high level filter language, pcap-filter, to specify which packets are of interest. Reusing our earlier example, the following filter expression captures all IPv4 UDP packets to or from port 53, likely DNS traffic: ip and udp port 53 Internally, tcpdump uses libpcap to compile the filter to classic BPF (cBPF). cBPF is a simple bytecode language to represent programs that inspect the contents of a packet. A program returns non-zero to indicate that a packet matched the filter, and zero otherwise. The virtual machine that executes cBPF programs is very simple, featuring only two registers, a and x. There is no way of checking the length of the input packet[1]; instead any out of bounds packet access will terminate the cBPF program, returning 0 (no match). The full set of opcodes are listed in the Linux documentation. Returning to our example filter, ip and udp port 53 compiles to the following cBPF program, expressed as an annotated flowchart: Example cBPF filter flowchart Tcpdump attaches the generated cBPF filter to a raw packet socket using a setsockopt system call with SO_ATTACH_FILTER. The kernel runs the filter on every packet destined for the socket, but only delivers matching packets. Tcpdump displays the delivered packets, or writes them to a pcap capture file for later analysis. xdpcap In the context of XDP, our tcpdump replacement should: Accept filters in the same filter language as tcpdump Dynamically instrument XDP programs of interest Expose matching packets to userspace XDP XDP uses an extended version of the cBPF instruction set, eBPF, to allow arbitrary programs to run for each packet received by a network card, potentially modifying the packets. A stringent kernel verifier statically analyzes eBPF programs, ensuring that memory bounds are checked for every packet load. eBPF programs can return: XDP_DROP: Drop the packet XDP_TX: Transmit the packet back out the network interface XDP_PASS: Pass the packet up the network stack eBPF introduces several new features, notably helper function calls, enabling programs to call functions exposed by the kernel. This includes retrieving or setting values in maps, key-value data structures that can also be accessed from userspace. Filter A key feature of tcpdump is the ability to efficiently pick out packets of interest; packets are filtered before reaching userspace. To achieve this in XDP, the desired filter must be converted to eBPF. cBPF is already used in our XDP based DoS mitigation pipeline: cBPF filters are first converted to C by cbpfc, and the result compiled with Clang to eBPF. Reusing this mechanism allows us to fully support libpcap filter expressions: Pipeline to convert pcap-filter expressions to eBPF via C using cbpfc To remove the Clang runtime dependency, our cBPF compiler, cbpfc, was extended to directly generate eBPF: Pipeline to convert pcap-filter expressions directly to eBPF using cbpfc Converted to eBPF using cbpfc, ip and udp port 53 yields: Example cBPF filter converted to eBPF with cbpfc flowchart The emitted eBPF requires a prologue, which is responsible for loading a pointer to the beginning, and end, of the input packet into registers r6 and r7 respectively[2]. The generated code follows a very similar structure to the original cBPF filter, but with: Bswap instructions to convert big endian packet data to little endian. Guards to check the length of the packet before we load data from it. These are required by the kernel verifier. The epilogue can use the result of the filter to perform different actions on the input packet. As mentioned earlier, we’re open sourcing cbpfc; the code and documentation are available on GitHub. It can be used to compile cBPF to C, or directly to eBPF, and the generated code is accepted by the kernel verifier. Instrument Tcpdump can start and stop capturing packets at any time, without requiring coordination from applications. This rules out modifying existing XDP programs to directly run the generated eBPF filter; the programs would have to be modified each time xdpcap is run. Instead, programs should expose a hook that can be used by xdpcap to attach filters at runtime. xdpcap’s hook support is built around eBPF tail-calls. XDP programs can yield control to other programs using the tail-call helper. Control is never handed back to the calling program, the return code of the subsequent program is used. For example, consider two XDP programs, foo and bar, with foo attached to the network interface. Foo can tail-call into bar: Flow of XDP program foo tail-calling into program bar The program to tail-call into is configured at runtime, using a special eBPF program array map. eBPF programs tail-call into a specific index of the map, the value of which is set by userspace. From our example above, foo’s tail-call map holds a single entry: index program 0 bar A tail-call into an empty index will not do anything, XDP programs always need to return an action themselves after a tail-call should it fail. Once again, this is enforced by the kernel verifier. In the case of program foo: int foo(struct xdp_md *ctx) { // tail-call into index 0 - program bar tail_call(ctx, &map, 0); // tail-call failed, pass the packet return XDP_PASS; } To leverage this as a hook point, the instrumented programs are modified to always tail-call, using a map that is exposed to xdpcap by pinning it to a bpffs. To attach a filter, xdpcap can set it in the map. If no filter is attached, the instrumented program returns the correct action itself. With a filter attached to program foo, we have: Flow of XDP program foo tail-calling into an xdpcap filter The filter must return the original action taken by the instrumented program to ensure the packet is processed correctly. To achieve this, xdpcap generates one filter program per possible XDP action, each one hardcoded to return that specific action. All the programs are set in the map: index program 0 (XDP_ABORTED) filter XDP_ABORTED 1 (XDP_DROP) filter XDP_DROP 2 (XDP_PASS) filter XDP_PASS 3 (XDP_TX) filter XDP_TX By tail-calling into the correct index, the instrumented program determines the final action: Flow of XDP program foo tail-calling into xdpcap filters, one for each action xdpcap provides a helper function that attempts a tail-call for the given action. Should it fail, the action is returned instead: enum xdp_action xdpcap_exit(struct xdp_md *ctx, enum xdp_action action) { // tail-call into the filter using the action as an index tail_call((void *)ctx, &xdpcap_hook, action); // tail-call failed, return the action return action; } This allows an XDP program to simply: int foo(struct xdp_md *ctx) { return xdpcap_exit(ctx, XDP_PASS); } Expose Matching packets, as well as the original action taken for them, need to be exposed to userspace. Once again, such a mechanism is already part of our XDP based DoS mitigation pipeline! Another eBPF helper, perf_event_output, allows an XDP program to generate a perf event containing, amongst some metadata, the packet. As xdpcap generates one filter per XDP action, the filter program can include the action taken in the metadata. A userspace program can create a perf event ring buffer to receive events into, obtaining both the action and the packet. This is true of the original cBPF, but Linux implements a number of extensions, one of which allows the length of the input packet to be retrieved. ↩︎ This example uses registers r6 and r7, but cbpfc can be configured to use any registers. ↩︎

The Climate and Cloudflare

CloudFlare Blog -

Power is the precursor to all modern technology. James Watt’s steam engine energized the factory, Edison and Tesla’s inventions powered street lamps, and now both fossil fuels and renewable resources power the trillions of transistors in computers and phones. In the words of anthropologist Leslie White: “Other things being equal, the degree of cultural development varies directly as the amount of energy per capita per year harnessed and put to work.”Unfortunately, most of the traditional ways to generate power are simply not sustainable. Burning coal or natural gas releases carbon dioxide which directly leads to global warming, and threatens the habitats of global ecosystems, and by extension humans. If we can’t minimize the impact, our world will be dangerously destabilized -- mass extinctions will grow more likely, and mass famines, draughts, migration, and conflict will only be possible to triage rather than avoid. Is the Internet the primary source of this grave threat? No: all data centers globally accounted for 2-3% of total global power use in recent years, and power consumption isn’t the only contributor to human carbon emissions. Transportation (mostly oil use in cars, trucks, ships, trains, and airplanes) and industrial processing (steel, chemicals, heavy manufacturing, etc.) also account for similar volumes of carbon emissions. Within power use though, some internet industry analysts estimate that total data center energy (in kilowatt-hours, not percentage of global power consumption) may double every four years for the foreseeable future -- making internet energy use more than just rearranging deck chairs...How does internet infrastructure like Cloudflare’s contribute to power consumption? Computing power resources are split into end users (like your phone or computer displaying this page) and network infrastructure. That infrastructure likewise splits into “network services” like content delivery and “compute services” like database queries. Cloudflare offers both types of services, and has a sustainability impact in both -- this post describes how we think about it.Our NetworkThe Cloudflare Network has one huge advantage when power is considered. We run a homogeneous network of nearly identical machines around the world, all running the same code on similar hardware. The same servers respond to CDN requests, block massive DDoS attacks, execute customer code in the form of Workers, and even serve DNS requests to 1.1.1.1. When it is necessary to bring more capacity to a problem we are able to do it by adjusting our traffic’s routing through the Internet, not by requiring wasteful levels of capacity overhead in 175 locations around the world. Those factors combine to dramatically reduce the amount of waste, as they mean we don’t have large amounts of hardware sitting idle consuming energy without doing meaningful work. According to one study servers within one public cloud average 4.15% to 16.6% CPU utilization, while Cloudflare’s edge operates significantly higher than that top-end.One of the functions Cloudflare performs for our customers is caching, where we remember previous responses our customers have given to requests. This allows the edge location closest to the visitor to respond to requests instantly, saving the request a trip through the Internet to the customer’s origin server. This immediately saves energy, as sending data through the Internet requires switches and routers to make decisions which consumes power. Serving a response from cache is as close to the lowest power requirement you can imagine to serve a web request; we are reading data from memory or disk and immediately returning it. In contrast, when a customer’s origin has to serve a request, there are two additional costs Cloudflare avoids: first, even getting the request to arrive at the origin often requires many hops over the Internet, each requiring CPU cycles and the energy they consume. Second, the request often requires large amounts of code to be executed and even database queries to be run. The savings are so great that we often have customers enable our caching to keep their servers running even when their request volume would overwhelm their capacity; if our caching were disabled they would almost immediately fail. This means we are not only saving CPU cycles on our customer’s origin, we are preventing them from having to buy and run multiple-X more servers with the proportionally greater energy use & environmental impact that entails. Our breadth on the Internet also means the performance optimizations we are able to perform have a disproportionate impact. When we speed up TLS or fix CPU stalls we are shaving off milliseconds of CPU from requests traveling to 13 million different websites. It would be virtually impossible to get all of these performance improvements integrated into every one of those origins, but with Cloudflare they simply see fewer requests and energy is saved.Our PlatformThe energy efficiency of using a wax or tallow candle to create light is on the order of 0.01%. A modern power plant burning gas to power an LED light bulb is nearly 10% efficient, an improvement of 1,000x. One of the most powerful things we can do to lower energy consumption, therefore, is to give people ways of performing the same work with less energy.Our connection to this concept lives not just in our network, but in the serverless computing platform we offer atop it, Cloudflare Workers. Many of the conventions that govern how modern servers and services operate descend directly from the mainframe era of computing, where a single large machine would run a single job. Unlike other platforms which are based on that legacy, we don’t sell customers servers, virtual machines, or containers; instead we use a technology called isolates. Isolates represent a lightweight way to run a piece of code which provides much of the same security guarantees with less overhead, allowing many thousands of different customer’s code to be executed on a small number of machines efficiently. A traditional computer system might be just as efficient running a single program, but as our world shifts into serverless computing with thousands of code files running on a single machine, isolates shine.In a conventional computer system the complex security dance between the operating system and the code being executed by a user can consume as much as 30% of the CPU power used. This has only gotten worse with the recent patches required to prevent speculative execution vulnerabilities. Isolates share a single runtime which can manage the security isolation required to run many thousands of customer scripts, without falling back to the operating system. We are able to simply eliminate much of that 30% overhead, using that capacity to execute useful code instead.Additionally, by being able to start our isolates using just a few milliseconds of CPU time rather than the hundreds required by conventional processes we are able to dynamically scale rapidly, more efficiently using the hardware we do have. Isolates allow us to spend CPU cycles on only the code customers actually wish to execute, not wasteful overhead. These effects are in fact so dramatic that we have begun to rebuild parts of our own internal infrastructure as isolate-powered Cloudflare Workers in part to save energy for ourselves and our customers.Offsetting What’s LeftAll that means that the energy we ultimately do use for our operations is only a fraction of what it would otherwise take to accomplish the same tasks.Last year, we took our first major step toward neutralizing the remaining carbon footprint from our operations by purchasing Renewable Energy Certificates (RECs) to match all of our electricity use in North America. This year, we have expanded our scope to include all of our operations around the world.We currently have 175 data centers in more than 75 countries around the world, as well as 11 offices in San Francisco (our global HQ) London, Singapore, New York, Austin, San Jose, Champaign, Washington, D.C. Beijing, Sydney, and Munich. In order to reduce our carbon footprint, we have purchased RECs to match 100% of the power used in all those data centers and offices around the world as well.As our colleague Jess Bailey wrote about last year, one REC is created for every Megawatt-hour (MWh) of electricity generated from a renewable power source, like a wind turbine or solar panel. Renewable energy is dispersed into electricity transmission systems similar to how water flows in water distribution systems — each of them is mixed inextricably in its respective “pipes” and it’s not possible to track where any particular electron you use, or drop of water you drink, originally came from. RECs are a way to track the volume (and source) of renewable energy contributed to the grid, and act like a receipt for each MWh contributed.As we noted last year, this action is an important part of our sustainability plan, joining our efforts to work with data centers that have superior Power Usage Effectiveness (PUE), and adding to the waste diversion and energy efficiency efforts we already employ in all of our offices.When combined with our ability to dramatically reduce the amount of data which has to flow through the Internet and the number of requests which have to reach our customer’s origins we hope to not just be considered neutral, but to have a large-scale and long-term positive effect on the sustainability of the Internet itself.

Eating Dogfood at Scale: How We Build Serverless Apps with Workers

CloudFlare Blog -

You’ve had a chance to build a Cloudflare Worker. You’ve tried KV Storage and have a great use case for your Worker. You’ve even demonstrated the usefulness to your product or organization. Now you need to go from writing a single file in the Cloudflare Dashboard UI Editor to source controlled code with multiple environments deployed using your favorite CI tool.Fortunately, we have a powerful and flexible API for managing your workers. You can customize your deployment to your heart’s content. Our blog has already featured many things made possible by that API:The Wrangler CLICI/CD PipelineGithub ActionsWorker bootstrap templateThese tools make deployments easier to configure, but it still takes time to manage. The Serverless Framework Cloudflare Workers plugin removes that deployment overhead so you can spend more time working on your application and less on your deployment.Focus on your applicationHere at Cloudflare, we’ve been working to rebuild our Access product to run entirely on Workers. The move will allow Access to take advantage of the resiliency, performance, and flexibility of Workers. We’ll publish a more detailed post about that migration once complete, but the experience required that we retool some of our process to match or existing development experience as much as possible.To us this meant:GitEasily deployDifferent environmentsUnit TestingCI IntegrationTypescript/Multiple FilesEverything Must Be AutomatedThe Cloudflare Access team looked at three options for automating all of these tools in our pipeline. All of the options will work and could be right for you, but custom scripting can be a chore to maintain and Terraform lacked some extensibility.Custom ScriptingTerraformServerless FrameworkWe decided on the Serverless Framework. Serverless Framework provided a tool to mirror our existing process as closely as possible without too much DevOps overhead. Serverless is extremely simple and doesn’t interfere with the application code. You can get a project set up and deployed in seconds. It’s obviously less work than writing your own custom management scripts. But it also requires less boiler plate than Terraform because the Serverless Framework is designed for the “serverless” niche. However, if you are already using Terraform to manage other Cloudflare products, Terraform might be the best fit.WalkthroughEverything for the project happens in a YAML file called serverless.yml. Let’s go through the features of the configuration file.To get started, we need to install serverless from npm and generate a new project.npm install serverless -g serverless create --template cloudflare-workers --path myproject cd myproject npm install If you are an enterprise client, you want to use the cloudflare-workers-enterprise template as it will set up more than one worker (but don’t worry, you can add more to any template). Also, I’ll touch on this later, but if you want to write your workers in Rust, use the cloudflare-workers-rust template.You should now have a project that feels familiar, ready to be added to your favorite source control. In the project should be a serverless.yml file like the following.service: name: hello-world provider: name: cloudflare config: accountId: CLOUDFLARE_ACCOUNT_ID zoneId: CLOUDFLARE_ZONE_ID plugins: - serverless-cloudflare-workers functions: hello: name: hello script: helloWorld # there must be a file called helloWorld.js events: - http: url: example.com/hello/* method: GET headers: foo: bar x-client-data: value The service block simply contains the name of your service. This will be used in your Worker script names if you do not overwrite them.Under provider, name must be ‘cloudflare’  and you need to add your account and zone IDs. You can find them in the Cloudflare Dashboard.The plugins section adds the Cloudflare specific code.Now for the good part: functions. Each block under functions is a Worker script. name: (optional) If left blank it will be STAGE-service.name-script.identifier. If I removed name from this file and deployed in production stage, the script would be named production-hello-world-hello.script: the relative path to the javascript file with the worker script. I like to organize mine in a folder called handlers.events: Currently Workers only support http events. We call these routes. The example provided says that GET https://example.com/hello/<anything here> will  cause this worker to execute. The headers block is for testing invocations.At this point you can deploy your worker!CLOUDFLARE_AUTH_EMAIL=you@yourdomain.com CLOUDFLARE_AUTH_KEY=XXXXXXXX serverless deploy This is very easy to deploy, but it doesn’t address our requirements. Luckily, there’s just a few simple modifications to make.Maturing our YAML FileHere’s a more complex YAML file.service: name: hello-world package: exclude: - node_modules/** excludeDevDependencies: false custom: defaultStage: development deployVars: ${file(./config/deploy.${self:provider.stage}.yml)} kv: &kv - variable: MYUSERS namespace: users provider: name: cloudflare stage: ${opt:stage, self:custom.defaultStage} config: accountId: ${env:CLOUDFLARE_ACCOUNT_ID} zoneId: ${env:CLOUDFLARE_ZONE_ID} plugins: - serverless-cloudflare-workers functions: hello: name: ${self:provider.stage}-hello script: handlers/hello webpack: true environment: MY_ENV_VAR: ${self:custom.deployVars.env_var_value} SENTRY_KEY: ${self:custom.deployVars.sentry_key} resources: kv: *kv events: - http: url: "${self:custom.deployVars.SUBDOMAIN}.mydomain.com/hello" method: GET - http: url: "${self:custom.deployVars.SUBDOMAIN}.mydomain.com/alsohello*" method: GET We can add a custom section where we can put custom variables to use later in the file.defaultStage: We set this to development so that forgetting to pass a stage doesn’t trigger a production deploy. Combined with the stage option under provider we can set the stage for deployment.deployVars: We use this custom variable to load another YAML file dependent on the stage. This lets us have different values for different stages. In development, this line loads the file ./config/deploy.development.yml. Here’s an example file:env_var_value: true sentry_key: XXXXX SUBDOMAIN: dev kv: Here we are showing off a feature of YAML. If you assign a name to a block using the ‘&’, you can use it later as a YAML variable. This is very handy in a multi script account. We could have named this variable anything, but we are naming it kv since it holds our Workers Key Value storage settings to be used in our function below.Inside of the kv block we're creating a namespace and binding it to a variable available in your Worker. It will ensure that the namespace “users” exists and is bound to MYUSERS.kv: &kv - variable: MYUSERS namespace: users provider: The only new part of the provider block is stage. stage: ${opt:stage, self:custom.defaultStage} This line sets stage to either the command line option or custom.defaultStage if opt:stage is blank. When we deploy, we pass —stage=production to serverless deploy.Now under our function we have added webpack, resources, and environment. webpack: If set to true, will simply bundle each handler into a single file for deployment. It will also take a string representing a path to a webpack config file so you can customize it. This is how we add Typescript support to our projects.resources: This block is used to automate resource creation. In resources we're linking back to the kv block we created earlier.Side note: If you would like to include WASM bindings in your project, it can be done in a very similar way to how we included Workers KV. For more information on WASM, see the documentation.environment: This is the butter for the bread that is managing configuration for different stages. Here we can specify values to bind to variables to use in worker scripts. Combined with YAML magic, we can store our values in the aforementioned config files so that we deploy different values in different stages. With environments, we can easily tie into our CI tool. The CI tool has our deploy.production.yml. We simply run the following command from within our CI.sls deploy --stage=production Finally, I added a route to demonstrate that a script can be executed on multiple routes.At this point I’ve covered (or hinted) at everything on our original list except Unit Testing. There are a few ways to do this.We have a previous blog post about Unit Testing that covers using cloud worker, a great tool built by Dollar Shave Club.My team opted to use the classic node frameworks mocha and sinon. Because we are using Typescript, we can build for node or build for v8. You can also make mocha work for non-typescript projects if you use an experimental feature that adds import/export support to node.--experimental-modules We’re excited about moving more and more of our services to Cloudflare Workers, and the Serverless Framework makes that easier to do. If you’d like to learn even more or get involved with the project, see us on github.com. For additional information on using Serverless Framework with Cloudflare Workers, check out our documentation on the Serverless Framework.

Pages

Recommended Content

Subscribe to Complete Hosting Guide aggregator - Service Provider Blogs