Netflix is going places

– Faisal Z Siddiqi. [This post first appeared on LinkedIn]

Cricket

A Personal glimpse into my first 100 days at Netflix

Netflix had always been a company I admired. For me, it was the quintessential disruptor – continually changing the entertainment landscape in America. Its video apps were synonymous with strong user focus, its Culture deck was a one-of-a-kind manifesto making the rounds in Silicon Valley, and its TechBlog was a resource I had enjoyed reading for years.

But as I evaluated the opportunity to lead an engineering team at Netflix, the pragmatist (read, skeptic) in me wondered how things would really be behind the scenes. How real was the culture of freedom and responsibility? Could a 17+ year old “big” company operate nimbly?  Given it was no longer an upstart with an open canvas, would there still be learning for me? How much growth potential did the company itself have? After some interesting probing conversations with the team at Netflix, I decided to take the plunge.

This is my personal perspective after 100 days at Netflix.

The week I joined was the week of the 360 reviews, where each employee provides feedback to whoever they have materially interacted with over the year. It’s a simple no-frills write-up on what they should continue and what they should change. This includes reviews for peers as well as folks up and down the management chain. What stood out, however, was the complete honesty expressed in the feedback, and the prompt follow up for anything that appeared to be a surprise. Employees are encouraged to give continuous and direct feedback throughout the year, so there are no surprises come 360 review time, but the intent of the 360’s is to highlight the most meaningful feedback you have for your co-workers. And there is a lot of continuous direct feedback. It’s not unusual for an engineer to go up to his or her VP and tell them what they should be doing better. I found it to be a great course-correcting device as I started getting feedback from my team, both directly as well as indirectly through my manager.

This type of feedback system can only flourish when you have a foundational culture of “Freedom & Responsibility”. The culture is something you will hear about a lot in the interview process and throughout your time at Netflix. The focus on freedom and responsibility is the single most influential tenet of Netflix’s culture and you can read more about it here. In essence, employees have freedom over vast areas of engagement, as long as they take the responsibility to own the process and use good judgment to make decisions that are in Netflix’s interest. This informs the hiring decisions as the company tends to value culture-fit very highly in a potential candidate. As an example, if you want to build a new tool using TheLatestAndGreatestProgrammingLanguage, you don’t need any permissions – you just go ahead and do it, but it’s your responsibility to think about the maintainability, support, and evolution of the tool in a way that is scalable. The “you don’t need any permissions” policy applies to a large number of things. And for the littlest of things, it’s surprisingly liberating, and efficiency-boosting.

Within my first thirty days, I experienced another unique Netflixism. Nobody told me this in the interview process, but every quarter there is a company meeting where newbies perform on stage in front of the whole company. Not the kind of creativity I thought was expected of me. I don’t know about you, but prancing on stage as a no-nonsense Agent of S.H.I.E.L.D in front of all your co-workers, isn’t exactly my idea of getting introduced to my colleagues. In true Netflix Original fashion, though, I had a blast.

I manage the Personalization Infrastructure team at Netflix. The charter for my team is to accelerate the pace of video discovery innovation for Netflix. We build systems and frameworks for personalized video recommendations. Netflix is very data-driven and almost all facets of the user experience are A/B tested. You are always hearing about which cell is green or flat, meaning which sub-feature rolled out to a small percentage of Netflix customer base seems to show a positive statistical significance, or the lack thereof.  We have a lot of applied machine-learning (ML) researchers who come up with the latest ML algorithms, features and models. It’s my team’s job to increase their efficiency, providing them with a platform that allows them to iterate fast. This includes putting together a best-of-breed collection of open source technologies to build a service that, say, allows the researchers to “turn back time” by using snapshots of various micro-services. Or it may mean building an orchestration engine for ML pipelines from the ground up. Netflix has a rich history of Open sourcing and engineers are encouraged to think about open-sourcing opportunities as they go about doing their job. My team also builds and innovates on caching infrastructure for internal micro-services, built on top of memcached and optimized for public cloud usage across global regions. This service is one of the most heavily used pieces of software across engineering teams at Netflix and has been my portal into learning about the scale we deal with on a daily basis. Throughputs on the order of 300,000 requests per second on one replicated production cluster with low milliseconds of latency are not unusual. This software is already open-sourced, but we are working on a major cleanup and will be doing a long overdue round of updates to the open-sourced version in the coming months.

As most teams do here, the talented engineers on my team are honest, nimble and have little patience for “process overhead”. It took a while to get used to being told “this is not the Netflix way”.  What I have learned in my short time here is that a leader can be much more effective with a highly skilled team if they set appropriate context, facilitate the right tools, connect the dots as new opportunities arise, and then get out of the way.  In the time I have been here my team has hosted a Netflix Meetup engaging with the Apache Spark community, talked about our experience building machine learning pipelines at the Spark Summit 2015 and invited several guest speakers, often budding entrepreneurs, to talk about their solution and to exchange ideas with them.

The team works on distributed systems at scale using a host of emerging technologies and is a heavy user of Scala and Java in the cloud. You get to own your service/tool from conception to requirements, design, development, testing, deploying, and being on call for it. We have seen that this responsibility, combined with the motivation to not be paged in the middle of the night, often results in better self-healing software. Netflix has a great toolchain for monitoring and deploying software, and the Engineering Bootcamp provides a quick ramp-up for folks on the Netflix ecosystem. The team is growing and we have big plans for the upcoming months.

As I have been interviewing candidates, once in a while someone brings up Netflix’s reviews on Glassdoor. Frankly, I was reading them myself when I was contemplating employment here. Now that I have had some time to see things on the inside for myself, Quora’s non-anonymous responses seem to be far more balanced to me. As an example, there seems to be a perception that Netflix’s unlimited vacation policy is really an underhanded way to extract the most out of employees as there are no set holidays. Nothing could be further from the truth. It’s not uncommon to have employees take multi-week vacations, for more than a month at times. It’s amazing how effective well-rested employees can be.

It has not all been rosy, of course. I’ve had to grapple with reconciliation of at least partially overlapping tools built across teams, have tough conversations to give honest feedback to my team and management, and explain how an otherwise stable system could result in providing a certain employee with a not-so-personalized video recommendation (let’s just say not everyone enjoys Barbie videos!). That is very much a part of the learning experience I was seeking though, and ultimately, it’s how much you get out of your comfort zone that determines how much you may grow. And it’s best done with amazingly talented, humble, and responsible colleagues.

As I look forward to my journey at Netflix, I often think about the impact the company has had around me. My 5 year old daughter knows that Netflix is “Abba’s office”, but yesterday she surprised me with a “Oh, it’s also a button on my iPad… you press it to play kids shows”. I was promptly informed of her favorites, WildKratts, Sophia the First, Clifford and SuperWhy. She gave me an unsolicited demo, clicking through the Kids profile onto WildKratts in the Character Bar – a demo which would give many a product managers a run for their money. For someone who spent the last 15 years working in the “hard-to-explain-to-family-what-I-do” enterprise space, it’s a welcome change to work for a brand my pre-kindergartner can love and enjoy. It also bodes well for a service that wants to win more and more entertainment “moments of truth” for an increasingly global demographic.

Earlier this year, Reed talked about the intention to launch in 200 countries by end of 2016. Today, Netflix is bustling with energy around the push to go global. We have folks working on dubs/subs localization, UI treatments, localized marketing, global cloud infrastructure, and globalization of personalization algorithms, among others.

Which brings me to the banner image of this post. No, its not an upcoming Netflix Original. It’s the image from a T-Shirt I designed for Netflix engineers playing in a corporate cricket league. I chose it for this post because now I believe that like the trailblazing ball in this image, Netflix is going places. Literally.

Scaling Conviva for FIFA 2014

FIFA2014

Scaling Conviva’s Real-time Platform for the 2014 FIFA World Cup: An engineering Perspective

– Haijie Wu, Aaditya Ramesh and Faisal Zakaria Siddiqi, Conviva Engineering [This blog first appeared on Conviva.com’s Engineering blog ]

Soccer fans around the world have just embraced a full month of excitement, brought about by the 2014 FIFA World Cup. At Conviva, we had our own share of excitement building up to a record traffic crescendo by the time the USA v. Germany game came around.

Conviva’s real-time Intelligent Control Platform met the challenge of processing unprecedented levels of global traffic while maintaining liveliness of data and high availability. In doing so, our platform reached several milestones over the course of the event, highlighting its scale and power:

  • 28+ billion viewer minutes
  • 700+ million unique viewers
  • 3.2 million peak concurrent plays

As video continues its push to the online medium, we have proved that Quality of Experience has a significant impact on user viewing behavior, as documented in our 2014 Viewer Experience Report. Conviva addresses the challenge of maintaining viewer engagement by providing visibility and optimization, ensuring a high-quality user experience for our video streaming customers’ end users.

In this blog posting, we focus on one aspect of the scaling exercise that was undertaken to support the World Cup: How we leveraged a geo-diverse, hybrid-cloud deployment to meet the availability, throughput and liveness requirements for what turned out to be a breakthrough event for live sports streaming.

(For a look at engagement trends highlighted by World Cup data, take a look here.)

Technical Overview

The Conviva platform can be abstracted into three major component groups:

  1. Gateway and messaging layer, which handles the communication with the video players and logs QoS analytics, as well as responds back with Optimization Decisions. This is the entry point to the platform and is both availability- and latency-sensitive.
  1. Aggregation and compute layer, which consists of a low-latency streaming map-reduce, built in-house for real-time analytics, as well as a historical data computation workflow built on top of Apache Hadoop.
  1. Storage, query and presentation layer, which ensures that we are storing data based on optimizing querying patterns for use in Pulse, our web portal.

 

Conviva’s real-time platform is unique in the industry and is a key differentiator when it comes to live-event streaming analytics. Our platform has been designed to meet the following, often conflicting, requirements:

  • High Availability – Any critical-path services should stay up in the face of individual server failures, a datacenter/availability zone failure or network partition. The gateway and messaging layer thereby needs to have high resiliency.
  • Horizontal Scalability – System capacity should linearly scale with traffic by adding more hardware.
  • High Throughput and Low latency – Important for inter-datacenter transfers.
  • Traffic Bursts – Need to absorb unusual spikes in load generated by viral video events.
  • Disaster Recovery – Multi-tier data persistence is required at multiple geographical locations so we can recover from catastrophic failures.

 

Optimizing the Gateway and Messaging Layer

The gateway and messaging layer happens to be most sensitive to latency and reliability requirements. Any service-impacting issue at this layer may lead to irrecoverable data loss as the data hasn’t yet hit persistence layers.

Because of how we manage session state across the platform, this layer fortunately happens to also be easier to scale up. In essence, the gateway layer’s main functionality can be expressed as a ‘Map’ function without a ‘Reduce’ step, to draw a parallel with the MapReduce concept; this makes it horizontally scalable as long as the data can still be chunked.

Conviva’s multi-tenant platform is capable of being configured to partition traffic by various dimensions (e.g. traffic from different customers may be sent to different geographical datacenters). This capability allowed us to leverage our multi-datacenter hybrid cloud deployment tuned specifically for World Cup traffic.

So, for example, bringing the gateway layers closer to the end users in multiple major geographic regions allowed us to reduce the critical path round trip time between the video players and Conviva.

The low-latency requirement, however, required more work than simply choosing the appropriate type and location of cloud deployment. Data transmission over TCP between two physically far apart locations is afflicted by the well-known LFN problem, where the high-bandwidth latency product may lead to potentially limited throughput, unless carefully tuned.

After evaluating some options, we decided to use Apache Kafka as our primary inter-datacenter messaging layer. This decision was informed by Kafka’s design choice of strong consistency and fault-tolerance guarantees, also important requirements for us. Kafka comes with a tool called MirrorMaker for mirroring the data between two Kafka clusters, potentially across a Long Fat Network. To maximize the throughput over the LFN link, after the appropriate TCP tuning, we decided to parallelize the data streams at the application layer. We filled up the fat pipe by using multiple concurrent MirrorMaker streams, pulling from different Kafka partitions at the same time over separate TCP channels. After scale assessments, we settled down on a configuration of 20 Kafka brokers, 240 partitions, replication factor of 2 and about 120 MirrorMaker threads for each regional data center to support the barrage of control plane traffic.

Because of the requirement to support bursty traffic – a highly popular live streaming event can have several times more traffic than the baseline – a dedicated datacenter was not really a choice, so we implemented our regional datacenters on several availability zones in the AWS public cloud. This allowed us to dynamically scale the gateway and messaging layer based on expected traffic forecasts for upcoming games.

High-availability was implemented across several dimensions. Latency-based global DNS load balancing, as well as per-customer configuration static mappings, were used to partition traffic across various datacenters. Multiple, redundant transit connections were set up between each pair of regional and aggregation datacenters with enough bandwidth to meet the peak traffic watermarks designated for each datacenter. To support aggregation of data from multiple regional datacenters we ended up rewriting the MirrorMaker codebase from scratch. In case of a regional failure of reachability to a datacenter, video player traffic would be distributed over the remaining available regions. Additionally, in the case of a catastrophic failure, the Conviva client libraries had built-in failover logic to avoid video playback failures and thereby protect the viewer experience.

One of the ground realities we had to be ready for was the occasional network hiccups in the AWS fabric, as well as those on the long-haul connection over the WAN. Having a loosely coupled architecture is very important in distributed systems, because the loss of some component due to the network issues should not leave another component hanging, unless it has no alternative. Our gateway layers have been designed with this principle in mind, so the impact of network issues was limited. As for the messaging layers, local buffering of topic data in Kafka/MirrorMakers helped, but periodically the MirrorMakers got into a hung state as a result of network hiccups and had to be restarted. By doing this at the messaging layer, we were able to abstract out the effect of a flaky network for the core applications running on top of this fabric.

Overall, the World Cup was a clear success for Conviva in general and the Engineering team in particular. As with all projects, this was the result of a significant team effort by a lot of passionate folks at Conviva. If you would like to work on such high-profile projects that literally impact everyone who can stream a video, anywhere in the world, please check out our Careers page. We are hiring and always looking for talented and passionate individuals to take on the next big problem in distributed systems, cloud computing, machine learning and video delivery.