Code

September 2, 2022

Rain, a KV store for Strava's scale

I worked on a critical project at Strava called Rain. It’s a key-value store for immutable batch Spark outputs. We can use it for all sorts of things, enabling immutable batch refresh on the fly, automatically, and deploy-free. I wrote a blog post about it. Check it out and give me claps!

September 2, 2022

Privacy counts in Metroview

Check out my blog post on how I used Hamming weights to efficiently compute privacy thresholds in Metroview

March 3, 2022

Scaling aligned edges

The Strava Metro product needs to adapt to changing transportation infrastructure. I wrote a blog post on scaling activity aligned edges in the Strava engineering blog. To learn more, check out the link above where we describe the technical challenges that come with updating the basemap, and the solution that we came up with to support the evolving infrastructure and the growing active transportation dataset.

December 23, 2020

Kafka server disconnected

The Wealthfront data engineering team is a heavy user of Apache Kafka - we’ve built many of the streaming applications (link to another blog post?) that power the Wealthfront client experience around it. The multiple-producer, multiple-consumer persistent model is especially valuable for streaming use cases where we want to build in analytics on the streamed data. For example, if we have third-party transactions processing through Kafka, we can use Kafka as a source for both application logic and batch ETLs for offline analytics.

We embarked on a journey recently to refactor our external account linking flow to support multiple third-party providers. Our existing linking provider, Quovo, is deprecated after its acquisition by Plaid. The linking team settled on Yodlee as our next vendor for linking external accounts. Not only was this a big project on its own, but it also gave us the opportunity to rethink our offline and online data flows. We decided to use the Yodlee project as a starting point for a much larger effort to delve deeper into using AWS. We’re excited about where

This blog post is focused on a small issue inherent in the process of migrating hosted Kafka producers to a massively parallel serverless processing environment like AWS Lambda.

We recently productionalized a project to handle Yodlee data updates through webhooks. These are essentially microbatches of new transactions coming in from Yodlee’s syncs with other financial institutions. Our linking backend translates these transactions from the Yodlee schema into a generic format and further downstream will classify them into categories like savings and spending. But first, we need to actually make requests to Yodlee’s API to retrieve the transactions in a parallel manner before sending them to Kafka. AWS Lambda is the most obvious solution for this - it’s super robust to quick changes in throughput.

If you have used Lambda before you may know that latency-sensitive applications often need to implement warming, since cold starts of Lambda functions can take on the order of several seconds before execution can begin. This is because Lambda initializes a container environment specific for your function invocation. If there is a large time gap between invocations, you may see another Lambda worker initialized for each invocation. This behavior is unfortunately nondeterministic, so you won’t know if your invocation occurs on an existing Lambda worker or a new one.

There are certainly many benefits to the Lambda worker model, besides the obvious benefit of a lower average latency for invocations. At Wealthfront, we write mostly Java Lambda functions, and we use Guice for dependency injection. Having long-standing Lambda workers means that injected members of a class can persist between invocations. One instance where this helps is the case of connecting to RDS, where we are much more likely to get rate limited if each Lambda invocation initializes its own connection. Saving connections between invocations helps us reach a much higher scale without worrying about rate limiting.

The persistence of Lambda workers between invocations is not always good, though. A few weeks ago, we productionalized our Yodlee data update requester Lambda function. This function is invoked once for every element in each microbatch of data updates. It then sends the Yodlee response via Kafka to our backend. When our linking team turned on Yodlee’s webhook requests and we started linking Yodlee accounts internally, we started to see this error approximately once a day in our exception router:

ERROR The server disconnected before a response was received. “org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.”
WARN [Producer clientId=169.254.121.157] Received invalid metadata error in produce request on partition prod-KAFKA_DIRECT-stream-link-external_api_requests-0 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now"

Since we productionalized the Yodlee flow after we deployed this function initially, we didn’t see this until a critical mass of transactions were flowing through the system. CloudWatch logs are notoriously difficult to search through, but luckily we have routed Lambda logs to our Kibana instance (I’ll document our approach to this in a future blog post). Kibana pointed at our new Lambda as well as another Lambda in an integration test we had separately written (that wasn’t throwing exceptions to our exception router).

When I located the CloudWatch log group with these error messages, I found that these errors are all from Lambda invocations occurring well after the log group’s previous logs. Specifically, more than ten minutes after. This was a clue: each Lambda worker creates its own CloudWatch log group. So these time gaps indicated that a single worker had requests spaced out by more than 10 minutes. After spot-checking some log groups that did not contain this error message, it was clear that the time gap was the root of the problem. I discovered that Kafka brokers have a configuration property, connections.max.idle.ms, that dictates when to drop a client connection. By default, if a Kafka client connection is idle for 10 minutes, the broker will drop the connection. Since Lambda worker reuse is nondeterministic (at least to the client), we were very occasionally reusing workers that had been idle for more than 10 minutes. Kafka dropped the connection, and we got the error message above.

Why didn’t this happen from our backend producers? We use singleton Kafka producers from our microservices, and these services are sending high-throughput data streams like user events through Kafka. Since all messages are routed through the same producer on the service (as opposed to Lambda, where each worker has its own producer), the likelihood of ten minutes of idle time is very low.

The solution to fix our Lambdas was to use an expiring provider - instead of providing a Kafka producer via Guice injection, we provide a provider to a Kafka producer that will go retrieve a new instance if the previous instance has a connection that has been dropped.

This pattern likely applies to a wide variety of other use cases - including the RDS use case above. We haven’t seen the issue since applying the fix. Hopefully this helps someone out there running into the same Kafka connection issue.

I found the following articles from the AWS blog fairly interesting: This post introduces the use of self-hosted Kafka topics as input triggers to Lambda. Our team hosts Kafka in EC2, and this could assist in some future use cases. This post introduces provisioned concurrency to reduce the impact of cold starts on latency-sensitive workloads. Our batch computation of data updates is not latency-sensitive, but other workloads may be.

April 25, 2018

A history of frameworks

Suppose you wanted to build a web scraping API that would present the structured HTML data of some web page as JSON for other sites to consume. Certain aspects of the structure of site are important when designing a scraper like this. For example, it could be helpful for the site to use human-readable (or at least meaningful) query parameters. It would be helpful if the meaningful data on the webpage was in a predictable place. The HTML file should be more than just this:

    <script src="./script.js"></script>

The point here is that web crawlers and scrapers are often limited to the data that is statically available to them on the page. Without running the script to completion, the crawler cannot determine what data is on the page. A program cannot determine in advance whether or not the script will run to completion at all (a case of the halting problem). As a result, indexing the Web and scraping from sites is more difficult, and sometimes even impossible. Furthermore, allowing arbitrary mobile code to be run on client machines creates a number of security holes that do not exist with plain HTML webpages. Since HTML is a declarative language, it in itself does not introduce security holes.

The rise of JavaScript frameworks has made Web even more complex than Tim Berners-Lee had perhaps ever dreamt of. Though Web programmers have a powerful toolbox in JavaScript, there are also a number of disadvantages. In this paper, we will discuss some of the important trade-offs that JavaScript frameworks have introduced. We will also observe how JavaScript aligns with the original vision of the Web.

History of JavaScript

The first version of JavaScript was famously (or perhaps infamously) created at Netscape by Brendan Eich in the span of ten days (1). The rushed development of the language and the immediate adoption meant that the poor design decisions made in those ten days still have lasting effects to the present. In many ways, the language did not plan for evolution. JavaScript was originally designed so that Web pages could behave dynamically, and users could interact with the content in front of them. Originally called Mocha, it was designed to be easy to use for small-scale scripts on the Netscape browser. Several design decisions made JavaScript accessible for early adopters. Firstly, simplicity supported the scalability of the language. The Java-like syntax gave first time users an immediate sense of familiarity. Furthermore, the use of functions as first-class objects provides a powerful construct for event-driven programming.

At first, the language was used for its original purpose: to provide simple dynamic content to users easily. Small-scale animations, games, and basic user interactions were built using JavaScript. Plain JavaScript code was often messy and difficult to parse in the early days, especially when standards did not exist, leading to buggy webpages. As developers became more experienced with the language, the ECMAScript standard was developed, to establish uniformity across implementations of the language as well as best practices for engineers.

JavaScript Frameworks

History

One of the earliest JavaScript frameworks is one that is still widely used today: jQuery. The spirit of jQuery was in line with that of JavaScript. The library simplified the manipulation of HTML tags. Combined with the new technology of AJAX in the early 2000s (Asynchronous JavaScript and XML), jQuery was a powerful new tool in the hands of developers who wanted to work on cross-browser applications. AJAX allowed Web pages to interact with the server outside of the page load time. jQuery was built by John Resig with two principles in mind: simplicity and compatibility (2).

On the other hand, the event-driven nature of jQuery led to buggy Web pages for many developers. The notion of a “single source of truth” could become lost in thousands of lines of jQuery, since components within pages were often related in ways that were difficult to keep track of. On sites where there may have been hundreds or thousands of interactive elements, jQuery codebases grew tremendously. Furthermore, due to (relatively) slow network speeds during the early years of jQuery, the 30kb library increased page load time significantly. Nevertheless, thanks to its cross-browser compatibility and simple syntax, jQuery became vastly popular, at one point a part of almost 90% of all Web sites (3).

After the introduction of jQuery, a trend began to change the Web into what we see today. New websites that developers wanted to create could not be done with jQuery alone. Faster networks meant that developers did not need to be concerned about the bloat that JavaScript libraries or frameworks added. Developers begin writing single page applications, which would require a more structured approach in the use of JavaScript. The model-view-controller design philosophy was introduced, in contrast to the model-view philosophy of HTML and CSS. While HTML and CSS would handle the content and the presentation respectively, JavaScript could handle interactions, AJAX requests, and dynamic content. New frameworks were built to automatically synchronize data between the controller and the model. Though single page applications may not have been in Tim’s original vision, there is no doubt that these applications have significantly changed the Web. Frameworks like AngularJS, Backbone, React, and Ember have tied JavaScript intimately with the DOM. Live data-rendering and interactive Web applications were made possible with these frameworks. At the same time, there were significant disadvantages to using frameworks, including visibility, security, and turnover time.

Issue: Search Engine Optimization and Crawlers

Among the biggest of these issues is search engine optimization. Some frameworks allowed Web pages to render on the client side, purely from JavaScript. This means that the served HTML body could be as little as a single script tag. As we discussed above, search engines like Google have trouble indexing pages that are dynamically generated. The static nature of HTML makes it simple to follow links as the Google crawler does. In the early days of JavaScript frameworks, Google discouraged their use (4). Google described the trade off in building single page AJAX based applications: “making your application more responsive has come at a huge cost: crawlers are not able to see any content that is created dynamically” (4). This post was from 2009.

Of course, Google could not stand still while JavaScript apps began to take over the Web. Google, in a sense, applied Postel’s law in their recommendation. In 2009, they were conservative with what sort of content they claimed to be able to index. Google used its leverage on the Web to govern the way that developers built their sites. At the time, Google asked for snapshots of dynamically generated sites so that their crawlers could parse the data. Even so, Google still worked to index AJAX sites and single page applications. They began rendering the JavaScript-generated pages on crawlers as brosers would (5). Google needed to stay ahead of the Web developers in order to keep their search ahead of the trend. They disincentivized building AJAX applications so that they could get a head start on indexing those sites.

Google would finally come out in 2015 to deprecate that recommendation (6). Google can now index most JavaScript pages, crawling them as a modern browser would. Notably, in the same post, Google encourages the principle of “progressive enhancement” for Web pages, which is related to the principles of partial understanding and backwards compatibility. Web developers are encouraged to present content first and foremost, with layers of complexity added corresponding to different complexities of browser implementations. With this principle in mind, users on very old browsers would see all the essentials of the document, possibly with a engaging or interactive presentation.

However, Google isn’t the only company that crawls the Web. Smaller search engines with less engineering resources may not have implemented the same full-scale crawling capability that Google has. Developers looking to scrape from Web sites would have a much harder time if the data is dynamically generated. For example, in my own tests, I’ve observed that Python’s urllib library does not run JavaScript before returning. If a developer wants to scrape text from a page written in AngularJS, they will need to somehow run JavaScript before collecting the data.

Even after Google’s 2015 announcement, issues still came up with SEO on sites running mostly JavaScript. For example, Google’s crawler failed to index shows on Hulu, the media streaming site. The site uses JavaScript to render much of its content, while at the same time preventing third parties from hosting their media (7). Another experiment showed that client-rendered Angular sites failed SEO tests. One Google employee was quoted as saying “if you care about SEO, you still need to have server-rendered content” (8).

Some front-end frameworks allow for client-side routing, implemented with JavaScript. That is, the URI changes on the client, without making an within the framework. Depending on how the framework implements routing, crawlers may or may not see the client-rendered routes. In one experiment, Google’s crawler gave up on using a React client-side router to render a page (9). This means that many client-routed websites are not properly indexed by Google. As more applications are designed with client-side routing, this issue could become more problematic for Google. Both Google and the website using client side routing lose value if these sites are not properly indexed.

There are a few takeaways from the issue of SEO with JavaScript frameworks. Developers who are focused on getting indexed in search engines should primarily render their content server-side. At the same time, developers can count on Google to use its resources to support indexing pages generated with popular frameworks, at least eventually. Also, Web developers should try to cater their pages to the largest possible set of users, whether those users are bots or people using an old version of Internet Explorer.

Issue: Page Bloat and Open Source

When dial-up connections were common in the 2000s, the 30kb jQuery package could take a significant amount of time to load. Since then, networks have become faster than ever, serving hundreds of megabits per second. However, with the rise of JavaScript frameworks, some Web pages now depend on many megabytes worth of JavaScript code. Developers have started to abuse the bandwidth available to them by introducing large libraries into applications that only need a tiny fraction of the library. Though this may not seem like a huge issue in itself, the embrace of large libraries leads to other problems.

JavaScript frameworks often encourage the use of third-party libraries to extend the features of the framework. For example, React in particular is known for its reliance on add-ons. Node package manager (NPM) makes it easier than ever to include open source code in websites. As a result, hundreds of thousands of lines of code could end up in relatively simple web applications. This code, gone uninspected, could be malicious or introduce bugs into the application. Some NPM packages are so heavily relied upon that changing them will “break the internet”. In one example from 2016, an NPM package was removed over a legal dispute about the name of the package (10). Web developers around the world stopped working after an eleven line NPM package (that React depended on) was unpublished. The package was a single function that left padded strings with a given character. For several hours on March 22, 2016, developers depending on the package were at a standstill.

On top of the lost productivity, security flaws could have easily been exploited. After the open source contributor responsible for the 11 line package unpublished all his NPM packages, global package names became available for registration. A malicious developer could acquire one of these global names, republish it, and introduce malicious code into sites that depend on the unpublished package (11). This is a huge security issue, and indicates a problem with blindly trusting that code will function as expected. This is more of a classic debate about open-source software: how much can we trust fellow developers? Whether or not we choose to trust them, there are security flaws to address with some modern frameworks.

Issue: Competition and Turnover

One difficulty of being a modern Web developer is the pace at which new technologies are developed. jQuery’s popularity decreased thanks to its lack of foresight into the single-page application era. Not all frameworks are designed with future applications in mind. As a result, new frameworks are created for to provide new functionality. Angular introduced bidirectional data binding, and React introduced immutable data (12). Trying to keep up with the next hot framework requires developers to be consistently on their toes. For many smaller frameworks, the developer community is too small to warrant using the framework. We can observe Metcalfe’s law at work: the most valuable frameworks are the ones with the most developers.

The framework landscape is constantly changing. Whereas HTML has remained mostly consistent (and backwards compatible) since inception, JavaScript frameworks quickly grow obsolete after a few years in the open (13). For example, right when AngularJS 1 was reaching its peak in popularity in 2016, Angular 2 was released. Angular 2 was essentially a full rewrite of the framework. Developers would need to learn the new framework or risk obsolescence.

Even so, new frameworks are adopted quickly, thanks to support from large tech companies (Angular and React are backed by Google and Facebook respectively) and network effects. Soon after release, JavaScript frameworks often develop a vibrant community of developers and a host of resources providing support. We can see the power of network effects in the growth of jQuery, Angular, and React.

Conclusion

Despite their flaws, we cannot deny the power of JavaScript and frameworks in Web pages. As mentioned before, single page interactive applications are only possible with AJAX and JavaScript frameworks. In the early days of JavaScript, codebases were sprawling and difficult to manage. In many cases, frameworks have introduced structure to JavaScript code. They have provided clean abstractions for manipulating the DOM and handling AJAX. Frameworks make building Web applications accessible for the average JavaScript developer. There is no shortage of developers on the Web that actively tout the advantages of using JavaScript frameworks.

Tim Berners-Lee advocated the “rule of least power” (14). That is, one should use the least powerful language suitable for a purpose. JavaScript has certainly evolved to do much more than its original purpose. The language offers powerful dynamic content and interactive experiences at the possible expense of simplicity and security.

Developers today are focused on building apps quickly that work across platforms both in the browser and natively. New frameworks like Electron, Ionic, and React Native make it easier for developers to do just that. The future of JavaScript may lie outside the browser.

How does the JavaScript community proceed into the future? Given the trends of the last decade, we should be fairly certain that frameworks will not disappear. However, designers can make choices to mitigate the issues they may run into when writing with frameworks. In terms of SEO, developers can choose to work with frameworks that play well with crawlers like the Googlebot. They can follow Google Webmaster standards to ensure their site is indexed correctly. To address security problems, central organizations like NPM should establish standards that protect Web pages from the negative effects of open source. The frameworks available will continue to change as long as the Web continues to evolve. These frameworks can be built for evolution, but may not anticipate the new ideas and technologies that developers want to implement.

May 4, 2017

Facebook's worst bug

A recent New York Times article points out the impact that Facebook has had on isolating people into their political corners.

This is probably one of the biggest challenges of AI today: personalization of the News Feed has gone to the next level, and Facebook is suddenly responsible for the reinforcement of political ideologies. We can blame the existence of partisanship today at least partially on artificial intelligence.

Maybe give the user some ability to select how much they are sheltered in their political bubble? Maybe a “hmm… I don’t agree, but tell me more” button.

It makes me a bit curious about how Facebook is using data generated from “reacts” to play into their News Feed algorithm. Just because someone “angry reacts” at something doesn’t necessarily mean they don’t want to see it. How to differentiate? And how can Facebook create bipartisan News Feeds that people actually want to see?

March 24, 2016

Javascript from an Imperative Programmer's Perspective

“JavaScript is a misunderstood language. You really need to get to know it before hating on it.” – some web programmer in denial

I think it’s pretty much universally agreed upon that JavaScript’s bad parts far outweigh the good parts.

I’ve most recently struggled with asynchronous callbacks within nested loops:

for (foodType in menu) {
    for (foodname in foodType) {
        checkForFood(foodname, function(food) {
            foodArr.push(food);
        });
    }
}
return foodArr;

JShint gives a warning on the callback within a loop, and for good reason. The functions start piling on top of each other as the loop continues, not executing synchronously, which means that foodArr will still be empty on return. I wasn’t able to figure out a solution without using an external library. Instead, I needed to use the async library and a whole bunch of extra callbacks just to make this thing run synchronously.

var async = require('async');
async.forEachOf(menu, function(typearr, type, callback1) {
    type = type.trim();
    async.each(typearr, function(foodname, callback2) {
        checkForFood(type, foodname, function(food) {
            foodArr.push(food);
            callback2();
        });
    }, function(err) {        
        callback1();
    });
}, function(err) {
    callback(foodArr);
});

So maybe this isn’t actually a complaint about JavaScript, but rather the whole idea of asynchronous languages. I suppose I’m more used to synchronous languages, so it may not be fair for me to judge JavaScript for it. However, even the most seasoned of programmers still have trouble dealing with the issue of so-called “callback hell” in JavaScript. That final pyramid of a bajillion of these: }); will make anyone cringe.

Nonetheless, I’m grudgingly working on my JavaScript skills. And someday, I’ll be that stubborn programmer with years of experience, blinded from the bad side of JavaScript, telling everyone else that JavaScript is so misunderstood.

March 21, 2016

Make things easier for computers who want to see your site

In building an API to hold Tufts dining menu data, there were certain things that I found more challenging than others. There are the things that you would expect to be difficult (things that I expected to be difficult, anyway), such as pushing my site to Heroku and learning and using MongoDB for the first time. On the other hand, there are other things that I would expect to be easier, including parsing retrieved HTML and formatting it. You know, because HTML data is supposed to be structured, and can basically be turned into JSON on the spot.

Nope. I don’t know where Tufts Dining gets its menu template from, but I will say that it is incredibly hard to read, and to parse. Though the menu appeared hierarchical, the HTML represented it as a table, meaning that every heading and menu item seemed to have the same level of significance, at least in the HTML. Most of the styles were embedded into the HTML, a major no-no in the world of web programming.

Another feature of Tufts dining menus is that the URL’s associated with a specific menu are extremely long. I understand that they need some query string parameters in order to display the menu, but even when using query parameters, the beauty of a URL is definitely something to think about. When accessing the ingredients and nutrition page for a specific menu item, the URL does not even display the food in question.

The best that I can hope for, now that I have a working version of the API, is that no one further convolutes the menu pages.

So I have a plead to web developers everywhere: keep the URLs short and sweet, and write HTML that is hierarchical and easy to follow. When some college student wants to build an API out of the data on your page, they will thank you.

January 28, 2016

Go and Artificial Intelligence

I came across this really interesting article about a program designed to beat computers at the Chinese game Go here. This breakthrough is reminiscent of IBM’s Deep Blue in the 1990’s, beating chess champion Gary Kasparov in a test of man vs computer.

Go is significantly harder for a computer to solve than chess, with approximately 10¹⁷⁰ board configurations.

The program utilizes deep neural networks to determine the best possible move from a given position. After learning from millions of expert games, the program played against itself, learning to improve along the way.

This is pretty big for Google (who designed the program), machine learning, and technology in general. Read more about it here.

January 24, 2016

Hello World!

So I decided to start my own blog on my new webpage. I hope to add stuff regularly, and I hope that the other sections of the site will soon be populated with more and more content. I’ll try to keep it updated regularly, depending on how busy life is going. In the meantime, you can check out my other sites here.

For this site, I utilized Jekyll, a templating tool that helps to build websites.

I’m not quite sure what this site will turn into. I’m expecting to post about code, about travel, about stuff going on in the world, and on things that I find cool. We’ll see what happens. I’m pretty excited about the layout as of now.

Here goes nothing.