Subscribe to posts on Code
The Wealthfront data engineering team is a heavy user of Apache Kafka - we’ve built many of the streaming applications (link to another blog post?) that power the Wealthfront client experience around it. The multiple-producer, multiple-consumer persistent model is especially valuable for streaming use cases where we want to build in analytics on the streamed data. For example, if we have third-party transactions processing through Kafka, we can use Kafka as a source for both application logic and batch ETLs for offline analytics.
We embarked on a journey recently to refactor our external account linking flow to support multiple third-party providers. Our existing linking provider, Quovo, is deprecated after its acquisition by Plaid. The linking team settled on Yodlee as our next vendor for linking external accounts. Not only was this a big project on its own, but it also gave us the opportunity to rethink our offline and online data flows. We decided to use the Yodlee project as a starting point for a much larger effort to delve deeper into using AWS. We’re excited about where
This blog post is focused on a small issue inherent in the process of migrating hosted Kafka producers to a massively parallel serverless processing environment like AWS Lambda.
We recently productionalized a project to handle Yodlee data updates through webhooks. These are essentially microbatches of new transactions coming in from Yodlee’s syncs with other financial institutions. Our linking backend translates these transactions from the Yodlee schema into a generic format and further downstream will classify them into categories like savings and spending. But first, we need to actually make requests to Yodlee’s API to retrieve the transactions in a parallel manner before sending them to Kafka. AWS Lambda is the most obvious solution for this - it’s super robust to quick changes in throughput.
If you have used Lambda before you may know that latency-sensitive applications often need to implement warming, since cold starts of Lambda functions can take on the order of several seconds before execution can begin. This is because Lambda initializes a container environment specific for your function invocation. If there is a large time gap between invocations, you may see another Lambda worker initialized for each invocation. This behavior is unfortunately nondeterministic, so you won’t know if your invocation occurs on an existing Lambda worker or a new one.
There are certainly many benefits to the Lambda worker model, besides the obvious benefit of a lower average latency for invocations. At Wealthfront, we write mostly Java Lambda functions, and we use Guice for dependency injection. Having long-standing Lambda workers means that injected members of a class can persist between invocations. One instance where this helps is the case of connecting to RDS, where we are much more likely to get rate limited if each Lambda invocation initializes its own connection. Saving connections between invocations helps us reach a much higher scale without worrying about rate limiting.
The persistence of Lambda workers between invocations is not always good, though. A few weeks ago, we productionalized our Yodlee data update requester Lambda function. This function is invoked once for every element in each microbatch of data updates. It then sends the Yodlee response via Kafka to our backend. When our linking team turned on Yodlee’s webhook requests and we started linking Yodlee accounts internally, we started to see this error approximately once a day in our exception router:
ERROR The server disconnected before a response was received. “org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.” WARN [Producer clientId=169.254.121.157] Received invalid metadata error in produce request on partition prod-KAFKA_DIRECT-stream-link-external_api_requests-0 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now"
Since we productionalized the Yodlee flow after we deployed this function initially, we didn’t see this until a critical mass of transactions were flowing through the system. CloudWatch logs are notoriously difficult to search through, but luckily we have routed Lambda logs to our Kibana instance (I’ll document our approach to this in a future blog post). Kibana pointed at our new Lambda as well as another Lambda in an integration test we had separately written (that wasn’t throwing exceptions to our exception router).
When I located the CloudWatch log group with these error messages, I found that these errors are all from Lambda invocations occurring well after the log group’s previous logs. Specifically, more than ten minutes after. This was a clue: each Lambda worker creates its own CloudWatch log group. So these time gaps indicated that a single worker had requests spaced out by more than 10 minutes. After spot-checking some log groups that did not contain this error message, it was clear that the time gap was the root of the problem. I discovered that Kafka brokers have a configuration property, connections.max.idle.ms, that dictates when to drop a client connection. By default, if a Kafka client connection is idle for 10 minutes, the broker will drop the connection. Since Lambda worker reuse is nondeterministic (at least to the client), we were very occasionally reusing workers that had been idle for more than 10 minutes. Kafka dropped the connection, and we got the error message above.
Why didn’t this happen from our backend producers? We use singleton Kafka producers from our microservices, and these services are sending high-throughput data streams like user events through Kafka. Since all messages are routed through the same producer on the service (as opposed to Lambda, where each worker has its own producer), the likelihood of ten minutes of idle time is very low.
The solution to fix our Lambdas was to use an expiring provider - instead of providing a Kafka producer via Guice injection, we provide a provider to a Kafka producer that will go retrieve a new instance if the previous instance has a connection that has been dropped.
This pattern likely applies to a wide variety of other use cases - including the RDS use case above. We haven’t seen the issue since applying the fix. Hopefully this helps someone out there running into the same Kafka connection issue.
I found the following articles from the AWS blog fairly interesting: This post introduces the use of self-hosted Kafka topics as input triggers to Lambda. Our team hosts Kafka in EC2, and this could assist in some future use cases. This post introduces provisioned concurrency to reduce the impact of cold starts on latency-sensitive workloads. Our batch computation of data updates is not latency-sensitive, but other workloads may be.
Suppose you wanted to build a web scraping API that would present the structured HTML data of some web page as JSON for other sites to consume. Certain aspects of the structure of site are important when designing a scraper like this. For example, it could be helpful for the site to use human-readable (or at least meaningful) query parameters. It would be helpful if the meaningful data on the webpage was in a predictable place. The HTML file should be more than just this:
The point here is that web crawlers and scrapers are often limited to the data that is statically available to them on the page. Without running the script to completion, the crawler cannot determine what data is on the page. A program cannot determine in advance whether or not the script will run to completion at all (a case of the halting problem). As a result, indexing the Web and scraping from sites is more difficult, and sometimes even impossible. Furthermore, allowing arbitrary mobile code to be run on client machines creates a number of security holes that do not exist with plain HTML webpages. Since HTML is a declarative language, it in itself does not introduce security holes.
On the other hand, the event-driven nature of jQuery led to buggy Web pages for many developers. The notion of a “single source of truth” could become lost in thousands of lines of jQuery, since components within pages were often related in ways that were difficult to keep track of. On sites where there may have been hundreds or thousands of interactive elements, jQuery codebases grew tremendously. Furthermore, due to (relatively) slow network speeds during the early years of jQuery, the 30kb library increased page load time significantly. Nevertheless, thanks to its cross-browser compatibility and simple syntax, jQuery became vastly popular, at one point a part of almost 90% of all Web sites (3).
Issue: Search Engine Optimization and Crawlers
However, Google isn’t the only company that crawls the Web. Smaller search engines with less engineering resources may not have implemented the same full-scale crawling capability that Google has. Developers looking to scrape from Web sites would have a much harder time if the data is dynamically generated. For example, in my own tests, I’ve observed that Python’s
Issue: Page Bloat and Open Source
On top of the lost productivity, security flaws could have easily been exploited. After the open source contributor responsible for the 11 line package unpublished all his NPM packages, global package names became available for registration. A malicious developer could acquire one of these global names, republish it, and introduce malicious code into sites that depend on the unpublished package (11). This is a huge security issue, and indicates a problem with blindly trusting that code will function as expected. This is more of a classic debate about open-source software: how much can we trust fellow developers? Whether or not we choose to trust them, there are security flaws to address with some modern frameworks.
Issue: Competition and Turnover
One difficulty of being a modern Web developer is the pace at which new technologies are developed. jQuery’s popularity decreased thanks to its lack of foresight into the single-page application era. Not all frameworks are designed with future applications in mind. As a result, new frameworks are created for to provide new functionality. Angular introduced bidirectional data binding, and React introduced immutable data (12). Trying to keep up with the next hot framework requires developers to be consistently on their toes. For many smaller frameworks, the developer community is too small to warrant using the framework. We can observe Metcalfe’s law at work: the most valuable frameworks are the ones with the most developers.
- “Making AJAX Applications Crawlable | AJAX Crawling (Deprecated) | Google Developers.” Google, Google, 7 Oct. 2009, developers.google.com/webmasters/ajax-crawling/docs/learn-more.
- “Understanding Web Pages Better.” Official Google Webmaster Central Blog, 23 May 2014, webmasters.googleblog.com/2014/05/understanding-web-pages-better.html.
- “Deprecating Our AJAX Crawling Scheme.” Official Google Webmaster Central Blog, 14 Oct. 2015, webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html.
- Hund, Patrick. “Testing a React-Driven Website's SEO Using ‘Fetch as Google.’” FreeCodeCamp, FreeCodeCamp, 4 Nov. 2016, medium.freecodecamp.org/using-fetch-as-google-for-seo-experiments-with-react-driven-websites-914e0fc3ab1.
- Collins, Keith. “How One Programmer Broke the Internet by Deleting a Tiny Piece of Code.” Quartz, Quartz, 1 Apr. 2016, qz.com/646467/how-one-programmer-broke-the-internet-by-deleting-a-tiny-piece-of-code/.
- The Fact That This Is Possible with NPM Is Dangerous | Hacker News, news.ycombinator.com/item?id=11341006.
- Edited by Tim Berners-Lee and Noah Mendelsohn, The Rule of Least Power, www.w3.org/2001/tag/doc/leastPower.html.
A recent New York Times article points out the impact that Facebook has had on isolating people into their political corners.
This is probably one of the biggest challenges of AI today: personalization of the News Feed has gone to the next level, and Facebook is suddenly responsible for the reinforcement of political ideologies. We can blame the existence of partisanship today at least partially on artificial intelligence.
Maybe give the user some ability to select how much they are sheltered in their political bubble? Maybe a “hmm… I don’t agree, but tell me more” button.
It makes me a bit curious about how Facebook is using data generated from “reacts” to play into their News Feed algorithm. Just because someone “angry reacts” at something doesn’t necessarily mean they don’t want to see it. How to differentiate? And how can Facebook create bipartisan News Feeds that people actually want to see?
I’ve most recently struggled with asynchronous callbacks within nested loops:
JShint gives a warning on the callback within a loop, and for good reason. The functions start piling on top of each other as the loop continues, not executing synchronously, which means that foodArr will still be empty on return. I wasn’t able to figure out a solution without using an external library. Instead, I needed to use the async library and a whole bunch of extra callbacks just to make this thing run synchronously.
}); will make anyone cringe.
In building an API to hold Tufts dining menu data, there were certain things that I found more challenging than others. There are the things that you would expect to be difficult (things that I expected to be difficult, anyway), such as pushing my site to Heroku and learning and using MongoDB for the first time. On the other hand, there are other things that I would expect to be easier, including parsing retrieved HTML and formatting it. You know, because HTML data is supposed to be structured, and can basically be turned into JSON on the spot.
Nope. I don’t know where Tufts Dining gets its menu template from, but I will say that it is incredibly hard to read, and to parse. Though the menu appeared hierarchical, the HTML represented it as a table, meaning that every heading and menu item seemed to have the same level of significance, at least in the HTML. Most of the styles were embedded into the HTML, a major no-no in the world of web programming.
Another feature of Tufts dining menus is that the URL’s associated with a specific menu are extremely long. I understand that they need some query string parameters in order to display the menu, but even when using query parameters, the beauty of a URL is definitely something to think about. When accessing the ingredients and nutrition page for a specific menu item, the URL does not even display the food in question.
The best that I can hope for, now that I have a working version of the API, is that no one further convolutes the menu pages.
So I have a plead to web developers everywhere: keep the URLs short and sweet, and write HTML that is hierarchical and easy to follow. When some college student wants to build an API out of the data on your page, they will thank you.
I came across this really interesting article about a program designed to beat computers at the Chinese game Go here. This breakthrough is reminiscent of IBM’s Deep Blue in the 1990’s, beating chess champion Gary Kasparov in a test of man vs computer.
Go is significantly harder for a computer to solve than chess, with approximately 10170 board configurations.
The program utilizes deep neural networks to determine the best possible move from a given position. After learning from millions of expert games, the program played against itself, learning to improve along the way.
This is pretty big for Google (who designed the program), machine learning, and technology in general. Read more about it here.
So I decided to start my own blog on my new webpage. I hope to add stuff regularly, and I hope that the other sections of the site will soon be populated with more and more content. I’ll try to keep it updated regularly, depending on how busy life is going. In the meantime, you can check out my other sites here.
For this site, I utilized Jekyll, a templating tool that helps to build websites.
I’m not quite sure what this site will turn into. I’m expecting to post about code, about travel, about stuff going on in the world, and on things that I find cool. We’ll see what happens. I’m pretty excited about the layout as of now.
Here goes nothing.