Suppose you wanted to build a web scraping API that would present the structured HTML data of some web page as JSON for other sites to consume. Certain aspects of the structure of site are important when designing a scraper like this. For example, it could be helpful for the site to use human-readable (or at least meaningful) query parameters. It would be helpful if the meaningful data on the webpage was in a predictable place. The HTML file should be more than just this:
The point here is that web crawlers and scrapers are often limited to the data that is statically available to them on the page. Without running the script to completion, the crawler cannot determine what data is on the page. A program cannot determine in advance whether or not the script will run to completion at all (a case of the halting problem). As a result, indexing the Web and scraping from sites is more difficult, and sometimes even impossible. Furthermore, allowing arbitrary mobile code to be run on client machines creates a number of security holes that do not exist with plain HTML webpages. Since HTML is a declarative language, it in itself does not introduce security holes.
On the other hand, the event-driven nature of jQuery led to buggy Web pages for many developers. The notion of a “single source of truth” could become lost in thousands of lines of jQuery, since components within pages were often related in ways that were difficult to keep track of. On sites where there may have been hundreds or thousands of interactive elements, jQuery codebases grew tremendously. Furthermore, due to (relatively) slow network speeds during the early years of jQuery, the 30kb library increased page load time significantly. Nevertheless, thanks to its cross-browser compatibility and simple syntax, jQuery became vastly popular, at one point a part of almost 90% of all Web sites (3).
Issue: Search Engine Optimization and Crawlers
However, Google isn’t the only company that crawls the Web. Smaller search engines with less engineering resources may not have implemented the same full-scale crawling capability that Google has. Developers looking to scrape from Web sites would have a much harder time if the data is dynamically generated. For example, in my own tests, I’ve observed that Python’s
Issue: Page Bloat and Open Source
On top of the lost productivity, security flaws could have easily been exploited. After the open source contributor responsible for the 11 line package unpublished all his NPM packages, global package names became available for registration. A malicious developer could acquire one of these global names, republish it, and introduce malicious code into sites that depend on the unpublished package (11). This is a huge security issue, and indicates a problem with blindly trusting that code will function as expected. This is more of a classic debate about open-source software: how much can we trust fellow developers? Whether or not we choose to trust them, there are security flaws to address with some modern frameworks.
Issue: Competition and Turnover
One difficulty of being a modern Web developer is the pace at which new technologies are developed. jQuery’s popularity decreased thanks to its lack of foresight into the single-page application era. Not all frameworks are designed with future applications in mind. As a result, new frameworks are created for to provide new functionality. Angular introduced bidirectional data binding, and React introduced immutable data (12). Trying to keep up with the next hot framework requires developers to be consistently on their toes. For many smaller frameworks, the developer community is too small to warrant using the framework. We can observe Metcalfe’s law at work: the most valuable frameworks are the ones with the most developers.
- “Making AJAX Applications Crawlable | AJAX Crawling (Deprecated) | Google Developers.” Google, Google, 7 Oct. 2009, developers.google.com/webmasters/ajax-crawling/docs/learn-more.
- “Understanding Web Pages Better.” Official Google Webmaster Central Blog, 23 May 2014, webmasters.googleblog.com/2014/05/understanding-web-pages-better.html.
- “Deprecating Our AJAX Crawling Scheme.” Official Google Webmaster Central Blog, 14 Oct. 2015, webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html.
- Hund, Patrick. “Testing a React-Driven Website's SEO Using ‘Fetch as Google.’” FreeCodeCamp, FreeCodeCamp, 4 Nov. 2016, medium.freecodecamp.org/using-fetch-as-google-for-seo-experiments-with-react-driven-websites-914e0fc3ab1.
- Collins, Keith. “How One Programmer Broke the Internet by Deleting a Tiny Piece of Code.” Quartz, Quartz, 1 Apr. 2016, qz.com/646467/how-one-programmer-broke-the-internet-by-deleting-a-tiny-piece-of-code/.
- The Fact That This Is Possible with NPM Is Dangerous | Hacker News, news.ycombinator.com/item?id=11341006.
- Edited by Tim Berners-Lee and Noah Mendelsohn, The Rule of Least Power, www.w3.org/2001/tag/doc/leastPower.html.
Return to code