In April 2022 the U.S. Ninth Circuit of Appeals upheld a decision that web scraping of public information was legal, and not tantamount to computer hacking as LinkedIn had claimed in the case. Web scraping refers to the extraction of data from a website, and it’s exoneration is good news for archivists, academics, researchers, and journalists who regularly use the practice. The implications for an aspiring Memeticist is huge: being able to scrape the web is tantamount to having access to the sum total of human knowledge. Pair web scraping with machine learning categorization of memes and you can spot trends, spy on competitors, and learn what memes work.
The term ‘web scraping’ doesn’t have to refer to programmatically accessing a website, it can be done manually, usually by outsourced labor located in low cost corners of the globe. Typically however automated tools and scripts are used to acquire data for any reasonable level of scale. Good citizens in the web scraping world default to identifying themselves – for example Google’s web crawler archives the internet for its search engine, and is welcome everywhere for the traffic they provide. Recommended practice is to sign up to the service’s official API (Application Programming Interface), where there is one, so that requests can be monitored and handled more efficiently.
However APIs aren’t always accessible, available, or affordable, so many turn to web scraping to get what they need. Website data can be valuable, and hitting a server hundreds of thousands of times costs the owner money, so historically this has been a game of cat and mouse. Black hat techniques tend to revolve around sophisticated techniques to simulate a real user’s behavior to avoid anti-scraping defences. The process is necessarily brittle, as even websites who allow scraping, are likely to change without warning, breaking the script used to parse the website’s content.
Ultimately if information is online, it’s possible to scrape. In making the data available to normal users, they’re making it possible to scrape. Techniques range from simplistic to complex, and web scraping is even available in Microsoft Excel with “Data > Get External Data > From Web”. There are also browser extensions that allow you to extract elements from the page, or download videos, even when the author doesn’t want them downloaded. The general advice here I must insert is that you shouldn’t infringe on anyone’s copyright, and should respect their terms & conditions. More advanced functionality usually uses one of a handful of Python libraries like Beautiful Soup to parse HTML and write the rules for what to extract. There are also tools like Selenium which emulate web browsers, and can even do things like log in with your password. Whatever technique you use, web scraping is here to stay.
Name | Link | Type |
---|---|---|
Web scraping is legal, US appeals court reaffirms | Article | |
Web Scraping using Selenium and Python | Blog | |
What is Web Scraping? | Blog |