In April 2022 the U.S. Ninth Circuit of Appeals upheld a decision that web scraping of public information was legal, and not tantamount to computer hacking as LinkedIn had claimed in the case. Web scraping refers to the extraction of data from a website, and it’s exoneration is good news for archivists, academics, researchers, and journalists who regularly use the practice. The implication for an aspiring Memeticist is huge: being able to scrape the web is tantamount to having access to the sum total of human knowledge. Pair web scraping with machine learning categorization of memes and you can spot trends, spy on competitors, and learn what memes work.
Website data can be valuable, and hitting a server hundreds of thousands of times costs the owner money, so historically this has been a game of cat and mouse. Although web scraping can be done manually, typically automated tools and scripts are used to acquire data at scale. Good citizens in the web scraping world default to identifying themselves – for example Google’s web crawler archives the internet for its search engine, and is welcome everywhere for the traffic they provide. Black hat techniques tend to revolve around sophisticated techniques to simulate a real user’s behavior to avoid anti-scraping defences. The process is necessarily brittle, as even websites who allow scraping, are likely to change without warning, breaking the script used to parse the website’s content. Whether it’s Excel’s “Data > Get External Data > From Web” you’re using, or more sophisticated custom software, web scraping is here to stay.
Name | Link | Type |
---|---|---|
Web scraping is legal, US appeals court reaffirms | Article | |
What is Web Scraping? | Blog |