Scrapy 1.x will be the last series supporting Python 2. Scrapy 2.0, planned for Q4 2019 or Q1 2020, will support Python 3 only .

As a result, when an item loader is initialized with an item, ItemLoader.load_item() once again makes later calls to ItemLoader.get_output_value() or ItemLoader.load_item() return empty data.

Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.

The following deprecated settings have also been removed ( issue 3578 ):

How to split an item into multiple items in an item pipeline?

Non-default values for the SCHEDULER_PRIORITY_QUEUE setting may stop working. Scheduler priority queue classes now need to handle Request objects instead of arbitrary Python data structures.

Spider subclass instances were never meant to work, and they were not working as one would expect: instead of using the passed Spider subclass instance, their from_crawler method was called to generate a new instance.

Crawler , CrawlerRunner.crawl and CrawlerRunner.create_crawler no longer accept a Spider subclass instance, they only accept a Spider subclass now.

This change is backward incompatible . If you don’t want to retry 429 , you must override RETRY_HTTP_CODES accordingly.

429 is now part of the RETRY_HTTP_CODES setting by default

Make sure you install Scrapy 1.7.1. The Scrapy 1.7.0 package in PyPI is the result of an erroneous commit tagging and does not include all the changes described below.

See Module Relocations for more information, or use suggestions from Scrapy 1.5.x deprecation warnings to update your code.

Backward incompatible : Scrapy’s telnet console now requires username and password. See Telnet Console for more details. This change fixes a security issue ; see Scrapy 1.5.2 (2019-01-22) release notes for details.

If you’re using custom Selector or SelectorList subclasses, a backward incompatible change in parsel may affect your code. See parsel changelog for a detailed description, as well as for the full list of improvements.

CSS selectors are cached in parsel >= 1.5, which makes them faster when the same CSS path is used many times. This is very common in case of Scrapy spiders: callbacks are usually called several times, on different pages.

Another useful new feature is the introduction of Selector.attrib and SelectorList.attrib properties, which make it easier to get attributes of HTML elements. See Selecting element attributes .

There are currently no plans to deprecate .extract() and .extract_first() methods.

Most visible change is that .get() and .getall() selector methods are now preferred over .extract_first() and .extract() . We feel that these new methods result in a more concise and readable code. See extract() and extract_first() for more details.

While these are not changes in Scrapy itself, but rather in the parsel library which Scrapy uses for xpath/css selectors, these changes are worth mentioning here. Scrapy now depends on parsel >= 1.5, and Scrapy documentation is updated to follow recent parsel API conventions.

The fix is backward incompatible, it enables telnet user-password authentication by default with a random generated password. If you can’t upgrade right away, please consider setting TELNET_CONSOLE_PORT out of its default value.

Security bugfix: Telnet console extension can be easily exploited by rogue websites POSTing content to http://localhost:6023 , we haven’t found a way to exploit it from Scrapy, but it is very easy to trick a browser to do so and elevates the risk for local development environment.

This is a maintenance release with important bug fixes, but no new features:

This release brings small new features and improvements across the codebase. Some highlights:

Scrapy 1.4 does not bring that many breathtaking new features but quite a few handy improvements nonetheless.

Scrapy now supports anonymous FTP sessions with customizable user and password via the new FTP_USER and FTP_PASSWORD settings. And if you’re using Twisted version 17.1.0 or above, FTP is now available with Python 3.

There’s a new response.follow method for creating requests; it is now a recommended way to create Requests in Scrapy spiders. This method makes it easier to write correct spiders; response.follow has several advantages over creating scrapy.Request objects directly:

it handles relative URLs;

it works properly with non-ascii URLs on non-UTF8 pages;

in addition to absolute and relative URLs it supports Selectors; for <a> elements it can also extract their href values.

For example, instead of this:

for href in response . css ( 'li.page a::attr(href)' ) . extract (): url = response . urljoin ( href ) yield scrapy . Request ( url , self . parse , encoding = response . encoding )

One can now write this:

for a in response . css ( 'li.page a' ): yield response . follow ( a , self . parse )

Link extractors are also improved. They work similarly to what a regular modern browser would do: leading and trailing whitespace are removed from attributes (think href=" http://example.com" ) when building Link objects. This whitespace-stripping also happens for action attributes with FormRequest .

Please also note that link extractors do not canonicalize URLs by default anymore. This was puzzling users every now and then, and it’s not what browsers do in fact, so we removed that extra transformation on extracted links.

For those of you wanting more control on the Referer: header that Scrapy sends when following links, you can set your own Referrer Policy . Prior to Scrapy 1.4, the default RefererMiddleware would simply and blindly set it to the URL of the response that generated the HTTP request (which could leak information on your URL seeds). By default, Scrapy now behaves much like your regular browser does. And this policy is fully customizable with W3C standard values (or with something really custom of your own if you wish). See REFERRER_POLICY for details.

To make Scrapy spiders easier to debug, Scrapy logs more stats by default in 1.4: memory usage stats, detailed retry stats, detailed HTTP error code stats. A similar change is that HTTP cache path is also visible in logs now.

Last but not least, Scrapy now has the option to make JSON and XML items more human-readable, with newlines between items and even custom indenting offset, using the new FEED_EXPORT_INDENT setting.

Enjoy! (Or read on for the rest of changes in this release.)