Automated tests now pass on the latest PyPy version for supported Python versions in our continuous integration system ( issue 4504 )

Automated tests now pass on Windows as part of our continuous integration system ( issue 4458 )

Renewed the localhost certificate used for SSL tests ( issue 4650 )

Simplified the code example in Working with dataclass items ( issue 4652 )

The processing of ANSI escape sequences in enabled in Windows 10.0.14393 and later, where it is required for colored output ( issue 4393 , issue 4403 )

Request.from_curl and curl_to_request_kwargs() now set the request method to POST when a request body is specified and no request method is specified ( issue 4612 )

The parse command now allows specifying an output file ( issue 4317 , issue 4377 )

The scrapy.utils.python.retry_on_eintr function is now deprecated ( issue 4683 )

Removed the following classes and their parent modules from scrapy.linkextractors :

The base implementation of item loaders has been moved into a separate library, itemloaders , allowing usage from outside Scrapy and a separate release schedule

It also serves as a workaround for delayed file delivery , which causes Scrapy to only start item delivery after the crawl has finished when using certain storage backends ( S3 , FTP , and now GCS ).

The new FEED_EXPORT_BATCH_ITEM_COUNT setting allows to deliver output items in batches of up to the specified number of items.

The startproject command no longer makes unintended changes to the permissions of files in the destination folder, such as removing execution permissions ( issue 4662 , issue 4666 )

Made use of set literals in tests ( issue 4573 )

Configured Travis CI to also run the tests with Python 3.5.2 ( issue 4518 , issue 4615 )

Updated test requirements to reflect an incompatibility with pytest 5.4 and 5.4.1 ( issue 4588 )

You may now run the asyncio tests with Tox on any Python version ( issue 4521 )

Improved code sharing between the crawl and runspider commands ( issue 4548 , issue 4552 )

Removed backslashes preceding *args and **kwargs in some function and method signatures ( issue 4592 , issue 4596 )

It is again possible to download the documentation for offline reading ( issue 4578 , issue 4585 )

The display-on-hover behavior of internal documentation references now also covers links to commands , Request.meta keys, settings and signals ( issue 4495 , issue 4563 )

Removed a misleading import line from the scrapy.utils.log.configure_logging() code example ( issue 4510 , issue 4587 )

Removed from Coroutines the warning about the API being experimental ( issue 4511 , issue 4513 )

When FEEDS defines multiple URIs, log messages about items being stored now contain information from the corresponding feed, instead of always containing information about only one of the feeds ( issue 4619 , issue 4629 )

Fix a KeyError exception being sometimes raised from scrapy.utils.datatypes.LocalWeakReferencedCache ( issue 4597 , issue 4599 )

The startproject command now ensures that the generated project folders and files have the right permissions ( issue 4604 )

Spider callbacks defined using coroutine syntax no longer need to return an iterable, and may instead return a Request object, an item , or None ( issue 4609 )

When FEEDS defines multiple URIs, FEED_STORE_EMPTY is False and the crawl yields no items, Scrapy no longer stops feed exports after the first URI ( issue 4621 , issue 4626 )

CookiesMiddleware no longer re-encodes cookies defined as bytes in the cookies parameter of the __init__ method of Request ( issue 2400 , issue 3575 )

scrapy.utils.misc.create_instance() now raises a TypeError exception if the resulting instance is None ( issue 4528 , issue 4532 )

Upgraded the pickle protocol that Scrapy uses from protocol 2 to protocol 4, improving serialization capabilities and performance ( issue 4135 , issue 4541 )

When using Google Cloud Storage for a media pipeline , a warning is now logged if the configured credentials do not grant the required permissions ( issue 4346 , issue 4508 )

The dictionaries in the result list of a media pipeline now include a new key, status , which indicates if the file was downloaded or, if the file was not downloaded, why it was not downloaded; see FilesPipeline.get_media_requests for more information ( issue 2893 , issue 4486 )

scrapy.item.BaseItem is now deprecated, use scrapy.item.Item instead ( issue 4534 )

Support for Python 3.5.0 and 3.5.1 has been dropped; Scrapy now refuses to run with a Python version lower than 3.5.2, which introduced typing.Type ( issue 4615 )

New bytes_received signal that allows canceling response download

Removed code that added support for old versions of Twisted that we no longer support ( issue 4472 )

Removed a warning about importing StringTransport from twisted.test.proto_helpers in Twisted 19.7.0 or newer ( issue 4409 )

Removed warnings about using old, removed settings ( issue 4404 )

Extended use of InterSphinx to link to Python 3 documentation ( issue 4444 , issue 4445 )

Removed references to the Guppy library, which only works in Python 2 ( issue 4285 , issue 4343 )

Covered the curl2scrapy service in the documentation ( issue 4206 , issue 4455 )

Our PyPI entry now includes links for our documentation, our source code repository and our issue tracker ( issue 4456 )

Improved the documentation about signals that allow their handlers to return a Deferred ( issue 4295 , issue 4390 )

Spider.make_requests_from_url , deprecated in Scrapy 1.4.0, now issues a warning when used ( issue 4412 )

zope.interface 5.0.0 and later versions are now supported ( issue 4447 , issue 4448 )

Zsh completion no longer allows options after arguments ( issue 4438 )

None values in allowed_domains no longer cause a TypeError exception ( issue 4410 )

Request serialization no longer breaks for callbacks that are spider attributes which are assigned a function with a different name ( issue 4500 )

Zsh completion now excludes used option aliases from the completion list ( issue 4438 )

A warning is now issued when a value in allowed_domains includes a port ( issue 50 , issue 3198 , issue 4413 )

The new Response.ip_address attribute gives access to the IP address that originated a response ( issue 3903 , issue 3940 )

The crawl and runspider commands now support specifying an output format by appending :<format> to the output file ( issue 1336 , issue 3858 , issue 4507 )

A new setting, FEEDS , allows configuring multiple output feeds with different settings each ( issue 1336 , issue 3858 , issue 4507 )

The FEED_FORMAT and FEED_URI settings have been deprecated in favor of the new FEEDS setting ( issue 1336 , issue 3858 , issue 4507 )

The MultiValueDict , MultiValueDictKeyError , and SiteNode classes have been removed from scrapy.utils.datatypes ( issue 4400 )

The spiders property has been removed from Crawler , use CrawlerRunner.spider_loader or instantiate SPIDER_LOADER_CLASS with your settings instead ( issue 4398 )

The ChunkedTransferMiddleware middleware has been removed, including the entire scrapy.downloadermiddlewares.chunked module; chunked transfers work out of the box ( issue 4431 )

The REDIRECT_MAX_METAREFRESH_DELAY setting is no longer supported, use METAREFRESH_MAXDELAY instead ( issue 4385 )

The LOG_UNSERIALIZABLE_REQUESTS setting is no longer supported, use SCHEDULER_DEBUG instead ( issue 4385 )

If you catch an AssertionError exception from Scrapy, update your code to catch the corresponding new exception.

AssertionError exceptions triggered by assert statements have been replaced by new exception types, to support running Python in optimized mode (see -O ) without changing Scrapy’s behavior in any unexpected ways.

New FEEDS setting to export to multiple feeds

Removed top-level reactor imports to prevent errors about the wrong Twisted reactor being installed when setting a different Twisted reactor using TWISTED_REACTOR ( issue 4401 , issue 4406 )

Response.follow_all now supports an empty URL iterable as input ( issue 4408 , issue 4420 )

A new pqueues attribute offers a mapping of downloader slot names to the corresponding instances of downstream_queue_cls .

The following changes affect specifically the DownloaderAwarePriorityQueue class and may affect subclasses:

The spider attribute has been removed. Use crawler.spider instead.

It is used in push() to make up for the removal of its priority parameter.

A new priority() method has been added which, given a request, returns request.priority * -1 .

The following changes affect specifically the ScrapyPriorityQueue class and may affect subclasses:

The serialize attribute has been removed (details above)

The following class attributes have been added:

The new key parameter displaced the startprios parameter 1 position to the right.

Instances of downstream_queue_cls should be created using the new ScrapyPriorityQueue.qfactory or DownloaderAwarePriorityQueue.pqfactory methods.

qfactory was instantiated with a priority value (integer).

downstream_queue_cls , which replaced qfactory , must be instantiated differently.

__init__ may still receive all parameters as positional parameters, however:

In the __init__ method, most of the changes described above apply.

The following changes affect specifically the ScrapyPriorityQueue and DownloaderAwarePriorityQueue classes from scrapy.core.scheduler and may affect subclasses:

The signature of the __init__ method is now __init__(self, crawler, key) .

The following changes may impact custom disk and memory queue classes:

The serialize parameter is no longer passed. The disk queue class must take care of request serialization on its own before writing to disk, using the request_to_dict() and request_from_dict() functions from the scrapy.utils.reqser module.

The parameter for disk queues that contains data from the previous crawl, startprios or slot_startprios , is now passed as a keyword parameter named startprios .

A new keyword parameter has been added: key . It is a string that is always an empty string for memory queues and indicates the JOB_DIR value for disk queues.

The parameter that used to contain a factory function, qfactory , is now passed as a keyword parameter named downstream_queue_cls .

In the __init__ method or the from_crawler or from_settings class methods:

The following changes may impact custom priority queue classes:

The push method no longer receives a second positional parameter containing request.priority * -1 . If you need that value, get it from the first positional parameter, request , instead, or use the new priority() method in scrapy.core.scheduler.ScrapyPriorityQueue subclasses.

The following changes may impact any custom queue classes of all types:

Modified the tox configuration to allow running tests with any Python version, run Bandit and Flake8 tests by default, and enforce a minimum tox version programmatically ( issue 4179 )

Started reporting slowest tests, and improved the performance of some of them ( issue 4163 , issue 4164 )

We now use a recent version of Python to build the documentation ( issue 4140 , issue 4249 )

Fixed an inconsistency between code and output in Scrapy at a glance ( issue 4213 )

Improved consistency when referring to the __init__ method of an object ( issue 4086 , issue 4088 )

Fixed the signatures of the file_path method in media pipeline examples ( issue 4290 )

Improved the documentation about LinkExtractor.extract_links and simplified Link Extractors ( issue 4045 )

Cross-references within our documentation now display a tooltip when hovered ( issue 4173 , issue 4183 )

Links to unexisting documentation pages now allow access to the sidebar ( issue 4152 , issue 4169 )

API documentation now links to an online, syntax-highlighted view of the corresponding source code ( issue 4148 )

Fixed a typo in the message of the ValueError exception raised when scrapy.utils.misc.create_instance() gets both settings and crawler set to None ( issue 4128 )

Adding items to a scrapy.utils.datatypes.LocalCache object without a limit defined no longer raises a TypeError exception ( issue 4123 )

Z shell auto-completion now looks for .html files, not .http files, and covers the -h command-line switch ( issue 4122 , issue 4291 )

RFPDupeFilter , the default DUPEFILTER_CLASS , no longer writes an extra \r character on each line in Windows, which made the size of the requests.seen file unnecessarily large on that platform ( issue 4283 )

The correct encoding is now used for attach names in MailSender ( issue 4229 , issue 4239 )

Request no longer accepts strings as url simply because they have a colon ( issue 2552 , issue 4094 )

Redirects to URLs starting with 3 slashes ( /// ) are now supported ( issue 4032 , issue 4042 )

The first spider middleware (see SPIDER_MIDDLEWARES ) now also processes exceptions raised from callbacks that are generators ( issue 4260 , issue 4272 )

The crawl command now also exits with exit code 1 when an exception happens before the crawling starts ( issue 4175 , issue 4207 )

Download handlers (see DOWNLOAD_HANDLERS ) may now use the from_settings and from_crawler class methods that other Scrapy components already supported ( issue 4126 )

A new keep_fragments parameter of scrapy.utils.request.request_fingerprint() allows to generate different fingerprints for requests with different fragments in their URL ( issue 4104 )

BaseItemExporter subclasses may now use super().__init__(**kwargs) instead of self._configure(kwargs) in their __init__ method, passing dont_fail=True to the parent __init__ method if needed, and accessing kwargs at self._kwargs after calling their parent __init__ method ( issue 4193 , issue 4370 )

Spider objects now raise an AttributeError exception if they do not have a start_urls attribute nor reimplement start_requests , but have a start_url attribute ( issue 4133 , issue 4170 )

Scrapy logs a warning when it detects a request callback or errback that uses yield but also returns a value, since the returned value would be lost ( issue 3484 , issue 3869 )

A new request_left_downloader signal is sent when a request leaves the downloader ( issue 4303 )

item_error for exceptions raised during item processing by item pipelines

Request no longer requires a callback parameter when an errback parameter is specified ( issue 3586 , issue 4008 )

Item loader processors can now be regular functions, they no longer need to be methods ( issue 3899 )

The new Response.cb_kwargs attribute serves as a shortcut for Response.request.cb_kwargs ( issue 4331 )

Scheduler disk and memory queues may now use the class methods from_crawler or from_settings ( issue 3884 )

A new SCRAPER_SLOT_MAX_ACTIVE_SIZE setting allows configuring the existing soft limit that pauses request downloads when the total response data being processed is too high ( issue 1410 , issue 3551 )

The new Response.certificate attribute exposes the SSL certificate of the server as a twisted.internet.ssl.Certificate object for HTTPS responses ( issue 2726 , issue 4054 )

The new Response.follow_all method offers the same functionality as Response.follow but supports an iterable of URLs as input and returns an iterable of requests ( issue 2582 , issue 4057 , issue 4286 )

The next method of scrapy.utils.python.MutableChain is deprecated, use the global next() function or MutableChain.__next__ instead ( issue 4153 )

The noconnect query string argument of proxy URLs is deprecated and should be removed from proxy URLs ( issue 4198 )

scrapy.linkextractors.FilteringLinkExtractor is deprecated, use scrapy.linkextractors.LinkExtractor instead ( issue 4045 )

Using environment variables prefixed with SCRAPY_ to override settings is deprecated ( issue 4300 , issue 4374 , issue 4375 )

The following functions have been removed from scrapy.utils.python : isbinarytext , is_writable , setattr_default , stringify_dict ( issue 4362 )

The Scrapy shell no longer provides a sel proxy object, use response.selector instead ( issue 4347 )

Overridden settings are now logged in a different format. This is more in line with similar information logged at startup ( issue 4199 )

We have refactored the scrapy.core.scheduler.Scheduler class and related queue classes (see SCHEDULER_PRIORITY_QUEUE , SCHEDULER_DISK_QUEUE and SCHEDULER_MEMORY_QUEUE ) to make it easier to implement custom scheduler queue classes. See Changes to scheduler queue classes below for details.

Use the from_settings or from_crawler class methods to expose such a parameter to your custom download handlers.

The __init__ method of custom download handlers (see DOWNLOAD_HANDLERS ) or subclasses of the following downloader handlers no longer receives a settings parameter:

The HttpCompressionMiddleware now includes spaces after commas in the value of the Accept-Encoding header that it sets, following web browser behavior ( issue 4293 )

The METAREFRESH_IGNORE_TAGS setting is now an empty list by default, following web browser behavior ( issue 3844 , issue 4311 )

File extensions that LinkExtractor ignores by default now also include 7z , 7zip , apk , bz2 , cdr , dmg , ico , iso , tar , tar.gz , webm , and xz ( issue 1837 , issue 2067 , issue 4066 )

Retry gaveups (see RETRY_TIMES ) are now logged as errors instead of as debug information ( issue 3171 , issue 3566 )

Lower versions of these optional requirements may work, but it is not guaranteed ( issue 3892 )

Minimum versions of optional Scrapy requirements that are covered by continuous integration tests have been updated:

scrapy.item.DictItem is deprecated, use Item instead ( issue 3999 )

Use of the undocumented SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE environment variable is deprecated ( issue 3910 )

The documentation now covers how to define and configure a custom log format ( issue 3616 , issue 3660 )

A memory-handling and error-handling issue in scrapy.utils.ssl.get_temp_key_info() has been fixed ( issue 3920 )

FTP passwords in FEED_URI containing percent-escaped characters are now properly decoded ( issue 3941 )

When using botocore to persist files in S3, all botocore-supported headers are properly mapped now ( issue 3904 , issue 3905 )

A much improved completion definition is now available for Zsh ( issue 4069 )

Custom log formats can now drop messages by having the corresponding methods of the configured LOG_FORMATTER return None ( issue 3984 , issue 3987 )

When a @scrapes spider contract fails, all missing fields are now reported ( issue 766 , issue 3939 )

Set the new DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING setting to True to enable debug-level messages about TLS connection parameters after establishing HTTPS connections ( issue 2111 , issue 3450 )

Use the new DOWNLOADER_CLIENT_TLS_CIPHERS setting to customize the TLS/SSL ciphers used by the default HTTP/1.1 downloader ( issue 3392 , issue 3442 )

A new ROBOTSTXT_USER_AGENT setting allows defining a separate user agent string to use for robots.txt parsing ( issue 3931 , issue 3966 )

See also Deprecation removals below.

This is needed to allow adding values to existing fields ( loader.add_value('field', 'value2') ).

ItemLoader now turns the values of its input item into lists:

JSONRequest is now called JsonRequest for consistency with similar classes ( issue 3929 , issue 3982 )

Python 3.4 is no longer supported, and some of the minimum requirements of Scrapy have also changed:

As a result, when an item loader is initialized with an item, ItemLoader.load_item() once again makes later calls to ItemLoader.get_output_value() or ItemLoader.load_item() return empty data.

Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.

It is now possible to generate an API documentation coverage report ( issue 3806 , issue 3810 , issue 3860 )

It is now possible to run all tests from the same tox environment in parallel; the documentation now covers this and other ways to run tests ( issue 3707 )

The scrapy.utils.gz.is_gzipped function is deprecated. Use scrapy.utils.gz.gzip_magic_number instead.

The scrapy.utils.datatypes.MergeDict class is deprecated for Python 3 code bases. Use ChainMap instead. ( issue 3878 )

The following modules are deprecated:

process_request callbacks passed to Rule that do not accept two arguments are deprecated.

The queuelib.PriorityQueue value for the SCHEDULER_PRIORITY_QUEUE setting is deprecated. Use scrapy.pqueues.ScrapyPriorityQueue instead.

The following deprecated settings have also been removed ( issue 3578 ):

_root (both the __init__ method argument and the object property, use root )

From both scrapy.selector and scrapy.selector.lxmlsel :

The following deprecated APIs have been removed ( issue 3578 ):

Updated the FAQ entry about crawl order to explain why the first few requests rarely follow the desired order ( issue 1739 , issue 3621 )

The documentation of Rule now covers how to access the text of a link when using CrawlSpider ( issue 3711 , issue 3712 )

Requests with private callbacks are now correctly unserialized from disk ( issue 3790 )

Fixed a memory leak in scrapy.pipelines.media.MediaPipeline affecting, for example, non-200 responses and exceptions from custom middlewares ( issue 3813 )

System exceptions like KeyboardInterrupt are no longer caught ( issue 3726 )

process_spider_exception() is now also invoked for generators ( issue 220 , issue 2061 )

A new redirect_reasons request meta key exposes the reason (status code, meta refresh) behind every followed redirect ( issue 3581 , issue 3687 )

A new METAREFRESH_IGNORE_TAGS setting allows overriding which HTML tags are ignored when searching a response for HTML meta tags that trigger a redirect ( issue 1422 , issue 3768 )

A new FEED_STORAGE_FTP_ACTIVE setting allows using FTP’s active connection mode for feeds exported to FTP servers ( issue 3829 )

A new FEED_STORAGE_S3_ACL setting allows defining a custom ACL for feeds exported to Amazon S3 ( issue 3607 )

A new restrict_text parameter for the LinkExtractor __init__ method allows filtering links by linking text ( issue 3622 , issue 3635 )

A process_request callback passed to the Rule __init__ method now receives the Response object that originated the request as its second argument ( issue 3682 )

A new JSONRequest class offers a more convenient way to build JSON requests ( issue 3504 , issue 3505 )

A new Request.cb_kwargs attribute provides a cleaner way to pass keyword arguments to callback methods ( issue 1138 , issue 3563 )

A new scheduler priority queue, scrapy.pqueues.DownloaderAwarePriorityQueue , may be enabled for a significant scheduling improvement on crawls targetting multiple web domains, at the cost of no CONCURRENT_REQUESTS_PER_IP support ( issue 3520 )

See also Deprecation removals below.

For more information, see SCHEDULER .

An additional crawler parameter has been added to the __init__ method of the Scheduler class. Custom scheduler subclasses which don’t accept arbitrary parameters in their __init__ method might break because of this change.

Non-default values for the SCHEDULER_PRIORITY_QUEUE setting may stop working. Scheduler priority queue classes now need to handle Request objects instead of arbitrary Python data structures.

Spider subclass instances were never meant to work, and they were not working as one would expect: instead of using the passed Spider subclass instance, their from_crawler method was called to generate a new instance.

Crawler , CrawlerRunner.crawl and CrawlerRunner.create_crawler no longer accept a Spider subclass instance, they only accept a Spider subclass now.

This change is backward incompatible . If you don’t want to retry 429 , you must override RETRY_HTTP_CODES accordingly.

429 is now part of the RETRY_HTTP_CODES setting by default

A cleaner way to pass arguments to callbacks

Make sure you install Scrapy 1.7.1. The Scrapy 1.7.0 package in PyPI is the result of an erroneous commit tagging and does not include all the changes described below.

collections.deque is used to store MiddlewareManager methods instead of a list ( issue 3476 )

All Scrapy tests now pass on Windows; Scrapy testing suite is executed in a Windows environment on CI ( issue 3315 ).

Deprecated scrapy.interfaces.ISpiderManager is removed; please use scrapy.interfaces.ISpiderLoader.

See Module Relocations for more information, or use suggestions from Scrapy 1.5.x deprecation warnings to update your code.

improved links to beginner resources in the tutorial ( issue 3367 , issue 3468 );

Using your browser’s Developer Tools for scraping is a new tutorial which replaces old Firefox and Firebug tutorials ( issue 3400 ).

Docs are re-written to suggest .get/.getall API instead of .extract/.extract_first. Also, Selectors docs are updated and re-structured to match latest parsel docs; they now contain more topics, such as Selecting element attributes or Extensions to CSS Selectors ( issue 3390 ).

flags are now preserved when copying Requests ( issue 3342 );

proper handling of pickling errors in Python 3 when serializing objects for disk queues ( issue 3082 )

fixed issue with extra blank lines in .csv exports under Windows ( issue 3039 );

Referer header value is added to RFPDupeFilter log messages ( issue 3588 )

better error message when an exporter is disabled ( issue 3358 );

Link extraction improvements: “ftp” is added to scheme list ( issue 3152 ); “flv” is added to common video extensions ( issue 3165 )

non-zero exit code is returned from Scrapy commands when error happens on spider initialization ( issue 3226 )

better validation of url argument in Response.follow ( issue 3131 )

a message is added to IgnoreRequest in RobotsTxtMiddleware ( issue 3113 )

INFO log level is used to show telnet host/port ( issue 3115 )

Fixed errback handling in contracts, e.g. for cases where a contract is executed for URL which returns non-200 response ( issue 3371 ).

request_cls attribute in Contract subclasses allow to use different Request classes in contracts, for example FormRequest ( issue 3383 ).

dont_filter=True is used for contract requests, which allows to test different callbacks with the same URL ( issue 3381 );

Exceptions in contracts code are handled better ( issue 3377 );

Lazy loading of Downloader Handlers is now optional; this enables better initialization error handling in custom Downloader Handlers ( issue 3394 ).

new SitemapSpider sitemap_filter() method which allows to select sitemap entries based on their attributes in SitemapSpider subclasses ( issue 3512 ).

request_reached_downloader is fired when Downloader gets a new Request; this signal can be useful e.g. for custom Schedulers ( issue 3393 ).

item_error is fired when an error happens in a pipeline ( issue 3256 );

from_crawler support is added to dupefilters ( issue 2956 ); this allows to access e.g. settings or a spider from a dupefilter.

from_crawler support is added to feed exporters and feed storages. This, among other things, allows to access Scrapy settings from custom feed storages and exporters ( issue 1605 , issue 3348 ).

Backward incompatible : Scrapy’s telnet console now requires username and password. See Telnet Console for more details. This change fixes a security issue ; see Scrapy 1.5.2 (2019-01-22) release notes for details.

If you’re using custom Selector or SelectorList subclasses, a backward incompatible change in parsel may affect your code. See parsel changelog for a detailed description, as well as for the full list of improvements.

CSS selectors are cached in parsel >= 1.5, which makes them faster when the same CSS path is used many times. This is very common in case of Scrapy spiders: callbacks are usually called several times, on different pages.

Another useful new feature is the introduction of Selector.attrib and SelectorList.attrib properties, which make it easier to get attributes of HTML elements. See Selecting element attributes .

There are currently no plans to deprecate .extract() and .extract_first() methods.

Most visible change is that .get() and .getall() selector methods are now preferred over .extract_first() and .extract() . We feel that these new methods result in a more concise and readable code. See extract() and extract_first() for more details.

While these are not changes in Scrapy itself, but rather in the parsel library which Scrapy uses for xpath/css selectors, these changes are worth mentioning here. Scrapy now depends on parsel >= 1.5, and Scrapy documentation is updated to follow recent parsel API conventions.

various bug fixes, small new features and usability improvements across the codebase.

telnet console security improvements, first released as a backport in Scrapy 1.5.2 (2019-01-22) ;

better extensibility: item_error and request_reached_downloader signals; from_crawler support for feed exporters, feed storages and dupefilters.

big documentation improvements, including a switch from .extract_first() + .extract() API to .get() + .getall() API;

See telnet console documentation for more info

The fix is backward incompatible, it enables telnet user-password authentication by default with a random generated password. If you can’t upgrade right away, please consider setting TELNETCONSOLE_PORT out of its default value.

Security bugfix: Telnet console extension can be easily exploited by rogue websites POSTing content to http://localhost:6023 , we haven’t found a way to exploit it from Scrapy, but it is very easy to trick a browser to do so and elevates the risk for local development environment.

O(N^2) gzip decompression issue which affected Python 3 and PyPy is fixed ( issue 3281 );

This is a maintenance release with important bug fixes, but no new features:

A better example of ItemExporters usage ( issue 2989 )

Use pymongo.collection.Collection.insert_one() in MongoDB example ( issue 2781 )

Include references to Scrapy subreddit in the docs

Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning ( issue 2862 )

Add verification to check if Request callback is callable ( issue 2766 )

Default Scrapy User-Agent now uses https link to scrapy.org ( issue 2983 ). This is technically backward-incompatible ; override USER_AGENT if you relied on old value.

Fix logging of settings overridden by custom_settings ; this is technically backward-incompatible because the logger changes from [scrapy.utils.log] to [scrapy.crawler] , so please update your log parsers if needed ( issue 1343 )

Show warning when a URL is put to Spider.allowed_domains instead of a domain ( issue 2250 ).

Better log messages for responses over DOWNLOAD_WARNSIZE and DOWNLOAD_MAXSIZE limits ( issue 2927 )

CrawlerProcess got an option to disable installation of root log handler ( issue 2921 )

Explicit message for NotImplementedError when parse callback not defined ( issue 2831 )

scrapy.mail.MailSender now works in Python 3 (it requires Twisted 17.9.0)

New --meta option of the “scrapy parse” command allows to pass additional request.meta ( issue 2883 )

522 and 524 status codes are added to RETRY_HTTP_CODES ( issue 2851 )

LinkExtractor now ignores m4v extension by default, this is change in behavior.

Logging of settings overridden by custom_settings is fixed; this is technically backward-incompatible because the logger changes from [scrapy.utils.log] to [scrapy.crawler] . If you’re parsing Scrapy logs, please update your log parsers ( issue 1343 ).

Default Scrapy User-Agent now uses https link to scrapy.org ( issue 2983 ). This is technically backward-incompatible ; override USER_AGENT if you relied on old value.

Better default handling of HTTP 308, 522 and 524 status codes.

Compatibility with Python 3.6, PyPy and PyPy3 is improved; PyPy and PyPy3 are now supported officially, by running tests on CI.

scrapy parse command now allows to set custom request meta via --meta argument.

Warnings, exception and logging messages are improved to make debugging easier.

Crawling with proxy servers becomes more efficient, as connections to proxies can be reused now.

Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.

This release brings small new features and improvements across the codebase. Some highlights:

Scrapy 1.4 does not bring that many breathtaking new features but quite a few handy improvements nonetheless.

Scrapy now supports anonymous FTP sessions with customizable user and password via the new FTP_USER and FTP_PASSWORD settings. And if you’re using Twisted version 17.1.0 or above, FTP is now available with Python 3.

There’s a new response.follow method for creating requests; it is now a recommended way to create Requests in Scrapy spiders. This method makes it easier to write correct spiders; response.follow has several advantages over creating scrapy.Request objects directly:

it handles relative URLs;

it works properly with non-ascii URLs on non-UTF8 pages;

in addition to absolute and relative URLs it supports Selectors; for <a> elements it can also extract their href values.

For example, instead of this:

for href in response . css ( 'li.page a::attr(href)' ) . extract (): url = response . urljoin ( href ) yield scrapy . Request ( url , self . parse , encoding = response . encoding )

One can now write this:

for a in response . css ( 'li.page a' ): yield response . follow ( a , self . parse )

Link extractors are also improved. They work similarly to what a regular modern browser would do: leading and trailing whitespace are removed from attributes (think href=" http://example.com" ) when building Link objects. This whitespace-stripping also happens for action attributes with FormRequest .

Please also note that link extractors do not canonicalize URLs by default anymore. This was puzzling users every now and then, and it’s not what browsers do in fact, so we removed that extra transformation on extracted links.

For those of you wanting more control on the Referer: header that Scrapy sends when following links, you can set your own Referrer Policy . Prior to Scrapy 1.4, the default RefererMiddleware would simply and blindly set it to the URL of the response that generated the HTTP request (which could leak information on your URL seeds). By default, Scrapy now behaves much like your regular browser does. And this policy is fully customizable with W3C standard values (or with something really custom of your own if you wish). See REFERRER_POLICY for details.

To make Scrapy spiders easier to debug, Scrapy logs more stats by default in 1.4: memory usage stats, detailed retry stats, detailed HTTP error code stats. A similar change is that HTTP cache path is also visible in logs now.

Last but not least, Scrapy now has the option to make JSON and XML items more human-readable, with newlines between items and even custom indenting offset, using the new FEED_EXPORT_INDENT setting.

Enjoy! (Or read on for the rest of changes in this release.)