The long and the short of it is: we used CrowdFlower to get training data for a machine learning algorithm. It turned out to have quite a few functionality and usability problems, but not enough to make us leave early. The pricing structure left us confused.

Our use-case for crowd labour

At Nestoria, we work with multiple partners who provide us with feeds of their listings on their portals. We aggregate across these portals and present a search interface to the user across them all. Our users appreciate this feature, but they naturally do not want to be shown duplicate listings of the same property, even if they are on different portals. So we developed a classifier that classifies each pair of listings as duplicates or not. It is trained on a collection of pairs of listings where a trusted human has manually classified them.

We recently decided that we needed better and larger training/validation/test sets of data, so we thought: why not experiment with crowd labour?

Why CrowdFlower?

The most well-known crowd labour provider is Amazon Mechanical Turk. Mechanical Turk requires its clients have an address in the USA, which is problematic for us. So we decided to try CrowdFlower instead, especially considering its international work force. The other alternatives all had problems: they either didn’t have an international enough work force, or they required you to commit to a project before experimenting, or they had bugs in their website so that I could not even sign up! Also, at the time, CrowdFlower were offering a free trial with 100 free dollars to spend, so CrowdFlower it was.

Tour of a CrowdFlower job

Phase 1: Data

The first task in creating a CrowdFlower job is uploading the data. The formats accepted by the web interface are .csv , .tsv , .xls , .xlsx and .ods . The formats accepted by the API are .csv or .json . At least, the documentation says it supports JSON. However, it fails to mention that it actually expects a text file where each line is a separate JSON object, like this:

{ "key1": "value1", "key2": "value2" } { "key1": "value3", "key2": "value4" }

This of course is not JSON at all, but a custom format that contains JSON. This could have been solved with better documentation, even just an example, but although example parameters are provided, an example payload is not, so I had to discover this by emailing support. Every line needs to contain all the keys as the previous line and any previous uploads.

Something else that is not obvious is that CrowdFlower expects tabular data, it does not like you adding or removing columns, giving somewhat helpful error messages on the web interface (but not in the API response). You’re better off creating a new job if you want to add or delete a column. Support told me that there is a force parameter to work around this.

The other gotcha that is not documented in the API is that uploading units one by one causes the job to be automatically reclassified as a “survey” job, which behaves differently to a normal job. If you want the normal behaviour, make sure you upload more than one unit in the bulk operation upload. This I only understood by emailing support.

You can view your data using the web UI in a normal spreadsheet-like format, which is nice, but it’s paginated, which is annoying. Searching for particular rows is inconvenient.

Phase 2: Build Job

The next phase is to build the interface that contributors will see, as you obviously don’t want them to simply view a line of CSV. There are many good templates offered here, and I based my template on one of the deduplication ones. Unfortunately, it turned out not to be suited for my needs, so it was simpler to create my own template from scratch.

The first section to fill out is instructions to the contributors. This is where you explain what the job is, give examples and explain what the do in corner-cases. The templates give good examples of thorough instructions to follow.

You can write the job using the templating language Liquid, used by Shopify. It’s nice and basic, unfortunately, it’s too basic. For instance, you can have if statements, and you can use and and or operators, but there’s no way of specifying precedence using parenthesis, you have to chain if statements together. To make things even trickier, CrowdFlower automatically converts null values to the string No data available , but leaves empty strings as is.

I wanted to display a table row, but only if at least one of the listings had a value for that key. Here’s what I had to end up coding:

{% if l1_num_beds != "No data available" or l2_num_beds != "No data available" %} {% if l1_num_beds != "" or l2_num_beds != "" %} <tr><th>num_beds</th><td>{{l1_num_beds}}</td><td>{{l2_num_beds}}</td></tr> {% endif %} {% endif %}

As far as I know, there is no way to iterate over variables, and at this point I was much more willing to simply write a script that created this template for me and to upload it to CrowdFlower using the API.

The nice thing is that I can use seemingly any HTML, CSS or Javascript I want, so I managed to implement highlighting rows in the table comparing the two listings where the values are different.

For simpler tasks, there’s quite an intuitive WYSIWYG editor.

You can add controls, such as radio boxes on check boxes, with questions that you expect contributors to answer. It’s pretty straightforward, however, one gotcha is that whatever string you use in the multiple choice answer will end up in the final results, you can’t reword it without changing the results as well.

Phase 3: Test Questions

The next step is to create test questions. You simply grab some of the units that you upload, and answer them yourself, specifying which answers are acceptable, and typing in a reason why the given answers are correct. The contributors see this reason field after they fail a test question. They can also contest a test question if they feel it is wrong or unfair, entering a text message if they wish.

The need for test questions is obvious: you need some way to measure how well the contributors are answering questions, not to mention stopping them from cheating by answering questions randomly.

There is an optional quiz mode. If enabled, contributors are asked to complete a quiz before they can join the job. They are presented with 10 test questions (10 is unconfigurable, AFAICT), and they must answer a certain percentage (by default, 70%) correctly to be allowed into the job. If they complete the quiz, they are still considered “untrusted”, but they’re allowed to do the normal job.

During the normal job, contributors see a page with a number of rows/units that they have to answer. Some of these rows may actually be test questions in disguise, if they fail at enough of the disguised test questions, they are kicked out of the job and their previous judgements are revoked. On the other hand, if they answer enough disguised test questions correctly, their judgements are considered “trusted”, and the job only finishes once enough trusted judgements have been collected.

Phase 4: Job settings

These are the basic settings that you set before launching a job. There are actually more settings than are displayed in this screenshot, once you enable “advanced” mode.

You can choose how many rows per page are displayed to the contributor. (CrowdFlower uses the term “unit” and “row” interchangeably.) I did not initially see the purpose behind this setting. Why not simply get rid of the idea of pages altogether, and just display each unit one by one or in batches to the contributor?

Only after I’d spent all my free dollars and some real money, and only after some panicked emailing of support, did I realise the significance of this setting. Every page contains exactly one test question. The documentation that I missed says that it contains one test question “by default”, but in actuality this is not configurable. So the “rows per page” setting actually sets the percentage of units viewed by the contributor that are actually disguised test questions. I had wondered why it wouldn’t allow me to set this to one! Had I understood this, I would have probably set the number of rows per page to a higher value.

The judgements per row setting indicates the number of contributors that you want to work on each unit. If the task is highly subjective, you’ll want to increase this number. CrowdFlower will stop asking for judgements once the limit has been reached, but you can configure it to keep asking for judgements until a level of consensus is reached instead.

The payment per page is how much you will pay for an individual page. This does not include the 20% fee that CrowdFlower takes in addition. I’ll talk more about what CrowdFlower charges for what later on, but for now, I want to point out that’s all the information you get about pricing. Before trying CrowdFlower, I was expecting to be able to see what the market place for crowd labour on CrowdFlower, to be able to judge what a good price for my tasks would be, whether I was overpaying or underpaying. Instead, there is exactly zero visibility on this. I have no idea how contributors select my job, if they can do so at all. All I see is the aggregated results of a contributor survey, as mentioned later on.

Some of the advanced features are really interesting. You can share the job with your team internally if you’re willing to join the crowd workers yourself. You can view the list of sources of crowd labour, with names like ZoomBucks, SurveyMad, RewardSpot, Poin-web, KeepRewarding, Gold Tasks, Free Easy Prizes, Crowd Guru and BitcoinGet, and you can disable individual ones if you want to. Interestingly, Mechanical Turk was not on this list.

You can ask for contributors from certain regions of the world only, or with certain language-capabilities only. The list of languages is: Arabic, Bahasa, Chinese, French, German, Hindi, Italian, Portuguese, Russian, Spanish, Turkish and Vietnamese. The assumption is that everybody can speak English.

Phase 5: Launch

The next stage is the launch page. It estimates how much the job is going to cost, which is greater than just the product of the number of units, the number of judgements per row and the payment per judgement, because you need to pay for test questions, quizzes and untrusted judgements as well. (However, you don’t need to pay for failed quizzes). You need to pay the estimate in advance so that the job doesn’t stop for lack of funds. So far, most of my jobs have costed more than the initial estimate, so the job stopped half-way through, waiting for me to top up. There doesn’t seem to be away to pay more than the estimate initially to avoid this.

Phase 6: Monitor

This is the fun part. You get to watch this monitor page update in real-time:

I found for the English-language tasks that I attracted many contributors straight away, despite not having made a deliberate effort to increase payments. As contributors arrive, you can see them attempt the quiz, and see whether they pass or fail. You can review your test questions, see which ones cause contributors to fail most often, and which ones are contested most often, with their explanatory comments. You can drill-down on individual contributors and see their activity, their channel and their name. You can even award individual contributors bonuses or forgive past missed test questions. A good tip for productivity is that you can continue adding test questions after you’ve launched a job.

Contributors fill out a survey and rate the job on its instructions, its ease of use, the fairness of the test questions and the pay. This is all the feedback you get from contributors, apart from contested test questions. Contributors who failed the quiz partake in the survey, according to a support rep. I do not know what prevents contributors from rating my pay badly every time, it seems like there’s no deterrent for people trying to game the system in this manner.

The work was done faster than I needed it. (The screenshot is misleading because I paused the job and then resumed it the following day, and the “judgements per hour” does not know how to deal with this.) The job was completed to 99% quite quickly, but then the last few pending judgements took ages, I don’t know why.

It’s possible for your allocated funds to run out, at which point, you’ll be prompted with an email to fund it some more. This happens almost every time, since the estimated cost is always off by 50%.

While the job is running, you can view various charts and stats about the results so far.

One gotcha is that you cannot add more data once you’ve launched a job, even you pause or cancel it. The API and the web interface will allow you to upload the data, and the data will appear in the data section, but you simply cannot launch the units. I found this confusing and emailed support to tell them as much.

Another important gotcha is that contributors are kicked out of the job once they have completed a number of pages equal to the number of available test questions. This is because CrowdFlower seems to refuse to let contributors work a page without a test question. This seems to be how it works, based on the maximal number of judgements that the contributors made, but I couldn’t find it documented any where.

Yet another gotcha is that if you enable the quiz, the test questions used in the quiz will be reused as disguised test questions later on in the normal job! This seems to be a bug, after all, CrowdFlower doesn’t seem to reuse test questions within a job, just between the quiz for the job and the job.

Phase 7: Results

Once you’re done, you can download the results in inconviently zipped .CSV files. You can also access the results through the API even while it’s running, however, I recommend against this, as because of the necessary pagination, you can get inconsistent results.

Journey of bugs

While I managed to get the job done with CrowdFlower eventually, I spent at least a week struggling with it. The number of bugs I encountered with it was astounding, more suitable for an alpha product I feel. Here’s a list of them, as well as some usability problems:

Only accounts with “business” emails are allowed (not Gmail). You can invite people with Gmail accounts, but then they can’t sign up!

When you upload only 1 unit the system assumes you’re running a survey.

There are a lot of unclear labels in the UI. For example, “judgements per row” should really be renamed to “trusted judgments per unit”.

At one point, CrowdFlower forgot the answers to my test questions, possibly because I edited the template, only to have the answers reappear later on.

If you create a test question from a unit, the corresponding unit does not disappear from the list of units to be answered! Although I don’t think you pay for those. Apparently, this is to make the numbers less confusing, but I find it more confusing this way round.

After launching and cancelling a job, you can upload new data, but you can’t actually get the new data answered, making it pointless to upload new data.

CrowdFlower invoice does not specify if any VAT was paid.

I received an email complaining about insufficient funds, when I hadn’t enabled the external workforce, and therefore the funds were not needed.

The “perfomance” section in the “launch job” page is greyed out before you click “enable external workforce”, but the “payment details” section isn’t. This led me to believe that “external workforce” meant something different to just any paid workforce.

The estimated date and time of completion is displayed without any timezone information, I had to email support to confirm that it was in UTC.

The API paginates, and it’s not clearly documented at all that it does so! The return values do not contain the number of pages, so you are simply obliged to keep trying the next page numbers until you get an empty result! The limit is 1000 for some calls and 100 for others. Watch out for this if you’re using the API. Support told me that their version in the works solves this issue.

I found the process of adding people to your team quite confusing, there doesn’t seem to be a well-designed distinction between my account and a team. I don’t think it’s possible to be part of more than one team, for example.

The fact that every page contains exactly one test question was not obvious.

The fact that contributors are kicked out after completing a number of pages equal to the number of test questions was not obvious.

Test questions are reused after the quiz.

There is no page that fully explains where all the costs went on a particular job, you are left to guess how many test questions you paid for and so on.

On the positive side, their product support is excellent, I always got a reply to my emails within 24 hours. Commercial support was another matter, they took over a week to get back to me about pricing.

Pricing

When I started my trial, this is what CrowdFlower’s pricing page looked like:

For some reason, I assumed that these subscription services were optional, and that you could choose to pay as you go if you wished. It turns out that this was incorrect. Your trial doesn’t end once your free $100 end, rather, it ends once you’ve ordered 5000 units, even if you have spent real money. Once you’ve reached that limit, you need to subscribe and pay a minimum of $5000 a month!

To me, the screenshot seems to imply that $5500 a month gets you 30000 data records, does it not? In actual fact, $5500 a month gets you access to the system, and you still need to pay contributors money for each judgment, as well as a 20% CrowdFlower fee.

CrowdFlower have since changed their pricing structure, without warning and without any announcements on their part. Now, there are simply two models: “data for everyone”, where you pay as you go and are obliged to make your data public, and “pro”, whose pricing is hidden. When I emailed them, we were quoted a substantial cost per unit. Surprisingly, this cost remains fixed no matter how many judgments per unit you ask for, which seems like a error to me.

Interacting with their API in Perl

Since their API is REST, it was quite simple to build a Perl interface to it. I won’t be publishing a CPAN module, as frankly, I don’t want to support a buggy and moving target, but to show how simple it is, here’s the bulk of it:

sub crowdflower { my ($self, $job_id, $rest_of_url, $page, $http_method, $mime my $UA = LWP::UserAgent->new; $UA->default_header(Accept => "application/json"); my $BASE_URL = "https://api.crowdflower.com/v1"; my $url = "$BASE_URL/jobs/$job_id/$rest_of_url?key=$API_KEY"; if (defined $page) { $url .= sprintf("&page=%d", $page); } my $Request = HTTP::Request->new($http_method => $url); # Really we should put "$mime ; charset=UTF-8" here, but Crowdflower won't accept # it, and it assumes UTF-8. $Request->content_type("$mime"); if ($payload) { my $content; if ($mime eq 'application/json') { $content = JSON::XS->new->encode($payload); } # CSV and lines of JSON formats ommitted for brevity $Request->content( Encode::encode("UTF-8", $content, Encode::FB_CROAK | Encode::LEAVE_SRC), ); } my $Response = $UA->request($Request); if (! $Response->is_success) { LOGDIE "Response is not successful: " . $Response->status_line . " content: " . $Response->decoded_content; } my $data = $Response->decoded_content; my $rh_response = JSON::XS->new->decode($data); }

Well, that’s not quite enough, because if you want all pages for a particular response, you have to keep iterating through page numbers until you reach an empty result, like this:

sub crowdflower_all_pages { my ($self, $page, @options) = @_; my $return_value; my $page = 1; PAGE: while (1) { DEBUG "Getting page number $page"; my $page_response = $self->crowdflower(@options, page => $page); if (! defined($page_response)) { last PAGE; } elsif (! defined($return_value)) { DEBUG "First page response"; $return_value = $page_response; } elsif (ref($page_response) ne ref($return_value)) { LOGDIE sprintf("Previously got %s reference, but now %s reference is returned", ref($page_response), ref($return_value)); } elsif (ref($page_response) eq 'ARRAY') { DEBUG "Page response: array of " . scalar(@$page_response); last PAGE if ! @$page_response; push @$return_value, @$page_response; } elsif (ref($page_response) eq 'HASH') { DEBUG "Page response: hash of " . scalar(keys %$page_response); last PAGE if ! keys %$page_response; for my $key (keys %$page_response) { LOGDIE "Key $key already retrieved in a previous page" if exists $return_value->{$key}; $return_value->{$key} = $page_response->{$key}; } } else { LOGDIE "Unsupported reference type: '" . ref($page_response) . "'"; } } continue { $page++ } INFO "After looking at $page pages, returning"; return $return_value; }

I hope this blog post was useful in your experiments with crowd labour.