Five Things To Consider Before Using Mechanical Turk

November 23, 2015 |by ZeroFox Team

9 minute read

Here at ZeroFox, we ingest data. Lots and lots of data. We pull from every major social network and a variety of additional social web sources in order to analyze for malicious activity. To do this, we use a variety of different machine learning classifiers that separate the scam accounts from the average user, the malicious links from the benign and the bots from the humans.

Anyone with machine learning experience knows where we’re going with this: machine learning algorithms get better at classification the more data on which they are trained. But in order to train the algorithm in the first place, researchers need to provide a pre-classified, or “labeled,” training set. This means, within a limited set of data, labeling the training set manually -- thus solving the problem the algorithm was built to solve in the first place. It’s one of the classic data science “chicken-or-the-egg” conundrums.

Despite significant advances to computing and automation, labeling the training set requires a huge amount of manual effort -- after all, you wouldn’t build an automated solution if it were easy or scalable. Labeling is a necessary evil, costing data scientists both money and time. The core issue with labeling is that humans are the best (and sometimes, the only) labelers, and you probably do not have enough of them in-house to label the data at scale.

Usually, labeling can only be tackled with copious amounts of coffee, energy drinks and patience (and sometimes alcohol). However, there are tools that give data scientists access to human workers in order to help perform these monotonous labeling tasks. One such tool is Mechanical Turk (MTurk). If you don't already know, MTurk is Amazon's platform for crowdsourcing data collection and labeling. Using Mechanical Turk allows you to pay people around the globe to solve hard-to-automate problems, such as labeling images, taking surveys and transcribing speech. By uploading your data (e.g. links to web pages, images or even text messages) and modifying some HTML templates, you can send hundreds or thousands of Human Intelligence Tasks (or HITs) to workers across the globe.

ZeroFox uses MTurk from time to time to help scale up the amount of labeled data, which we use to train our models. Through these Turk exercises, we have gained some insight into the process of setting a Turk job that is ‘just right.’

Here's our top 5 considerations when using Mechanical Turk.

1. Good requests take effort to create

One of the best features about Mechanical Turk is that you can have multiple workers solving the same task. Doing so allows you to determine whether or not the two workers agree on the answer. Whenever multiple workers disagree on a task, it’s wise to review in-house. When we first started using MTurk, workers disagreed 27% of the time. This was unacceptable: it meant that more than 1/4 of our tasks would need to be reviewed in-house to prevent incorrect labeling. With this in mind, we knew that some workers weren't understanding what we were asking and that we would have to spend more time on clarifying the semantics of our requests.

Example of a typical HIT

We always use an iterative process from the beginning: we try small batches (about 100-200 samples) to get feedback and revise our language. A hundred or so individual tasks cost around $5, and the data is easy to review. This process is always extremely useful; we get a chance to reword our HITs and make them simpler and easier to comprehend. When it comes to MTurk, simplicity is key. This phase also allows you to clarify instructions if there is disagreement among the workers. You can judge the quality of your instructions and language from the frequency of HIT disagreements and from direct feedback. By addressing these issue early, you can save time and money in the long run.

2. Work is cheap, but costs add up

The common conception of Mechanical Turk is that you can solve large-scale problems for pennies on the dollar. While an individual task is incredibly cheap, costs add up quickly.

For example, consider a batch of 10,000 HITs in which each HIT is set to $0.03. That already comes out to $300, but it’s not your entire cost. Amazon charges a 40-45% fee, broken down into 20% base, 20% for batches with more than 10 HITs and an additional 5% if you choose to use the Masters, workers who have demonstrated competency on special qualification tasks. Finally, hiring 2 or 3 workers to do each HIT is standard practice, since disagreements on answers is extremely valuable information. As we mentioned before, disagreements imply that either one worker responded incorrectly, or that the HIT was controversial or confusing. After applying fees and redundancy to the base costs, the batch costs 3 to 4 times as much as the original estimate. And don’t forget, you’ll need several of these batches to finish your project.

Because there is a bit of variance on the price of a HIT, requesters must experiment to find the optimal amount to pay workers. You can't simply post a 50-page survey for $.01 and expect it to be completed anytime this year. Charging less than popular demand costs you in other ways: the amount of time taken to complete the project and the quality of the labeling. Comparing your project to other similar projects can help with determining the HIT price, but in general, we advise leaning towards paying your workers more generously. The project will be completed quicker, labeled more reliably and you’ll build a relationship with your workers.

You can also learn a few tricks from experienced workers. We chatted with MTurk workers in an internet relay chat (IRC) channel and learned that embedding the page into the HIT (as opposed to simply linking it) significantly decreases the amount of effort. This decreases costs for worker and requester alike.

3. Don’t underestimate reputation

Worker and requester reputation is essential in the Mechanical Turk community. Score is important to workers: it allows them to view and obtain higher paying HITs. Rejections directly hurt workers score - if only 2% of HITs are rejected, they fall below the 98% acceptance rate necessary to see all MTurk HIT requests. If only a handful requesters reject a worker, the worker’s account could be suspended.

Requesters also have a reputation score. Workers avoid requesters that consistently reject or block HIT results. Before accepting a HIT, many workers check Turkopticon, the de facto review site for Mechanical Turk requesters. A bad reputation can scare away high-quality workers, which in turn worsens the quality of results and increases the time it takes to complete a HIT. A good requester reputation means workers can trust you to pay, which increases the number of workers who want to complete your jobs. This, in turn, allows you to be more selective in who you accept.

Workers submit ratings to Turkopticon for the following categories: Fair, Fast, Pay and Communicativity. “Fairness” is how often you reject work. “Fastness” is how quickly you approve payments. “Pay” is how much you compensate workers for your HITs. “Communicativity” is whether you're willing to discuss questions and talk with workers whose work you have rejected. These are the categories that workers find important, and it's critical to keep these scores high.

Our rating for one of our HITs

4. Communication is not wasted time

You might be tempted to use Mechanical Turk as an automated tool for mass data labeling. But don’t forget the fundamentally human aspect of how MTurk operates. Communicating with workers is critical -- you can clarify key ideas and the design of your request. Doing so allows you to revise your HITs to avoid confusion with future batches. Ultimately, this saves time and money for everyone involved. After the HIT, workers might bring up interesting uses for your data and offer to help the broader project. We have even had workers offer to write us custom scripts to increase efficiencies in our automation capabilities.

5. Screen your workers

Some Mechanical Turk workers simply won't be good fits for the jobs you give them. If you're doing a one-off project, then you might not need to worry too much about this; giving each HIT to multiple workers and manually reviewing any disagreements is the best way to ensure quality. But if you're considering multiple batches or more long-term projects, you want to build up a base of workers that you know and trust. When you make a request, Amazon recommends the use of MTurk Masters (where Amazon takes an extra cut of the cost). However, we've found that limiting the worker qualifications based on the number of approved HITs and HIT approval rate gets us better overall results than using Masters. In the end, it’s cheaper too. In order to restrict projects to certain workers, you can use the Mechanical Turk API to run a test and assign qualifications.

In the end, you still need to actually review the results of your batch. For example, some workers use plugins and scripts to help answer questions, and in at least one circumstance, a plugin caused a worker to submit bad data. We have also caught scammers automatically submitting random answers to our questions. We reject answers by MTurk workers who don’t play by the rules. We solved this issue by adding an extra answer to our tasks, “Selecting this option will flag your answers for possible rejection.” This technique has helped us weed out scammers with scripts that randomly select options.

Conclusion

Mechanical Turk offers a cheap and efficient way to curate large data sets in need of human classification. The problem with MTurk, as is the case with most human-related tasks, is communication. Most of our recommendations revolve around maintaining a good relationship between requester and worker. Without a clearly defined problem to solve, especially in a classification task, you run the risk of building a classifier that does a poor job of generalization in the real world. There is also an added benefit of receiving feedback from these workers, which can help you define your problem space to a more tightly knit domain.

We love Mechanical Turk. It’s a community of likeminded people who want to solve the kinds of problems that we want to solve. The best advice we can give is this: don’t treat Mechanical Turk as a place for cheap labor. Treat it like a powerful venue to collaborate on solving interesting, complex and data-driven challenges.