Features



Entity Resolution: The problem of linking multiple database representations of the same real world "entity". SampleClean provides a library and programming API for constructing distributed entity resolution pipelines.

The problem of linking multiple database representations of the same real world "entity". SampleClean provides a library and programming API for constructing distributed entity resolution pipelines.



Crowd Sourcing: Entity resolution tasks can be hard to automate and for reliable results crowdsourcing is a preferred solution. SampleClean provides a library of crowd sourcing tools that also adaptively learns through Active Learning. To use crowd sourcing, a pre-requisite is to run the AMPCrowd server.

Entity resolution tasks can be hard to automate and for reliable results crowdsourcing is a preferred solution. SampleClean provides a library of crowd sourcing tools that also adaptively learns through Active Learning. To use crowd sourcing, a pre-requisite is to run the



Approximate Query Processing: We often want to know aggregate statistics of the database (SUM, COUNT, AVG), and to answer these queries with high accuracy it often suffices to clean a small sample of data. SampleClean provides the primitives to sample and extrapolate query results on the sample.

We often want to know aggregate statistics of the database (SUM, COUNT, AVG), and to answer these queries with high accuracy it often suffices to clean a small sample of data. SampleClean provides the primitives to sample and extrapolate query results on the sample.

Programming With SampleClean You can download the SampleClean jar to include with any Spark programs or you can clone our github repository to check out the source code. We have provided a programming guide to help you get started. Programming Guide Scala Docs You can download the SampleClean jar to include with any Spark programs or you can clone our github repository to check out the source code. We have provided a programming guide to help you get started.

Quick Start We will walk through a basic tutorial on how to get SampleClean running using Spark Shell either locally or on a cluster. We will walk through a basic tutorial on how to get SampleClean running using Spark Shell either locally or on a cluster.

Pre-requisites

Spark and SampleClean Local Installation

mkdir sampleclean

tar xvzf spark-1.2.2.tgz

cd spark-1.2.2

sbt/sbt -Phive assembly/assembly

mv hive-site.xml.default conf/hive-site.xml

Testing Your Installation

./bin/spark-shell --jars sampleclean-v0.1.jar

import sampleclean.api.SampleCleanContext

val scc = new SampleCleanContext(sc)

scc.hql("CREATE TABLE restaurant(id String, entity String, name String, category String, city String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\

'") scc.hql("LOAD DATA LOCAL INPATH 'restaurant.csv' OVERWRITE INTO TABLE restaurant")

scc.initialize("restaurant","restaurant_working")

scc.hql("select count(distinct name) from restaurant").collect().foreach(println)

import sampleclean.clean.deduplication.EntityResolution val algorithm = EntityResolution.longAttributeCanonicalize(scc,"restaurant_working","name",0.7) algorithm.exec() scc.writeToParent("restaurant_working")

scc.hql("select count(distinct name) from restaurant").collect().foreach(println)

Using the Crowd

import sampleclean.crowd._ val crowdConfig = CrowdConfiguration(crowdName=”internal”, crowdServerHost=”127.0.0.1”, crowdServerPort=443) val taskParams = CrowdTaskConfiguration(votesPerPoint=1, maxPointsPerTask=10)

val crowdMatcher = EntityResolution.createCrowdMatcher(scc, “name” , “restaurant_working”) crowdMatcher.alstrategy.setCrowdParameters(crowdConfig) crowdMatcher.alstrategy.setTaskParameters(taskParams) val crowdAlgorithm = EntityResolution.longAttributeCanonicalize(scc,"restaurant_working","name",0.6) crowdAlgorithm.components.addMatcher(crowdMatcher)

crowdAlgorithm.exec()

scc.writeToParent("restaurant_working")

scc.hql("select count(distinct name) from restaurant").collect().foreach(println)

exit

Cluster Installation

We provide a set of Scala libraries for Entity Resolution, Crowd Sourcing, and Approximate Query Processing.1. Java Development Kit 7+ Download2. Scala 2.10.x Download1. First create a new directory2. Download Spark 1.2.x to this directory Download3. Untar Spark4. Build Spark5. Download SampleClean to the spark directory6. To avoid permission issues on a local deployment, configure hive with our default config. Download the config to the spark directory Download7. Put the config in the spark configuration folder8. Download the example dataset to the spark folder Download9. Open the Spark shell10. Import SampleClean11. Create New SampleCleanContext and HiveContext12. Load Example Dataset13. Create a working set14. Count the number of distinct restaurants15. Do Entity Resolution16. Count the number of distinct restaurants19. Configure crowd tasks (if you installed AMPCrowd earlier):20. Add a crowd matching step to the entity resolution algorithm21. Run the crowd-driven entity resolution (creating crowd tasks)22. Do some crowd tasks (navigate your browser to http://127.0.0.1:8000/crowds/internal/)23. Persist the new results24. Count the number of distinct restaurants25. ExitYou can also use SampleClean on a Spark cluster using our provided scripts. Note that you must have valid AWS credentials to start your cluster. The scripts configure all requirements necessary. Check sampleclean-async/deploy/README to learn about deploying EC2 clusters for Sample Clean. After starting the cluster, you can login remotely and use Sample Clean with Spark Submit or Spark Shell (similar to the local usage mode). Remember to load your datasets into HDFS using ephemeral or persistent storage before running your application.