usaddress is a python library for parsing unstructured address strings into address components, using advanced NLP methods.

The parse method will split your address string into components, and label each component. >>> import usaddress >>> usaddress . parse ( 'Robie House, 5757 South Woodlawn Avenue, Chicago, IL 60637' ) [('Robie', 'BuildingName'), ('House,', 'BuildingName'), ('5757', 'AddressNumber'), ('South', 'StreetNamePreDirectional'), ('Woodlawn', 'StreetName'), ('Avenue,', 'StreetNamePostType'), ('Chicago,', 'PlaceName'), ('IL', 'StateName'), ('60637', 'ZipCode')] The tag method will try to be a little smarter - it will merge consecutive components & strip commas, as well as return an address type ( Street Address , Intersection , PO Box , or Ambiguous ) >>> import usaddress >>> usaddress . tag ( 'Robie House, 5757 South Woodlawn Avenue, Chicago, IL 60637' ) (OrderedDict([ ('BuildingName', 'Robie House'), ('AddressNumber', '5757'), ('StreetNamePreDirectional', 'South'), ('StreetName', 'Woodlawn'), ('StreetNamePostType', 'Avenue'), ('PlaceName', 'Chicago'), ('StateName', 'IL'), ('ZipCode', '60637')]), 'Street Address') >>> usaddress . tag ( 'State & Lake, Chicago' ) (OrderedDict([ ('StreetName', 'State'), ('IntersectionSeparator', '&'), ('SecondStreetName', 'Lake'), ('PlaceName', 'Chicago')]), 'Intersection') >>> usaddress . tag ( 'P.O. Box 123, Chicago, IL' ) (OrderedDict([ ('USPSBoxType', 'P.O. Box'), ('USPSBoxID', '123'), ('PlaceName', 'Chicago'), ('StateName', 'IL')]), 'PO Box')

Because the tag method returns an OrderedDict with labels as keys, it will throw a RepeatedLabelError error when multiple areas of an address have the same label, and thus can’t be concatenated. When RepeatedLabelError is raised, it is likely that either (1) the input string is not a valid address, or (2) some tokens were labeled incorrectly.

RepeatedLabelError has the attributes original_string (the input string) and parsed_string (the output of the parse method on the input string). You can use these attributes to write custom exception handling, for example: try : tagged_address , address_type = usaddress . tag ( string ) except usaddress . RepeatedLabelError as e : some_special_instructions ( e . parsed_string , e . original_string )

It is also possible to pass a mapping dict to the tag method to remap the labels to your own format. For example: