At SigParser we’re in the business of capturing data from unstructured emails. When we started building SigParser we tried all the open source solutions for parsing emails. None of them had the accurate enough. So we built the most accurate email body parser in existance.

Say you have an email body like this…

Great talking with you. Let's catchup soon. Thanks, Mark Anderson VP of Engineering 888-222-4444 On Fri, Nov 19, 2018 at 12:03 PM, Paul Johnson <paul@example.com> wrote: > Let's talk at 11. > Thanks > Paul Johnson

And you want the first message only…

Great talking with you. Let's catchup soon.

Or maybe the second message body…

Let's talk at 11.

How do you do that easily? We’ll cover various programming solutions below.

Why is this hard?

We spent years building email parsers. There are a lot of issues that need to be solved when writing your own email parser:

Signature identification

Various formats for headers On Fri, Nov 19th… On 10/9/2018 Headers that wrap across lines From:, To:, Date: style headers

Reply chains indicated by > or multiple >>>

Some lines look like signatures but aren’t

Corrupted email headers

Common for plain text emails to split reply headers

Multi-language support is required even if no one speaks another language on your team

Header formats change over time

Email clients change over time

Still don’t believe us? Look at our change logs. We’re constantly finding new edge cases.

Due to this, we suggest not coding your own signature parsing algorithm. It is non-trivial. There are also a number of open source half baked efforts out there as well. We’ve tried them all. Most of our users have tried those first before using SigParser.

Our simple email parsing tools provide a consistent JSON result.

Clean email bodies of signatures and reply chains

Get email bodies for forwarded emails

Capture nested email chains in a single MIME message or .eml file

REST API option - POST https : //api.sigparser.com/api/Mime/ParseString

Windows, Linux and AWS Lambda deployment options .eml, .msg, or JSON format inputs

Frequent updates as email clients and patterns change

Usage based and unlimited plans available

The output structure will look like this.

{ "CleanedBodyPlain": "Another response in the chain.\r

\r

", "CleanedBodyHtml": "<div dir=\"ltr\"><div dir=\"ltr\"><div>Another response in the chain. </div><div><br clear=\"all\"></div></div></div>", "IsSpammyLookingEmailMessage": false, "IsSpammyLookingSender": false, "EmailTypes": [ "NormalEmail" ], "Emails": [ { "CleanedBodyPlain": "Another response in the chain.\r

\r

", "CleanedBodyHtml": "<div dir=\"ltr\"><div dir=\"ltr\"><div>Another response in the chain. </div><div><br clear=\"all\"></div></div></div>", "Subject": null, "Date": "2020-05-11T16:41:16+00:00", "FromEmailAddress": "paul@example.com", "FromName": "Paul Mendoza", "To": [ { "Name": "Outlook Tester", "EmailAddress": "outlook.tester@salesforceemail.com" } ], "Cc": [] }, { "CleanedBodyPlain": "This is a reply from the test account.\r

\r

", "CleanedBodyHtml": null, "Subject": null, "Date": "2020-05-11T09:40:00", "FromEmailAddress": "outlook.tester@salesforceemail.com", "FromName": "Outlook Tester", "To": [], "Cc": [] }, { "CleanedBodyPlain": null, "CleanedBodyHtml": null, "Subject": "One more test email at 3:25 PM", "Date": "2020-04-12T15:25:00", "FromEmailAddress": "paul@example.com", "FromName": "Paul Mendoza", "To": [ { "Name": "Outlook Tester", "EmailAddress": "outlook.tester@salesforceemail.com" } ], "Cc": [] } ], "Subject": "Re: One more test email at 3:25 PM", "Date": "2020-05-11T16:41:16+00:00", "Headers": { "mime-version": "1.0", "date": "Mon, 11 May 2020 09:41:16 -0700", "references": "<CAL5Lp9VcCVNqeiw0Rry7BHQaTct46qv3BnUvR5-HNqWZO-Xxiw@mail.gmail.com>\r

\t<BY5PR04MB6819EFA89CDABDFCB9D67D2F8AA10@BY5PR04MB6819.namprd04.prod.outlook.com>", "in-reply-to": "<BY5PR04MB6819EFA89CDABDFCB9D67D2F8AA10@BY5PR04MB6819.namprd04.prod.outlook.com>", "message-id": "<CAL5Lp9X0RjYNOo68Y_boL8OOw32gU-SWxLW3WjgYj93eTfUsyQ@mail.gmail.com>", "subject": "Re: One more test email at 3:25 PM", "from": "Paul Mendoza <paul@example.com>", "to": "Outlook Tester <outlook.tester@salesforceemail.com>", "content-type": "multipart/alternative; boundary=\"00000000000001bd4705a5620460\"" }, "FullPlainTextBody": "Another response in the chain.



*Paul Mendoza*, Founder

Mobile 760-917-3753

SigParser

paul@example.com

Schedule a meeting with me here <https://www.meetingbird.com/m/xxxxxx>



Listen to podcasts? I was recently on the *FutureTech Podcast*

<https://www.futuretechpodcast.com/podcasts/digging-up-the-data-your-company-has-needs-and-cant-access-paul-mendoza-sigparser/>

talking about SigParser and use cases other customers are using it for.





On Mon, May 11, 2020 at 9:40 AM Outlook Tester <

outlook.tester@salesforceemail.com> wrote:



> This is a reply from the test account.

>

>

>

> *From:* Paul Mendoza <paul@example.com>

> *Sent:* Sunday, April 12, 2020 3:25 PM

> *To:* Outlook Tester <outlook.tester@salesforceemail.com>

> *Subject:* One more test email at 3:25 PM

>

>

>

>

> *Paul Mendoza, *Founder

>

> Mobile 760-917-3753

>

> SigParser

>

> paul@example.com

>

> Schedule a meeting with me here <https://www.meetingbird.com/m/xxxxxx>

>

> Listen to podcasts? I was recently on the *FutureTech Podcast*

> <https://www.futuretechpodcast.com/podcasts/digging-up-the-data-your-company-has-needs-and-cant-access-paul-mendoza-sigparser/>

> talking about SigParser and use cases other customers are using it for.

>

", "FullHtmlBody": "<div dir=\"ltr\"><div dir=\"ltr\"><div>Another response in the chain. </div><div><br clear=\"all\"><div><div dir=\"ltr\" class=\"gmail_signature\" data-smartmail=\"gmail_signature\"><div dir=\"ltr\"><div><div dir=\"ltr\"><div><div dir=\"ltr\"><div><div dir=\"ltr\"><div dir=\"ltr\"><div dir=\"ltr\"><div dir=\"ltr\"><font color=\"#3d85c6\" face=\"tahoma, sans-serif\" style=\"font-size:12.8px\"><b>Paul Mendoza</b></font><font color=\"#3d85c6\" face=\"tahoma, sans-serif\" style=\"font-size:12.8px;font-weight:bold\">, </font><span style=\"font-size:12.8px;color:rgb(61,133,198);font-family:tahoma,sans-serif\">Founder</span><div style=\"font-size:12.8px\"><div><font color=\"#666666\" size=\"2\" face=\"arial narrow, sans-serif\">Mobile 760-917-3753</font></div><div><font color=\"#666666\" size=\"2\" face=\"arial narrow, sans-serif\">SigParser</font></div><div><a href=\"mailto:paul@example.com\" style=\"font-family:tahoma,sans-serif;font-size:12.8px;color:rgb(17,85,204)\" target=\"_blank\">paul@example.com</a><br></div><div><a href=\"https://www.meetingbird.com/m/xxxxxx\" target=\"_blank\">Schedule a meeting with me here</a></div><div><img src=\"https://drive.google.com/a/sigparser.com/uc?id=1GUhMvrGnJMCfkge1HMqyKFQCLSJNXcw-&export=download\" width=\"200\" height=\"90\"><br></div></div>Listen to podcasts? I was recently on the <a href=\"https://www.futuretechpodcast.com/podcasts/digging-up-the-data-your-company-has-needs-and-cant-access-paul-mendoza-sigparser/\" target=\"_blank\"><b>FutureTech Podcast</b></a> talking about SigParser and use cases other customers are using it for. </div></div></div></div></div></div></div></div></div></div></div></div><br></div></div><br><div class=\"gmail_quote\"><div dir=\"ltr\" class=\"gmail_attr\">On Mon, May 11, 2020 at 9:40 AM Outlook Tester <<a href=\"mailto:outlook.tester@salesforceemail.com\">outlook.tester@salesforceemail.com</a>> wrote:<br></div><blockquote class=\"gmail_quote\" style=\"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex\">











<div lang=\"EN-US\">

<div class=\"gmail-m_-2662285044572695259WordSection1\">

<p class=\"MsoNormal\">This is a reply from the test account.<u></u><u></u></p>

<p class=\"MsoNormal\"><u></u> <u></u></p>

<div style=\"border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0in 0in\">

<p class=\"MsoNormal\"><b>From:</b> Paul Mendoza <<a href=\"mailto:paul@example.com\" target=\"_blank\">paul@example.com</a>> <br>

<b>Sent:</b> Sunday, April 12, 2020 3:25 PM<br>

<b>To:</b> Outlook Tester <<a href=\"mailto:outlook.tester@salesforceemail.com\" target=\"_blank\">outlook.tester@salesforceemail.com</a>><br>

<b>Subject:</b> One more test email at 3:25 PM<u></u><u></u></p>

</div>

<p class=\"MsoNormal\"><u></u> <u></u></p>

<div>

<p class=\"MsoNormal\"><br clear=\"all\">

<u></u><u></u></p>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<p class=\"MsoNormal\"><b><span style=\"font-size:9.5pt;font-family:Tahoma,sans-serif;color:rgb(61,133,198)\">Paul Mendoza, </span></b><span style=\"font-size:9.5pt;font-family:Tahoma,sans-serif;color:rgb(61,133,198)\">Founder</span><u></u><u></u></p>

<div>

<div>

<p class=\"MsoNormal\"><span style=\"font-size:10pt;font-family:"Arial Narrow",sans-serif;color:rgb(102,102,102)\">Mobile 760-917-3753</span><span style=\"font-size:9.5pt\"><u></u><u></u></span></p>

</div>

<div>

<p class=\"MsoNormal\"><span style=\"font-size:10pt;font-family:"Arial Narrow",sans-serif;color:rgb(102,102,102)\">SigParser</span><span style=\"font-size:9.5pt\"><u></u><u></u></span></p>

</div>

<div>

<p class=\"MsoNormal\"><span style=\"font-size:9.5pt\"><a href=\"mailto:paul@example.com\" target=\"_blank\"><span style=\"font-family:Tahoma,sans-serif;color:rgb(17,85,204)\">paul@example.com</span></a><u></u><u></u></span></p>

</div>

<div>

<p class=\"MsoNormal\"><span style=\"font-size:9.5pt\"><a href=\"https://www.meetingbird.com/m/xxxxxx\" target=\"_blank\">Schedule a meeting with me here</a><u></u><u></u></span></p>

</div>

<div>

<p class=\"MsoNormal\"><span style=\"font-size:9.5pt\"><img border=\"0\" width=\"200\" height=\"90\" style=\"width: 2.0833in; height: 0.9375in;\" id=\"gmail-m_-2662285044572695259_x0000_i1025\" src=\"https://ci6.googleusercontent.com/proxy/TTpjUlFcjmphqTPKcbTFGb7TsHUk5MzP3P1Wt2uZYLjMzlO0UPeF7MAgaUwFk4hqlFafCMhmzlmkc3FUbGH4ijNXkqx9DAsv-_3CFnCTmZaZhMlONJqrrR-oGfWMfwqGpDgk301HHsijRMhsymfOCkhNKg=s0-d-e1-ft#https://drive.google.com/a/sigparser.com/uc?id=1GUhMvrGnJMCfkge1HMqyKFQCLSJNXcw-&export=download\"></span><span style=\"font-size:9.5pt\"><u></u><u></u></span></p>

</div>

</div>

<p class=\"MsoNormal\">Listen to podcasts? I was recently on the <a href=\"https://www.futuretechpodcast.com/podcasts/digging-up-the-data-your-company-has-needs-and-cant-access-paul-mendoza-sigparser/\" target=\"_blank\">

<b>FutureTech Podcast</b></a> talking about SigParser and use cases other customers are using it for.

<u></u><u></u></p>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>



</blockquote></div></div>

" }

Learn More About SigParser Learn more about how SigParser's API automatically parses email bodies and other email content. Try our API for free with no commitment required. How SigParser API Works

Command Line (Linux or Windows)

Consume SigParser from any shell. Provide it with a JSON file of the email or an EML file or a MSG file and it will return a JSON structured response for the fields listed above. You can also tell it to output to a directory.

SigParser API called with Python

Example of how to call our assembly in Python. You’ll need to write the JSON out to the input.json file first.

import os stream = os.popen('SigParserEmailUtils cleanedemail --filename input.json') output = stream.read() output

Lambda Deployment Option

AWS Lambda is a great service to deploy SigParser’s email parsing tools to. Each email its own dedicated RAM and CPU, Lambdas are kept warm for around 5 minutes which means the startup time is decreased per email and they scale really well.

Deploying Your Lambda

To configure, create a .NET Core 2.1 (C#/PowerShell) Lambda function. Name doesn’t matter.

In Function code section set the Handler as SigParser.EmailParsing.Lambda::SigParser.EmailParsing.Lambda.Function::GetCleanedEmailAsync

Upload the SigParser.EmailParsing.Utils.Lambda.zip file.

Set the Environment Variable for SigParserLicenseKey to your license Cryptolens license key. Contact us to get that.

Set the Memory to 2048MB of RAM. SigParser needs quite a bit of RAM to run all the machine learning systems quickly.

Click Save and then click Test and use this test email and it should return a JSON result. The first time can be slow but after that it tends to be fast.

{ "FromEmailAddress": "mary.johnson@fake.com", "FromName": "Mary Johsnon", "TextBody": null, "HtmlBody": "<p>Hi John,<\\/p>\\r\

\\r\

<p>Let\\'s get coffee tomorrow.<\\/p>\\r\

\\r\

<p>Thanks Mary Johnson<\\/p>" }

Invoke Lambda Function

RAM Usage Explained

SigParser needs 2048MB of RAM per email to safely execute without running out of RAM when processing emails. The average real human emails needs 962 MB of RAM. The 99th percentile nees 1605MB.

SigParser Email Parser in incredibly CPU intensive. In AWS the more RAM you give a Lambda the more CPU speed AWS gives that Lambda. So having lots of RAM isn’t wasteful since it executes faster.

Mailgun vs SigParser Parsing Libraries

We get compared to Mailgun’s open source email parsing library but these are very different libraries when it comes to what they do and their performance.

SigParser Mailgun Accuracy

Estimated accuracy for signature line identification 99.9% 92% Strip Signatures Off Emails

Yes Yes Support Languages

How many lanauges can it split emails for? English, German, Spanish, French, Portuguese, Russian, Dutch, Norwegian, Korean, Chinese, Turkish, Swedish, Czech English Forward Extraction

Capture forwarded messages Yes No ML Knowledge

How much machine learning knowledge do you need? Nothing Some. You'll need to find your own training data too since the 200 emails samples they give you isn't a very robust set. Deliverables

What do you get? Linux assembly, Windows assembly, Lambda zip file, Nuget Package Python source code