Design data to text generation hybrid templating system using SimpleNLG.

4 min readDec 9, 2020

About more than a year ago, I wrote about designing a hybrid templating engine using natural language generation library (SimpleNLG). I was really surprised a lot of people found the article helpful and asked for a hands-on tutorial to use SimpleNLG for data to text conversion. I have finally put together a simple tutorial to generate sentences using SimpleNLG.

Some pre-requisites: Python, Pandas, PysimpleNLG

I have divided this article into following parts.

Some introduction of SimpleNLG.
Using some data, a walkthrough to generate text.
Some helpful links.

What is SimpleNLG?

SimpleNLG is a java based library to generate natural language. As mentioned in their documentation, SimpleNLG is intended to function as a “realisation engine” for natural language generation architectures. It is mainly divided into 2 parts:

Lexicon: computes the morphological realization.
Realizer: generates texts from a syntactic form with adequate grammar coverage.

Generate text using SimpleNLG

Let’s go ahead and define our lexicon and realizer.

import simplenlgfrom simplenlg.framework import *
from simplenlg.lexicon import *
from simplenlg.realiser.english import *
from simplenlg.phrasespec import *
from simplenlg.features import *lexicon = Lexicon.getDefaultLexicon()
nlgFactory = NLGFactory(lexicon)
realiser = Realiser(lexicon)

Now, let’s create some simple sentences.

# Sample Example for creating a sentences1 = nlgFactory.createSentence("my dog is happy")# Once you created the sentence, inorder to get the text we need to realise the sentence generated
output = realiser.realiseSentence(s1)

output will contain the grammatically correct sentence with proper punctuation: My dog is happy.

Let’s see how can we use this simple tool to convert our raw data into text. For this tutorial I am using Kaggle’s mental health data. We are using age and treatment column for our illustration.

# Building a feature using treatment and age column
survey_age = survey[['treatment', 'Age']]# Divide the age in different age groupsdef create_age_group(age):
    
    if age >= 18 and age < 25:
        
        return "Early 20s"
    
    if age >= 25 and age < 30:
        
        return "Late 20s"
    
    if age >= 30 and age < 35:
        
        return "Early 30s"
    
    if age >= 35 and age < 40:
        
        return "Late 30s"
    
    if age >= 40 and age < 45:
        
        return "Early 40s"
    
    if age >= 45 and age < 50:
        
        return "Late 40s"
    
    if age >= 50 and age < 70:
        
        return "50s"# Applying age group aggregation on Age.survey_age['age_group'] = survey['Age'].apply(create_age_group)# counting the treatment for each age groupfinal_df = survey_age.groupby(['age_group', 'treatment']).size().reset_index()
final_df = final_df.rename(columns={0: "count"})

Our final dataframe final_df contains three columns: age_group, treatment, count

Once we have this information with us, let’s convert it into a sample text.

If treatment == yes, text is: [count] people in Age group [age_group] seeks help for mental illness.

If treatment == no, text is:[count] people in Age group [age_group] does not seek help for mental illness.

In order to generate above sentence, we need to break these sentences with their morphological form:

Noun Phrase: People

Pre Modifier: [count]

Post Noun Modifier: in Age Group [age_group]

Subject Phrase: [count] people in Age group [age_group]

Verb: seek

Complement: help for mental illness

If treatment is No, we need to negate this sentence.

Once we have divided the sentence into its components, it’s time to generate sentences for each row of the dataFrame.

"""
Required text: 1. [20] people in Age group ---- seeks help for mental illness.
               2. [30] people in Age group ---- does not seek help for mental illness.
               
               Inorder to create these sentences let's create a small rule:
               
               Noun phrase: people
               Premodifier: []
               PostModifier: in Age Group + []
               
               Subject: Noun Phrase
               Verb: Seek
               Complement: help for mental illness
               
"""def create_descriptions(row):
    noun_phrase = nlgFactory.createNounPhrase("People")
    noun_phrase.addPreModifier(str(row['count']))
    post_modifier = "in Age Group " + row['age_group']
    noun_phrase.addPostModifier(post_modifier)
    
    sentence = nlgFactory.createClause()
    sentence.setSubject(noun_phrase)
    sentence.setVerb("seek")
    
    if row['treatment'] == 'No':
        
        # This will negate the sentence
        sentence.setFeature(Feature.NEGATED, True)
    
    sentence.addComplement("help for mental illness")return realiser.realiseSentence(sentence)final_df['text'] = final_df.apply(lambda r: create_descriptions(r), axis=1)

And there you go, we have developed our first data to text pipeline. We can also generate complex sentences using multiple clauses and phrases too.

Some helpful links

SimpleNLG provides a great tutorial to use their APIs. It has a very practical walkthrough to generate complex sentences as well. All the java APIs work fine with its python wrapper.
Parsing tool to analyze your text and break it into morphs.
Complete tutorial is available on GitHub.
Previous article link.

Design data to text generation hybrid templating system using SimpleNLG.

What is SimpleNLG?

Generate text using SimpleNLG

Some helpful links

Written by Somya Anand