Natural Language Processing for General Ledger

Often NLP has limited but very powerful role to play in General Ledger and overall ERP business processes.

A valid use case is,

  • what is impact of a Market Research report to Comopany's Operating Revenue?
  • what does it mean to Company's cash flow if Federal Tax Rates changes?
  • how company Finance Statement react (if applicable) to Twitter feed, political or climate change?

Other use case may be applicable to transaction data

  • is there any confidential data?
  • did vendor submit duplicate invoices?
  • which city/state did incur expenses?
  • how to identify if Finance transaction have employee/personal information?
julia>     using DataFrames, TextAnalysis
julia> df_str_sample = DataFrame(sentences = [ "Amit Shukla lives in Los Angeles California.", "Most of Techie people live in North California.", "Elon Musk thinks, How does it matter who is living where?", "It doesn't matter to Bill Gates, sharing live zip code information 90210 is harmless.", "Jeff Bezos is here to see GeneralLedger.jl, NOT LIVING conditions, getting headache now.", "I am already took pills, says Jack Ma.", "This data does not make any sense to John Doe." ])7×1 DataFrame Row │ sentences │ String ─────┼─────────────────────────────────── 1 │ Amit Shukla lives in Los Angeles… 2 │ Most of Techie people live in No… 3 │ Elon Musk thinks, How does it ma… 4 │ It doesn't matter to Bill Gates,… 5 │ Jeff Bezos is here to see Genera… 6 │ I am already took pills, says Ja… 7 │ This data does not make any sens…
julia> str1 = TextAnalysis.StringDocument("Amit Shukla lives in Los Angeles California.")A TextAnalysis.StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: Amit Shukla lives in Los Angeles California.
julia> str2 = TextAnalysis.StringDocument("Most of Techie people live in North California.")A TextAnalysis.StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: Most of Techie people live in North California.
julia> str3 = TextAnalysis.StringDocument("Elon Musk thinks, How does it matter who is living where?")A TextAnalysis.StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: Elon Musk thinks, How does it matter who is living
julia> str4 = TextAnalysis.StringDocument("It doesn't matter to Bill Gates, sharing live zip code information 90210 is harmless.")A TextAnalysis.StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: It doesn't matter to Bill Gates, sharing live zip
julia> str5 = TextAnalysis.StringDocument("Jeff Bezos is here to see GeneralLedger.jl, NOT LIVING conditions, getting headache now.")A TextAnalysis.StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: Jeff Bezos is here to see GeneralLedger.jl, NOT LI
julia> str6 = TextAnalysis.StringDocument("I am already took pills, says Jack Ma.")A TextAnalysis.StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: I am already took pills, says Jack Ma.
julia> str7 = StringDocument("This data does not make any sense to John Doe.")A TextAnalysis.StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: This data does not make any sense to John Doe.
julia> TextAnalysis.stem!(str7)
julia> crpstr2 = Corpus([str1,str2,str3,str4,str5,str6])A Corpus with 6 documents: * 6 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> update_lexicon!(crpstr2)
julia> update_inverse_index!(crpstr2)
julia> # lexicon(crpstr1), lexicon(crpstr2), inverse_index(crpstr1), inverse_index(crpstr2), crps2["live"] # TextAnalysis.text(str1), # TextAnalysis.text(str2), # TextAnalysis.text(str3), # TextAnalysis.text(str4), # TextAnalysis.text(str5), # TextAnalysis.text(str6), # TextAnalysis.text(str7) df_str_sample[:,:sentences]7-element Vector{String}: "Amit Shukla lives in Los Angeles California." "Most of Techie people live in North California." "Elon Musk thinks, How does it matter who is living where?" "It doesn't matter to Bill Gates" ⋯ 23 bytes ⋯ " information 90210 is harmless." "Jeff Bezos is here to see Gener" ⋯ 26 bytes ⋯ "nditions, getting headache now." "I am already took pills, says Jack Ma." "This data does not make any sense to John Doe."

what are you talking about here?

("live", 5)

(lives, live, living, live, LIVING) => 5 occurances

which places you are talking about?

("North", "North California", "Los Angeles", "California", "90210")

is there any personal data?

("Amit Shukla", "Elon Musk", "Bill Gates", "Jeff Bezos", "Jack Ma", "John Doe")

is there any Protective Health information data?

("Bill Gates", "90210"), ("Jeff Bezos", "headache"), ("Jack Ma", "pills")