working with Data

In this section, we will review and create, ERP data sets required for system analytics.

Please see, most of the datasets discussed below, adhere to ERD discussed earlier. These datasets are not real but is very close to real life datasets.

These scripts do NOT copy any real organization dataset, any resemblance to any real organization working data is pure coincidental.

ERP Systems

eCommerce.ai package supports following ERP systems data structures.

Oracle, PeopleSoft, SAP, Tally, Intuit, QuickBooks etc. I will cover examples from ERP Domains like GL (General Ledger), AP (Accounts Payable), AR (Account Receivables), B2P (Buy to Pay), Expense, Travel & Time, HCM Human Capital Management, CRM etc.

Note

GitHub doesn't allow to load GBs/volumes data into GitHub repository.

While most of the scripts used in this project/package is tested on 20+GBs of data. You'll find sample datasets in ../../assets/sampleData directory.

Also, you can use following scripts to generate more data by changing sampleSize variable in following scripts.

Let's get started, and create ERP commerce dataset.

SUPPLY CHAIN

you will need following packages.

using Pkg
Pkg.add("DataFrames")
Pkg.add("Dates")
Pkg.add("CategoricalArrays")
Pkg.add("Interact")
Pkg.add("WebIO")
Pkg.add("CSV")
Pkg.add("XLSX")
Pkg.add("DelimitedFiles")
Pkg.add("Distributions")
Pkg.build("WebIO")
Pkg.status();

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions

    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.7/Project.toml`
  No Changes to `~/.julia/environments/v1.7/Manifest.toml`
    Building WebIO → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/c9529be473e97fa0b3b2642cdafcd0896b4c9494/build.log`
      Status `~/.julia/environments/v1.7/Project.toml`
  [336ed68f] CSV v0.10.3
  [54eefc05] Cascadia v1.0.1
  [324d7699] CategoricalArrays v0.10.5
  [8f4d0f93] Conda v1.7.0
  [a93c6f00] DataFrames v1.3.2
  [31c24e10] Distributions v0.25.53
  [e30172f5] Documenter v0.27.15
  [8f5d6c58] EzXML v1.1.0
  [708ec375] Gumbo v0.8.0
  [cd3eb016] HTTP v0.9.17
  [7073ff75] IJulia v1.23.2
  [c601a237] Interact v0.10.4
  [0f8b85d8] JSON3 v1.9.4
  [b9914132] JSONTables v1.0.3
  [4d0d745f] PDFIO v0.1.13
  [c3e4b0f8] Pluto v0.18.4
  [2dfb63ee] PooledArrays v1.4.0
  [438e738f] PyCall v1.93.1
  [88034a9c] StringDistances v0.11.2
  [a2db99b7] TextAnalysis v0.7.3
  [05625dda] WebDriver v0.1.2
  [0f1e0344] WebIO v0.8.17
  [fdbf4ff8] XLSX v0.7.9
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles

rest of this blog, I will assume, you have added all packages and imported in current namespace/notebook scope.

Note

All of Finance and supply chain data discussed here is also uploaded in GitHub repo under sampleData folder. This same script can be used to produce more voluminous data.

Chartfields

Item master, Item Attribs, Item Costing
UNSPSC: The United Nations Standard Products and Services Code® (UNSPSC®) is a global classification system of products and services. These codes are used to classify products and services.
GTIN: Global Trade Item Number (GTIN) can be used by a company to uniquely identify all of its trade items. GS1 defines trade items as products or services that are priced, ordered or invoiced at any point in the supply chain.
Vendor master, Vendor Attribs, Vendor Costing Customer/Buyer/Procurement Officer Attribs shipto, warehouse, storage & inventory locations

Transactions

PurchaseOrder
MSR - Material Service
Voucher
Invoice
Receipt
Shipment
Sales, Revenue
Travel, Expense, TimeCard
Accounting Lines

Item Master from UNSPSC

import Pkg
Pkg.add("XLSX")
Pkg.add("CSV")
using XLSX, CSV, DataFrames
###############################
## create SUPPLY CHAIN DATA ###
###############################
# Item master, Item Attribs, Item Costing ##
#       UNSPSC, GTIN
############################################

##########
# UNSPSC #
##########
# UNSPSC file can be downloaded from this link https://www.ungm.org/Public/UNSPSC
# xf = XLSX.readxlsx("../../assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx")
# xf will display names of sheets and rows with data
# let's read this data in to a DataFrame

# using below command will read xlsx data into DataFrame but will not render column labels
# df = DataFrame(XLSX.readdata("../../assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx", "UNSPSC", "A1:D12988"), :auto)
dfUNSPSC = DataFrame(XLSX.readtable("../../assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx", "UNSPSC")...)
# ... operator will splat the tuple (data, column_labels) into the constructor of DataFrame

# replace missing values with an integer 99999
replace!(dfUNSPSC."Parent key", missing => 99999)

# let's export this clean csv, we'll load this into database
# CSV.write("UNSPSC.csv", dfUNSPSC)

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfUNSPSC)
# empty!(dfUNSPSC)
# Base.summarysize(dfUNSPSC)

first(dfUNSPSC, 5)

5 rows × 4 columns

	Key	Parent key	Code	Title
	Any	Any	Any	Any
1	100	99999	A	Raw Materials, Chemicals, Paper, Fuel
2	101	99999	B	Industrial Equipment & Tools
3	102	99999	C	Components & Supplies
4	103	99999	D	Construction, Transportation & Facility Equipment & Supplies
5	104	99999	E	Medical, Laboratory & Test Equipment & Supplies & Pharmaceuticals

GTIN data

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
##########
# GTIN ###
##########

# xf = XLSX.readxlsx("../../assets/sampleData/DS_GTIN_ALL.xlsx")
# xf will display names of sheets and rows with data
# let's read this data in to a DataFrame

# using below command will read xlsx data into DataFrame but will not render column labels
# df = DataFrame(XLSX.readdata("../../assets/sampleData/DS_GTIN_ALL.xlsx", "Worksheet", "A14:E143403   "), :auto)
dfGTIN = DataFrame(XLSX.readtable("../../assets/sampleData/DS_GTIN_ALL.xlsx", "Worksheet";first_row=14)...)
first(dfGTIN,5)
# ... operator will splat the tuple (data, column_labels) into the constructor of DataFrame

# replace missing values with an integer 99999
# replace!(dfUNSPSC."Parent key", missing => 99999)
# size(dfUNSPSC)

# let's export this clean csv, we'll load this into database
# CSV.write("UNSPSC.csv", dfUNSPSC)
# readdir(pwd())

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfGTIN)
# empty!(dfGTIN)
# Base.summarysize(dfGTIN)

5 rows × 5 columns

	150621010	10603295507444	ATTUNE FB TIB BASE SZ 10 POR	Orthopaedics	EA
	Any	Any	Any	Any	Any
1	254505412-12	10603295480709	ATTUNE RP PS TRL SZ 9 7MM	Orthopaedics	EA
2	L20409-13	10603295258117	BROACH CORAIL AMT 9	Orthopaedics	EA
3	257004050-12	10603295490449	C-STEM AMT SZ0-1 HI NECK	Orthopaedics	EA
4	254505411-12	10603295480693	ATTUNE RP PS TRL SZ 9 6MM	Orthopaedics	EA
5	L20416-13	10603295258186	BROACH CORAIL AMT 16	Orthopaedics	EA

Vendor Master

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
#################
# Vendor master #
#################

#####
## This is big data set (<5GB), uncomments and run only once ##
#####
# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# show(first(dfGUDIDdevice[:,[:brandName, :catalogNumber, :dunsNumber, :companyName, :rx, :otc]],5), allcols=true)

# create Vendor Master from GUDID dataset
# show(first(dfGUDIDdevice,5), allcols=true)
# names(dfGUDIDdevice)
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :catalogNumber, :dunsNumber, :companyName, :rx, :otc]])
# dfVendor = unique(dfGUDIDdevice[:,[:companyName]]) # 7574 unique vendors

# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# dfVendor is a good dataset, have 216k rows for 7574 unique vendors

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfVendor)
# empty!(dfVendor)
# Base.summarysize(dfVendor)

Locations

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
#### Location Master

data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
dfLocation = DataFrame(data, vec(header))
first(dfLocation[:,Not(:zips)],5)

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfLocation)
# empty!(dfLocation)
# Base.summarysize(dfLocation)

5 rows × 16 columns

	city	city_ascii	state_id	state_name	county_fips	county_name	lat	lng	population	density	source	military	incorporated	timezone	ranking	id
	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any
1	New York	New York	NY	New York	36061	New York	40.6943	-73.9249	18713220	10715	polygon	FALSE	TRUE	America/New_York	1	1840034016
2	Los Angeles	Los Angeles	CA	California	6037	Los Angeles	34.1139	-118.407	12750807	3276	polygon	FALSE	TRUE	America/Los_Angeles	1	1840020491
3	Chicago	Chicago	IL	Illinois	17031	Cook	41.8373	-87.6862	8604203	4574	polygon	FALSE	TRUE	America/Chicago	1	1840000494
4	Miami	Miami	FL	Florida	12086	Miami-Dade	25.7839	-80.2102	6445545	5019	polygon	FALSE	TRUE	America/New_York	1	1840015149
5	Dallas	Dallas	TX	Texas	48113	Dallas	32.7936	-96.7662	5743938	1526	polygon	FALSE	TRUE	America/Chicago	1	1840019440

ORG

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
dfOrgMaster = DataFrame(
    ORG=repeat(["Galp Inc"], inner=8),
    ENTITY=repeat(["HeadOffice"], inner=8),
    GROUP=repeat(["Operations"], inner=8),
    DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
    UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])
first(dfOrgMaster,5)

5 rows × 5 columns

	ORG	ENTITY	GROUP	DEPARTMENT	UNIT
	String	String	String	String	String
1	Galp Inc	HeadOffice	Operations	Procurement	Sourcing
2	Galp Inc	HeadOffice	Operations	Procurement	Sourcing
3	Galp Inc	HeadOffice	Operations	Procurement	Maintenance
4	Galp Inc	HeadOffice	Operations	Procurement	Support
5	Galp Inc	HeadOffice	Operations	Procurement	Services

now since we created Supply chain attribute, chartfields / dimensions

item master
vendor master
location master
org Hierarchy

using above chartfields, let's create following Supply Chain Transactions

MSR - Material Service request
PurchaseOrder
Voucher
Invoice
Receipt
Shipment
Sales, Revenue
Travel, Expense, TimeCard
Accounting Lines

MSR - Material Service request

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed

# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# df GUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
# dfOrgMaster = DataFrame(
#    ENTITY=repeat(["HeadOffice"], inner=8),
#    GROUP=repeat(["Operations"], inner=8),
#    DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
#    UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])

# dfMSR = DataFrame(
#    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
#    MSR_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
#    FROM_UNIT = rand(dfOrgMaster.UNIT, sampleSize),
#    TO_UNIT = rand(dfOrgMaster.UNIT, sampleSize),
#    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
#    QTY = rand(dfOrgMaster.UNIT, sampleSize));
# first(dfMSR, 5)

PO - Purchase Order

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed

# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
    ENTITY=repeat(["HeadOffice"], inner=8),
    GROUP=repeat(["Operations"], inner=8),
    DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
    UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);

# dfPO = DataFrame(
#    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
#    PO_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
#    VENDOR=rand(unique(dfVendor.companyName), sampleSize),
#    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
#    QTY = rand(1:150, sampleSize),
#    UNIT_PRICE = rand(Normal(100, 2), sampleSize)
#    );
# show(first(dfPO, 5),allcols=true)

8 rows × 4 columns

	ENTITY	GROUP	DEPARTMENT	UNIT
	String	String	String	String
1	HeadOffice	Operations	Procurement	Sourcing
2	HeadOffice	Operations	Procurement	Sourcing
3	HeadOffice	Operations	Procurement	Maintenance
4	HeadOffice	Operations	Procurement	Support
5	HeadOffice	Operations	Procurement	Services
6	HeadOffice	Operations	HR	Helpdesk
7	HeadOffice	Operations	HR	ServiceCall
8	HeadOffice	Operations	MFG	IT

Invoice - Voucher Invoice

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions

sampleSize = 1000 # number of rows, scale as needed

# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
    ENTITY=repeat(["HeadOffice"], inner=8),
    GROUP=repeat(["Operations"], inner=8),
    DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
    UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);

# dfVCHR = DataFrame(
#    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
#    VCHR_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
#    STATUS=rand(["Closed","Paid","Open","Cancelled","Exception"], sampleSize),
#    VENDOR_INVOICE_NUM = rand(10001:9999999, sampleSize),
#    VENDOR=rand(unique(dfVendor.companyName), sampleSize),
#    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
#    QTY = rand(1:150, sampleSize),
#    UNIT_PRICE = rand(Normal(100, 2), sampleSize)
#    );
# show(first(dfVCHR, 5),allcols=true)

8 rows × 4 columns

	ENTITY	GROUP	DEPARTMENT	UNIT
	String	String	String	String
1	HeadOffice	Operations	Procurement	Sourcing
2	HeadOffice	Operations	Procurement	Sourcing
3	HeadOffice	Operations	Procurement	Maintenance
4	HeadOffice	Operations	Procurement	Support
5	HeadOffice	Operations	Procurement	Services
6	HeadOffice	Operations	HR	Helpdesk
7	HeadOffice	Operations	HR	ServiceCall
8	HeadOffice	Operations	MFG	IT

Sales - Revenue Register

using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed

# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
    ENTITY=repeat(["HeadOffice"], inner=8),
    GROUP=repeat(["Operations"], inner=8),
    DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
    UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);

# dfREVENUE = DataFrame(
#    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
#    SALES_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
#    STATUS=rand(["Sold","Pending","Hold","Cancelled","Exception"], sampleSize),
#    SALES_RECEIPT_NUM = rand(10001:9999999, sampleSize),
#    CUSTOMER=rand(unique(dfVendor.companyName), sampleSize),
#    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
#    QTY = rand(1:150, sampleSize),
#    UNIT_PRICE = rand(Normal(100, 2), sampleSize)
#    );
# show(first(dfREVENUE, 5),allcols=true)

8 rows × 4 columns

	ENTITY	GROUP	DEPARTMENT	UNIT
	String	String	String	String
1	HeadOffice	Operations	Procurement	Sourcing
2	HeadOffice	Operations	Procurement	Sourcing
3	HeadOffice	Operations	Procurement	Maintenance
4	HeadOffice	Operations	Procurement	Support
5	HeadOffice	Operations	Procurement	Services
6	HeadOffice	Operations	HR	Helpdesk
7	HeadOffice	Operations	HR	ServiceCall
8	HeadOffice	Operations	MFG	IT

Shipment - Receipt

# using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
# sampleSize = 1000 # number of rows, scale as needed

# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
# dfOrgMaster = DataFrame(
    # ENTITY=repeat(["HeadOffice"], inner=8),
    # GROUP=repeat(["Operations"], inner=8),
    # DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
    # UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])

# dfSHIPRECEIPT = DataFrame(
   # UNIT = rand(dfOrgMaster.UNIT, sampleSize),
   # SHIP_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
   # STATUS=rand(["Shipped","Returned","In process","Cancelled","Exception"], sampleSize),
   # SHIPMENT_NUM = rand(10001:9999999, sampleSize),
   # CUSTOMER=rand(unique(dfVendor.companyName), sampleSize),
   # GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
   # QTY = rand(1:150, sampleSize),
   # UNIT_PRICE = rand(Normal(100, 2), sampleSize)
   # );
# show(first(dfSHIPRECEIPT, 5),allcols=true)

FINANCE Data model

Chart of accounts (organized hierarchy of account groups in tree form), Location/Department or Product based hierarchy allows businesses to group and report organization activities based on business processes.

These hierarchical grouping help capture monetary and statistical values of organization in finance statements.

To create Finance Data model and Ledger Cash-flow or Balance Sheet like statements, We need associated dimensions (chartfields like chart of accounts).

Accounts chartfield

using Pkg, DataFrames, CategoricalArrays, PooledArrays, Dates

# here CLASSIFICATION column vector stores 3500 distinct values in an array
CLASSIFICATION=repeat(["OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
                ], inner=500)

cl = categorical(CLASSIFICATION)
levels(cl)

# using PooledArrays
pl = categorical(CLASSIFICATION)
levels(pl)

# show values in tabular format
# run one command at a time
df = DataFrame(Dict("Descr" => "CLASSIFICATION...ARR...", "Value" => size(CLASSIFICATION)[1]))
push!(df,("CAT...ARR...",size(cl)[1]))
push!(df,("CAT...ARR..COMPRESS.",size(compress(cl))[1]))
push!(df,("POOL...ARR...",size(pl)[1]))
push!(df,("POOL...ARR..COMPRESS.",size(compress(pl))[1]))
push!(df,("CAT...LEVELs...",size(levels(cl))[1]))
push!(df,("POOL...LEVELs...",size(levels(pl))[1]))
push!(df,("CLASSIFICATION...MEMSIZE", Base.summarysize(CLASSIFICATION)))
push!(df,("CAT...ARR...MEMSIZE", Base.summarysize(cl)))
push!(df,("POOL...ARR...MEMSIZE", Base.summarysize(pl)))
push!(df,("CAT...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(cl))))
push!(df,("POOL...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(pl))))

first(df,5)

accountsDF = DataFrame(
    ENTITY = "Galp Retail Inc.",
    AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"),
    ID = 11000:1000:45000,
    CLASSIFICATION=repeat([
        "OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
                ], inner=5),
    CATEGORY=[
        "Travel","Payroll","non-Payroll","Allowance","Cash",
        "Facility","Supply","Services","Investment","Misc.",
        "Depreciation","Gain","Service","Retired","Fault.",
        "Receipt","Accrual","Return","Credit","ROI",
        "Cash","Funds","Invest","Transfer","Roll-over",
        "FTE","Members","Non_Members","Temp","Contractors",
        "Sales","Merchant","Service","Consulting","Subscriptions"],
    STATUS="A",
    DESCR=repeat([
    "operating expenses","non-operating expenses","ASSETS","liability","net-worth","stats","revenue"], inner=5),
    ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5))
accountsDF[collect(1:5),:]

5 rows × 8 columns

	ENTITY	AS_OF_DATE	ID	CLASSIFICATION	CATEGORY	STATUS	DESCR	ACCOUNT_TYPE
	String	Date	Int64	String	String	String	String	String
1	Galp Retail Inc.	1900-01-01	11000	OPERATING_EXPENSES	Travel	A	operating expenses	E
2	Galp Retail Inc.	1900-01-01	12000	OPERATING_EXPENSES	Payroll	A	operating expenses	E
3	Galp Retail Inc.	1900-01-01	13000	OPERATING_EXPENSES	non-Payroll	A	operating expenses	E
4	Galp Retail Inc.	1900-01-01	14000	OPERATING_EXPENSES	Allowance	A	operating expenses	E
5	Galp Retail Inc.	1900-01-01	15000	OPERATING_EXPENSES	Cash	A	operating expenses	E

Department chartfield

using DataFrames, Dates
## create Accounts chartfield
# DEPARTMENT Chartfield
deptDF = DataFrame(
    AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
    ID = 1100:100:1500,
    CLASSIFICATION=["SALES","HR", "IT","BUSINESS","OTHERS"],
    CATEGORY=["sales","human_resource","IT_Staff","business","others"],
    STATUS="A",
    DESCR=[
    "Sales & Marketing","Human Resource","Information Technology","Business leaders","other temp"
        ],
    DEPT_TYPE=["S","H","I","B","O"]);
deptDF[collect(1:5),:]

5 rows × 7 columns

	AS_OF_DATE	ID	CLASSIFICATION	CATEGORY	STATUS	DESCR	DEPT_TYPE
	Date	Int64	String	String	String	String	String
1	2000-01-01	1100	SALES	sales	A	Sales & Marketing	S
2	2000-01-01	1200	HR	human_resource	A	Human Resource	H
3	2000-01-01	1300	IT	IT_Staff	A	Information Technology	I
4	2000-01-01	1400	BUSINESS	business	A	Business leaders	B
5	2000-01-01	1500	OTHERS	others	A	other temp	O

Location chartfield

using DataFrames, Dates
locationDF = DataFrame(
    AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
    ID = 11:1:22,
    CLASSIFICATION=repeat([
        "Galp Region A","Galp Region B", "Galp Region C"], inner=4),
    CATEGORY=repeat([
        "Galp Region A","Galp Region B", "Galp Region C"], inner=4),
    STATUS="A",
    DESCR=[
"Boston","New York","Philadelphia","Cleveland","Richmond",
"Atlanta","Chicago","St. Louis","Minneapolis","Kansas City",
"Dallas","San Francisco"],
    LOC_TYPE="Physical");
locationDF[:,:]

12 rows × 7 columns

	AS_OF_DATE	ID	CLASSIFICATION	CATEGORY	STATUS	DESCR	LOC_TYPE
	Date	Int64	String	String	String	String	String
1	2000-01-01	11	Galp Region A	Galp Region A	A	Boston	Physical
2	2000-01-01	12	Galp Region A	Galp Region A	A	New York	Physical
3	2000-01-01	13	Galp Region A	Galp Region A	A	Philadelphia	Physical
4	2000-01-01	14	Galp Region A	Galp Region A	A	Cleveland	Physical
5	2000-01-01	15	Galp Region B	Galp Region B	A	Richmond	Physical
6	2000-01-01	16	Galp Region B	Galp Region B	A	Atlanta	Physical
7	2000-01-01	17	Galp Region B	Galp Region B	A	Chicago	Physical
8	2000-01-01	18	Galp Region B	Galp Region B	A	St. Louis	Physical
9	2000-01-01	19	Galp Region C	Galp Region C	A	Minneapolis	Physical
10	2000-01-01	20	Galp Region C	Galp Region C	A	Kansas City	Physical
11	2000-01-01	21	Galp Region C	Galp Region C	A	Dallas	Physical
12	2000-01-01	22	Galp Region C	Galp Region C	A	San Francisco	Physical

Ledger RECORD

## pu
using DataFrames, Dates
accountsDF = DataFrame(
    ENTITY = "Galp Retail Inc.",
    AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"),
    ID = 11000:1000:45000,
    CLASSIFICATION=repeat([
        "OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
                ], inner=5),
    CATEGORY=[
        "Travel","Payroll","non-Payroll","Allowance","Cash",
        "Facility","Supply","Services","Investment","Misc.",
        "Depreciation","Gain","Service","Retired","Fault.",
        "Receipt","Accrual","Return","Credit","ROI",
        "Cash","Funds","Invest","Transfer","Roll-over",
        "FTE","Members","Non_Members","Temp","Contractors",
        "Sales","Merchant","Service","Consulting","Subscriptions"],
    STATUS="A",
    DESCR=repeat([
    "operating expenses","non-operating expenses","ASSETS","liability","net-worth","stats","revenue"], inner=5),
    ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5))
# DEPARTMENT Chartfield
deptDF = DataFrame(
    AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
    ID = 1100:100:1500,
    CLASSIFICATION=["SALES","HR", "IT","BUSINESS","OTHERS"],
    CATEGORY=["sales","human_resource","IT_Staff","business","others"],
    STATUS="A",
    DESCR=[
    "Sales & Marketing","Human Resource","Information Technology","Business leaders","other temp"
        ],
    DEPT_TYPE=["S","H","I","B","O"]);
size(deptDF),deptDF[collect(1:5),:]

locationDF = DataFrame(
    AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
    ID = 11:1:22,
    CLASSIFICATION=repeat([
        "Galp Region A","Galp Region B", "Galp Region C"], inner=4),
    CATEGORY=repeat([
        "Galp Region A","Galp Region B", "Galp Region C"], inner=4),
    STATUS="A",
    DESCR=[
"Boston","New York","Philadelphia","Cleveland","Richmond",
"Atlanta","Chicago","St. Louis","Minneapolis","Kansas City",
"Dallas","San Francisco"],
    LOC_TYPE="Physical");
locationDF[:,:]

# creating Ledger
ledgerDF = DataFrame(
            LEDGER = String[], FISCAL_YEAR = Int[], PERIOD = Int[], ORGID = String[],
            OPER_UNIT = String[], ACCOUNT = Int[], DEPT = Int[], LOCATION = Int[],
            POSTED_TOTAL = Float64[]
            );

# create 2020 Period 1-12 Actuals Ledger
l = "Actuals";
fy = 2020;
for p = 1:12
    for i = 1:10^5
        push!(ledgerDF, (l, fy, p, "Galp Inc.", rand(locationDF.CATEGORY),
            rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
    end
end

# create 2021 Period 1-4 Actuals Ledger
l = "Actuals";
fy = 2021;
for p = 1:4
    for i = 1:10^5
        push!(ledgerDF, (l, fy, p, "Galp Inc.", rand(locationDF.CATEGORY),
            rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
    end
end

# create 2021 Period 1-4 Budget Ledger
l = "Budget";
fy = 2021;
for p = 1:12
    for i = 1:10^5
        push!(ledgerDF, (l, fy, p, "Galp Inc.", rand(locationDF.CATEGORY),
            rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
    end
end

# here is ~3 million rows ledger dataframe
first(ledgerDF,5)

5 rows × 9 columns

	LEDGER	FISCAL_YEAR	PERIOD	ORGID	OPER_UNIT	ACCOUNT	DEPT	LOCATION	POSTED_TOTAL
	String	Int64	Int64	String	String	Int64	Int64	Int64	Float64
1	Actuals	2020	1	Galp Inc.	Galp Region B	26000	1100	19	3.23507e7
2	Actuals	2020	1	Galp Inc.	Galp Region B	22000	1400	14	3.8979e7
3	Actuals	2020	1	Galp Inc.	Galp Region B	11000	1300	13	8.63023e7
4	Actuals	2020	1	Galp Inc.	Galp Region C	43000	1300	15	7.56502e7
5	Actuals	2020	1	Galp Inc.	Galp Region A	25000	1100	16	7.85415e7