Stages of Batch Processing — WAeUP.Kofa 1.8.2.dev0 Documentation (2024)

See also

Data Center Doctests

The term ‘data import’ actually understates the range of functionsimporters really have. As already stated, many importers do not onlyrestore data once backed up by exporters or, in other words, takevalues from CSV files and write them one-on-one into the database.The data undergo a complex staged data processing algorithm.Therefore, we prefer calling them ‘batch processors’ instead ofimporters. The stages of the import process are as follows.

Stage 1: File Upload

Users with permissionwaeup.manageDataCenterare allowed to access the data center and also to use the uploadpage. On this page they can access an overview of all availablebatch processors. When clicking on a processor name, required,optional and non-schema fields show up in the modal window.Also a CSV file template, which can be filled and uploaded to avoidheader errors, is being provided in this window.

Many importer fields are of type ‘Choice’, which means only definiedkeywords (tokens) are allowed, see schema fields.An overview of all sources and vocabularies, which feed thechoices, can be also accessed from the datacenter upload page andshows up in a modal window. Sources and vocabularies of the basepackage can be viewed here.

Data center managers can upload any kind of CSV file from theirlocal computer. The uploader does not check the integrity of thecontent but the validity of its CSV encoding (seecheck_csv_charset).It also checks the filename extension and allows only a limitednumber of files in the data center.

DatacenterUploadPage.max_files = 20

If the upload succeeded the uploader sends an email to all importmanagers (users with rolewaeup.ImportManager)of the portal that a new file was uploaded.

The uploader changes the filename. An uploaded file foo.csv willbe stored as foo_USERNAME.csv where username is the user id ofthe currently logged in user. Spaces in filename are replaced byunderscores. Pending data filenames remain unchanged (see below).

After file upload the data center manager can click the ‘Processdata’ button to open the page where files can be selected for import(import step 1). After selecting a file the data center managercan preview the header and the first three records of the uploadedfile (import step 2). If the preview fails or the headercontains duplicate column titles, an error message is raised. Theuser cannot proceed but is requested to replace the uploaded file.If the preview succeeds the user is able to proceed to the next step(import step 3) by selecting the appropriate processor and animport mode. In import mode create new objects are added to thedatabase, in update mode existing objects are modified and inremove mode deleted.

Stage 2: File Header Validation

Import step 3 is the stage where the file content is assessed forthe first time and checked if the column titles correspond with thefields of the processor chosen. The page shows the header and thefirst record of the uploaded file. The page allows to change columntitles or to ignore entire columns during import. It might havehappened that one or more column titles are misspelled or that theperson, who created the file, ignored the case-sensitivity of fieldnames. Then the data import manager can easily fix this by selectingthe correct title and click the ‘Set headerfields’ button. Settingthe column titles is temporary, it does not modify the uploadedfile. Consequently, it does not make sense to set new column titlesif the file is not imported afterwards.

The page also calls the checkHeaders method of the batch processorwhich checks for required fields. If a required column title ismissing, a warning message is raised and the user can’t proceed tothe next step (import step 4).

Important

Data center managers, who are only charged with uploading files butnot with the import of files, are requested to proceed up to import step 3and verify that the data format meets all the import criteria andrequirements of the batch processor.

Stage 3: Data Validation and Import

Import step 4 is the actual data import. The import is started byclicking the ‘Perform import’ button. This action requires thewaeup.importDatapermission. If data managers don’t have this permission, they willbe redirected to the login page.

Kofa does not validate the data in advance. It tries to import thedata row-by-row while reading the CSV file. The reason is thatimport files very often contain thousands or even tenthousands ofrecords. It is not feasable for data managers to edit import filesuntil they are error-free. Very often such an error is not really amistake made by the person who compiled the file. Example: Theimport file contains course results although the student has not yetregistered the courses. Then the import of this single record has towait, i.e. it has to be marked pending, until the student has addedthe course ticket. Only then it can be edited by the batch processor.

The core import method is:

BatchProcessor.doImport()[source]

In contrast to most other methods, doImport is not supposed tobe customized, neither in custom packages nor in derived batchprocessor classes. Therefore, this is the only place where wedo import data.

Before this method starts creating or updating persistent data, itprepares two more files in a temporary folder of the filesystem: (1)a file for pending data with file extension .pending and (2)a file for successfully processed data with file extension.finished. Then the method starts iterating over all rows ofthe CSV file. Each row is treated as follows:

  1. An empty row is skipped.

  2. Empty strings or lists ([]) in the row are replaced byignore markers.

  3. The BatchProcessor.checkConversion method validates and converts all values in the row. Conversion means the transformation of stringsinto Python objects. For instance, number expressions have to betransformed into integers, dates into datetime objects, phone numberexpressions into phone number objects, etc. The converter returns adictionary with converted values or, if the validation of one of theelements fails, an appropriate warning message. If the conversionfails a pending record is created and stored in the pending data filetogether with a warning message the converter has raised.

  4. In create mode only:

    The parent object must be found and a childobject with same object id must not exist. Otherwise the rowis skipped, a corresponding warning message is raised and arecord is stored in the pending data file.

    The BatchProcessor.checkCreateRequirements method checks additionalrequirements the parent object must fulfill before a new sububjectis being added. These requirements are not imposed by the datatype but the context of the object. For example, the course resultsof graduated students must not changed by import, neither by creating nor updating or removing course tickets.

    Now doImport tries to add the new object with the datafrom the conversion dictionary. In some cases thismay fail and a DuplicationError is raised. For example, a newpayment ticket is created but the same payment for same sessionhas already been made. In this case the object id is unique, noother object with same id exists, but making the ‘same’ paymenttwice does not make sense. The import is skipped and arecord is stored in the pending data file.

  5. In update mode only:

    If the object can’t be found, the row is skipped,a no such entry warning message is raised and a record isstored in the pending data file.

    The BatchProcessor.checkUpdateRequirements method checks additionalrequirements the object must fulfill before being updated. Theserequirements are not imposed by the data type but the contextof the object. For example, post-graduate students have a differentregistration workflow. With this method we do forbid certain workflowtransitions or states.

    Finally, doImport updates the existing object with the datafrom the conversion dictionary.

  6. In remove mode only:

    If the object can’t be found, the row is skipped,a no such entry warning message is raised and a record isstored in the pending data file.

    The BatchProcessor.checkRemoveRequirements method checks additionalrequirements the object must fulfill before being removed.These requirements are not imposed by the data type but the contextof the object. For example, the course results of graduated studentsmust not changed by import, neither by creating nor updating orremoving course tickets.

    Finally, doImport removes the existing object.

Stage 4: Post-Processing

The data import is finalized by callingdistProcessedFiles.This method moves the .pending and .finished files as well as theoriginally imported file from their temporary to their final location in thestorage path of the filesystem from where they can be accessed through thebrowser user interface.

Stages of Batch Processing — WAeUP.Kofa 1.8.2.dev0 Documentation (2024)

FAQs

What are the stages in batch processing? ›

Table Of Contents
  • Stage 1: File Upload.
  • Stage 2: File Header Validation.
  • Stage 3: Data Validation and Import.
  • Stage 4: Post-Processing.

What is the batching process? ›

Batch processing is the method computers use to periodically complete high-volume, repetitive data jobs. Certain data processing tasks, such as backups, filtering, and sorting, can be compute intensive and inefficient to run on individual data transactions.

What are the types of models in batch process? ›

For a batch process, ISA S88 describes the following models: Physical model: To explain physical assets of the enterprise. Process model: For sub-division of a batch process. Procedural control model: For sub-division of procedural elements of batch process.

What is batch processing in transaction processing system? ›

Batch processing is the processing of transactions in a group or batch. No user interaction is required once batch processing is underway. This differentiates batch processing from transaction processing, which involves processing transactions one at a time and requires user interaction.

What are the three phases of batch job? ›

Batch jobs undergo three primary phases:
  • Load and Dispatch (Phase): Mule prepares for batch job processing. ...
  • Process (Phase): Actual processing begins, with records processed asynchronously. ...
  • On Complete (Phase): An optional phase that provides a summary of the batch processing.
Sep 5, 2023

What is an example of a batch processing operating system? ›

The system processes each job in turn, without any user intervention, until all jobs have been completed. Some examples of batch processing operating systems include IBM's z/OS, Unisys MCP, and Burroughs MCP/BCS.

What is batching documents? ›

Document batching in terms of AWS refers to the process of grouping multiple documents or files together for seamless management and efficient processing within the cloud environment.

What are the 2 types of batching? ›

Weight Batching vs Volume Batching
Weight BatchingVolume Batching
Measurement of quantities is done by considering their weight.Measurement of quantities is done by considering their volume.
It is an accurate methodIt is an approximate method.
It is a tedious process.It is a simple method
4 more rows
May 27, 2023

What are the characteristics of batch processing? ›

What are the characteristics of batch processing? Batch processing in big data involves processing a high volume of data in batches. Batches are scheduled based on the availability of input data and processing results. The goal of batch processing is use case specific, but is about achieving specific business results.

What are the analytics requirements for batch processing? ›

For batch process analytics, two perspectives need to be merged: data over time and data that shows quality or yield parameters of finished batches. Two types of BSPC models are needed to fully account for batch process data.

What is the batch process in strategic management? ›

Meanwhile, process strategy describes ways to improve the process planning of a particular product or service. Batch processes refer to those production lines that create a big push at one time, while continuous processes are almost constantly working but can move more slowly than a batch process.

What are the names of batch operating system? ›

  • GCOS.
  • GECOS.
  • MVS.
  • OS/1100.
  • OS/MVT.
  • OS/SVS.
  • SCOPE. From CompuWiki, a Wikia wiki.

What is batch document processing? ›

Batch document processing is simply scanning a large volume of documents, generally into a few files or one composite file, and using intelligent capture software to process the scanned or captured file or files.

What are batch processing tools? ›

Modern Batch Processing Tools
StepsTools
Processing of BatchesSpark, Pig, Hive, Python, and U-SQL
Data Storing with AnalyticsHive, Hbase, SQL Data Warehouse, MongoDB, DynamoDB, Spark SQL
Reporting and AnalyticsPython, Power BI, Azure Analysis Service
Arrangement of DataOozie, Sqoop and Azure Data Factory
1 more row
Mar 1, 2022

What is batch processing also known as? ›

Batch processing is also known as serial,sequential and off line processing. Batch Processing – This is one of the widely used type of data processing which is also known as serial/sequential, tacked/queued of offline processing.

What is an example of batching? ›

For example, a writer might batch all of their brainstorming into one period of time and their writing research into another. The circ*mstances of the task. This method groups related tasks based on the conditions of the activity.

What does it mean if my order is batching? ›

Batching Explained: What it Means and How it Works

Batching (also known as fulfillment batching or batch picking) is a picking method that allows warehouse staff to fulfill multiple orders at the same time.

How to do batching? ›

How to task batch
  1. List your to-dos. Task batching requires significant practice, which generally starts by listing and prioritizing your to-do list. ...
  2. Separate tasks. You may find it beneficial to separate larger projects into smaller tasks. ...
  3. Categorize tasks. ...
  4. Organize your schedule.
Feb 3, 2023

What does batching mean in manufacturing? ›

Batch production, or batch manufacturing, is one of the main types of manufacturing. It means that the product is made in batches, with every item in the batch going through a series of steps at the same time and in the same order, until the entire batch is complete.

Top Articles
Latest Posts
Article information

Author: Roderick King

Last Updated:

Views: 5492

Rating: 4 / 5 (71 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Roderick King

Birthday: 1997-10-09

Address: 3782 Madge Knoll, East Dudley, MA 63913

Phone: +2521695290067

Job: Customer Sales Coordinator

Hobby: Gunsmithing, Embroidery, Parkour, Kitesurfing, Rock climbing, Sand art, Beekeeping

Introduction: My name is Roderick King, I am a cute, splendid, excited, perfect, gentle, funny, vivacious person who loves writing and wants to share my knowledge and understanding with you.