See also
Data Center Doctests
The term ‘data import’ actually understates the range of functionsimporters really have. As already stated, many importers do not onlyrestore data once backed up by exporters or, in other words, takevalues from CSV files and write them one-on-one into the database.The data undergo a complex staged data processing algorithm.Therefore, we prefer calling them ‘batch processors’ instead ofimporters. The stages of the import process are as follows.
Stage 1: File Upload¶
Users with permissionwaeup.manageDataCenterare allowed to access the data center and also to use the uploadpage. On this page they can access an overview of all availablebatch processors. When clicking on a processor name, required,optional and non-schema fields show up in the modal window.Also a CSV file template, which can be filled and uploaded to avoidheader errors, is being provided in this window.
Many importer fields are of type ‘Choice’, which means only definiedkeywords (tokens) are allowed, see schema fields.An overview of all sources and vocabularies, which feed thechoices, can be also accessed from the datacenter upload page andshows up in a modal window. Sources and vocabularies of the basepackage can be viewed here.
Data center managers can upload any kind of CSV file from theirlocal computer. The uploader does not check the integrity of thecontent but the validity of its CSV encoding (seecheck_csv_charset).It also checks the filename extension and allows only a limitednumber of files in the data center.
DatacenterUploadPage.
max_files
= 20
If the upload succeeded the uploader sends an email to all importmanagers (users with rolewaeup.ImportManager)of the portal that a new file was uploaded.
The uploader changes the filename. An uploaded file foo.csv
willbe stored as foo_USERNAME.csv
where username is the user id ofthe currently logged in user. Spaces in filename are replaced byunderscores. Pending data filenames remain unchanged (see below).
After file upload the data center manager can click the ‘Processdata’ button to open the page where files can be selected for import(import step 1). After selecting a file the data center managercan preview the header and the first three records of the uploadedfile (import step 2). If the preview fails or the headercontains duplicate column titles, an error message is raised. Theuser cannot proceed but is requested to replace the uploaded file.If the preview succeeds the user is able to proceed to the next step(import step 3) by selecting the appropriate processor and animport mode. In import mode create
new objects are added to thedatabase, in update
mode existing objects are modified and inremove
mode deleted.
Stage 2: File Header Validation¶
Import step 3 is the stage where the file content is assessed forthe first time and checked if the column titles correspond with thefields of the processor chosen. The page shows the header and thefirst record of the uploaded file. The page allows to change columntitles or to ignore entire columns during import. It might havehappened that one or more column titles are misspelled or that theperson, who created the file, ignored the case-sensitivity of fieldnames. Then the data import manager can easily fix this by selectingthe correct title and click the ‘Set headerfields’ button. Settingthe column titles is temporary, it does not modify the uploadedfile. Consequently, it does not make sense to set new column titlesif the file is not imported afterwards.
The page also calls the checkHeaders method of the batch processorwhich checks for required fields. If a required column title ismissing, a warning message is raised and the user can’t proceed tothe next step (import step 4).
Important
Data center managers, who are only charged with uploading files butnot with the import of files, are requested to proceed up to import step 3and verify that the data format meets all the import criteria andrequirements of the batch processor.
Stage 3: Data Validation and Import¶
Import step 4 is the actual data import. The import is started byclicking the ‘Perform import’ button. This action requires thewaeup.importDatapermission. If data managers don’t have this permission, they willbe redirected to the login page.
Kofa does not validate the data in advance. It tries to import thedata row-by-row while reading the CSV file. The reason is thatimport files very often contain thousands or even tenthousands ofrecords. It is not feasable for data managers to edit import filesuntil they are error-free. Very often such an error is not really amistake made by the person who compiled the file. Example: Theimport file contains course results although the student has not yetregistered the courses. Then the import of this single record has towait, i.e. it has to be marked pending, until the student has addedthe course ticket. Only then it can be edited by the batch processor.
The core import method is:
BatchProcessor.
doImport
()[source]In contrast to most other methods, doImport is not supposed tobe customized, neither in custom packages nor in derived batchprocessor classes. Therefore, this is the only place where wedo import data.
Before this method starts creating or updating persistent data, itprepares two more files in a temporary folder of the filesystem: (1)a file for pending data with file extension
.pending
and (2)a file for successfully processed data with file extension.finished
. Then the method starts iterating over all rows ofthe CSV file. Each row is treated as follows:An empty row is skipped.
Empty strings or lists (
[]
) in the row are replaced byignore markers.The BatchProcessor.checkConversion method validates and converts all values in the row. Conversion means the transformation of stringsinto Python objects. For instance, number expressions have to betransformed into integers, dates into datetime objects, phone numberexpressions into phone number objects, etc. The converter returns adictionary with converted values or, if the validation of one of theelements fails, an appropriate warning message. If the conversionfails a pending record is created and stored in the pending data filetogether with a warning message the converter has raised.
In create mode only:
The parent object must be found and a childobject with same object id must not exist. Otherwise the rowis skipped, a corresponding warning message is raised and arecord is stored in the pending data file.
The BatchProcessor.checkCreateRequirements method checks additionalrequirements the parent object must fulfill before a new sububjectis being added. These requirements are not imposed by the datatype but the context of the object. For example, the course resultsof graduated students must not changed by import, neither by creating nor updating or removing course tickets.
Now doImport tries to add the new object with the datafrom the conversion dictionary. In some cases thismay fail and a DuplicationError is raised. For example, a newpayment ticket is created but the same payment for same sessionhas already been made. In this case the object id is unique, noother object with same id exists, but making the ‘same’ paymenttwice does not make sense. The import is skipped and arecord is stored in the pending data file.
In update mode only:
If the object can’t be found, the row is skipped,a
no such entry
warning message is raised and a record isstored in the pending data file.The BatchProcessor.checkUpdateRequirements method checks additionalrequirements the object must fulfill before being updated. Theserequirements are not imposed by the data type but the contextof the object. For example, post-graduate students have a differentregistration workflow. With this method we do forbid certain workflowtransitions or states.
Finally, doImport updates the existing object with the datafrom the conversion dictionary.
In remove mode only:
If the object can’t be found, the row is skipped,a
no such entry
warning message is raised and a record isstored in the pending data file.The BatchProcessor.checkRemoveRequirements method checks additionalrequirements the object must fulfill before being removed.These requirements are not imposed by the data type but the contextof the object. For example, the course results of graduated studentsmust not changed by import, neither by creating nor updating orremoving course tickets.
Finally, doImport removes the existing object.
Stage 4: Post-Processing¶
The data import is finalized by callingdistProcessedFiles.This method moves the .pending
and .finished
files as well as theoriginally imported file from their temporary to their final location in thestorage path of the filesystem from where they can be accessed through thebrowser user interface.