CSV File Validation In PHP (Part III)

This is the third and last part o the explanation of how to implement a csv file validation in php, using the Strategy Pattern.

If you remember the first post, we started implementing simple validations for each column, and in the second post, we wrote the main class that will take care of processing the whole CSV file.

Now, it’s time to implement the missing piece, to merge both parts. Basically, we need a class that should act as interface between our analyzer and the validators. We should apply a different validation’s implementation depending on the csv field.

In order to do that, I turn the field’s name into the class name that should take care of the validation of the field. In example, if the field is called “short_description”, then its validator class should be ShortDescription (UpperCamelCase). In order to separate the validators by phases, I prepend the phase with an underscore, so for instance, the previous example would eventually be Lexical_ShortDescription and Semantic_ShortDescription.

Then, all I have to do is instantiate the class, and call its method “validate” with the given input, and return the result.

This how it looks like:

class ValidatorContext {

private $strategy;
private $warnings;

/*
* Create the instance of the validator by building the class name with the UpperCamelCase format
* Generate a warning if the validator wasn't found, and create a generic validator instance
*
* @param $type string type of analysis
* @param $strategy string validator strategy class name
* @param $optData array optional data passed to the semantic validators
* @return void
*/
function __construct($type, $strategy, $profile, $optData = null) {
$this->warnings = array();
$validator = ucfirst($type) . '_' . str_replace(' ', '', ucwords(str_replace('_', ' ', $validator)));
if(class_exists($validator)) {
$this->strategy = new $validator($profile, $optData);
}else {
$this->strategy = new Generic_Validator();
$this->warnings[] = sprintf('Strategy %s (%s) is not implemented for the %s analizer ' . "\n", $strategy, $validator, $type);
}
}

/**
* Remove elements from memory
* @return void
*/
function __destruct() {
unset($this->strategy);
}

/**
* getTokens
* Return the tokens retrieved from the validator
* @return array
*/
public function getTokens() {
return $this->strategy->getTokens();
}

/**
* getErrors
* Return the errrors found during the validation
* @return array
*/
public function getErrors() {
return $this->strategy->getErrors();
}

/**
* getWarnings
* Return the warnings found during the validation
* @return array
*/
public function getWarnings() {
return $this->warnings;
}

/**
* getErrorMsg
     * Return the errror description to help the user to solve the problem
* @param $input string|array
* @return string
*/
public function getErrorMsg($input) {
return $this->strategy->getErrorMsg($input);
}

/**
* validate
* @access public
* @param $input mixed input data of the csv
* @return bool whether this data is valid or not for current validator
*/
public function validate(&$input) {
//Uncomment the line below to debug
//echo get_class($this->_strategy) . ": " . $input . "\n";
return $this->strategy->validate($input);
}

}

And basically, that’s about it. You might thrown an exception if there is no validation found for a certain field, but in my case its enough to just bypass the validation, so the Generic_Validator class, will always return true.

I hope this can help you to implement your own csv validation in php. If you want to download the source code of the whole application, I’ll put the link with some working examples of usage included during this week.

Updated on 14/04/2013

I’ve created a git repository with the source code of the CSV validator. I’ve included an example of usage with some demo files as well. All you have to do is clone the repository and execute “example.php” from php-cli.

You get the source code of the validator from here.

CSV File Validation In PHP (Part II)

If you remember my previous post regarding CSV validation, I explained how to implement a different validator for each field, and phase, using the Strategy Design Pattern. For instance, in the lexical phase, I was checking for the correctness of the SKU field regarding the lenght of the string, and the absence of forbidden characters. In the semantic phase, depending on if it’s an update or an import task, I was checking for the existence or absence of the provided SKUs list in the database.

One of the big advantages of this implementation, is that it’s very easy to implement unitary tests for each validator, as they are almost totally independent from each other. I said almost because the return values of the lexical validators are the input of the semantic ones, so there is some kind of dependency there. For instance, the “store” field used to be a string containing the code of the store, but eventually, we allowed to put several comma-separated store codes on the field. This required changes not only in the regular expressions of the lexical validator, but also on any semantic validator that depended on the store tokens, as they’ll be expecting an input string, but instead, they’ll get an array.

I could keep on talking about the pros and cons of this approach, but it’s out of the scope of this post. Our goal is to implement a CSV file validation, so, let’s focus on the implementation of the main class which will take care or reading the CSV file, and execute the proper validator for each column.

For the shake of readability on the blog, I’ve un-indented one level the methods’ code. This is how the lexical phase looks like:

class Analyzer {

    const LEXICAL = 'lexical';
    const SEMANTIC = 'semantic';

    /**
     * Allowed profiles
     */
    private static $_PROFILES = array('import', 'update');

    private $_validators;
    private $_columnIndexes;
    private $_requiredFields;
    private $_profile;
    private $_errors;

function __construct(array $required, $profile) {
    $this->_validators = array();
    $this->_columnIndexes = array();
    $this->_errors = array();
    $this->_requiredFields = $required;
    $this->_profile = $profile;
    if(!in_array($profile, self::$_PROFILES)) {
        die('Unknown profile specified.');
    }
}
/*
 * Perform the lexical analysis over the CSV
 * Iterate over the file and exit if an error is found
 *
 * @param $file string full path of the csv file
 * @return bool whether the file is valid (from a lexical point of view) or not
 */
protected function lexical($file) {
    if (!file_exists($file))  {
        return false;
    } else {
        $delimiter = self::getDelimiter($file);
    }

    $handle = fopen($file, 'r');

    //Parse the first row, instantiate all the validators
    $valid = $this->parseFirstRow(fgetcsv($handle, 0, $delimiter));
    //Number of columns specified on the header
    $num_columns = sizeOf($this->_columnIndexes);
    //line number count
    $i = 1;

    while(($data = fgetcsv($handle, 0, $delimiter)) !== FALSE && $valid) {
        $errors = array();

        //For each column
        foreach ($data as $key => $value) {

            //Skip all columns without header
            if($key >= $num_columns) {
                break;
            }

            $value = trim($value);

            //Validate
            $errors[$this->_columnIndexes[$key]] = $this->_validators[$key]->validate($value);

            $valid = $valid && $errors[$this->_columnIndexes[$key]];
        }

        //If any error was found, exit
        if(!$valid) {
            $filtered_errors = array_keys($errors, false);
            if(count($filtered_errors) > 0) {
		//Store the errors founds on the current line
                $this->_errors[$i] = $filtered_errors;
            }
            break;
        }
        $i++;
    }
    fclose($handle);

    return $valid;
}
/**
 * parseFirstRow
 * Check that the column names aren't duplicated
 * Ensure all required fields are present
 * Create the instances of each validator.
 *
 * @param array $data
 * @access protected
 * @return bool
 */
protected function parseFirstRow(array $data) {
    $valid = true;
    //Clean the data
    $data = array_filter(array_map('trim', array_map('strtolower', $data)));

    //Ensure that there aren't duplicated columns
    $dupes = array_diff_key($data, array_unique($data));
    if(!empty($dupes)) {
        $this->_errors[] = sprintf('The following columns are duplicated on the CSV: "%s".', implode($dupes, '", "'));
        $valid = false;
    }

    //Ensure all required columns are present
    if($valid &&
        //The number of columns is lower than the required fields, we don't need to keep checking, some columns are missing.
        (count($data) < count($this->_requiredFields) ||
        //The number of optional fields must match with the number of fields that are not required, otherwise something is missing.
        count(array_diff($data, $this->_requiredFields)) !== (count($data) - count($this->_requiredFields)) ||
        //If the operation is an import, either categories or category_ids must be present
        ($this->_profile == 'import' && !(in_array('categories', $data) || in_array('category_ids', $data))))) {

            $required = implode(array_diff($this->_requiredFields, $data), '", "');
            if($this->_profile == 'import' && !in_array('category_ids', $data) && !in_array('categories', $data)) {
                if($required) {
                    $required .= '" and "categories" or "category_ids';
                } else {
                    $required = 'categories" or "category_ids';
                }
            }
            $this->_errors[] = sprintf('The following columns are missing on the CSV: "%s".', $required);
            $valid = false;
        }

    if($valid) {
        //Instantiate all the lexical validators
        foreach ($data as $key => $value) {
            $this->_validators[$key] = new ValidatorContext(Analyzer::LEXICAL, $value, $this->_profile);
            $this->_columnIndexes[$key] = $value;
        }
    }
    return $valid;
}
/**
 * getDelimiter
 * Try to detect the delimiter character on a CSV file, by reading the first row.
 *
 * @param mixed $file
 * @access public
 * @return string
 */
public static function getDelimiter($file) {
    $delimiter = false;
    $line = '';
    if($f = fopen($file, 'r')) {
        $line = fgets($f); // read until first newline
        fclose($f);
    }
    if(strpos($line, ';') !== FALSE && strpos($line, ',') === FALSE) {
        $delimiter = ';';
    } else if(strpos($line, ',') !== FALSE && strpos($line, ';') === FALSE) {
        $delimiter = ',';
    } else {
        die('Unable to find the CSV delimiter character. Make sure you use "," or ";" as delimiter and try again.');
    }
    return $delimiter;
}
}

I hope that the code is pretty much self-explanatory, but anyway I’ll summarize the process.

It all begins with the instantiation of the Analyzer class, where you provide the profile, and the required fields for this task.
Then, the lexical method is called with the full path of the CSV file to be processed. If the file exists, we firstly try to detect the delimiter used, which should be a comma or a semicolon.

Afterwards, the first row is parsed by cleaning the data, and checking for the correctness of the fields, ensuring there are not duplicated fields, and that all the required fields are present. If everything went as expected, we call the ValidatorContext class providing the fieldname, and we’ll get the proper validator instatiated.

Finally, we iterate over the CSV, ensuring that all the validations are passed, stopping the process in case of error, and saving the field and the line where the problem was found.

That’s all from now, in the next post I’ll show the last piece of the puzzle, which is the implementation of the ValidatorContext class, where we will be able to unify the whole process.