CSV File Validation In PHP (Part II)

If you remember my previous post regarding CSV validation, I explained how to implement a different validator for each field, and phase, using the Strategy Design Pattern. For instance, in the lexical phase, I was checking for the correctness of the SKU field regarding the lenght of the string, and the absence of forbidden characters. In the semantic phase, depending on if it’s an update or an import task, I was checking for the existence or absence of the provided SKUs list in the database.

One of the big advantages of this implementation, is that it’s very easy to implement unitary tests for each validator, as they are almost totally independent from each other. I said almost because the return values of the lexical validators are the input of the semantic ones, so there is some kind of dependency there. For instance, the “store” field used to be a string containing the code of the store, but eventually, we allowed to put several comma-separated store codes on the field. This required changes not only in the regular expressions of the lexical validator, but also on any semantic validator that depended on the store tokens, as they’ll be expecting an input string, but instead, they’ll get an array.

I could keep on talking about the pros and cons of this approach, but it’s out of the scope of this post. Our goal is to implement a CSV file validation, so, let’s focus on the implementation of the main class which will take care or reading the CSV file, and execute the proper validator for each column.

For the shake of readability on the blog, I’ve un-indented one level the methods’ code. This is how the lexical phase looks like:

class Analyzer {

    const LEXICAL = 'lexical';
    const SEMANTIC = 'semantic';

    /**
     * Allowed profiles
     */
    private static $_PROFILES = array('import', 'update');

    private $_validators;
    private $_columnIndexes;
    private $_requiredFields;
    private $_profile;
    private $_errors;

function __construct(array $required, $profile) {
    $this->_validators = array();
    $this->_columnIndexes = array();
    $this->_errors = array();
    $this->_requiredFields = $required;
    $this->_profile = $profile;
    if(!in_array($profile, self::$_PROFILES)) {
        die('Unknown profile specified.');
    }
}
/*
 * Perform the lexical analysis over the CSV
 * Iterate over the file and exit if an error is found
 *
 * @param $file string full path of the csv file
 * @return bool whether the file is valid (from a lexical point of view) or not
 */
protected function lexical($file) {
    if (!file_exists($file))  {
        return false;
    } else {
        $delimiter = self::getDelimiter($file);
    }

    $handle = fopen($file, 'r');

    //Parse the first row, instantiate all the validators
    $valid = $this->parseFirstRow(fgetcsv($handle, 0, $delimiter));
    //Number of columns specified on the header
    $num_columns = sizeOf($this->_columnIndexes);
    //line number count
    $i = 1;

    while(($data = fgetcsv($handle, 0, $delimiter)) !== FALSE && $valid) {
        $errors = array();

        //For each column
        foreach ($data as $key => $value) {

            //Skip all columns without header
            if($key >= $num_columns) {
                break;
            }

            $value = trim($value);

            //Validate
            $errors[$this->_columnIndexes[$key]] = $this->_validators[$key]->validate($value);

            $valid = $valid && $errors[$this->_columnIndexes[$key]];
        }

        //If any error was found, exit
        if(!$valid) {
            $filtered_errors = array_keys($errors, false);
            if(count($filtered_errors) > 0) {
		//Store the errors founds on the current line
                $this->_errors[$i] = $filtered_errors;
            }
            break;
        }
        $i++;
    }
    fclose($handle);

    return $valid;
}
/**
 * parseFirstRow
 * Check that the column names aren't duplicated
 * Ensure all required fields are present
 * Create the instances of each validator.
 *
 * @param array $data
 * @access protected
 * @return bool
 */
protected function parseFirstRow(array $data) {
    $valid = true;
    //Clean the data
    $data = array_filter(array_map('trim', array_map('strtolower', $data)));

    //Ensure that there aren't duplicated columns
    $dupes = array_diff_key($data, array_unique($data));
    if(!empty($dupes)) {
        $this->_errors[] = sprintf('The following columns are duplicated on the CSV: "%s".', implode($dupes, '", "'));
        $valid = false;
    }

    //Ensure all required columns are present
    if($valid &&
        //The number of columns is lower than the required fields, we don't need to keep checking, some columns are missing.
        (count($data) < count($this->_requiredFields) ||
        //The number of optional fields must match with the number of fields that are not required, otherwise something is missing.
        count(array_diff($data, $this->_requiredFields)) !== (count($data) - count($this->_requiredFields)) ||
        //If the operation is an import, either categories or category_ids must be present
        ($this->_profile == 'import' && !(in_array('categories', $data) || in_array('category_ids', $data))))) {

            $required = implode(array_diff($this->_requiredFields, $data), '", "');
            if($this->_profile == 'import' && !in_array('category_ids', $data) && !in_array('categories', $data)) {
                if($required) {
                    $required .= '" and "categories" or "category_ids';
                } else {
                    $required = 'categories" or "category_ids';
                }
            }
            $this->_errors[] = sprintf('The following columns are missing on the CSV: "%s".', $required);
            $valid = false;
        }

    if($valid) {
        //Instantiate all the lexical validators
        foreach ($data as $key => $value) {
            $this->_validators[$key] = new ValidatorContext(Analyzer::LEXICAL, $value, $this->_profile);
            $this->_columnIndexes[$key] = $value;
        }
    }
    return $valid;
}
/**
 * getDelimiter
 * Try to detect the delimiter character on a CSV file, by reading the first row.
 *
 * @param mixed $file
 * @access public
 * @return string
 */
public static function getDelimiter($file) {
    $delimiter = false;
    $line = '';
    if($f = fopen($file, 'r')) {
        $line = fgets($f); // read until first newline
        fclose($f);
    }
    if(strpos($line, ';') !== FALSE && strpos($line, ',') === FALSE) {
        $delimiter = ';';
    } else if(strpos($line, ',') !== FALSE && strpos($line, ';') === FALSE) {
        $delimiter = ',';
    } else {
        die('Unable to find the CSV delimiter character. Make sure you use "," or ";" as delimiter and try again.');
    }
    return $delimiter;
}
}

I hope that the code is pretty much self-explanatory, but anyway I’ll summarize the process.

It all begins with the instantiation of the Analyzer class, where you provide the profile, and the required fields for this task.
Then, the lexical method is called with the full path of the CSV file to be processed. If the file exists, we firstly try to detect the delimiter used, which should be a comma or a semicolon.

Afterwards, the first row is parsed by cleaning the data, and checking for the correctness of the fields, ensuring there are not duplicated fields, and that all the required fields are present. If everything went as expected, we call the ValidatorContext class providing the fieldname, and we’ll get the proper validator instatiated.

Finally, we iterate over the CSV, ensuring that all the validations are passed, stopping the process in case of error, and saving the field and the line where the problem was found.

That’s all from now, in the next post I’ll show the last piece of the puzzle, which is the implementation of the ValidatorContext class, where we will be able to unify the whole process.