CSV File Validation In PHP (Part II)

If you remember my previous post regarding CSV validation, I explained how to implement a different validator for each field, and phase, using the Strategy Design Pattern. For instance, in the lexical phase, I was checking for the correctness of the SKU field regarding the lenght of the string, and the absence of forbidden characters. In the semantic phase, depending on if it’s an update or an import task, I was checking for the existence or absence of the provided SKUs list in the database.

One of the big advantages of this implementation, is that it’s very easy to implement unitary tests for each validator, as they are almost totally independent from each other. I said almost because the return values of the lexical validators are the input of the semantic ones, so there is some kind of dependency there. For instance, the “store” field used to be a string containing the code of the store, but eventually, we allowed to put several comma-separated store codes on the field. This required changes not only in the regular expressions of the lexical validator, but also on any semantic validator that depended on the store tokens, as they’ll be expecting an input string, but instead, they’ll get an array.

I could keep on talking about the pros and cons of this approach, but it’s out of the scope of this post. Our goal is to implement a CSV file validation, so, let’s focus on the implementation of the main class which will take care or reading the CSV file, and execute the proper validator for each column.

For the shake of readability on the blog, I’ve un-indented one level the methods’ code. This is how the lexical phase looks like:

class Analyzer {

    const LEXICAL = 'lexical';
    const SEMANTIC = 'semantic';

    /**
     * Allowed profiles
     */
    private static $_PROFILES = array('import', 'update');

    private $_validators;
    private $_columnIndexes;
    private $_requiredFields;
    private $_profile;
    private $_errors;

function __construct(array $required, $profile) {
    $this->_validators = array();
    $this->_columnIndexes = array();
    $this->_errors = array();
    $this->_requiredFields = $required;
    $this->_profile = $profile;
    if(!in_array($profile, self::$_PROFILES)) {
        die('Unknown profile specified.');
    }
}
/*
 * Perform the lexical analysis over the CSV
 * Iterate over the file and exit if an error is found
 *
 * @param $file string full path of the csv file
 * @return bool whether the file is valid (from a lexical point of view) or not
 */
protected function lexical($file) {
    if (!file_exists($file))  {
        return false;
    } else {
        $delimiter = self::getDelimiter($file);
    }

    $handle = fopen($file, 'r');

    //Parse the first row, instantiate all the validators
    $valid = $this->parseFirstRow(fgetcsv($handle, 0, $delimiter));
    //Number of columns specified on the header
    $num_columns = sizeOf($this->_columnIndexes);
    //line number count
    $i = 1;

    while(($data = fgetcsv($handle, 0, $delimiter)) !== FALSE && $valid) {
        $errors = array();

        //For each column
        foreach ($data as $key => $value) {

            //Skip all columns without header
            if($key >= $num_columns) {
                break;
            }

            $value = trim($value);

            //Validate
            $errors[$this->_columnIndexes[$key]] = $this->_validators[$key]->validate($value);

            $valid = $valid && $errors[$this->_columnIndexes[$key]];
        }

        //If any error was found, exit
        if(!$valid) {
            $filtered_errors = array_keys($errors, false);
            if(count($filtered_errors) > 0) {
		//Store the errors founds on the current line
                $this->_errors[$i] = $filtered_errors;
            }
            break;
        }
        $i++;
    }
    fclose($handle);

    return $valid;
}
/**
 * parseFirstRow
 * Check that the column names aren't duplicated
 * Ensure all required fields are present
 * Create the instances of each validator.
 *
 * @param array $data
 * @access protected
 * @return bool
 */
protected function parseFirstRow(array $data) {
    $valid = true;
    //Clean the data
    $data = array_filter(array_map('trim', array_map('strtolower', $data)));

    //Ensure that there aren't duplicated columns
    $dupes = array_diff_key($data, array_unique($data));
    if(!empty($dupes)) {
        $this->_errors[] = sprintf('The following columns are duplicated on the CSV: "%s".', implode($dupes, '", "'));
        $valid = false;
    }

    //Ensure all required columns are present
    if($valid &&
        //The number of columns is lower than the required fields, we don't need to keep checking, some columns are missing.
        (count($data) < count($this->_requiredFields) ||
        //The number of optional fields must match with the number of fields that are not required, otherwise something is missing.
        count(array_diff($data, $this->_requiredFields)) !== (count($data) - count($this->_requiredFields)) ||
        //If the operation is an import, either categories or category_ids must be present
        ($this->_profile == 'import' && !(in_array('categories', $data) || in_array('category_ids', $data))))) {

            $required = implode(array_diff($this->_requiredFields, $data), '", "');
            if($this->_profile == 'import' && !in_array('category_ids', $data) && !in_array('categories', $data)) {
                if($required) {
                    $required .= '" and "categories" or "category_ids';
                } else {
                    $required = 'categories" or "category_ids';
                }
            }
            $this->_errors[] = sprintf('The following columns are missing on the CSV: "%s".', $required);
            $valid = false;
        }

    if($valid) {
        //Instantiate all the lexical validators
        foreach ($data as $key => $value) {
            $this->_validators[$key] = new ValidatorContext(Analyzer::LEXICAL, $value, $this->_profile);
            $this->_columnIndexes[$key] = $value;
        }
    }
    return $valid;
}
/**
 * getDelimiter
 * Try to detect the delimiter character on a CSV file, by reading the first row.
 *
 * @param mixed $file
 * @access public
 * @return string
 */
public static function getDelimiter($file) {
    $delimiter = false;
    $line = '';
    if($f = fopen($file, 'r')) {
        $line = fgets($f); // read until first newline
        fclose($f);
    }
    if(strpos($line, ';') !== FALSE && strpos($line, ',') === FALSE) {
        $delimiter = ';';
    } else if(strpos($line, ',') !== FALSE && strpos($line, ';') === FALSE) {
        $delimiter = ',';
    } else {
        die('Unable to find the CSV delimiter character. Make sure you use "," or ";" as delimiter and try again.');
    }
    return $delimiter;
}
}

I hope that the code is pretty much self-explanatory, but anyway I’ll summarize the process.

It all begins with the instantiation of the Analyzer class, where you provide the profile, and the required fields for this task.
Then, the lexical method is called with the full path of the CSV file to be processed. If the file exists, we firstly try to detect the delimiter used, which should be a comma or a semicolon.

Afterwards, the first row is parsed by cleaning the data, and checking for the correctness of the fields, ensuring there are not duplicated fields, and that all the required fields are present. If everything went as expected, we call the ValidatorContext class providing the fieldname, and we’ll get the proper validator instatiated.

Finally, we iterate over the CSV, ensuring that all the validations are passed, stopping the process in case of error, and saving the field and the line where the problem was found.

That’s all from now, in the next post I’ll show the last piece of the puzzle, which is the implementation of the ValidatorContext class, where we will be able to unify the whole process.

CSV File Validation In PHP (Part I)

Last week, I had to implement a CSV file validation in PHP. It’s a potentially very big file with several columns, where each one has its custom restrictions. For example, the “SKU” field must be an alphanumeric string no longer than 64 characters, the “IMAGES” field must contain image filename(s) and/or image URL(s) (comma separated values), and so on.

Eventually, I came up with a solution, which consists on using a strategy pattern, implementing a custom validation strategy for each field.

In order to fully ensure that the input data is correct, two kinds of validations must be performed:

  • Restrictions regarding the format, like allowed characters or maximun length.
  • Not only the data must be in the proper format, but also it must be valid from a semantical point of view. For instance, the SKUs must exist on the database, or each image URL specified must actually be an online image (an html, or an offline server shouldn’t pass the validation).

So, I split the requirements in two kind of validators: lexical and semantic.

This is how it looks like, hopefully it can be useful if you are in a similar situation:

Base validator:


abstract class Abstract_Validator {

/*
 * Tell whether the data is correct or not
 * Might change (correct) the $input value
 *
 * @return bool
 */
 abstract function validate(&$input);
 /*
 * Return a string with the details of the error and/or the allowed values for the current field.
 *
 * @param $input string|array
 * @return string
 */
 abstract function getErrorMsg($input);

 }

Base lexical validator with an implementation example:

abstract class Abstract_Lexical extends Abstract_Validator {

 protected $tokens;

 function __construct() {
 $this->tokens = array();
 }

 public function getTokens() {
 return $this->tokens;
 }
 public function validate(&$input) {
 $this->tokens[] = $input;
 }

/*
* Return the last part of the error message
* Can be called from a child class to complete the returned message.
*
* @param $input string
* @return string
*/
 public function getErrorMsg($input) {
   return sprintf('but "%s" provided.', $input);
 }
}

/*
 * EAN is an optional field, but if it's not empty, it must be a valid code.
 */
class Lexical_Ean extends Abstract_Lexical {
 protected function checkEAN($fullcode) {
 $code = substr($fullcode, 0, -1);
 $checksum = 0;
 foreach (str_split(strrev($code)) as $pos => $val) {
 $checksum += $val * (3 - 2 * ($pos % 2));
 }
 return (10 - ($checksum % 10)) % 10 == substr($fullcode,-1);
 }
 public function validate(&$input) {
 parent::validate(&$input);
 return strlen($input) == 0 || $this->checkEAN($input);
 }

 public function getErrorMsg($input) {
 return 'allowed values: valid EAN code (13 digits number), ' . parent::getErrorMsg($input);
 }
 }

Base Semantic validator, with an implementation example:

abstract class Abstract_Semantic extends Abstract_Validator {
 protected $dbLink;
 protected $errors;
 protected static $FIELD;
 protected static $QUERY;
 protected static $TYPE = 'validateExisting';

 function __construct($dbLink = null) {
 $errors = array();
 if($dbLink) {
 $this->dbLink = $dbLink;
 }
 }

 /*
 * Default validator
 * Ensure the extended class has defined the required values
 * Perform the query and pass the result to the proper method
 *
 * @param $input simple array with data
 * @return bool
 */
 public function validate(&$input) {

 if (!$this->dbLink || !static::$QUERY || !static::$FIELD || !static::$TYPE) {
 $this->errors[] = 'Error validating ' . get_class($this) . ', incomplete data was provided.';
 return false;
 }
 $field = static::$FIELD;
 $type = static::$TYPE;
 $list = $this->dbLink->query(sprintf(static::$QUERY, implode($input, '","')));
 return $this->$type($list, $field, $input);
 }

 /*
 * Ensure all requested items where present in the database
 * Otherwise, log all not found items
 *
 * @param list mysqli_result
 * @param $field field to check
 * @param $input array with data
 * @return bool
 */
 protected function validateExisting($list, $field, $input) {
 $ret = $list->num_rows == count($input);
 if(!$ret) {
 $found = array();
 while($obj = $list->fetch_object()){
 $found[] = $obj->$field;
 }
 $this->errors = array_diff($input, $found);
 }
 return $ret;
 }

 /*
 * Return the errors found during the validation
 * @return array
 */
 public function getErrors() {
 return $this->errors;
 }

 /**
 * Return the description message for the error.
 *
 * @param $input array
 * @return string
 */
 public function getErrorMsg($input) {
 if (static::$TYPE == 'validateExisting') {
 $exist = 'must exist';
 } else $exist = 'cannot exist';

 return sprintf('%s(s) "%s" %s on the database.', str_replace('Semantic_', '', get_class($this)), implode($input, ', '), $exist);
 }
 }

/**
 * Store field must correspond to an existing store in the database.
 */
class Semantic_Store extends Abstract_Semantic {
 protected static $QUERY = 'SELECT code from stores where code in("%s")';
 protected static $FIELD = 'code';
 }

That’s all for the moment, on next post I’ll explain how I did join all this, iterating over the CSV file, stay tuned!

SOAP IP Address in PHP

This is an example about how to perform SOAP IP Address petitions over a certain IP address in php.

It can be useful in several situations. For instance, if you have a bunch of server behind a load balancer, all of them hosting the same sites (virtual hosts), and you need to comunicate directly to each of them. If you know the server’s ip address, you can comunicate directly to them by specifying the ip address and the host.
Basically, we need to extend the SoapClient class, overriding the __doRequest method in a way that allows us to specify a custom header in the HTTP petitions.

class HostSoapClient extends SoapClient {

 const DEBUG = true;

/*
 * Override the method __doRequest to perform a custom CURL petition, specifying a host on the header
 *
 * @param $request string
 * @param $location sting
 * @param $action string contain the host and the action in the format url#action
 * @param $version int
 * @param $one_way int
 * @return string
 */
 public function __doRequest($request, $location, $action, $version, $one_way = 0) {
 $url_parts = parse_url($location);
 $ch = curl_init();
 list($host, $action) = explode('#', $action);
 if(self::DEBUG) {
 echo 'Performing request to: ' . $location . "\n";
 echo 'Host specified on the header: ' . $host . "\n";
 echo 'Request:' . $request . "\n";
 }
 curl_setopt($ch,CURLOPT_URL, $location);
 curl_setopt($ch,CURLOPT_POSTFIELDS, $request);
 curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: ' . $host, 'SoapAction: ' . $action));
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 $response = curl_exec($ch);
 curl_close($ch);
 return $response;
 }
}

Usage example:

$server = '11.22.33.44'; //The ip address where the page is hosted
$url = $server . '/api/index/index'; //Path of the api (tested on both v1 and v2)
$host = 'www.yourhost.com'; //An existing hostname on the $server ip
$client = new HostSoapClient(null,
		array('trace' => 1,
			'exceptions' => 1,
			'cache_wsdl' => 0,
			'location'=> $url,
			'uri' => $host));
$session = $client->login('user', 'pass');
//Now you are connected and you can call any resource, ie.
$client->call($session, 'api.method', array('param', 'value'));