Introduction
In this blog post you will see how to remove invalid characters from XML using SSIS. We will use search and replace feature of Advanced File System Task.
Remove Invalid characters from XML
Xml file specification have restriction about which characters can be part of XML data and which should be avoided. If you use certain invalid character then XML Parser can throw error saying invalid character found by parser. Here is a very useful link which explains which characters are allowed in XML and not allowed in XML.
Using SSIS and Regex to clean invalid XML Characters
To clean invalid characters we can use below regular expression. Which basically removes any characters outside allowed character range.
1 |
[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF] |
Above expression can be good enough for most cases but still not the best solution. If you really want most accurate expression as per XML Specs then use below expression. It can be SLOW for large XML files.
1 |
[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff]) |
Remove any characters
If you don’t have range for valid characters but you know list of invalid characters then use below expression (without ^). For example if you want to remove A, C and E then use below
1 2 3 4 5 |
[ACE] --OR-- Use hex codes [\x41\x43\x45] |
You can also use range
[0-9A-Za-z]Step-By-Step Example
- Download and Install SSIS PowerPack from here
- Create new SSIS Project
- Drag ZS Advanced File System Task from the SSIS Toolbox to control flow designer
- Double click task to edit properties like below. Enter the following expression for Regular Expression
1[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF] - Save the package and execute. After you run the task it will remove (replace with blank) any invalid character in your XML file.
Here is the screenshot of Task configuration.