Introduction
In this blog post, you will see how to remove invalid characters from XML using SSIS. We will use the search and replace feature of the Advanced File System Task.
Remove Invalid characters from XML.
The XML file specification specifies which characters can be part of XML data and which should be avoided. If you use an invalid character, the XML Parser can throw an error indicating that it found an invalid character. Here is a very useful link that explains which characters are allowed in XML and which are not.
Using SSIS and Regex to clean invalid XML Characters
To clean invalid characters, we can use the following regular expression. This basically removes any characters outside the allowed character range.
|
1 |
[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF] |
The above expression is sufficient for most cases, but it is not the best solution. If you really want the most accurate expression per the XML Specs, use the expression below. It can be SLOW for large XML files.
|
1 |
[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff]) |
Remove any characters
If you don’t have a range of valid characters but you know a list of invalid characters, use the expression below (without the ^). For example, if you want to remove A, C, and E; then use the below
|
1 2 3 4 5 |
[ACE] --OR-- Use hex codes [\x41\x43\x45] |
You can also use a range
[0-9A-Za-z]Step-By-Step Example
- Download and install SSIS PowerPack from here.
- Create a new SSIS Project.
- Drag the ZS Advanced File System Task from the SSIS Toolbox to the control flow designer.
- Double-click the task to edit its properties, as shown below. Enter the following expression for a Regular Expression.
1[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF] - Save the package and execute. After you run the task, it will remove (replace with blank) any invalid character in your XML file.
Here is the screenshot of the Task configuration.

SSIS Advanced File System Task – Search and Replace Option -Remove invalid XML characters using Regex (Regular Expression)



