An In-Depth Look at XML Document Attack Vectors

By OPSWAT

Aug 15, 2017 Last updated: Jun 10, 2024

Share this Post

XML Document Attack Vectors

In June, we published a short announcement about the beta release of XML document data sanitization (CDR) in which we briefly mentioned the importance of it:

"The flexibility of XML has resulted in its widespread usage, including within Microsoft Office documents and SOAP messages. However, XML documents have many security vulnerabilities that can be targeted for different types of attacks, such as file retrieval, server side request forgery, port scanning, or brute force attacks."

This blog post is for the technical reader who would like to see more details about XML-based attacks with some examples.

We will cover the following threats in this article. OPSWAT MetaDefender data sanitization (CDR) addresses all of these threats.

XML injection
XSS/CDATA Injection
Oversized payloads or XML bombs
Recursive payloads
VBA macros
JavaScript

XML Injection

XML injection can be exploited to deliver attacks targeting XML applications that do not escape reserved characters.

In an XML document, "<" and ">" are reserved characters used to specify the beginning or the end of an XML tag. If one wants to use these reserved characters, one "escapes" their predefined meaning by using XML entities.

For example, suppose we want an element with the following content: "If the size of a stream < 4096, this sector is considered as a mini stream." We then have to escape "<" to be "<". "<" is an XML entity; it will be changed to "<" automatically by an XML parser.

XML injection attacks typically occur in this way:

An attacker injects malicious JavaScript markup code as escaped text in an XML document. Because the code is escaped, malware filtering may not detect it.
The XML document is then parsed by an XML application. In this step, the attacker targets XML applications that do not serialize properly reserved characters. This means that reserved characters are not escaped.
Later, content of the XML element that contains malicious JavaScript markup code is used as input data for a website. When an innocent user loads this website, the malicious code is executed.

Sample code:

XML Injection Attack Sample Code Temperature

The above sample is an XML element named "temperature." Its content can be uploaded to a website, and the JavaScript code is run when a user opens the website.

Solution: To mitigate XML injection, we check if XML documents contain unescaped reserved characters that are used to inject malicious JavaScript code. If they do, we remove this code.

XSS/CDATA Injection

A CDATA section in an XML document is used to escape the text. It can also be used to inject malicious JavaScript code, leading to a web service attack. Every character in a CDATA section is extracted and kept the same by XML parsers. Hence, CDATA can be used to convey JavaScript markup code.

CDATA injection occurs in a scenario similar to XML injection.

An attacker injects malicious JavaScript markup code in CDATA sections in an XML document. Malware filtering may not detect the malicious code because it is escaped by CDATA tags.
Later, the XML document is parsed by an XML application. For XML applications that do not serialize properly reserved characters, reserved characters are not escaped.
Finally, content of the XML element that contains malicious JavaScript markup code is uploaded to a website. When this website is opened by a user, the malicious code is executed.

Sample code:

XML CDATA Injection Attack Sample Code

Solution: To prevent CDATA injection, we check if XML documents contain a CDATA section and reserved characters inside that are used to inject malicious JavaScript code. If so, we remove the code.

XML Bombs

XML bomb attacks are designed to exhaust the resources of a web server. When processing an XML document injected with an XML bomb, the XML parser requests very high amounts of computation power to parse the document. XML bombs are well known as XML "billion laugh attacks."

An XML bomb attack is made possible by exploiting XML entities. There are both predefined XML entities and user-defined XML entities. A user can define an XML entity as follows:

XML Bomb Entity Definition Sample Code

In the above entity definition, "name" is the entity name and "replacement text" is its value. The entity value is then inserted into an element content or attribute as "&name;".

Another entity can be used as an entity value — and this opens the door for an attacker to convey an XML bomb attack.

Let's take a look at the following XML sample.

XML Bomb Sample Code

When completely parsed, the content of "Example" element would contain 2,127 words "ha," which is about 3.4 x 1,026 terabytes. It is impossible to parse this document. As a result, the XML parser will crash.

Solution: To deal with XML bomb attacks, the XML parser is configured to limit expansion of user-defined XML entities. When the expansion exceeds a certain depth level, the parser will raise an exception.

Visual Basic Macro

XML is a well-known format not only for saving text but also for use by Microsoft Office applications. Attackers can utilize Microsoft Office XML files to hide malicious macros. This method gives an attack a greater chance of success because many users will expect XML files to be harmless text files.

When a Microsoft Word document is converted to XML format, Visual Basic for Applications (VBA) macros are compressed and encoded in base64. In a Windows machine with Microsoft Office software pre-installed, Word documents saved in XML formats are recognized.

When the XML file is double-clicked, Microsoft Word opens automatically and may run embedded VBA macros.

Attackers carry out VBA macro attacks with XML in the following way:

Create a Microsoft Word document and add a malicious VBA macro
Convert the document to XML format
Send the XML document to victims, for example by email
The victim clicks on the XML file, then Microsoft opens the XML file and runs the VBA macro
The VBA macro downloads another harmful program and executes it

How to Detect and Remove VBA Macros

The following elements and attributes in Microsoft Word XML files help identify VBA macros.

<?mso-application progid="Word.Document"?> tells us that the XML document is a Microsoft Word document
<w:wordDocument ... w:macrosPresent="Yes" ...> shows that the document contains a VBA macro
<w:binData w:name="editdata.mso"></w:binData> is where VBA macro content is stored in the document

Solution: By checking the above elements and attributes, we can remove VBA macros from an XML document.

JavaScript

Malicious JavaScript code can be injected into XML documents by using XML injection and CDATA injection techniques.

MetaDefender XML data sanitization implicitly detects and removes JavaScript code while processing XML injection and CDATA injection. JavaScript code that is injected directly into XML documents by using script tags are removed too.

XML Data Sanitization Demos

Below are links to MetaDefender.com scanning results for the example files that we created for each of these attacks, along with scanning results for the sanitized versions of those files. (Since our sample XML documents contain examples of the exploits but do not actually perform any malicious actions, MetaDefender.com engines do not detect them as malicious.)

You can also download a ZIP archive containing the original and sanitized files by clicking here.

No.	Sample	After Sanitizing	Notes
1	VBA macro sample	Sanitized VBA macro sample	XML containing VBA macro and sanitized result
2	XML bomb	Sanitized XML bomb	XML bomb and sanitized result
3	Javascript in XML	Sanitized JavaScript in XML	XML has script tag and sanitized result
4	XML injection sample	Sanitized XML injection sample	XML injection and sanitized result
5	CDATA injection sample	Sanitized CDATA injection sample	CDATA injection and sanitized result

How to Utilize MetaDefender Data Sanitization with XML Documents

XML Document Data Sanitization CDR

References

Research and content development assistance provided by OPSWAT data sanitization team.

Tags:

Latest Posts

AI Is Exposing What You Can't Patch
Jul 20, 2026
SVG-Delivered Malware Is Flooding Emails. Here Is What Actually Blocks It.
Jul 17, 2026
My OPSWAT™ Central Management v10.7.26062
Jul 16, 2026
How Deep CDR™ Technology and Metascan™ Multiscanning Address PCI DSS Security Requirements
Jul 15, 2026
Government Networks Need Instruction-Level Emulation to Stop Zero-Day Threats at the Perimeter
Jul 14, 2026

Sign up for the OPSWAT Newsletter

Get the latest OPSWAT company updates along with event information and the news that's driving the industry forward.

Sign Me Up

Follow Us on Social Media

Follow OPSWAT on your LinkedIn, Facebook, Twitter, and YouTube for more!