Type to search
Analyze a file Free Tools

An In-Depth Look at XML Document Attack Vectors

‹ Blog

An In-Depth Look at XML Document Attack Vectors

XML Document Attack Vectors

In June, we published a short announcement about the beta release of XML document data sanitization (CDR) in which we briefly mentioned the importance of it:

"The flexibility of XML has resulted in its widespread usage, including within Microsoft Office documents and SOAP messages. However, XML documents have many security vulnerabilities that can be targeted for different types of attacks, such as file retrieval, server side request forgery, port scanning, or brute force attacks."

This blog post is for the technical reader who would like to see more details about XML-based attacks with some examples.

We will cover the following threats in this article. OPSWAT Metadefender data sanitization (CDR) addresses all of these threats.

  • XML injection
  • XSS/CDATA Injection
  • Oversized payloads or XML bombs
  • Recursive payloads
  • VBA macros
  • JavaScript

XML Injection

XML injection can be exploited to deliver attacks targeting XML applications that do not escape reserved characters.

In an XML document, "<" and ">" are reserved characters used to specify the beginning or the end of an XML tag. If one wants to use these reserved characters, one "escapes" their predefined meaning by using XML entities.

For example, suppose we want an element with the following content: "If the size of a stream < 4096, this sector is considered as a mini stream." We then have to escape "<" to be "&lt;". "&lt;" is an XML entity; it will be changed to "<" automatically by an XML parser.

XML injection attacks typically occur in this way:

  • An attacker injects malicious JavaScript markup code as escaped text in an XML document. Because the code is escaped, malware filtering may not detect it.
  • The XML document is then parsed by an XML application. In this step, the attacker targets XML applications that do not serialize properly reserved characters. This means that reserved characters are not escaped.
  • Later, content of the XML element that contains malicious JavaScript markup code is used as input data for a website. When an innocent user loads this website, the malicious code is executed.

Sample code:

XML Injection Attack Sample Code Temperature

The above sample is an XML element named "temperature." Its content can be uploaded to a website, and the JavaScript code is run when a user opens the website.

Solution: To mitigate XML injection, we check if XML documents contain unescaped reserved characters that are used to inject malicious JavaScript code. If they do, we remove this code.

XSS/CDATA Injection

A CDATA section in an XML document is used to escape the text. It can also be used to inject malicious JavaScript code, leading to a web service attack. Every character in a CDATA section is extracted and kept the same by XML parsers. Hence, CDATA can be used to convey JavaScript markup code.

CDATA injection occurs in a scenario similar to XML injection.

  • An attacker injects malicious JavaScript markup code in CDATA sections in an XML document. Malware filtering may not detect the malicious code because it is escaped by CDATA tags.
  • Later, the XML document is parsed by an XML application. For XML applications that do not serialize properly reserved characters, reserved characters are not escaped.
  • Finally, content of the XML element that contains malicious JavaScript markup code is uploaded to a website. When this website is opened by a user, the malicious code is executed.

Sample code:

XML CDATA Injection Attack Sample Code

Solution: To prevent CDATA injection, we check if XML documents contain a CDATA section and reserved characters inside that are used to inject malicious JavaScript code. If so, we remove the code.

XML Bombs

XML bomb attacks are designed to exhaust the resources of a web server. When processing an XML document injected with an XML bomb, the XML parser requests very high amounts of computation power to parse the document. XML bombs are well known as XML "billion laugh attacks."

An XML bomb attack is made possible by exploiting XML entities. There are both predefined XML entities and user-defined XML entities. A user can define an XML entity as follows:

XML Bomb Entity Definition Sample Code

In the above entity definition, "name" is the entity name and "replacement text" is its value. The entity value is then inserted into an element content or attribute as "&name;".

Another entity can be used as an entity value – and this opens the door for an attacker to convey an XML bomb attack.

Let's take a look at the following XML sample.

XML Bomb Sample Code

When completely parsed, the content of "Example" element would contain 2,127 words "ha," which is about 3.4 x 1,026 terabytes. It is impossible to parse this document. As a result, the XML parser will crash.

Solution: To deal with XML bomb attacks, the XML parser is configured to limit expansion of user-defined XML entities. When the expansion exceeds a certain depth level, the parser will raise an exception.

Visual Basic Macro

XML is a well-known format not only for saving text but also for use by Microsoft Office applications. Attackers can utilize Microsoft Office XML files to hide malicious macros. This method gives an attack a greater chance of success because many users will expect XML files to be harmless text files.

When a Microsoft Word document is converted to XML format, Visual Basic for Applications (VBA) macros are compressed and encoded in base64. In a Windows machine with Microsoft Office software pre-installed, Word documents saved in XML formats are recognized.

When the XML file is double-clicked, Microsoft Word opens automatically and may run embedded VBA macros.

Attackers carry out VBA macro attacks with XML in the following way:

  • Create a Microsoft Word document and add a malicious VBA macro
  • Convert the document to XML format
  • Send the XML document to victims, for example by email
  • The victim clicks on the XML file, then Microsoft opens the XML file and runs the VBA macro
  • The VBA macro downloads another harmful program and executes it

How to Detect and Remove VBA Macros

The following elements and attributes in Microsoft Word XML files help identify VBA macros.

  • <?mso-application progid="Word.Document"?> tells us that the XML document is a Microsoft Word document
  • <w:wordDocument ... w:macrosPresent="Yes" ...> shows that the document contains a VBA macro
  • <w:binData w:name="editdata.mso"></w:binData> is where VBA macro content is stored in the document

Solution: By checking the above elements and attributes, we can remove VBA macros from an XML document.


Malicious JavaScript code can be injected into XML documents by using XML injection and CDATA injection techniques.

Metadefender XML data sanitization implicitly detects and removes JavaScript code while processing XML injection and CDATA injection. JavaScript code that is injected directly into XML documents by using script tags are removed too.

XML Data Sanitization Demos

Below are links to Metadefender.com scanning results for the example files that we created for each of these attacks, along with scanning results for the sanitized versions of those files. (Since our sample XML documents contain examples of the exploits but do not actually perform any malicious actions, Metadefender.com engines do not detect them as malicious.)

You can also download a ZIP archive containing the original and sanitized files by clicking here.

No. Sample After Sanitizing Notes
1 VBA macro sample Sanitized VBA macro sample XML containing VBA macro and sanitized result
2 XML bomb Sanitized XML bomb XML bomb and sanitized result
3 Javascript in XML Sanitized JavaScript in XML XML has script tag and sanitized result
4 XML injection sample Sanitized XML injection sample XML injection and sanitized result
5 CDATA injection sample Sanitized CDATA injection sample CDATA injection and sanitized result


How to Utilize Metadefender Data Sanitization with XML Documents

XML Document Data Sanitization CDR


  1. OWASP, "XML Security Cheat Sheet"
  2. Costello, Roger L. The MITRE Corporation, "XML Risks and Mitigations"

Research and content development assistance provided by OPSWAT data sanitization team.

Taeil Goh
Chief Technical Officer

Taeil Goh joined OPSWAT in 2008 as a software engineer. Taeil has been involved in Metadefender product development from the early stages, and his huge contributions were reflected in his promotion to CTO in 2016. He is now more focused on mentoring product managers for new innovative OPSWAT technology, investing a lot of time in joint solutions with technical partners and in identifying new technology areas to focus on. He is also responsible for product usability and enterprise security. Taeil spends his free time playing tennis or flying a Cessna 172.

data sanitization (CDR) Content Disarm & Reconstruction Metadefender XML malware