Remove Extra New Lines and Whitespace Using Java Regex

Sometimes when extracting text from another item may result in formatting issues that involve extra blank lines or leading/trailing whitespace on each line.

This commonly occurs when extracting from HTML elements or XML documents.

The following String regular expressions can fix the following issues.

(?m) = multi-line mode

The following removes the leading/trailing whitespace from each line in the string.

node.getTextContent().replaceAll(  
        "(?m)^[\\s&&[^\n]]+|[\\s+&&[^\n]]+$", "");
Example:  
   The quick brown fox
      jumps over
         the lazy dog.

Result:  
The quick brown fox  
jumps over  
the lazy dog.  

The following removes extra blank lines from the string.

node.getTextContent().replaceAll("(?m)^[ \t]*\r?\n", "");  
Exmaple:  
The quick brown fox


jumps over



the lazy dog.

Result:  
The quick brown fox  
jumps over  
the lazy dog.  

References
Regular.Expressions.info - Specifying Modes Inside The Regular Expression


comments powered by Disqus