Search MS word files in a directory for specific content in Linux

I have a directory site framework filled with MS word files and also I must browse the directory for certain chain. Previously I was utilizing the following order to look data for in a listing

I am actually an explainer as well as know nothing about scripting yet I was so pissed off about grep not having the capacity to check inside Word .doc documents that I operated out exactly how to create this little layer script to use catdoc as well as grep to search a listing of .doc apply for an offered input chain.

The opensource command series energy crgrep will definitely look many MS document styles

replace “string_to_search” in above command with your text message. This demand spits file title( s) of documents having “string_to_search”.

The most effective remedy I encountered was to make use of unoconv to convert the word documents to html. It additionally possesses a .txt result, but that went down information in my case.

Listed here’s a means to utilize “unzip” to imprint the entire components to typical result, after that pipeline to “grep -q” to discover whether the desired string is actually current in the output. It works for docx layout data.

In a.doc file the text message is actually generally present and can be actually found through grep, but that content is broken up and sprinkled along with area codes and also formatting details so seeking a words you know exists may certainly not match. A look for one thing extremely brief possesses a much better odds of matching.

The command is not perfect due to the fact that jobs unusual on little documents (the outcome may be untrustful), becasue for some reseaon antiword discharges this text.

If it is actually few reports you can compose a manuscript that integrates one thing like catdoc:, by looping over each report, perfoming a catdoc and grep, holding that in a bash variable, and outputting it if it’s satisfying.

Possibly the scrap personalities are not regularly the very same. It would be actually great if somebody could compose an energy that will take all this right into account. On my windows machine the exact same files answer well to hunts.

A .docx documents is really a zip repository gathering many data with each other in a listing construct (make an effort renaming a .docx after that unzipping it!)– with zip compression it’s not likely that grep is going to discover anything.

