I am trying to extract a string that is located between the first and second comma in a specific line in a series of text files (subtitle files). The text files are formatted this way:
Subtitles01.txt
[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour
Style: Default, Estrangelo Edessa, 57, &H00FFFFFF
Style: Title1, Arno Pro, 65, &H00606066
Subtitles02.txt
[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour
Style: OP Eng, Arno Pro, 45, &H00100F11
Style: ED Romaji, Nueva Std Cond, 46, &H00FFFFFF
Subtitles03.txt
[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour
Style: OP Eng, Estrangelo Edessa, 45, &H00100F11
Style: Default, Arno Pro, 45, &H00100F11
Style: ED Romaji, Nueva Std Cond, 46, &H00FFFFFF
What I want to achieve here is extract the Fontname for each line that start with “Style: ” and then determine which subtitles contain the fonts I want in a non-repeat manner. So essentially the end result would be output to a textfile like the following;
Subtitles01.txt: Estrangelo Edessa
Subtitles01.txt: Arno Pro
Subtitles02.txt: Arno Pro
Subtitles02.txt: Nueva Std Cond
Subtitles03.txt: Estrangelo Edessa
Subtitles03.txt: Arno Pro
Subtitles03.txt: Nueva Std Cond
Only Subtitles03.txt is needed.
Since Subtitles03.txt contains all the fonts in Subtitles01.txt and Subtitles02.txt, only Subtitles03.txt is needed. The goal is to use the least amount of files to find the unique fonts in all the files. I have came up with the following batch script using findstr to extract the lines starting with “Style: ” but I am stuck beyond that.
@echo off
findstr /B /C:"Style:" *.txt > results.txt
if %errorlevel%==0 (
echo Found! logged files into results.txt
) else (
echo No matches found
)
Any help would be appreciated. Thank you guys!
I imagine it would be much easier to use some other language besides batch, or at least use non-native utilities. But here is a pure native batch solution.
I don’t see how FINDSTR regex will help with this problem. It cannot extract a portion of the matching line like many other non-native batch regex utilities.
You can use FOR /F to extract the fonts from each file:
You can use environment variables to come up with a list of unique fonts. Define variables with the font name in the variable name, all prefixed with
font_. Only one variable can be defined for a given name. The assigned value does not matter. You can then useset font_to list all the unique font names. The number of unique names can be counted, or the actual font name can be parsed out (remove thefont_prefix).The tricky part is establishing the minimum set of files required to cover the complete set of unique font names. I imagine someone could come up with an efficient solution. I’ve just employed a brute force recursive permutation method: I count the number of unique fonts found in each permutation and compare the number to the total number of unique fonts. I have added a few shortcuts to not proceed down a particular permutation path if I’ve already found a smaller complete set than the current set.
The code could be simpler if I used SETLOCAL in my recursion, but batch is limited to only 32 levels of SETLOCAL. I wanted a solution that could support more than 32 files, although I’m a bit worried about performance with that many files.
Edit -I fixed a bug in my
:permuteFilesroutine that surfaced once I had more than 3 filesHere are the results using your example input: