InDesign CC自動化作戦 (InDesign CC Automation Operation)

HTMLタグを削除してテキストフレームに読み込む

■プログラム説明（ソースコード説明）
　HTMLファイルのタグを削除してテキストフレームにデータを読み込むには正規表現を利用します。正規表現で不要なタグなどを削除します。&などはreplace()で簡単に置換することができます。サンプルでは複数行の処理や不要な空白の削除を行っていますが、特にそれらが必要ない場合には該当行を削除してください。横のコメントに、どのような処理を行っているかを書いてあるので参考にしてください。

■ソースコード
pageObj = app.documents.add();
txtObj = pageObj.textFrames.add();
txtObj.visibleBounds = ["2cm","2cm","10000cm","180cm"];
htmlFile = new File("/id_text/0.html");
txtObj.place(htmlFile);
txt = txtObj.contents;
str = new RegExp("<head.*?</head.*?>","gmi"); //　ヘッダー部分削除
txt = txt.replace(str,"");
str = new RegExp("<script.*?</script.*?>","gmi"); //　scriptタグ削除
txt = txt.replace(str,"");
str = new RegExp("","gmi"); //　コメント削除
txt = txt.replace(str,"");
str = new RegExp("<[^>]*?>","gmi"); //　HTMLタグ削除
txt = txt.replace(str,"");
str = new RegExp(" +","gmi"); //　複数の半角空白をまとめる
txt = txt.replace(str,"");
str = new RegExp("　+","gmi"); //　複数の全角空白をまとめる
txt = txt.replace(str,"");
str = new RegExp(String.fromCharCode(9)+"+","gmi"); //　複数のTABをまとめる
txt = txt.replace(str,"　");
str = new RegExp(String.fromCharCode(13)+"+","gmi"); //　複数の改行をまとめる
txt = txt.replace(str,String.fromCharCode(13));
str = new RegExp("　"+String.fromCharCode(13),"gmi"); //　不要な改行をまとめる
txt = txt.replace(str,"");
txt = txt.replace(/&/gm,"&"); //　&に変換
txt = txt.replace(/</gm,"<"); //　<に変換
txt = txt.replace(/>/gm,">"); //　>に変換
txt = txt.replace(/ /gm," "); //　半角空白に変換
txt = txt.replace(/&#...;/gm,""); //　&#000は削除
txtObj.contents = txt;

■ポイント
　日本語の場合、文字コードはSHIFT JISにしてください。また、改行コードはLFまたはCRのみにしてください。Windows (DOS)のCD+LFでは動作しません。これらはRubyやPerl、nkf等で変換してから、このスクリプトを利用するようにしてください。

■実際のスクリプトをダウンロード(sample.jsx.zip)