Ë «q±iZ<ãó¤—dZddlZddlZddlZddlmZddlmZmZddl m Z e je«Z ddiZGd „d «ZGd„de«ZdgZy) z"Tokenization class for model MyT5.éN)Údefaultdicté)Ú AddedTokenÚPreTrainedTokenizer)ÚloggingÚ vocab_filezbyte_maps.jsoncóÊ—eZdZdZdZdeeeefzfd„Zdeeeeezfdedefd„Z deeefd eeeeezffd „Z deed deezfd „Zddeed eefd„Zy)ÚByteRewriteraZ Byte rewriter class for MyT5 tokenizer. This class is used to rewrite bytes using a hash tree. The hash tree is constructed from a set of rewriting rules. Args: rewriting_rules (`str` or `dict[str, str]`): A path to a json file containing the rewriting rules or a dictionary containing the rewriting rules. z[LEAF]Úrewriting_rulescóŠ—t|t«r+t|d«5}tj|«}ddd«n't|t «st dt|«›«‚|j|«|_ |j«Dcic]\}}||“Œ }}}|j|«|_y#1swYŒYxYwcc}}w)NÚrzDrewriting_rules should be either a path to json file or a dict, got )Ú isinstanceÚstrÚopenÚjsonÚloadÚdictÚ TypeErrorÚtypeÚconstruct_hash_treeÚ hash_treeÚitemsÚreverse_hash_tree)ÚselfrÚfÚkÚvÚreverse_rewriting_ruless ú\/opt/pipecat/venv/lib/python3.12/site-packages/transformers/models/myt5/tokenization_myt5.pyÚ__init__zByteRewriter.__init__,sº€Üo¤sÔ+Üo sÓ+ð /¨qÜ"&§)¡)¨A£,÷ /ð /ä˜O¬TÔ2ÜØVÔW[Ð\kÓWlÐVmÐnóð ð×1Ñ1°/ÓBˆŒØ4C×4IÑ4IÓ4K×"L©D¨A¨q 1 a¡4Ð"LÐÑ"LØ!%×!9Ñ!9Ð:QÓ!RˆÕ÷ /ð /üó#MsB3Â B?Â3B<rÚbyte_in_sequenceÚbyte_out_sequencecó”—|jd«}|jd«}|}|D]}||vri||<||}Œ|||j<y)zL Add a leaf with the output byte sequence to the hash tree. ú N)ÚsplitÚLEAF)rrr!r"Úbyte_in_listÚ byte_out_listÚtree_pointerÚbs rÚadd_leafzByteRewriter.add_leaf9sb€ð(×-Ñ-¨cÓ2ˆØ)×/Ñ/°Ó4ˆ à ˆØò +ˆAØ˜Ñ$Ø"$˜Q‘Ø'¨™?‰Lð +ð #0ˆT—Y‘YÒóÚreturncóÎ—tt«}d„td«D«D]}|g|||j<Œ|j «D]\}}|j|||«Œ|S)zE Construct a hash tree for rewritten byte sequences. c3ó$K—|]}|d›–—Œ yw)Ú02xN©)Ú.0Úxs rú z3ByteRewriter.construct_hash_tree..Msèø€Ò1 QsG“*Ñ1ùs‚é)rrÚranger&rr+)rrrr*Úin_sequenceÚout_sequences rrz ByteRewriter.construct_hash_treeHss€ô ¤Ó%ˆ Ù1¤e¨C£jÔ1ò *ˆAØ'( cˆIa‰L˜Ÿ™Ò#ð *ð*9×)>Ñ)>Ó)@ò @Ñ%ˆK˜ØM‰M˜) [°,Õ?ð @ðÐr,Ú byte_sequenceNcó\—|j}|D] }||vr||}Œ y||jS)zW Search the hash tree and return the rewritten byte sequence if found. N)rr&)rr9r)r*s rÚsearch_hash_treezByteRewriter.search_hash_treeUsA€ð—~‘~ˆØò ˆAØLÑ Ø+¨A™‘áð ð˜DŸI™IÑ&Ð&r,Úin_bytescóZ—g}d}d}|t|«kr–|s|jn|j}t|t|««D]?}||}||vr||}n||k(r|g} |}n$n"|j|vsŒ/||j} |}ŒA|j «|dz}|t|«krŒ–|S)a6 Rewrite a sequence of bytes using the hash tree. Args: in_bytes (`list[str]`): A list of bytes to be rewritten. reverse (`bool`): If True, decoding is performed with the reverse hash tree. Returns: `list[str]`: The rewritten byte sequence. ré)Úlenrrr6r&Úextend) rr<ÚreverseÚ out_bytesÚb_startÚb_endr)Újr*Úcur_leafs rÚ rewrite_byteszByteRewriter.rewrite_bytesbsÍ€ðˆ ØˆØˆàœ˜H› Ò%Ù18˜4Ÿ>š>¸d×>TÑ>TˆLÜ˜7¤C¨£MÓ2ò Ø˜Q‘KØ˜Ñ$Ø#/°¡?‘LØ˜'’\Ø !˜sHØEÙáØ—9‘9 Ò,Ø+¨D¯I©IÑ6HØ‘Eð ð ×Ñ˜XÔ&Ø˜a‘iˆGð!œ˜H› Ó%ð$Ðr,)F) Ú__name__Ú __module__Ú__qualname__Ú__doc__r&rrr Úlistr+rr;rGr1r,rr r sÅ„ñð€DðS¨¨d°3¸°8©nÑ(<óSð 0 $ s¨D°4¸±9Ñ,<Ð'<Ñ"=ð 0ÐQTð 0Ðiló 0ð°4¸¸S¸±>ðÀdÈ3ÐPTÐW[Ð\_ÑW`ÑP`ÐK`ÑFaóð'¨d°3©ið'¸DÀ4ÈÁ9Ñ"`): The end of sequence token. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. extra_ids (`int`, *optional*, defaults to 125): Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as "" where "{%d}" is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning ("" is the last token in the vocabulary like in ByT5 preprocessing see [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)). additional_special_tokens (`list[str]`, *optional*): Additional special tokens used by the tokenizer. Ú input_idsÚattention_maskNr-c óì•—|dkDr|€t|«Dcgc]}d|›d‘Œ }}nK|dkDrF|Dt|«dkDr6tttd„|«««} | |k7rt d|›d|›d«‚t|t«rt|dd¬ «n|}t|t«rt|dd¬ «n|}t|t«rt|dd¬ «n|}|||d œ|_t|j«|_ d|_ tjt|d««|_t|jd «|_t|jd«|_t%‰ |Ld|||d|dœ|¤Žycc}w)Nrz có.—tdt|«v«S)NÚextra_id)Úboolr)r3s rúz(MyT5Tokenizer.__init__..²s€´D¸ÄsÈ1ÃvÐ9MÓ4N€r,zBoth extra_ids (z!) and additional_special_tokens (zm) are provided to MyT5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokensT)ÚlstripÚrstrip)rr>ér5r Ú decompose_mapÚ merge_map)Ú eos_tokenÚ unk_tokenÚ pad_tokenÚ extra_idsÚadditional_special_tokensr1)r6r?ÚsetÚfilterÚ ValueErrorrrrÚ_added_tokens_decoderÚoffsetÚ_utf_vocab_sizerrrÚ byte_mapsr Údecompose_rewriterÚmerge_rewriterÚsuperr )rrr\r]r^r_r`ÚkwargsÚiÚextra_tokensÚ __class__s €rr zMyT5Tokenizer.__init__£sŽø€ðqŠ=Ð6Ð>ÜDIÈ)ÓDTÖ(U¸q¨:°a°S¸Ò):Ð(UÐ%Ñ(UØ ˜Š]Ð8ÐDÌÐMfÓIgÐjkÒIkäœs¤6Ñ*NÐPiÓ#jÓkÓlˆLØ˜yÒ(Ü Ø& y kÐ1RÐSlÐRmðn(ð(óðôHRÐR[Ô]`ÔGa”J˜y°¸dÕCÐgpˆ ÜGQÐR[Ô]`ÔGa”J˜y°¸dÕCÐgpˆ ÜGQÐR[Ô]`ÔGa”J˜y°¸dÕCÐgpˆ à)2°yÀYÑ%OˆÔ"Ü˜$×4Ñ4Ó5ˆŒØ#ˆÔôŸ™¤4¨ °CÓ#8Ó9ˆŒä".¨t¯~©~¸oÑ/NÓ"OˆÔÜ*¨4¯>©>¸+Ñ+FÓGˆÔä ‰Ñð ØØØØØ&?ñ ðó ùò3)Vs– E1có—|jS©N)rf)rs rÚ vocab_sizezMyT5Tokenizer.vocab_sizeÑs€à×#Ñ#Ð#r,cóÄ—t|j|jz«Dcic]}|j|«|“Œ}}|j |j «|Scc}wrp)r6rqreÚconvert_ids_to_tokensÚupdateÚadded_tokens_encoder)rrlÚvocabs rÚ get_vocabzMyT5Tokenizer.get_vocabÖsW€Ü;@ÀÇÁÐSW×S^ÑS^ÑA^Ó;_Ö`°a×+Ñ+¨AÓ.°Ñ1Ð`ˆÐ`Ø ‰T×.Ñ.Ô/Øˆùòas¥AÚtoken_ids_0Útoken_ids_1Úalready_has_special_tokenscó¤•—|rt‰|||d¬«S|€dgt|«zdgzSdgt|«zdgzdgt|«zzdgzS)aÄ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`list[int]`): List of IDs. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `list[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)rxryrzrr>)rjÚget_special_tokens_maskr?)rrxryrzrns €rr|z%MyT5Tokenizer.get_special_tokens_maskÜsyø€ñ$&Ü‘7Ñ2Ø'°[Ð]að3óð ð ÐØCœ#˜kÓ*Ñ*¨q¨cÑ1Ð1Ø”c˜+Ó&Ñ&¨1¨#Ñ-°!°´s¸;Ó7GÑ1GÑHÈAÈ3ÑNÐNr,Ú token_idscó¬—t|«dkDr7|d|jk(r%tjd|j›d«|S||jgzS)z.Do not add eos again if user already added it.réÿÿÿÿzThis sequence already has zQ. In future versions this behavior may lead to duplicated eos tokens being added.)r?Úeos_token_idÚwarningsÚwarnr\)rr}s rÚ_add_eos_if_not_presentz%MyT5Tokenizer._add_eos_if_not_presentøs]€äˆy‹>˜AÒ )¨B¡-°4×3DÑ3DÒ"DÜM‰MØ,¨T¯^©^Ð,<ð=+ð+ô ðÐà × 1Ñ 1Ð2Ñ2Ð2r,cót—|jg}|€t||z«dgzSt||z|z|z«dgzS)aÉ Create a mask from the two sequences passed to be used in a sequence-pair classification task. MyT5 does not make use of token type ids, therefore a list of zeros is returned. Args: token_ids_0 (`list[int]`): List of IDs. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `list[int]`: List of zeros. r)r€r?)rrxryÚeoss rÚ$create_token_type_ids_from_sequencesz2MyT5Tokenizer.create_token_type_ids_from_sequencessP€ð × Ñ Ð!ˆàÐÜ{ SÑ(Ó)¨Q¨CÑ/Ð/Ü; Ñ$ {Ñ2°SÑ8Ó9¸Q¸CÑ?Ð?r,cóX—|j|«}|€|S|j|«}||zS)a‚ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format: - single sequence: `X ` - pair of sequences: `A B ` Args: token_ids_0 (`list[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `list[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. )rƒ)rrxrys rÚ build_inputs_with_special_tokensz.MyT5Tokenizer.build_inputs_with_special_tokenss;€ð&×2Ñ2°;Ó?ˆØÐØÐà×6Ñ6°{ÓCˆKØ Ñ,Ð,r,Útextcór—|jd«Dcgc]}|d›‘Œ}}|j|«}|Scc}w)z‡Take as input a string and return a list of strings (tokens) for words/sub-words. Represents tokens in two character hex formatúutf-8r0)ÚencodeÚmorphological_encode)rr‰rkrlÚtokenss rÚ _tokenizezMyT5Tokenizer._tokenize4s@€ð'+§k¡k°'Ó&:Ö; QsG‘*Ð;ˆÐ;Ø×*Ñ*¨6Ó2ˆØˆ ùòÓ?Ä#Àd×F_ÑF_ÓB`Ñ`ˆ Øò 0ˆEØ˜ Ñ%Øœ5 ¨Ó0Ñ0‘àœ5Ÿ=™=¨Ó/Ñ/‘ð 0ð —‘ °Ó9ˆØˆ r,Úsave_directoryÚfilename_prefixcón—tjj|«r2tjj||r|dzndtdz«}n|r|dznd|z}t|dd¬«5}|j tj|jdd¬ ««ddd«|fS#1swY|fSxYw) Nú-ÚrÚwr‹)ÚencodingrYF)ÚindentÚensure_ascii) ÚosÚpathÚisdirÚjoinÚVOCAB_FILES_NAMESrÚwriterÚdumpsrg)rrr®rÚwriters rÚsave_vocabularyzMyT5Tokenizer.save_vocabularyns¢€Ü 7‰7=‰=˜Ô(ÜŸ™Ÿ™Ø¹/ °3Ò!6ÈrÔUfÐgsÑUtÑ tó‰Jñ4C˜/¨CÒ/ÈÈnÑ\ˆJÜ *˜c¨GÔ 4ð S¸ØL‰LœŸ™ D§N¡N¸1È5ÔQÔR÷ Sàˆ}Ð÷ Sàˆ}ÐúsÁ,2B)Â)B4)zzzé}N)r-N)NFrp)rHrIrJrKÚmodel_input_namesrºÚvocab_files_namesr ÚpropertyrqrwrLr’rUr|rƒr†rˆrrr•r˜rržr¬Útupler¾Ú __classcell__)rns@rrNrN…s¢ø„ñð4%Ð&6Ð7ÐØ)Ðð ØØØØ"&ð, ð õ, ð\ñ$óð$òðpuñOØ ™9ðOØ37¸±9¸tÑ3CðOØhlðOà ˆc‰õOð8 3°°c±ð 3¸tÀC¹yó 3ðGKñ@Ø ™9ð@Ø37¸±9¸tÑ3Cð@à ˆc‰ó@ð0GKñ-Ø ™9ð-Ø37¸±9¸tÑ3Cð-à ˆc‰ó-ð4˜cð°°S± óòòð ¨D°©Ið¸$¸s¹)óð¨D°©Ið¸$¸s¹)óòñ. ¨cð ÀCÈ$ÁJð ÐZ_Ð`cÑZd÷ r,rN)rKrr¶rÚcollectionsrÚtokenization_pythonrrÚutilsrÚ get_loggerrHÚloggerrºr rNÚ__all__r1r,rúrËseðñ)ãÛ ÛÝ#çBÝð ˆ× Ñ ˜HÓ %€ð"Ð#3Ð4Ð÷cñcôLrÐ'ôrðjÐ r,