BATOSAY Shell
Server IP : 170.10.162.208  /  Your IP : 216.73.216.181
Web Server : LiteSpeed
System : Linux altar19.supremepanel19.com 4.18.0-553.69.1.lve.el8.x86_64 #1 SMP Wed Aug 13 19:53:59 UTC 2025 x86_64
User : deltahospital ( 1806)
PHP Version : 7.4.33
Disable Function : NONE
MySQL : OFF  |  cURL : ON  |  WGET : ON  |  Perl : ON  |  Python : ON  |  Sudo : OFF  |  Pkexec : OFF
Directory :  /home/deltahospital/.cagefs/tmp/

Upload File :
current_dir [ Writeable ] document_root [ Writeable ]

 

Command :


[ HOME ]     

Current File : /home/deltahospital/.cagefs/tmp/phperryas
Metadata-Version: 1.1
Name: Tempita
Version: 0.5.1
Summary: A very small text templating language
Home-page: http://pythonpaste.org/tempita/
Author: Ian Bicking
Author-email: ianb@colorstudy.com
License: MIT
Description-Content-Type: UNKNOWN
Description: Tempita is a small templating language for text substitution.
        
        This isn't meant to be the Next Big Thing in templating; it's just a
        handy little templating language for when your project outgrows
        ``string.Template`` or ``%`` substitution.  It's small, it embeds
        Python in strings, and it doesn't do much else.
        
        You can read about the `language
        <http://pythonpaste.org/tempita/#the-language>`_, the `interface
        <http://pythonpaste.org/tempita/#the-interface>`_, and there's nothing
        more to learn about it.
        
        You can install from the `svn repository
        <http://svn.pythonpaste.org/Tempita/trunk#Tempita-dev>`__ with
        ``easy_install Tempita==dev``.
        
Keywords: templating template language html
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Text Processing
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 3

tempita

setup.cfg
setup.py
Tempita.egg-info/PKG-INFO
Tempita.egg-info/SOURCES.txt
Tempita.egg-info/dependency_links.txt
Tempita.egg-info/top_level.txt
Tempita.egg-info/zip-safe
tempita/__init__.py
tempita/_looper.py
tempita/compat3.py[console_scripts]
chardetect = chardet.cli.chardetect:main

Metadata-Version: 2.1
Name: chardet
Version: 3.0.4
Summary: Universal encoding detector for Python 2 and 3
Home-page: https://github.com/chardet/chardet
Author: Mark Pilgrim
Author-email: mark@diveintomark.org
Maintainer: Daniel Blanchard
Maintainer-email: dan.blanchard@gmail.com
License: LGPL
Keywords: encoding,i18n,xml
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
License-File: LICENSE

Chardet: The Universal Character Encoding Detector
--------------------------------------------------

.. image:: https://img.shields.io/travis/chardet/chardet/stable.svg
   :alt: Build status
   :target: https://travis-ci.org/chardet/chardet

.. image:: https://img.shields.io/coveralls/chardet/chardet/stable.svg
   :target: https://coveralls.io/r/chardet/chardet

.. image:: https://img.shields.io/pypi/v/chardet.svg
   :target: https://warehouse.python.org/project/chardet/
   :alt: Latest version on PyPI

.. image:: https://img.shields.io/pypi/l/chardet.svg
   :alt: License


Detects
 - ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
 - Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
 - EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
 - EUC-KR, ISO-2022-KR (Korean)
 - KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
 - ISO-8859-5, windows-1251 (Bulgarian)
 - ISO-8859-1, windows-1252 (Western European languages)
 - ISO-8859-7, windows-1253 (Greek)
 - ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
 - TIS-620 (Thai)

.. note::
   Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily
   disabled until we can retrain the models.

Requires Python 2.6, 2.7, or 3.3+.

Installation
------------

Install from `PyPI <https://pypi.python.org/pypi/chardet>`_::

    pip install chardet

Documentation
-------------

For users, docs are now available at https://chardet.readthedocs.io/.

Command-line Tool
-----------------

chardet comes with a command-line script which reports on the encodings of one
or more files::

    % chardetect somefile someotherfile
    somefile: windows-1252 with confidence 0.5
    someotherfile: ascii with confidence 1.0

About
-----

This is a continuation of Mark Pilgrim's excellent chardet. Previously, two
versions needed to be maintained: one that supported python 2.x and one that
supported python 3.x.  We've recently merged with `Ian Cordasco <https://github.com/sigmavirus24>`_'s
`charade <https://github.com/sigmavirus24/charade>`_ fork, so now we have one
coherent version that works for Python 2.6+.

:maintainer: Dan Blanchard



chardet
LICENSE
MANIFEST.in
NOTES.rst
README.rst
setup.cfg
setup.py
test.py
chardet/__init__.py
chardet/big5freq.py
chardet/big5prober.py
chardet/chardistribution.py
chardet/charsetgroupprober.py
chardet/charsetprober.py
chardet/codingstatemachine.py
chardet/compat.py
chardet/cp949prober.py
chardet/enums.py
chardet/escprober.py
chardet/escsm.py
chardet/eucjpprober.py
chardet/euckrfreq.py
chardet/euckrprober.py
chardet/euctwfreq.py
chardet/euctwprober.py
chardet/gb2312freq.py
chardet/gb2312prober.py
chardet/hebrewprober.py
chardet/jisfreq.py
chardet/jpcntx.py
chardet/langbulgarianmodel.py
chardet/langcyrillicmodel.py
chardet/langgreekmodel.py
chardet/langhebrewmodel.py
chardet/langhungarianmodel.py
chardet/langthaimodel.py
chardet/langturkishmodel.py
chardet/latin1prober.py
chardet/mbcharsetprober.py
chardet/mbcsgroupprober.py
chardet/mbcssm.py
chardet/sbcharsetprober.py
chardet/sbcsgroupprober.py
chardet/sjisprober.py
chardet/universaldetector.py
chardet/utf8prober.py
chardet/version.py
chardet.egg-info/PKG-INFO
chardet.egg-info/SOURCES.txt
chardet.egg-info/dependency_links.txt
chardet.egg-info/entry_points.txt
chardet.egg-info/top_level.txt
chardet/cli/__init__.py
chardet/cli/chardetect.py
docs/.gitignore
docs/Makefile
docs/README.md
docs/conf.py
docs/faq.rst
docs/how-it-works.rst
docs/index.rst
docs/make.bat
docs/supported-encodings.rst
docs/usage.rst
docs/api/chardet.rst
docs/api/modules.rst
tests/README.txt
tests/Big5/0804.blogspot.com.xml
tests/Big5/_chromium_Big5_with_no_encoding_specified.html
tests/Big5/_ude_1.txt
tests/Big5/blog.worren.net.xml
tests/Big5/carbonxiv.blogspot.com.xml
tests/Big5/catshadow.blogspot.com.xml
tests/Big5/coolloud.org.tw.xml
tests/Big5/digitalwall.com.xml
tests/Big5/ebao.us.xml
tests/Big5/fudesign.blogspot.com.xml
tests/Big5/kafkatseng.blogspot.com.xml
tests/Big5/ke207.blogspot.com.xml
tests/Big5/leavesth.blogspot.com.xml
tests/Big5/letterlego.blogspot.com.xml
tests/Big5/linyijen.blogspot.com.xml
tests/Big5/marilynwu.blogspot.com.xml
tests/Big5/myblog.pchome.com.tw.xml
tests/Big5/oui-design.com.xml
tests/Big5/sanwenji.blogspot.com.xml
tests/Big5/sinica.edu.tw.xml
tests/Big5/sylvia1976.blogspot.com.xml
tests/Big5/tlkkuo.blogspot.com.xml
tests/Big5/unoriginalblog.com.xml
tests/Big5/upsaid.com.xml
tests/Big5/willythecop.blogspot.com.xml
tests/Big5/ytc.blogspot.com.xml
tests/CP932/hardsoft.at.webry.info.xml
tests/CP932/www2.chuo-u.ac.jp-suishin.xml
tests/CP932/y-moto.com.xml
tests/CP949/ricanet.com.xml
tests/EUC-JP/_mozilla_bug426271_text-euc-jp.html
tests/EUC-JP/_mozilla_bug431054_text.html
tests/EUC-JP/_mozilla_bug620106_text.html
tests/EUC-JP/_ude_1.txt
tests/EUC-JP/aivy.co.jp.xml
tests/EUC-JP/akaname.main.jp.xml
tests/EUC-JP/arclamp.jp.xml
tests/EUC-JP/aristrist.s57.xrea.com.xml
tests/EUC-JP/artifact-jp.com.xml
tests/EUC-JP/atom.ycf.nanet.co.jp.xml
tests/EUC-JP/azito.under.jp.xml
tests/EUC-JP/azoz.org.xml
tests/EUC-JP/blog.kabu-navi.com.atom.xml
tests/EUC-JP/blog.kabu-navi.com.xml
tests/EUC-JP/bphrs.net.xml
tests/EUC-JP/ch.kitaguni.tv.xml
tests/EUC-JP/club.h14m.org.xml
tests/EUC-JP/contents-factory.com.xml
tests/EUC-JP/furusatonoeki.cutegirl.jp.xml
tests/EUC-JP/manana.moo.jp.xml
tests/EUC-JP/mimizun.com.xml
tests/EUC-JP/misuzilla.org.xml
tests/EUC-JP/overcube.com.atom.xml
tests/EUC-JP/overcube.com.xml
tests/EUC-JP/pinkupa.com.xml
tests/EUC-JP/rdf.ycf.nanet.co.jp.xml
tests/EUC-JP/siesta.co.jp.aozora.xml
tests/EUC-JP/tls.org.xml
tests/EUC-JP/yukiboh.moo.jp.xml
tests/EUC-KR/_chromium_windows-949_with_no_encoding_specified.html
tests/EUC-KR/_mozilla_bug9357_text.html
tests/EUC-KR/_ude_euc1.txt
tests/EUC-KR/_ude_euc2.txt
tests/EUC-KR/acnnewswire.net.xml
tests/EUC-KR/alogblog.com.xml
tests/EUC-KR/arts.egloos.com.xml
tests/EUC-KR/birder.egloos.com.xml
tests/EUC-KR/blog.bd-lab.com.xml
tests/EUC-KR/blog.empas.com.xml
tests/EUC-KR/blog.rss.naver.com.xml
tests/EUC-KR/calmguy.egloos.com.xml
tests/EUC-KR/chisato.info.xml
tests/EUC-KR/console.linuxstudy.pe.kr.xml
tests/EUC-KR/critique.or.kr.xml
tests/EUC-KR/epitaph.egloos.com.xml
tests/EUC-KR/ittrend.egloos.com.xml
tests/EUC-KR/jely.egloos.com.xml
tests/EUC-KR/jely.pe.kr.xml
tests/EUC-KR/jowchung.oolim.net.xml
tests/EUC-KR/kina.egloos.com.xml
tests/EUC-KR/lennon81.egloos.com.xml
tests/EUC-KR/oroll.egloos.com.xml
tests/EUC-KR/poliplus.egloos.com.xml
tests/EUC-KR/scarletkh2.egloos.com.xml
tests/EUC-KR/siwoo.org.xml
tests/EUC-KR/sparcs.kaist.ac.kr.xml
tests/EUC-KR/tori02.egloos.com.xml
tests/EUC-KR/willis.egloos.com.xml
tests/EUC-KR/xenix.egloos.com.xml
tests/EUC-KR/yunho.egloos.com.xml
tests/EUC-KR/zangsalang.egloos.com.xml
tests/EUC-TW/_ude_euc-tw1.txt
tests/GB2312/14.blog.westca.com.xml
tests/GB2312/2.blog.westca.com.xml
tests/GB2312/_chromium_gb18030_with_no_encoding_specified.html.xml
tests/GB2312/_mozilla_bug171813_text.html
tests/GB2312/acnnewswire.net.xml
tests/GB2312/bbs.blogsome.com.xml
tests/GB2312/cappuccinos.3322.org.xml
tests/GB2312/chen56.blogcn.com.xml
tests/GB2312/cindychen.com.xml
tests/GB2312/cnblog.org.xml
tests/GB2312/coverer.com.xml
tests/GB2312/eighthday.blogspot.com.xml
tests/GB2312/godthink.blogsome.com.xml
tests/GB2312/jjgod.3322.org.xml
tests/GB2312/lily.blogsome.com.xml
tests/GB2312/luciferwang.blogcn.com.xml
tests/GB2312/pda.blogsome.com.xml
tests/GB2312/softsea.net.xml
tests/GB2312/w3cn.org.xml
tests/GB2312/xy15400.blogcn.com.xml
tests/IBM855/_ude_1.txt
tests/IBM855/aif.ru.health.xml
tests/IBM855/aug32.hole.ru.xml
tests/IBM855/aviaport.ru.xml
tests/IBM855/blog.mlmaster.com.xml
tests/IBM855/forum.template-toolkit.ru.1.xml
tests/IBM855/forum.template-toolkit.ru.4.xml
tests/IBM855/forum.template-toolkit.ru.6.xml
tests/IBM855/forum.template-toolkit.ru.8.xml
tests/IBM855/forum.template-toolkit.ru.9.xml
tests/IBM855/greek.ru.xml
tests/IBM855/intertat.ru.xml
tests/IBM855/janulalife.blogspot.com.xml
tests/IBM855/kapranoff.ru.xml
tests/IBM855/money.rin.ru.xml
tests/IBM855/music.peeps.ru.xml
tests/IBM855/newsru.com.xml
tests/IBM855/susu.ac.ru.xml
tests/IBM866/_ude_1.txt
tests/IBM866/aif.ru.health.xml
tests/IBM866/aug32.hole.ru.xml
tests/IBM866/aviaport.ru.xml
tests/IBM866/blog.mlmaster.com.xml
tests/IBM866/forum.template-toolkit.ru.1.xml
tests/IBM866/forum.template-toolkit.ru.4.xml
tests/IBM866/forum.template-toolkit.ru.6.xml
tests/IBM866/forum.template-toolkit.ru.8.xml
tests/IBM866/forum.template-toolkit.ru.9.xml
tests/IBM866/greek.ru.xml
tests/IBM866/intertat.ru.xml
tests/IBM866/janulalife.blogspot.com.xml
tests/IBM866/kapranoff.ru.xml
tests/IBM866/money.rin.ru.xml
tests/IBM866/music.peeps.ru.xml
tests/IBM866/newsru.com.xml
tests/IBM866/susu.ac.ru.xml
tests/KOI8-R/_chromium_KOI8-R_with_no_encoding_specified.html
tests/KOI8-R/_ude_1.txt
tests/KOI8-R/aif.ru.health.xml
tests/KOI8-R/aug32.hole.ru.xml
tests/KOI8-R/aviaport.ru.xml
tests/KOI8-R/blog.mlmaster.com.xml
tests/KOI8-R/forum.template-toolkit.ru.1.xml
tests/KOI8-R/forum.template-toolkit.ru.4.xml
tests/KOI8-R/forum.template-toolkit.ru.6.xml
tests/KOI8-R/forum.template-toolkit.ru.8.xml
tests/KOI8-R/forum.template-toolkit.ru.9.xml
tests/KOI8-R/greek.ru.xml
tests/KOI8-R/intertat.ru.xml
tests/KOI8-R/janulalife.blogspot.com.xml
tests/KOI8-R/kapranoff.ru.xml
tests/KOI8-R/koi.kinder.ru.xml
tests/KOI8-R/money.rin.ru.xml
tests/KOI8-R/music.peeps.ru.xml
tests/KOI8-R/newsru.com.xml
tests/KOI8-R/susu.ac.ru.xml
tests/MacCyrillic/_ude_1.txt
tests/MacCyrillic/aif.ru.health.xml
tests/MacCyrillic/aug32.hole.ru.xml
tests/MacCyrillic/aviaport.ru.xml
tests/MacCyrillic/blog.mlmaster.com.xml
tests/MacCyrillic/forum.template-toolkit.ru.4.xml
tests/MacCyrillic/forum.template-toolkit.ru.6.xml
tests/MacCyrillic/forum.template-toolkit.ru.8.xml
tests/MacCyrillic/forum.template-toolkit.ru.9.xml
tests/MacCyrillic/greek.ru.xml
tests/MacCyrillic/intertat.ru.xml
tests/MacCyrillic/kapranoff.ru.xml
tests/MacCyrillic/koi.kinder.ru.xml
tests/MacCyrillic/money.rin.ru.xml
tests/MacCyrillic/music.peeps.ru.xml
tests/MacCyrillic/newsru.com.xml
tests/MacCyrillic/susu.ac.ru.xml
tests/SHIFT_JIS/10e.org.xml
tests/SHIFT_JIS/1affliate.com.xml
tests/SHIFT_JIS/_chromium_Shift-JIS_with_no_encoding_specified.html
tests/SHIFT_JIS/_ude_1.txt
tests/SHIFT_JIS/_ude_2.txt
tests/SHIFT_JIS/_ude_3.txt
tests/SHIFT_JIS/_ude_4.txt
tests/SHIFT_JIS/accessories-brand.com.xml
tests/SHIFT_JIS/amefoot.net.xml
tests/SHIFT_JIS/andore.com.inami.xml
tests/SHIFT_JIS/andore.com.money.xml
tests/SHIFT_JIS/andore.com.xml
tests/SHIFT_JIS/blog.inkase.net.xml
tests/SHIFT_JIS/blog.paseri.ne.jp.xml
tests/SHIFT_JIS/bloglelife.com.xml
tests/SHIFT_JIS/brag.zaka.to.xml
tests/SHIFT_JIS/celeb.lalalu.com.xml
tests/SHIFT_JIS/clickablewords.com.xml
tests/SHIFT_JIS/do.beginnersrack.com.xml
tests/SHIFT_JIS/dogsinn.jp.xml
tests/SHIFT_JIS/grebeweb.net.xml
tests/SHIFT_JIS/milliontimes.jp.xml
tests/SHIFT_JIS/moon-light.ne.jp.xml
tests/SHIFT_JIS/nextbeaut.com.xml
tests/SHIFT_JIS/ooganemochi.com.xml
tests/SHIFT_JIS/perth-on.net.xml
tests/SHIFT_JIS/sakusaka-silk.net.xml
tests/SHIFT_JIS/setsuzei119.jp.xml
tests/SHIFT_JIS/tamuyou.haun.org.xml
tests/SHIFT_JIS/yasuhisa.com.xml
tests/TIS-620/_mozilla_bug488426_text.html
tests/TIS-620/opentle.org.xml
tests/TIS-620/pharmacy.kku.ac.th.analyse1.xml
tests/TIS-620/pharmacy.kku.ac.th.centerlab.xml
tests/TIS-620/pharmacy.kku.ac.th.healthinfo-ne.xml
tests/TIS-620/trickspot.boxchart.com.xml
tests/UTF-16/bom-utf-16-be.srt
tests/UTF-16/bom-utf-16-le.srt
tests/UTF-32/bom-utf-32-be.srt
tests/UTF-32/bom-utf-32-le.srt
tests/ascii/_chromium_iso-8859-1_with_no_encoding_specified.html
tests/ascii/_mozilla_bug638318_text.html
tests/ascii/howto.diveintomark.org.xml
tests/iso-2022-jp/_ude_1.txt
tests/iso-2022-kr/_ude_iso1.txt
tests/iso-2022-kr/_ude_iso2.txt
tests/iso-8859-1/_ude_1.txt
tests/iso-8859-1/_ude_2.txt
tests/iso-8859-1/_ude_3.txt
tests/iso-8859-1/_ude_4.txt
tests/iso-8859-1/_ude_5.txt
tests/iso-8859-1/_ude_6.txt
tests/iso-8859-2-hungarian/auto-apro.hu.xml
tests/iso-8859-2-hungarian/cigartower.hu.xml
tests/iso-8859-2-hungarian/escience.hu.xml
tests/iso-8859-2-hungarian/hirtv.hu.xml
tests/iso-8859-2-hungarian/honositomuhely.hu.xml
tests/iso-8859-2-hungarian/saraspatak.hu.xml
tests/iso-8859-2-hungarian/shamalt.uw.hu.mk.xml
tests/iso-8859-2-hungarian/shamalt.uw.hu.mr.xml
tests/iso-8859-2-hungarian/shamalt.uw.hu.mv.xml
tests/iso-8859-2-hungarian/shamalt.uw.hu.xml
tests/iso-8859-2-hungarian/ugyanmar.blogspot.com.xml
tests/iso-8859-5-bulgarian/aero-bg.com.xml
tests/iso-8859-5-bulgarian/bbc.co.uk.popshow.xml
tests/iso-8859-5-bulgarian/bpm.cult.bg.2.xml
tests/iso-8859-5-bulgarian/bpm.cult.bg.4.xml
tests/iso-8859-5-bulgarian/bpm.cult.bg.9.xml
tests/iso-8859-5-bulgarian/bpm.cult.bg.medusa.4.xml
tests/iso-8859-5-bulgarian/bpm.cult.bg.xml
tests/iso-8859-5-bulgarian/debian.gabrovo.com.news.xml
tests/iso-8859-5-bulgarian/debian.gabrovo.com.xml
tests/iso-8859-5-bulgarian/doncho.net.comments.xml
tests/iso-8859-5-bulgarian/ecloga.cult.bg.xml
tests/iso-8859-5-bulgarian/ide.li.xml
tests/iso-8859-5-bulgarian/linux-bg.org.xml
tests/iso-8859-5-cyrillic/_chromium_ISO-8859-5_with_no_encoding_specified.html
tests/iso-8859-5-cyrillic/aif.ru.health.xml
tests/iso-8859-5-cyrillic/aug32.hole.ru.xml
tests/iso-8859-5-cyrillic/aviaport.ru.xml
tests/iso-8859-5-cyrillic/blog.mlmaster.com.xml
tests/iso-8859-5-cyrillic/forum.template-toolkit.ru.1.xml
tests/iso-8859-5-cyrillic/forum.template-toolkit.ru.4.xml
tests/iso-8859-5-cyrillic/forum.template-toolkit.ru.6.xml
tests/iso-8859-5-cyrillic/forum.template-toolkit.ru.8.xml
tests/iso-8859-5-cyrillic/forum.template-toolkit.ru.9.xml
tests/iso-8859-5-cyrillic/greek.ru.xml
tests/iso-8859-5-cyrillic/intertat.ru.xml
tests/iso-8859-5-cyrillic/janulalife.blogspot.com.xml
tests/iso-8859-5-cyrillic/kapranoff.ru.xml
tests/iso-8859-5-cyrillic/money.rin.ru.xml
tests/iso-8859-5-cyrillic/music.peeps.ru.xml
tests/iso-8859-5-cyrillic/newsru.com.xml
tests/iso-8859-5-cyrillic/susu.ac.ru.xml
tests/iso-8859-6-arabic/_chromium_ISO-8859-6_with_no_encoding_specified.html
tests/iso-8859-7-greek/_chromium_ISO-8859-7_with_no_encoding_specified.html
tests/iso-8859-7-greek/_ude_greek.txt
tests/iso-8859-7-greek/disabled.gr.xml
tests/iso-8859-7-greek/hotstation.gr.xml
tests/iso-8859-7-greek/naftemporiki.gr.bus.xml
tests/iso-8859-7-greek/naftemporiki.gr.cmm.xml
tests/iso-8859-7-greek/naftemporiki.gr.fin.xml
tests/iso-8859-7-greek/naftemporiki.gr.mrk.xml
tests/iso-8859-7-greek/naftemporiki.gr.mrt.xml
tests/iso-8859-7-greek/naftemporiki.gr.spo.xml
tests/iso-8859-7-greek/naftemporiki.gr.wld.xml
tests/iso-8859-9-turkish/divxplanet.com.xml
tests/iso-8859-9-turkish/subtitle.srt
tests/iso-8859-9-turkish/wikitop_tr_ISO-8859-9.txt
tests/utf-8/_chromium_UTF-8_with_no_encoding_specified.html
tests/utf-8/_mozilla_bug306272_text.html
tests/utf-8/_mozilla_bug426271_text-utf-8.html
tests/utf-8/_ude_1.txt
tests/utf-8/_ude_2.txt
tests/utf-8/_ude_3.txt
tests/utf-8/_ude_5.txt
tests/utf-8/_ude_greek.txt
tests/utf-8/_ude_he1.txt
tests/utf-8/_ude_he2.txt
tests/utf-8/_ude_he3.txt
tests/utf-8/_ude_russian.txt
tests/utf-8/anitabee.blogspot.com.xml
tests/utf-8/balatonblog.typepad.com.xml
tests/utf-8/boobooo.blogspot.com.xml
tests/utf-8/linuxbox.hu.xml
tests/utf-8/pihgy.hu.xml
tests/utf-8/weblabor.hu.2.xml
tests/utf-8/weblabor.hu.xml
tests/utf-8-sig/_ude_4.txt
tests/utf-8-sig/bom-utf-8.srt
tests/windows-1250-hungarian/bbc.co.uk.hu.forum.xml
tests/windows-1250-hungarian/bbc.co.uk.hu.learningenglish.xml
tests/windows-1250-hungarian/bbc.co.uk.hu.pressreview.xml
tests/windows-1250-hungarian/bbc.co.uk.hu.xml
tests/windows-1250-hungarian/objektivhir.hu.xml
tests/windows-1250-hungarian/torokorszag.blogspot.com.xml
tests/windows-1251-bulgarian/bbc.co.uk.popshow.xml
tests/windows-1251-bulgarian/bpm.cult.bg.2.xml
tests/windows-1251-bulgarian/bpm.cult.bg.3.xml
tests/windows-1251-bulgarian/bpm.cult.bg.4.xml
tests/windows-1251-bulgarian/bpm.cult.bg.9.xml
tests/windows-1251-bulgarian/bpm.cult.bg.medusa.4.xml
tests/windows-1251-bulgarian/bpm.cult.bg.xml
tests/windows-1251-bulgarian/debian.gabrovo.com.news.xml
tests/windows-1251-bulgarian/debian.gabrovo.com.xml
tests/windows-1251-bulgarian/doncho.net.comments.xml
tests/windows-1251-bulgarian/doncho.net.xml
tests/windows-1251-bulgarian/ecloga.cult.bg.xml
tests/windows-1251-bulgarian/ide.li.xml
tests/windows-1251-bulgarian/informator.org.xml
tests/windows-1251-bulgarian/linux-bg.org.xml
tests/windows-1251-bulgarian/rinennor.org.xml
tests/windows-1251-cyrillic/_chromium_windows-1251_with_no_encoding_specified.html
tests/windows-1251-cyrillic/_ude_1.txt
tests/windows-1251-cyrillic/aif.ru.health.xml
tests/windows-1251-cyrillic/anthropology.ru.xml
tests/windows-1251-cyrillic/aug32.hole.ru.xml
tests/windows-1251-cyrillic/aviaport.ru.xml
tests/windows-1251-cyrillic/blog.mlmaster.com.xml
tests/windows-1251-cyrillic/forum.template-toolkit.ru.1.xml
tests/windows-1251-cyrillic/forum.template-toolkit.ru.4.xml
tests/windows-1251-cyrillic/forum.template-toolkit.ru.6.xml
tests/windows-1251-cyrillic/forum.template-toolkit.ru.8.xml
tests/windows-1251-cyrillic/forum.template-toolkit.ru.9.xml
tests/windows-1251-cyrillic/greek.ru.xml
tests/windows-1251-cyrillic/intertat.ru.xml
tests/windows-1251-cyrillic/janulalife.blogspot.com.xml
tests/windows-1251-cyrillic/kapranoff.ru.xml
tests/windows-1251-cyrillic/money.rin.ru.xml
tests/windows-1251-cyrillic/music.peeps.ru.xml
tests/windows-1251-cyrillic/newsru.com.xml
tests/windows-1252/_mozilla_bug421271_text.html
tests/windows-1252/github_bug_9.txt
tests/windows-1254-turkish/_chromium_windows-1254_with_no_encoding_specified.html
tests/windows-1255-hebrew/_chromium_ISO-8859-8_with_no_encoding_specified.html
tests/windows-1255-hebrew/_chromium_windows-1255_with_no_encoding_specified.html
tests/windows-1255-hebrew/_ude_he1.txt
tests/windows-1255-hebrew/_ude_he2.txt
tests/windows-1255-hebrew/_ude_he3.txt
tests/windows-1255-hebrew/carshops.co.il.xml
tests/windows-1255-hebrew/exego.net.2.xml
tests/windows-1255-hebrew/hagada.org.il.xml
tests/windows-1255-hebrew/halemo.net.edoar.xml
tests/windows-1255-hebrew/hevra.org.il.xml
tests/windows-1255-hebrew/hydepark.hevre.co.il.7957.xml
tests/windows-1255-hebrew/info.org.il.xml
tests/windows-1255-hebrew/infomed.co.il.xml
tests/windows-1255-hebrew/law.co.il.xml
tests/windows-1255-hebrew/maakav.org.xml
tests/windows-1255-hebrew/neviim.net.xml
tests/windows-1255-hebrew/notes.co.il.50.xml
tests/windows-1255-hebrew/notes.co.il.6.xml
tests/windows-1255-hebrew/notes.co.il.7.xml
tests/windows-1255-hebrew/notes.co.il.8.xml
tests/windows-1255-hebrew/pcplus.co.il.xml
tests/windows-1255-hebrew/sharks.co.il.xml
tests/windows-1255-hebrew/whatsup.org.il.xml
tests/windows-1256-arabic/_chromium_windows-1256_with_no_encoding_specified.htmlfrom __future__ import absolute_import, division, unicode_literals

import re
import warnings

from .constants import DataLossWarning

baseChar = """
[#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D8-#x00F6] |
[#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] | [#x0141-#x0148] |
[#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] | [#x01F4-#x01F5] |
[#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386 |
[#x0388-#x038A] | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] |
[#x03D0-#x03D6] | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3] |
[#x0401-#x040C] | [#x040E-#x044F] | [#x0451-#x045C] | [#x045E-#x0481] |
[#x0490-#x04C4] | [#x04C7-#x04C8] | [#x04CB-#x04CC] | [#x04D0-#x04EB] |
[#x04EE-#x04F5] | [#x04F8-#x04F9] | [#x0531-#x0556] | #x0559 |
[#x0561-#x0586] | [#x05D0-#x05EA] | [#x05F0-#x05F2] | [#x0621-#x063A] |
[#x0641-#x064A] | [#x0671-#x06B7] | [#x06BA-#x06BE] | [#x06C0-#x06CE] |
[#x06D0-#x06D3] | #x06D5 | [#x06E5-#x06E6] | [#x0905-#x0939] | #x093D |
[#x0958-#x0961] | [#x0985-#x098C] | [#x098F-#x0990] | [#x0993-#x09A8] |
[#x09AA-#x09B0] | #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD] |
[#x09DF-#x09E1] | [#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10] |
[#x0A13-#x0A28] | [#x0A2A-#x0A30] | [#x0A32-#x0A33] | [#x0A35-#x0A36] |
[#x0A38-#x0A39] | [#x0A59-#x0A5C] | #x0A5E | [#x0A72-#x0A74] |
[#x0A85-#x0A8B] | #x0A8D | [#x0A8F-#x0A91] | [#x0A93-#x0AA8] |
[#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] | [#x0AB5-#x0AB9] | #x0ABD | #x0AE0 |
[#x0B05-#x0B0C] | [#x0B0F-#x0B10] | [#x0B13-#x0B28] | [#x0B2A-#x0B30] |
[#x0B32-#x0B33] | [#x0B36-#x0B39] | #x0B3D | [#x0B5C-#x0B5D] |
[#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90] | [#x0B92-#x0B95] |
[#x0B99-#x0B9A] | #x0B9C | [#x0B9E-#x0B9F] | [#x0BA3-#x0BA4] |
[#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] | [#x0BB7-#x0BB9] | [#x0C05-#x0C0C] |
[#x0C0E-#x0C10] | [#x0C12-#x0C28] | [#x0C2A-#x0C33] | [#x0C35-#x0C39] |
[#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90] | [#x0C92-#x0CA8] |
[#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE | [#x0CE0-#x0CE1] |
[#x0D05-#x0D0C] | [#x0D0E-#x0D10] | [#x0D12-#x0D28] | [#x0D2A-#x0D39] |
[#x0D60-#x0D61] | [#x0E01-#x0E2E] | #x0E30 | [#x0E32-#x0E33] |
[#x0E40-#x0E45] | [#x0E81-#x0E82] | #x0E84 | [#x0E87-#x0E88] | #x0E8A |
#x0E8D | [#x0E94-#x0E97] | [#x0E99-#x0E9F] | [#x0EA1-#x0EA3] | #x0EA5 |
#x0EA7 | [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] | #x0EB0 | [#x0EB2-#x0EB3] |
#x0EBD | [#x0EC0-#x0EC4] | [#x0F40-#x0F47] | [#x0F49-#x0F69] |
[#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 | [#x1102-#x1103] |
[#x1105-#x1107] | #x1109 | [#x110B-#x110C] | [#x110E-#x1112] | #x113C |
#x113E | #x1140 | #x114C | #x114E | #x1150 | [#x1154-#x1155] | #x1159 |
[#x115F-#x1161] | #x1163 | #x1165 | #x1167 | #x1169 | [#x116D-#x116E] |
[#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB | [#x11AE-#x11AF] |
[#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB | #x11F0 | #x11F9 |
[#x1E00-#x1E9B] | [#x1EA0-#x1EF9] | [#x1F00-#x1F15] | [#x1F18-#x1F1D] |
[#x1F20-#x1F45] | [#x1F48-#x1F4D] | [#x1F50-#x1F57] | #x1F59 | #x1F5B |
#x1F5D | [#x1F5F-#x1F7D] | [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] | #x1FBE |
[#x1FC2-#x1FC4] | [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] |
[#x1FE0-#x1FEC] | [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC] | #x2126 |
[#x212A-#x212B] | #x212E | [#x2180-#x2182] | [#x3041-#x3094] |
[#x30A1-#x30FA] | [#x3105-#x312C] | [#xAC00-#xD7A3]"""

ideographic = """[#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]"""

combiningCharacter = """
[#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486] | [#x0591-#x05A1] |
[#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF | [#x05C1-#x05C2] | #x05C4 |
[#x064B-#x0652] | #x0670 | [#x06D6-#x06DC] | [#x06DD-#x06DF] |
[#x06E0-#x06E4] | [#x06E7-#x06E8] | [#x06EA-#x06ED] | [#x0901-#x0903] |
#x093C | [#x093E-#x094C] | #x094D | [#x0951-#x0954] | [#x0962-#x0963] |
[#x0981-#x0983] | #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4] |
[#x09C7-#x09C8] | [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02 |
#x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48] |
[#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] | #x0ABC |
[#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] | [#x0B01-#x0B03] |
#x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] | [#x0B4B-#x0B4D] |
[#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8] |
[#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] | [#x0C3E-#x0C44] |
[#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83] |
[#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] | [#x0CD5-#x0CD6] |
[#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] | [#x0D4A-#x0D4D] |
#x0D57 | #x0E31 | [#x0E34-#x0E3A] | [#x0E47-#x0E4E] | #x0EB1 |
[#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD] | [#x0F18-#x0F19] |
#x0F35 | #x0F37 | #x0F39 | #x0F3E | #x0F3F | [#x0F71-#x0F84] |
[#x0F86-#x0F8B] | [#x0F90-#x0F95] | #x0F97 | [#x0F99-#x0FAD] |
[#x0FB1-#x0FB7] | #x0FB9 | [#x20D0-#x20DC] | #x20E1 | [#x302A-#x302F] |
#x3099 | #x309A"""

digit = """
[#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9] | [#x0966-#x096F] |
[#x09E6-#x09EF] | [#x0A66-#x0A6F] | [#x0AE6-#x0AEF] | [#x0B66-#x0B6F] |
[#x0BE7-#x0BEF] | [#x0C66-#x0C6F] | [#x0CE6-#x0CEF] | [#x0D66-#x0D6F] |
[#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29]"""

extender = """
#x00B7 | #x02D0 | #x02D1 | #x0387 | #x0640 | #x0E46 | #x0EC6 | #x3005 |
#[#x3031-#x3035] | [#x309D-#x309E] | [#x30FC-#x30FE]"""

letter = " | ".join([baseChar, ideographic])

# Without the
name = " | ".join([letter, digit, ".", "-", "_", combiningCharacter,
                   extender])
nameFirst = " | ".join([letter, "_"])

reChar = re.compile(r"#x([\d|A-F]{4,4})")
reCharRange = re.compile(r"\[#x([\d|A-F]{4,4})-#x([\d|A-F]{4,4})\]")


def charStringToList(chars):
    charRanges = [item.strip() for item in chars.split(" | ")]
    rv = []
    for item in charRanges:
        foundMatch = False
        for regexp in (reChar, reCharRange):
            match = regexp.match(item)
            if match is not None:
                rv.append([hexToInt(item) for item in match.groups()])
                if len(rv[-1]) == 1:
                    rv[-1] = rv[-1] * 2
                foundMatch = True
                break
        if not foundMatch:
            assert len(item) == 1

            rv.append([ord(item)] * 2)
    rv = normaliseCharList(rv)
    return rv


def normaliseCharList(charList):
    charList = sorted(charList)
    for item in charList:
        assert item[1] >= item[0]
    rv = []
    i = 0
    while i < len(charList):
        j = 1
        rv.append(charList[i])
        while i + j < len(charList) and charList[i + j][0] <= rv[-1][1] + 1:
            rv[-1][1] = charList[i + j][1]
            j += 1
        i += j
    return rv

# We don't really support characters above the BMP :(
max_unicode = int("FFFF", 16)


def missingRanges(charList):
    rv = []
    if charList[0] != 0:
        rv.append([0, charList[0][0] - 1])
    for i, item in enumerate(charList[:-1]):
        rv.append([item[1] + 1, charList[i + 1][0] - 1])
    if charList[-1][1] != max_unicode:
        rv.append([charList[-1][1] + 1, max_unicode])
    return rv


def listToRegexpStr(charList):
    rv = []
    for item in charList:
        if item[0] == item[1]:
            rv.append(escapeRegexp(chr(item[0])))
        else:
            rv.append(escapeRegexp(chr(item[0])) + "-" +
                      escapeRegexp(chr(item[1])))
    return "[%s]" % "".join(rv)


def hexToInt(hex_str):
    return int(hex_str, 16)


def escapeRegexp(string):
    specialCharacters = (".", "^", "$", "*", "+", "?", "{", "}",
                         "[", "]", "|", "(", ")", "-")
    for char in specialCharacters:
        string = string.replace(char, "\\" + char)

    return string

# output from the above
nonXmlNameBMPRegexp = re.compile('[\x00-,/:-@\\[-\\^`\\{-\xb6\xb8-\xbf\xd7\xf7\u0132-\u0133\u013f-\u0140\u0149\u017f\u01c4-\u01cc\u01f1-\u01f3\u01f6-\u01f9\u0218-\u024f\u02a9-\u02ba\u02c2-\u02cf\u02d2-\u02ff\u0346-\u035f\u0362-\u0385\u038b\u038d\u03a2\u03cf\u03d7-\u03d9\u03db\u03dd\u03df\u03e1\u03f4-\u0400\u040d\u0450\u045d\u0482\u0487-\u048f\u04c5-\u04c6\u04c9-\u04ca\u04cd-\u04cf\u04ec-\u04ed\u04f6-\u04f7\u04fa-\u0530\u0557-\u0558\u055a-\u0560\u0587-\u0590\u05a2\u05ba\u05be\u05c0\u05c3\u05c5-\u05cf\u05eb-\u05ef\u05f3-\u0620\u063b-\u063f\u0653-\u065f\u066a-\u066f\u06b8-\u06b9\u06bf\u06cf\u06d4\u06e9\u06ee-\u06ef\u06fa-\u0900\u0904\u093a-\u093b\u094e-\u0950\u0955-\u0957\u0964-\u0965\u0970-\u0980\u0984\u098d-\u098e\u0991-\u0992\u09a9\u09b1\u09b3-\u09b5\u09ba-\u09bb\u09bd\u09c5-\u09c6\u09c9-\u09ca\u09ce-\u09d6\u09d8-\u09db\u09de\u09e4-\u09e5\u09f2-\u0a01\u0a03-\u0a04\u0a0b-\u0a0e\u0a11-\u0a12\u0a29\u0a31\u0a34\u0a37\u0a3a-\u0a3b\u0a3d\u0a43-\u0a46\u0a49-\u0a4a\u0a4e-\u0a58\u0a5d\u0a5f-\u0a65\u0a75-\u0a80\u0a84\u0a8c\u0a8e\u0a92\u0aa9\u0ab1\u0ab4\u0aba-\u0abb\u0ac6\u0aca\u0ace-\u0adf\u0ae1-\u0ae5\u0af0-\u0b00\u0b04\u0b0d-\u0b0e\u0b11-\u0b12\u0b29\u0b31\u0b34-\u0b35\u0b3a-\u0b3b\u0b44-\u0b46\u0b49-\u0b4a\u0b4e-\u0b55\u0b58-\u0b5b\u0b5e\u0b62-\u0b65\u0b70-\u0b81\u0b84\u0b8b-\u0b8d\u0b91\u0b96-\u0b98\u0b9b\u0b9d\u0ba0-\u0ba2\u0ba5-\u0ba7\u0bab-\u0bad\u0bb6\u0bba-\u0bbd\u0bc3-\u0bc5\u0bc9\u0bce-\u0bd6\u0bd8-\u0be6\u0bf0-\u0c00\u0c04\u0c0d\u0c11\u0c29\u0c34\u0c3a-\u0c3d\u0c45\u0c49\u0c4e-\u0c54\u0c57-\u0c5f\u0c62-\u0c65\u0c70-\u0c81\u0c84\u0c8d\u0c91\u0ca9\u0cb4\u0cba-\u0cbd\u0cc5\u0cc9\u0cce-\u0cd4\u0cd7-\u0cdd\u0cdf\u0ce2-\u0ce5\u0cf0-\u0d01\u0d04\u0d0d\u0d11\u0d29\u0d3a-\u0d3d\u0d44-\u0d45\u0d49\u0d4e-\u0d56\u0d58-\u0d5f\u0d62-\u0d65\u0d70-\u0e00\u0e2f\u0e3b-\u0e3f\u0e4f\u0e5a-\u0e80\u0e83\u0e85-\u0e86\u0e89\u0e8b-\u0e8c\u0e8e-\u0e93\u0e98\u0ea0\u0ea4\u0ea6\u0ea8-\u0ea9\u0eac\u0eaf\u0eba\u0ebe-\u0ebf\u0ec5\u0ec7\u0ece-\u0ecf\u0eda-\u0f17\u0f1a-\u0f1f\u0f2a-\u0f34\u0f36\u0f38\u0f3a-\u0f3d\u0f48\u0f6a-\u0f70\u0f85\u0f8c-\u0f8f\u0f96\u0f98\u0fae-\u0fb0\u0fb8\u0fba-\u109f\u10c6-\u10cf\u10f7-\u10ff\u1101\u1104\u1108\u110a\u110d\u1113-\u113b\u113d\u113f\u1141-\u114b\u114d\u114f\u1151-\u1153\u1156-\u1158\u115a-\u115e\u1162\u1164\u1166\u1168\u116a-\u116c\u116f-\u1171\u1174\u1176-\u119d\u119f-\u11a7\u11a9-\u11aa\u11ac-\u11ad\u11b0-\u11b6\u11b9\u11bb\u11c3-\u11ea\u11ec-\u11ef\u11f1-\u11f8\u11fa-\u1dff\u1e9c-\u1e9f\u1efa-\u1eff\u1f16-\u1f17\u1f1e-\u1f1f\u1f46-\u1f47\u1f4e-\u1f4f\u1f58\u1f5a\u1f5c\u1f5e\u1f7e-\u1f7f\u1fb5\u1fbd\u1fbf-\u1fc1\u1fc5\u1fcd-\u1fcf\u1fd4-\u1fd5\u1fdc-\u1fdf\u1fed-\u1ff1\u1ff5\u1ffd-\u20cf\u20dd-\u20e0\u20e2-\u2125\u2127-\u2129\u212c-\u212d\u212f-\u217f\u2183-\u3004\u3006\u3008-\u3020\u3030\u3036-\u3040\u3095-\u3098\u309b-\u309c\u309f-\u30a0\u30fb\u30ff-\u3104\u312d-\u4dff\u9fa6-\uabff\ud7a4-\uffff]')  # noqa

nonXmlNameFirstBMPRegexp = re.compile('[\x00-@\\[-\\^`\\{-\xbf\xd7\xf7\u0132-\u0133\u013f-\u0140\u0149\u017f\u01c4-\u01cc\u01f1-\u01f3\u01f6-\u01f9\u0218-\u024f\u02a9-\u02ba\u02c2-\u0385\u0387\u038b\u038d\u03a2\u03cf\u03d7-\u03d9\u03db\u03dd\u03df\u03e1\u03f4-\u0400\u040d\u0450\u045d\u0482-\u048f\u04c5-\u04c6\u04c9-\u04ca\u04cd-\u04cf\u04ec-\u04ed\u04f6-\u04f7\u04fa-\u0530\u0557-\u0558\u055a-\u0560\u0587-\u05cf\u05eb-\u05ef\u05f3-\u0620\u063b-\u0640\u064b-\u0670\u06b8-\u06b9\u06bf\u06cf\u06d4\u06d6-\u06e4\u06e7-\u0904\u093a-\u093c\u093e-\u0957\u0962-\u0984\u098d-\u098e\u0991-\u0992\u09a9\u09b1\u09b3-\u09b5\u09ba-\u09db\u09de\u09e2-\u09ef\u09f2-\u0a04\u0a0b-\u0a0e\u0a11-\u0a12\u0a29\u0a31\u0a34\u0a37\u0a3a-\u0a58\u0a5d\u0a5f-\u0a71\u0a75-\u0a84\u0a8c\u0a8e\u0a92\u0aa9\u0ab1\u0ab4\u0aba-\u0abc\u0abe-\u0adf\u0ae1-\u0b04\u0b0d-\u0b0e\u0b11-\u0b12\u0b29\u0b31\u0b34-\u0b35\u0b3a-\u0b3c\u0b3e-\u0b5b\u0b5e\u0b62-\u0b84\u0b8b-\u0b8d\u0b91\u0b96-\u0b98\u0b9b\u0b9d\u0ba0-\u0ba2\u0ba5-\u0ba7\u0bab-\u0bad\u0bb6\u0bba-\u0c04\u0c0d\u0c11\u0c29\u0c34\u0c3a-\u0c5f\u0c62-\u0c84\u0c8d\u0c91\u0ca9\u0cb4\u0cba-\u0cdd\u0cdf\u0ce2-\u0d04\u0d0d\u0d11\u0d29\u0d3a-\u0d5f\u0d62-\u0e00\u0e2f\u0e31\u0e34-\u0e3f\u0e46-\u0e80\u0e83\u0e85-\u0e86\u0e89\u0e8b-\u0e8c\u0e8e-\u0e93\u0e98\u0ea0\u0ea4\u0ea6\u0ea8-\u0ea9\u0eac\u0eaf\u0eb1\u0eb4-\u0ebc\u0ebe-\u0ebf\u0ec5-\u0f3f\u0f48\u0f6a-\u109f\u10c6-\u10cf\u10f7-\u10ff\u1101\u1104\u1108\u110a\u110d\u1113-\u113b\u113d\u113f\u1141-\u114b\u114d\u114f\u1151-\u1153\u1156-\u1158\u115a-\u115e\u1162\u1164\u1166\u1168\u116a-\u116c\u116f-\u1171\u1174\u1176-\u119d\u119f-\u11a7\u11a9-\u11aa\u11ac-\u11ad\u11b0-\u11b6\u11b9\u11bb\u11c3-\u11ea\u11ec-\u11ef\u11f1-\u11f8\u11fa-\u1dff\u1e9c-\u1e9f\u1efa-\u1eff\u1f16-\u1f17\u1f1e-\u1f1f\u1f46-\u1f47\u1f4e-\u1f4f\u1f58\u1f5a\u1f5c\u1f5e\u1f7e-\u1f7f\u1fb5\u1fbd\u1fbf-\u1fc1\u1fc5\u1fcd-\u1fcf\u1fd4-\u1fd5\u1fdc-\u1fdf\u1fed-\u1ff1\u1ff5\u1ffd-\u2125\u2127-\u2129\u212c-\u212d\u212f-\u217f\u2183-\u3006\u3008-\u3020\u302a-\u3040\u3095-\u30a0\u30fb-\u3104\u312d-\u4dff\u9fa6-\uabff\ud7a4-\uffff]')  # noqa

# Simpler things
nonPubidCharRegexp = re.compile("[^\x20\x0D\x0Aa-zA-Z0-9\\-'()+,./:=?;!*#@$_%]")


class InfosetFilter(object):
    replacementRegexp = re.compile(r"U[\dA-F]{5,5}")

    def __init__(self,
                 dropXmlnsLocalName=False,
                 dropXmlnsAttrNs=False,
                 preventDoubleDashComments=False,
                 preventDashAtCommentEnd=False,
                 replaceFormFeedCharacters=True,
                 preventSingleQuotePubid=False):

        self.dropXmlnsLocalName = dropXmlnsLocalName
        self.dropXmlnsAttrNs = dropXmlnsAttrNs

        self.preventDoubleDashComments = preventDoubleDashComments
        self.preventDashAtCommentEnd = preventDashAtCommentEnd

        self.replaceFormFeedCharacters = replaceFormFeedCharacters

        self.preventSingleQuotePubid = preventSingleQuotePubid

        self.replaceCache = {}

    def coerceAttribute(self, name, namespace=None):
        if self.dropXmlnsLocalName and name.startswith("xmlns:"):
            warnings.warn("Attributes cannot begin with xmlns", DataLossWarning)
            return None
        elif (self.dropXmlnsAttrNs and
              namespace == "http://www.w3.org/2000/xmlns/"):
            warnings.warn("Attributes cannot be in the xml namespace", DataLossWarning)
            return None
        else:
            return self.toXmlName(name)

    def coerceElement(self, name):
        return self.toXmlName(name)

    def coerceComment(self, data):
        if self.preventDoubleDashComments:
            while "--" in data:
                warnings.warn("Comments cannot contain adjacent dashes", DataLossWarning)
                data = data.replace("--", "- -")
            if data.endswith("-"):
                warnings.warn("Comments cannot end in a dash", DataLossWarning)
                data += " "
        return data

    def coerceCharacters(self, data):
        if self.replaceFormFeedCharacters:
            for _ in range(data.count("\x0C")):
                warnings.warn("Text cannot contain U+000C", DataLossWarning)
            data = data.replace("\x0C", " ")
        # Other non-xml characters
        return data

    def coercePubid(self, data):
        dataOutput = data
        for char in nonPubidCharRegexp.findall(data):
            warnings.warn("Coercing non-XML pubid", DataLossWarning)
            replacement = self.getReplacementCharacter(char)
            dataOutput = dataOutput.replace(char, replacement)
        if self.preventSingleQuotePubid and dataOutput.find("'") >= 0:
            warnings.warn("Pubid cannot contain single quote", DataLossWarning)
            dataOutput = dataOutput.replace("'", self.getReplacementCharacter("'"))
        return dataOutput

    def toXmlName(self, name):
        nameFirst = name[0]
        nameRest = name[1:]
        m = nonXmlNameFirstBMPRegexp.match(nameFirst)
        if m:
            warnings.warn("Coercing non-XML name", DataLossWarning)
            nameFirstOutput = self.getReplacementCharacter(nameFirst)
        else:
            nameFirstOutput = nameFirst

        nameRestOutput = nameRest
        replaceChars = set(nonXmlNameBMPRegexp.findall(nameRest))
        for char in replaceChars:
            warnings.warn("Coercing non-XML name", DataLossWarning)
            replacement = self.getReplacementCharacter(char)
            nameRestOutput = nameRestOutput.replace(char, replacement)
        return nameFirstOutput + nameRestOutput

    def getReplacementCharacter(self, char):
        if char in self.replaceCache:
            replacement = self.replaceCache[char]
        else:
            replacement = self.escapeChar(char)
        return replacement

    def fromXmlName(self, name):
        for item in set(self.replacementRegexp.findall(name)):
            name = name.replace(item, self.unescapeChar(item))
        return name

    def escapeChar(self, char):
        replacement = "U%05X" % ord(char)
        self.replaceCache[char] = replacement
        return replacement

    def unescapeChar(self, charcode):
        return chr(int(charcode[1:], 16))
from __future__ import absolute_import, division, unicode_literals

import string

EOF = None

E = {
    "null-character":
        "Null character in input stream, replaced with U+FFFD.",
    "invalid-codepoint":
        "Invalid codepoint in stream.",
    "incorrectly-placed-solidus":
        "Solidus (/) incorrectly placed in tag.",
    "incorrect-cr-newline-entity":
        "Incorrect CR newline entity, replaced with LF.",
    "illegal-windows-1252-entity":
        "Entity used with illegal number (windows-1252 reference).",
    "cant-convert-numeric-entity":
        "Numeric entity couldn't be converted to character "
        "(codepoint U+%(charAsInt)08x).",
    "illegal-codepoint-for-numeric-entity":
        "Numeric entity represents an illegal codepoint: "
        "U+%(charAsInt)08x.",
    "numeric-entity-without-semicolon":
        "Numeric entity didn't end with ';'.",
    "expected-numeric-entity-but-got-eof":
        "Numeric entity expected. Got end of file instead.",
    "expected-numeric-entity":
        "Numeric entity expected but none found.",
    "named-entity-without-semicolon":
        "Named entity didn't end with ';'.",
    "expected-named-entity":
        "Named entity expected. Got none.",
    "attributes-in-end-tag":
        "End tag contains unexpected attributes.",
    'self-closing-flag-on-end-tag':
        "End tag contains unexpected self-closing flag.",
    "expected-tag-name-but-got-right-bracket":
        "Expected tag name. Got '>' instead.",
    "expected-tag-name-but-got-question-mark":
        "Expected tag name. Got '?' instead. (HTML doesn't "
        "support processing instructions.)",
    "expected-tag-name":
        "Expected tag name. Got something else instead",
    "expected-closing-tag-but-got-right-bracket":
        "Expected closing tag. Got '>' instead. Ignoring '</>'.",
    "expected-closing-tag-but-got-eof":
        "Expected closing tag. Unexpected end of file.",
    "expected-closing-tag-but-got-char":
        "Expected closing tag. Unexpected character '%(data)s' found.",
    "eof-in-tag-name":
        "Unexpected end of file in the tag name.",
    "expected-attribute-name-but-got-eof":
        "Unexpected end of file. Expected attribute name instead.",
    "eof-in-attribute-name":
        "Unexpected end of file in attribute name.",
    "invalid-character-in-attribute-name":
        "Invalid character in attribute name",
    "duplicate-attribute":
        "Dropped duplicate attribute on tag.",
    "expected-end-of-tag-name-but-got-eof":
        "Unexpected end of file. Expected = or end of tag.",
    "expected-attribute-value-but-got-eof":
        "Unexpected end of file. Expected attribute value.",
    "expected-attribute-value-but-got-right-bracket":
        "Expected attribute value. Got '>' instead.",
    'equals-in-unquoted-attribute-value':
        "Unexpected = in unquoted attribute",
    'unexpected-character-in-unquoted-attribute-value':
        "Unexpected character in unquoted attribute",
    "invalid-character-after-attribute-name":
        "Unexpected character after attribute name.",
    "unexpected-character-after-attribute-value":
        "Unexpected character after attribute value.",
    "eof-in-attribute-value-double-quote":
        "Unexpected end of file in attribute value (\").",
    "eof-in-attribute-value-single-quote":
        "Unexpected end of file in attribute value (').",
    "eof-in-attribute-value-no-quotes":
        "Unexpected end of file in attribute value.",
    "unexpected-EOF-after-solidus-in-tag":
        "Unexpected end of file in tag. Expected >",
    "unexpected-character-after-solidus-in-tag":
        "Unexpected character after / in tag. Expected >",
    "expected-dashes-or-doctype":
        "Expected '--' or 'DOCTYPE'. Not found.",
    "unexpected-bang-after-double-dash-in-comment":
        "Unexpected ! after -- in comment",
    "unexpected-space-after-double-dash-in-comment":
        "Unexpected space after -- in comment",
    "incorrect-comment":
        "Incorrect comment.",
    "eof-in-comment":
        "Unexpected end of file in comment.",
    "eof-in-comment-end-dash":
        "Unexpected end of file in comment (-)",
    "unexpected-dash-after-double-dash-in-comment":
        "Unexpected '-' after '--' found in comment.",
    "eof-in-comment-double-dash":
        "Unexpected end of file in comment (--).",
    "eof-in-comment-end-space-state":
        "Unexpected end of file in comment.",
    "eof-in-comment-end-bang-state":
        "Unexpected end of file in comment.",
    "unexpected-char-in-comment":
        "Unexpected character in comment found.",
    "need-space-after-doctype":
        "No space after literal string 'DOCTYPE'.",
    "expected-doctype-name-but-got-right-bracket":
        "Unexpected > character. Expected DOCTYPE name.",
    "expected-doctype-name-but-got-eof":
        "Unexpected end of file. Expected DOCTYPE name.",
    "eof-in-doctype-name":
        "Unexpected end of file in DOCTYPE name.",
    "eof-in-doctype":
        "Unexpected end of file in DOCTYPE.",
    "expected-space-or-right-bracket-in-doctype":
        "Expected space or '>'. Got '%(data)s'",
    "unexpected-end-of-doctype":
        "Unexpected end of DOCTYPE.",
    "unexpected-char-in-doctype":
        "Unexpected character in DOCTYPE.",
    "eof-in-innerhtml":
        "XXX innerHTML EOF",
    "unexpected-doctype":
        "Unexpected DOCTYPE. Ignored.",
    "non-html-root":
        "html needs to be the first start tag.",
    "expected-doctype-but-got-eof":
        "Unexpected End of file. Expected DOCTYPE.",
    "unknown-doctype":
        "Erroneous DOCTYPE.",
    "expected-doctype-but-got-chars":
        "Unexpected non-space characters. Expected DOCTYPE.",
    "expected-doctype-but-got-start-tag":
        "Unexpected start tag (%(name)s). Expected DOCTYPE.",
    "expected-doctype-but-got-end-tag":
        "Unexpected end tag (%(name)s). Expected DOCTYPE.",
    "end-tag-after-implied-root":
        "Unexpected end tag (%(name)s) after the (implied) root element.",
    "expected-named-closing-tag-but-got-eof":
        "Unexpected end of file. Expected end tag (%(name)s).",
    "two-heads-are-not-better-than-one":
        "Unexpected start tag head in existing head. Ignored.",
    "unexpected-end-tag":
        "Unexpected end tag (%(name)s). Ignored.",
    "unexpected-start-tag-out-of-my-head":
        "Unexpected start tag (%(name)s) that can be in head. Moved.",
    "unexpected-start-tag":
        "Unexpected start tag (%(name)s).",
    "missing-end-tag":
        "Missing end tag (%(name)s).",
    "missing-end-tags":
        "Missing end tags (%(name)s).",
    "unexpected-start-tag-implies-end-tag":
        "Unexpected start tag (%(startName)s) "
        "implies end tag (%(endName)s).",
    "unexpected-start-tag-treated-as":
        "Unexpected start tag (%(originalName)s). Treated as %(newName)s.",
    "deprecated-tag":
        "Unexpected start tag %(name)s. Don't use it!",
    "unexpected-start-tag-ignored":
        "Unexpected start tag %(name)s. Ignored.",
    "expected-one-end-tag-but-got-another":
        "Unexpected end tag (%(gotName)s). "
        "Missing end tag (%(expectedName)s).",
    "end-tag-too-early":
        "End tag (%(name)s) seen too early. Expected other end tag.",
    "end-tag-too-early-named":
        "Unexpected end tag (%(gotName)s). Expected end tag (%(expectedName)s).",
    "end-tag-too-early-ignored":
        "End tag (%(name)s) seen too early. Ignored.",
    "adoption-agency-1.1":
        "End tag (%(name)s) violates step 1, "
        "paragraph 1 of the adoption agency algorithm.",
    "adoption-agency-1.2":
        "End tag (%(name)s) violates step 1, "
        "paragraph 2 of the adoption agency algorithm.",
    "adoption-agency-1.3":
        "End tag (%(name)s) violates step 1, "
        "paragraph 3 of the adoption agency algorithm.",
    "adoption-agency-4.4":
        "End tag (%(name)s) violates step 4, "
        "paragraph 4 of the adoption agency algorithm.",
    "unexpected-end-tag-treated-as":
        "Unexpected end tag (%(originalName)s). Treated as %(newName)s.",
    "no-end-tag":
        "This el

Batosay - 2023
IDNSEO Team